Autocorrect – how does it decide what you “meant”?
This lesson builds on Levenshtein (minimum edit) distance. In class we acted out a simplified autocorrect engine that combines distance with other signals to rank suggestions — and sometimes chooses to KEEP your original word.
Learning goals
- “I can generate candidate corrections for a misspelled word using a small Levenshtein edit distance table.”
- “I can score candidates with multiple signals (distance, keyboard proximity, frequency, and context).”
- “I can justify when an autocorrect system should not correct (the KEEP decision).”
Success criteria
You should be able to respond to the following, in one or two sentences each:
- Why can using distance alone result in bad suggestions?
- How does keyboard adjacency changes the ranking? Why?
- When to KEEP the typed word, as-is, rather than autocorrecting.
The pipeline we used
- Candidate generation: List bank words within edit distance ≤ 2 of the misspelling (use a Levenshtein distance grid if needed).
- Score each candidate with this formula:
TOTAL = (−Distance) + Frequency (+0/+1/+2) + Keyboard (+0…+2) + Context (+0/+2)
- Distance: Levenshtein distance (insert/delete/substitute cost = 1).
- Keyboard adjacency: +1 for each substitution where the letters are adjacent on QWERTY (horizontal, vertical, or diagonal neighbors), cap +2. Insertions/deletions earn 0.
- Frequency: Common = +2, Medium = +1, Rare = +0.
- Context: If the candidate fits a provided blank, +2; else 0.
- KEEP rule: If the best TOTAL ≤ −1, the system does not autocorrect.
Why more than distance?
Distance narrows the search, but it can be fooled by rare words or keyboard typos. Frequency says what users usually type, keyboard says what typos are likely, and context checks whether the word fits the sentence.
Quick example
Misspelling: tehn; context: “I’ll do it ___.” (bank includes then, than, this, them, thin, thus)
- Distance tends to be small for then, than, them, thin (2 edits).
- Keyboard adjacency favors then (letters swapped are near), and context strongly favors then (“do it then”).
- TOTAL comes out largest for then, so we autocorrect to then.
Exercise
Pick a short misspelling (≤ 5 letters), make a small bank of similar words, and score them with the formula. Check distances with the class demo if you like.
Curriculum alignment
Here is what students practiced today:
- Knowledge & Understanding: terms (candidate, frequency tier, keyboard adjacency), recall Levenshtein.
- Thinking: balancing signals, tuning the KEEP threshold, defending a choice with numbers.
- Communication: short oral/written justifications for your ranking.
- Application: running the full pipeline on new slips and verifying distances.
Here is a summary of Ontario curriculum connections for this course:
- A. Programming Concepts & Skills: algorithmic thinking; explicit cost models and decision rules.
- B. Software Development: iterate and tune a heuristic; verify part of the system (distance) against an oracle.
- C. Designing Modular Programs & Algorithm Analysis: separate candidate generation from ranking; reason about trade-offs.
- D. Topics in CS: user impact of over/under-correction; trust and usability.