NOTE
This text was produced by ChatGPT 5 Thinking, then edited for clarity and length by Mr. Gordon.
How does word candidate generation work for an autocorrect system?
This article explains how an autocorrect engine quickly produces a small set of likely candidates to score — without checking the entire dictionary.
Big picture
When you type a misspelling like tehn, the keyboard:
- Looks up likely words from a compact lexicon (trie/DAWG).
- Traverses that structure while tracking Levenshtein edit distance ≤ k (usually 1–2).
- Adds keyboard‑nearby substitutions first (typos you’re likely to make).
- Keeps only the top‑N cheapest/most promising candidates (a small beam).
- Hands this tiny list to the ranker:
TOTAL = −distance + frequency + keyboard + context.
What’s a trie?
A trie is a tree of characters. Every path from the root spells a prefix; words are marked at terminal nodes.
flowchart LR r1((root)) r1 --> n_t[t] n_t --> n_th[th] n_th --> n_the[the] n_the --> n_them["them (word)"] n_the --> n_then["then (word)"] n_th --> n_tha[tha] n_tha --> n_than["than (word)"] n_t --> n_th2[th] n_th2 --> n_thi[thi] n_thi --> n_this["this (word)"]
Why it helps: Prefix search is fast — as you type th…, you’re already at the th node; you only explore children below it.
What’s a DAWG?
A DAWG (Directed Acyclic Word Graph) merges identical suffix subtrees of a trie to save memory.
flowchart LR r2((root)) r2 --> t0[t] t0 --> h0[h] h0 --> a0[a] h0 --> e0[e] a0 --> n0[n] e0 --> sfx["shared e-suffix"] sfx --> w_them["them (word)"] sfx --> w_then["then (word)"] a0 --> w_than["than (word)"] t0 --> h1[h] h1 --> i0[i] i0 --> w_this["this (word)"]
Concept sketch: Instead of duplicating the e → (m|n) tail, a DAWG shares it.
Trade‑off: Same power as a trie for lookup; much smaller in RAM, slightly trickier to build.
Sidebar: Pronunciation
-
trie → pronounced like “tree” (/triː/).
Originates from “retrieval.” Some people say “try,” but that’s less common. -
DAWG (Directed Acyclic Word Graph) → pronounced like “dog” (/dɔːɡ/ or /dɑːɡ/).
You may also hear people spell it out “D‑A‑W‑G.” The acronym’s spelling hints at the “dog” pronunciation.
Safe choices in mixed company: say “tree” and “dog.”
Traversing
We walk the trie/DAWG while carrying the current Levenshtein edit costs for the prefix we’ve read. Paths whose best possible distance already exceeds k are pruned.
Intuition: you don’t check all words — you only follow branches that could still end within k edits of what the user typed.
Tiny example
Misspelling: teh, limit k = 1
- At root: state says “0 edits used.”
- Follow
t(match) → still 0 edits used. - From
tevs targetteh:- Insert a letter, Delete a typed letter, or Substitute → each costs 1.
- Path
t → hwithout theeexceedsk=1? Prune. - Path
t → e → hreaches the within 1 edit → keepthe.
Reasonable substitutions
We bias expansions toward likely typos by preferring neighboring keys (8‑neighborhood: horizontal/vertical/diagonal).
flowchart LR a1[typed: t] -->|adjacent| y1[try y] a1 -->|adjacent| r1[try r] a1 -->|adjacent| g1[try g] %% Make the link labeled "far" and then style it dotted: a1 -- far --- q1["try q<br/>(deprioritized)"]
During generation, we expand adjacent letters before distant ones and/or give them lower provisional cost, so they’re more likely to stay in the beam.
Keep it tiny
Maintain a beam or small list (e.g., top 20) of partial paths with the lowest provisional cost. When a new path is worse than the worst in the beam, drop it. This gives you milliseconds‑fast candidate sets.
Putting it together
Candidate generator (pseudocode):
function candidates(typo, k=2, beam_size=20):
beam = [(root, state_for_empty(typo), cost=0, prefix="")]
results = {}
while beam not empty:
node, state, cost, pref = pop_lowest_cost(beam)
if is_word(node) and state.min_cost <= k:
results[pref] = min(results.get(pref, +∞), state.min_cost)
for ch, child in expand_children(node, prefer_adjacent=True):
next_state = advance_levenshtein(state, ch)
next_cost = next_state.min_cost
if next_cost <= k:
push(beam, (child, next_state, next_cost, pref + ch))
trim_to_top_k(beam, beam_size)
return top_N_by_cost(results)advance_levenshteinupdates the edit distance table for one more character.prefer_adjacent=Trueskews which edges we try first.resultsmaps each discovered word to its minimum edit cost (to be used by your scoring sheet).
Score and decide
Take the small candidate list and apply your class ranker:
Pick the highest TOTAL. If the best score is ≤ −1, KEEP what the user typed.
One‑page summary
- Trie/DAWG: fast map from prefixes to words; lets you avoid scanning the whole dictionary.
- Levenshtein automaton: carries edit costs as you walk; prunes hopeless paths.
- Keyboard bias: try nearby keys first; give them cheaper provisional cost.
- Beam: keep only the few most promising partial paths at any time.
That’s how a keyboard can propose the “right” word in real time.