NOTE

This text was produced by ChatGPT 5 Thinking, then edited for clarity and length by Mr. Gordon.

How does word candidate generation work for an autocorrect system?

This article explains how an autocorrect engine quickly produces a small set of likely candidates to score — without checking the entire dictionary.

Big picture

When you type a misspelling like tehn, the keyboard:

  1. Looks up likely words from a compact lexicon (trie/DAWG).
  2. Traverses that structure while tracking Levenshtein edit distance ≤ k (usually 1–2).
  3. Adds keyboard‑nearby substitutions first (typos you’re likely to make).
  4. Keeps only the top‑N cheapest/most promising candidates (a small beam).
  5. Hands this tiny list to the ranker: TOTAL = −distance + frequency + keyboard + context.

What’s a trie?

A trie is a tree of characters. Every path from the root spells a prefix; words are marked at terminal nodes.


flowchart LR
    r1((root))
    r1 --> n_t[t]
    n_t --> n_th[th]
    n_th --> n_the[the]
    n_the --> n_them["them (word)"]
    n_the --> n_then["then (word)"]
    n_th --> n_tha[tha]
    n_tha --> n_than["than (word)"]
    n_t --> n_th2[th]
    n_th2 --> n_thi[thi]
    n_thi --> n_this["this (word)"]

Why it helps: Prefix search is fast — as you type th…, you’re already at the th node; you only explore children below it.

What’s a DAWG?

A DAWG (Directed Acyclic Word Graph) merges identical suffix subtrees of a trie to save memory.


flowchart LR
    r2((root))
    r2 --> t0[t]
    t0 --> h0[h]
    h0 --> a0[a]
    h0 --> e0[e]
    a0 --> n0[n]
    e0 --> sfx["shared e-suffix"]
    sfx --> w_them["them (word)"]
    sfx --> w_then["then (word)"]
    a0 --> w_than["than (word)"]
    t0 --> h1[h]
    h1 --> i0[i]
    i0 --> w_this["this (word)"]

Concept sketch: Instead of duplicating the e → (m|n) tail, a DAWG shares it.

Trade‑off: Same power as a trie for lookup; much smaller in RAM, slightly trickier to build.

  • trie → pronounced like “tree” (/triː/).
    Originates from “retrieval.” Some people say “try,” but that’s less common.

  • DAWG (Directed Acyclic Word Graph) → pronounced like “dog” (/dɔːɡ/ or /dɑːɡ/).
    You may also hear people spell it out “D‑A‑W‑G.” The acronym’s spelling hints at the “dog” pronunciation.

Safe choices in mixed company: say “tree” and “dog.”

Traversing

We walk the trie/DAWG while carrying the current Levenshtein edit costs for the prefix we’ve read. Paths whose best possible distance already exceeds k are pruned.

Intuition: you don’t check all words — you only follow branches that could still end within k edits of what the user typed.

Tiny example

Misspelling: teh, limit k = 1

  • At root: state says “0 edits used.”
  • Follow t (match) → still 0 edits used.
  • From te vs target teh:
    • Insert a letter, Delete a typed letter, or Substitute → each costs 1.
    • Path t → h without the e exceeds k=1? Prune.
    • Path t → e → h reaches the within 1 edit → keep the.

Reasonable substitutions

We bias expansions toward likely typos by preferring neighboring keys (8‑neighborhood: horizontal/vertical/diagonal).


flowchart LR
    a1[typed: t] -->|adjacent| y1[try y]
    a1 -->|adjacent| r1[try r]
    a1 -->|adjacent| g1[try g]
    %% Make the link labeled "far" and then style it dotted:
    a1 -- far --- q1["try q<br/>(deprioritized)"]

During generation, we expand adjacent letters before distant ones and/or give them lower provisional cost, so they’re more likely to stay in the beam.

Keep it tiny

Maintain a beam or small list (e.g., top 20) of partial paths with the lowest provisional cost. When a new path is worse than the worst in the beam, drop it. This gives you milliseconds‑fast candidate sets.

Putting it together

Candidate generator (pseudocode):

function candidates(typo, k=2, beam_size=20):
  beam = [(root, state_for_empty(typo), cost=0, prefix="")]
  results = {}
 
  while beam not empty:
    node, state, cost, pref = pop_lowest_cost(beam)
 
    if is_word(node) and state.min_cost <= k:
        results[pref] = min(results.get(pref, +∞), state.min_cost)
 
    for ch, child in expand_children(node, prefer_adjacent=True):
        next_state = advance_levenshtein(state, ch)
        next_cost  = next_state.min_cost
        if next_cost <= k:
            push(beam, (child, next_state, next_cost, pref + ch))
 
    trim_to_top_k(beam, beam_size)
 
  return top_N_by_cost(results)
  • advance_levenshtein updates the edit distance table for one more character.
  • prefer_adjacent=True skews which edges we try first.
  • results maps each discovered word to its minimum edit cost (to be used by your scoring sheet).

Score and decide

Take the small candidate list and apply your class ranker:

Pick the highest TOTAL. If the best score is ≤ −1, KEEP what the user typed.

One‑page summary

  • Trie/DAWG: fast map from prefixes to words; lets you avoid scanning the whole dictionary.
  • Levenshtein automaton: carries edit costs as you walk; prunes hopeless paths.
  • Keyboard bias: try nearby keys first; give them cheaper provisional cost.
  • Beam: keep only the few most promising partial paths at any time.

That’s how a keyboard can propose the “right” word in real time.