Candidate Generation

NOTE

This text was produced by ChatGPT 5 Thinking, then edited for clarity and length by Mr. Gordon.

How does word candidate generation work for an autocorrect system?

This article explains how an autocorrect engine quickly produces a small set of likely candidates to score — without checking the entire dictionary.

Big picture

When you type a misspelling like tehn, the keyboard:

Looks up likely words from a compact lexicon (trie/DAWG).
Traverses that structure while tracking Levenshtein edit distance ≤ k (usually 1–2).
Adds keyboard‑nearby substitutions first (typos you’re likely to make).
Keeps only the top‑N cheapest/most promising candidates (a small beam).
Hands this tiny list to the ranker: TOTAL = −distance + frequency + keyboard + context.

What’s a trie?

A trie is a tree of characters. Every path from the root spells a prefix; words are marked at terminal nodes.


flowchart LR
    r1((root))
    r1 --> n_t[t]
    n_t --> n_th[th]
    n_th --> n_the[the]
    n_the --> n_them["them (word)"]
    n_the --> n_then["then (word)"]
    n_th --> n_tha[tha]
    n_tha --> n_than["than (word)"]
    n_t --> n_th2[th]
    n_th2 --> n_thi[thi]
    n_thi --> n_this["this (word)"]

Why it helps: Prefix search is fast — as you type th…, you’re already at the th node; you only explore children below it.

What’s a DAWG?

A DAWG (Directed Acyclic Word Graph) merges identical suffix subtrees of a trie to save memory.


flowchart LR
    r2((root))
    r2 --> t0[t]
    t0 --> h0[h]
    h0 --> a0[a]
    h0 --> e0[e]
    a0 --> n0[n]
    e0 --> sfx["shared e-suffix"]
    sfx --> w_them["them (word)"]
    sfx --> w_then["then (word)"]
    a0 --> w_than["than (word)"]
    t0 --> h1[h]
    h1 --> i0[i]
    i0 --> w_this["this (word)"]

Concept sketch: Instead of duplicating the e → (m|n) tail, a DAWG shares it.

Trade‑off: Same power as a trie for lookup; much smaller in RAM, slightly trickier to build.

trie → pronounced like “tree” (/triː/).
Originates from “retrieval.” Some people say “try,” but that’s less common.
DAWG (Directed Acyclic Word Graph) → pronounced like “dog” (/dɔːɡ/ or /dɑːɡ/).
You may also hear people spell it out “D‑A‑W‑G.” The acronym’s spelling hints at the “dog” pronunciation.

Safe choices in mixed company: say “tree” and “dog.”

Traversing

We walk the trie/DAWG while carrying the current Levenshtein edit costs for the prefix we’ve read. Paths whose best possible distance already exceeds k are pruned.

Intuition: you don’t check all words — you only follow branches that could still end within k edits of what the user typed.

Tiny example

Misspelling: teh, limit k = 1

At root: state says “0 edits used.”
Follow t (match) → still 0 edits used.
From te vs target teh:
- Insert a letter, Delete a typed letter, or Substitute → each costs 1.
- Path t → h without the e exceeds k=1? Prune.
- Path t → e → h reaches the within 1 edit → keep the.

Reasonable substitutions

We bias expansions toward likely typos by preferring neighboring keys (8‑neighborhood: horizontal/vertical/diagonal).


flowchart LR
    a1[typed: t] -->|adjacent| y1[try y]
    a1 -->|adjacent| r1[try r]
    a1 -->|adjacent| g1[try g]
    %% Make the link labeled "far" and then style it dotted:
    a1 -- far --- q1["try q<br/>(deprioritized)"]

During generation, we expand adjacent letters before distant ones and/or give them lower provisional cost, so they’re more likely to stay in the beam.

Keep it tiny

Maintain a beam or small list (e.g., top 20) of partial paths with the lowest provisional cost. When a new path is worse than the worst in the beam, drop it. This gives you milliseconds‑fast candidate sets.

Putting it together

Candidate generator (pseudocode):

function candidates(typo, k=2, beam_size=20):
  beam = [(root, state_for_empty(typo), cost=0, prefix="")]
  results = {}
 
  while beam not empty:
    node, state, cost, pref = pop_lowest_cost(beam)
 
    if is_word(node) and state.min_cost <= k:
        results[pref] = min(results.get(pref, +∞), state.min_cost)
 
    for ch, child in expand_children(node, prefer_adjacent=True):
        next_state = advance_levenshtein(state, ch)
        next_cost  = next_state.min_cost
        if next_cost <= k:
            push(beam, (child, next_state, next_cost, pref + ch))
 
    trim_to_top_k(beam, beam_size)
 
  return top_N_by_cost(results)

advance_levenshtein updates the edit distance table for one more character.
prefer_adjacent=True skews which edges we try first.
results maps each discovered word to its minimum edit cost (to be used by your scoring sheet).

Score and decide

Take the small candidate list and apply your class ranker:

TOTAL = - Distance + Frequency + KeyboardAdjacency + Context

Pick the highest TOTAL. If the best score is ≤ −1, KEEP what the user typed.

One‑page summary

Trie/DAWG: fast map from prefixes to words; lets you avoid scanning the whole dictionary.
Levenshtein automaton: carries edit costs as you walk; prunes hopeless paths.
Keyboard bias: try nearby keys first; give them cheaper provisional cost.
Beam: keep only the few most promising partial paths at any time.

That’s how a keyboard can propose the “right” word in real time.

📱 ICS4U S1

Navigate this site

Candidate Generation

Big picture

What’s a trie?

What’s a DAWG?

Sidebar: Pronunciation

Traversing

Tiny example

Reasonable substitutions

Keep it tiny

Putting it together

Score and decide

One‑page summary

Navigate this page

When did we do this?