2 min read

RNNs (LSTM / GRU)

Table of Contents

Recurrent neural networks process tokens one at a time, maintaining a hidden state that summarizes everything seen so far. Unlike n-grams, the context window is (in principle) unbounded.

Key idea

At each step, combine the current token with the previous hidden state to produce a new hidden state and a prediction:

hโ‚œ = f(hโ‚œโ‚‹โ‚, xโ‚œ) โ†’ P(wโ‚œโ‚Šโ‚) = softmax(W ยท hโ‚œ)

LSTM and GRU cells add gates so the network can keep or forget information over long spans, mitigating vanishing gradients.

Inputs & representation

  • Input: a sequence of token embeddings (learned dense vectors).
  • State: a fixed-size hidden vector carried left-to-right.
  • Output: next-token distribution at every position.
x1 โ†’ [RNN] โ†’ h1 โ†’ ลท1
x2 โ†’ [RNN] โ†’ h2 โ†’ ลท2   (h1 feeds into h2)
x3 โ†’ [RNN] โ†’ h3 โ†’ ลท3

How it applies to autocomplete

Feed the typed prefix through the RNN; the final hidden state encodes the context and the softmax gives ranked completions. Naturally streaming โ€” you update the state token-by-token as the user types.

Trade-offs

StrengthWeakness
Unbounded context in theorySequential โ†’ hard to parallelize
Compact, good for on-deviceStruggles with very long dependencies
Streams naturally as user typesLargely superseded by transformers

References

  • Hochreiter & Schmidhuber (1997), Long Short-Term Memory.
  • Cho et al. (2014), GRU.

Notes / TODO

  • Prototype an LSTM next-token model; compare perplexity vs. the n-gram baseline.
  • Measure latency for on-device streaming.