2 min read

CNNs for sequences

Table of Contents

Convolutional networks slide learnable filters over the token sequence, capturing local patterns (character or word n-gram-like features) — but computed in parallel across the whole sequence rather than step-by-step.

Key idea

Stack 1-D convolutions; each layer widens the receptive field, so deep stacks (or dilated convolutions) can see long contexts. Causal convolutions mask future tokens so the model only conditions on the past — required for next-token prediction.

Receptive field grows with depth × kernel size (× dilation), not with sequence position.

Inputs & representation

  • Input: token (or character) embeddings arranged as a 1-D signal.
  • Model: stacked causal/dilated 1-D convs + residual connections.
  • Output: next-token distribution at each position, all positions at once.
[emb][emb][emb][emb]
   \  |  /              causal conv (kernel=3, masked)
   [ feature ]
       ↓ stack/dilate → wider context

How it applies to autocomplete

Character-level CNNs are good at morphology and typo tolerance; word-level CNNs capture short phrase patterns. Highly parallel training; fixed receptive field bounds how much history matters.

Trade-offs

StrengthWeakness
Fully parallel (fast training)Receptive field is bounded by depth
Strong at local / sub-word patternsLong-range context needs many layers
Robust to typos (char-level)Less common than transformers for LMs

References

  • Bai, Kolter & Koltun (2018), Temporal Convolutional Networks (TCN).
  • van den Oord et al. (2016), WaveNet (dilated causal convolutions).

Notes / TODO

  • Try a char-level causal CNN for typo robustness.
  • Measure how receptive-field size affects completion quality.