Gates &
Long Memory
A vanilla RNN forgets quickly. LSTMs and GRUs add small learned controllers — gates — that decide what to keep, what to forget, and what to add. The fix that made sequence learning work for a decade.
An LSTM wraps the recurrent unit in three sigmoid gates: forget, input, output. Each is a learned vector in (0, 1) that scales information as it flows through.
The cell state c_t is a separate channel — a memory bus that runs along the top of the cell, gently modified by the gates rather than rewritten each step. That's the trick: gradients can flow back along the cell state with minimal multiplication, dodging the vanishing-gradient pathology of vanilla RNNs.
The GRU is a leaner variant with two gates instead of three. Slightly fewer parameters; in practice, very similar performance on most tasks.
Before transformers, LSTMs were the workhorse of every sequential task — language modeling, machine translation, speech recognition, time series. The gates' ability to carry information across hundreds of steps was the difference between "useful" and "useless" on real text.
Even today, LSTMs survive in low-resource settings (mobile, edge devices) and in tasks where memory is small but sequence length is large. The gating idea also reappears in newer architectures: Mamba's selectivity mechanism is, in spirit, a learned forget gate.
- Hit Replay on the default sequence. Watch the cell state c stay relatively stable while the gates fluctuate per character — that's the long-memory channel doing its job.
- Toggle each gate in turn (force open or closed). Forcing the forget gate to 0 wipes memory at every step — the network reverts to no-memory behavior.
- Compare the cell state c with the hidden state h. h changes more drastically; c drifts gently. That's the architectural separation between "what to use right now" (h) and "what to remember" (c).
The Google Neural Machine Translation system (2016) was a stack of LSTMs with attention — a technical leap that ended decades of phrase-based statistical translation overnight.
Bidirectional LSTMs took streams of audio frames and emitted text in real-time. The recipe held until Whisper-style transformer models displaced it around 2020.
LSTMs over molecular SMILES strings predict whether a candidate compound binds a protein. Fast inference + small model size keep them in production at pharmaceutical companies.
DeepMind's DQN and successor agents used LSTMs to remember context across many frames — crucial for games where the relevant signal is partially observable in any single frame.
- Understanding LSTM Networks essay Christopher Olah · The diagrams in every textbook are reproduced from this essay. Required reading.
- Long Short-Term Memory paper (1997) Hochreiter & Schmidhuber · The original LSTM paper. Twenty years ahead of its time and arguably the single most important paper in modern deep learning.
- Empirical Evaluation of Gated Recurrent Networks paper Chung et al. (2014) · The systematic comparison of LSTMs and GRUs that established their rough equivalence.
- Attention and Augmented Recurrent Networks essay Olah & Carter (Distill) · The bridge from LSTMs to attention — what came next, and why.