jthomas.site// notebook · v.4.2026
Machine Learning, Visualized · Vol. XVIII

Gates &
Long Memory

A vanilla RNN forgets quickly. LSTMs and GRUs add small learned controllers — gates — that decide what to keep, what to forget, and what to add. The fix that made sequence learning work for a decade.

The concept

An LSTM wraps the recurrent unit in three sigmoid gates: forget, input, output. Each is a learned vector in (0, 1) that scales information as it flows through.

The cell state c_t is a separate channel — a memory bus that runs along the top of the cell, gently modified by the gates rather than rewritten each step. That's the trick: gradients can flow back along the cell state with minimal multiplication, dodging the vanishing-gradient pathology of vanilla RNNs.

The GRU is a leaner variant with two gates instead of three. Slightly fewer parameters; in practice, very similar performance on most tasks.

Why ML cares

Before transformers, LSTMs were the workhorse of every sequential task — language modeling, machine translation, speech recognition, time series. The gates' ability to carry information across hundreds of steps was the difference between "useful" and "useless" on real text.

Even today, LSTMs survive in low-resource settings (mobile, edge devices) and in tasks where memory is small but sequence length is large. The gating idea also reappears in newer architectures: Mamba's selectivity mechanism is, in spirit, a learned forget gate.

Try this
  1. Hit Replay on the default sequence. Watch the cell state c stay relatively stable while the gates fluctuate per character — that's the long-memory channel doing its job.
  2. Toggle each gate in turn (force open or closed). Forcing the forget gate to 0 wipes memory at every step — the network reverts to no-memory behavior.
  3. Compare the cell state c with the hidden state h. h changes more drastically; c drifts gently. That's the architectural separation between "what to use right now" (h) and "what to remember" (c).
· The cell-state bus runs across the top — a near-linear conveyor belt for memory. Three sigmoid valves pinch or open it: forget (multiplies the bus), input (gates new candidate memory before it merges), output (gates what leaves as h_t). Each valve's openness is a learned function of the current input and previous hidden state.
Where you've seen this04 examples
Pre-2018 Google Translate

The Google Neural Machine Translation system (2016) was a stack of LSTMs with attention — a technical leap that ended decades of phrase-based statistical translation overnight.

Siri, Alexa, voice typing

Bidirectional LSTMs took streams of audio frames and emitted text in real-time. The recipe held until Whisper-style transformer models displaced it around 2020.

Drug-target interaction prediction

LSTMs over molecular SMILES strings predict whether a candidate compound binds a protein. Fast inference + small model size keep them in production at pharmaceutical companies.

Atari-playing agents

DeepMind's DQN and successor agents used LSTMs to remember context across many frames — crucial for games where the relevant signal is partially observable in any single frame.

Further reading