Machine Learning, Visualized · Vol. XVIII

Gates &
Long Memory

A vanilla RNN forgets quickly. LSTMs and GRUs add small learned controllers — gates — that decide what to keep, what to forget, and what to add. The fix that made sequence learning work for a decade.

The concept

An LSTM wraps the recurrent unit in three sigmoid gates: forget, input, output. Each is a learned vector in (0, 1) that scales information as it flows through.

The cell state c_t is a separate channel — a memory bus that runs along the top of the cell, gently modified by the gates rather than rewritten each step. That's the trick: gradients can flow back along the cell state with minimal multiplication, dodging the vanishing-gradient pathology of vanilla RNNs.

The GRU is a leaner variant with two gates instead of three. Slightly fewer parameters; in practice, very similar performance on most tasks.

Why ML cares

Before transformers, LSTMs were the workhorse of every sequential task — language modeling, machine translation, speech recognition, time series. The gates' ability to carry information across hundreds of steps was the difference between "useful" and "useless" on real text.

Even today, LSTMs survive in low-resource settings (mobile, edge devices) and in tasks where memory is small but sequence length is large. The gating idea also reappears in newer architectures: Mamba's selectivity mechanism is, in spirit, a learned forget gate.

Try this

Hit Replay on the default sequence. Watch the cell state c stay relatively stable while the gates fluctuate per character — that's the long-memory channel doing its job.
Toggle each gate in turn (force open or closed). Forcing the forget gate to 0 wipes memory at every step — the network reverts to no-memory behavior.
Compare the cell state c with the hidden state h. h changes more drastically; c drifts gently. That's the architectural separation between "what to use right now" (h) and "what to remember" (c).

· The cell-state bus runs across the top — a near-linear conveyor belt for memory. Three sigmoid valves pinch or open it: forget (multiplies the bus), input (gates new candidate memory before it merges), output (gates what leaves as h_t). Each valve's openness is a learned function of the current input and previous hidden state.

Before this

Vanilla RNNs forgot in 5–10 steps because gradients vanish (or explode) when multiplied through a long chain of tanh derivatives. Hochreiter & Schmidhuber (1997) added a separate cell-state "highway" with three sigmoid gates. The cell state's additive update — instead of multiplicative — lets gradients flow back hundreds of steps. The first practical recipe for long sequences.

The LSTM equations

f_t = σ(W_f·[h_t−1, x_t] + b_f)
i_t = σ(W_i·[h_t−1, x_t] + b_i)
o_t = σ(W_o·[h_t−1, x_t] + b_o)
c̃_t = tanh(W_c·[h_t−1, x_t] + b_c)
c_t = f_t ⊙ c_t−1 + i_t ⊙ c̃_t
h_t = o_t ⊙ tanh(c_t)

Reading the symbols

σ (sigma) is the sigmoid: squashes a real number into (0, 1). Used for gates because a gate is "how open" — a fraction.

tanh squashes into (−1, +1). Used to produce signed candidate values added to the cell state.

⊙ is element-wise (Hadamard) product: each gate value scales its matching cell-state slot independently.

[h_t−1, x_t] means concatenation: stack the previous hidden state with the current input into one long vector before multiplying.

W_f, W_i, W_o, W_c are the learned weight matrices (one per gate plus one for candidate); b are biases.

Gate-by-gate, in English

f_t (forget): for each cell-state slot, how much of the old memory to keep (1 = keep all, 0 = wipe).

i_t (input): how much of the new candidate to write into each slot.

c̃_t (candidate): the new content proposed for the cell state — a tanh-bounded vector.

o_t (output): how much of the cell state to expose as this step's hidden output.

c_t (cell state): the long-term memory bus. Its update is mostly addition, not multiplication — that's what saves the gradient.

GRU

A two-gate simplification. Update gate replaces forget+input; reset gate scales the recurrent context. Often faster and almost as good.

Where you've seen this04 examples

Pre-2018 Google Translate

The Google Neural Machine Translation system (2016) was a stack of LSTMs with attention — a technical leap that ended decades of phrase-based statistical translation overnight.

Siri, Alexa, voice typing

Bidirectional LSTMs took streams of audio frames and emitted text in real-time. The recipe held until Whisper-style transformer models displaced it around 2020.

Drug-target interaction prediction

LSTMs over molecular SMILES strings predict whether a candidate compound binds a protein. Fast inference + small model size keep them in production at pharmaceutical companies.

Atari-playing agents

DeepMind's DQN and successor agents used LSTMs to remember context across many frames — crucial for games where the relevant signal is partially observable in any single frame.