Machine Learning, Visualized · Vol. XVII

Recurrent
Networks

A network with a memory. At each step it sees a new symbol and updates a hidden state — a running summary of everything it has seen so far. The same weights, applied over and over.

The concept

A recurrent neural network (RNN) is a network with a loop: at each timestep, it takes a new input and the previous hidden state, and produces a new hidden state.

The defining equation is h_t = tanh(W_xh · x_t + W_hh · h_{t−1} + b). The same weight matrices are reused at every step — that's why an RNN can read sequences of any length with a fixed parameter count.

Watch the hidden state evolve as the network reads a sequence. Different dimensions of the hidden vector specialize to track different patterns — vowels, recurring symbols, position in the sequence.

Why ML cares

Before transformers ate the field in 2018, RNNs (and their LSTM/GRU variants) were the dominant architecture for everything sequential: language modeling, machine translation, speech recognition, time-series forecasting.

They're still the right tool when sequences are very long, irregular, or when memory is constrained — RNN inference is constant-memory per step, while attention is quadratic. Modern "linear attention" variants (Mamba, RWKV) revive the recurrent recipe with new tricks.

Try this

Edit the sequence (only a b r c d and space). Hit Replay and watch the hidden state matrix fill in column by column.
Observe how a repeated pattern like abracadabra creates a recurring rhythm in the hidden state — certain dimensions oscillate as familiar chunks reappear.
Click ↻ weights to randomize. The same input produces a wildly different state trajectory — but the structure of "memory of recent history" is preserved.

· Same recurrent cell, drawn out across time. The hidden state h flows left-to-right; each step takes a new input x_t and the previous h_{t−1}, and the same weight matrices W_xh and W_hh are reused at every timestep — the architectural definition of "recurrent."

Before this

Feed-forward nets read fixed-size inputs — but sequences (sentences, audio, time series) come in any length, with no clean way to pad. RNNs introduced a hidden state that carries forward, processing one token at a time with the same weights re-used. They opened sequence modelling — but suffered the vanishing-gradient problem on long inputs.

The recurrence

h_t = tanh(W_xhx_t + W_hhh_t−1)

h_t is the hidden state at step t — a fixed-length vector of "running thoughts." x_t is the new input symbol. W_xh projects input into the hidden space; W_hh mixes the previous hidden state into the current one. Both are reused at every step — that's recurrence.

tanh, plainly

tanh is the hyperbolic tangent. It squashes any real number into the range (−1, +1). Big positives go to ~1, big negatives to ~−1, zero stays near zero. It's the activation function that keeps the hidden state from blowing up.

The vanishing gradient

Backpropping through 100 timesteps is like multiplying 100 derivatives. Tanh's slope is at most 1, so gradients shrink fast — vanilla RNNs forget anything more than ~10 steps back. LSTMs are the fix.

Where you've seen this04 examples

Pre-2018 machine translation

Google Translate's neural backend launched in 2016 with stacked LSTMs (encoder–decoder with attention). Held the field until transformers swept it away in 2018.

Speech recognition

Bidirectional LSTMs powered Siri, Alexa, and Google's voice search through the late 2010s — taking a stream of audio frames and producing a stream of phonemes.

Time series and weather

RNNs forecast electricity demand, financial volatility, and short-term weather. Newer state-space models (S4, Mamba) are direct descendants of the recurrent idea.

Karpathy's char-RNN

The famous 2015 blog post that generated Shakespeare, Wikipedia, and source code letter-by-letter. The first time many people saw a neural net produce structured prose.