Machine Learning, Visualized · Vol. XXXIV

Federated
Learning

A model trained across many devices, each keeping its data local. The server only sees the gradients — never the raw data. The fix when "send everything to the cloud" is illegal, expensive, or just impolite.

The concept

In one sentence: data stays local, gradients move. Each device trains on its own data; only the model updates leave.

The protocol — FedAvg: (1) server broadcasts current global weights to a sample of clients; (2) each client trains locally on its own data for a few epochs; (3) clients send the weight updates (not the data) back to the server; (4) server averages the updates, weighted by data size, into new global weights. Repeat for hundreds of rounds.

Variants add differential privacy (noise injected into each update so individual rows can't be reverse-engineered) and secure aggregation (the server can decrypt only the sum, not the individual contributions).

Why ML cares

Federated learning made on-device personalization viable. Gboard's next-word prediction, Apple's QuickType, and Siri voice models are all federated — your typing data never leaves your phone, but your phone contributes to the global model.

It's also the regulatory escape hatch for healthcare, finance, and HR ML: GDPR, HIPAA, and similar laws often forbid pooling raw data across jurisdictions. Federated training keeps each hospital's patient records on its servers while still building a shared model.

Try this

Hit Run rounds. Watch the broadcast phase: white packets fan out from the server to each client (the global weights). Then the upload phase: orange gradient packets travel back. The server averages (a single white packet emerges) and the loss curve drops.
Toggle non-IID data. Each client's local data histogram now skews toward a different class (device 1 mostly class A, device 2 mostly class B). Convergence slows; the loss curve gets noisier — a real-world federated pain point.
Turn up privacy noise (DP-σ). Notice the small jitter added to gradient packets before they leave each client; the global loss curve becomes noisier as σ rises. The privacy-utility tradeoff, made visible.
Raise client dropout. Some clients gray out with an X each round — offline phones, dead laptops. FedAvg still converges; robustness is one of the algorithm's quiet strengths.

FedAvg, in plain English

w^t+1 = Σ_k (n_k / n) · w_k^t+1

w · the global model weights — a long vector. Same architecture across all devices.
w_k^t+1 · client k's local weights after this round of local training (starting from the broadcast w^t).
n_k · client k's local data size (how many examples that device trained on).
n · total data across all participating clients this round (= Σ n_k).
The recipe: the server averages the clients' new weights, but weights with more data get more say. A client with 10× the data has 10× the influence.

· Top-center: the central server holding the global weights. Bottom: client devices, each surrounded by a cloud of its own local data points (the dots beside each client) and a class histogram. The data clouds never move — only the weight packets travel. White packets fan out from the server (broadcast w), clients train locally, orange gradient packets travel back, the server averages them weighted by data size. Right: global loss per round.

Before this

Before federated learning (McMahan 2017), training on user data meant uploading it — privacy-incompatible for healthcare, finance, and mobile keyboards. Federated keeps data local: only weight updates travel. Combined with differential privacy and secure aggregation, it is how Google's Gboard learns from your typing without sending it anywhere. Critical for any regulated domain where the data legally cannot move.

FedAvg, one round

Why "data stays local" matters

Three reasons: regulation (GDPR / HIPAA forbid pooling), cost (uploading every keystroke from a billion phones is expensive), and privacy (your data stays your data, unless you opt in). Federated solves all three with one trick.

Differential privacy

Add a small Gaussian noise of standard deviation σ to each client update before sending. The server's average still converges; individual rows become provably hard to recover. Bigger σ ⇒ more privacy, slower learning.

Cross-silo vs cross-device

Cross-silo: a few hundred organizations (hospitals, banks). Reliable, large local datasets. Cross-device: millions of phones. Unreliable, small local datasets — Gboard's regime.

Where you've seen this04 examples

Gboard's next-word prediction

Google's keyboard learns from billions of phones without ever uploading your text. Each phone trains locally; encrypted gradient summaries are aggregated server-side. Federated learning at industrial scale.

Apple's QuickType and Siri

iPhone's predictive text and Siri's wake-word personalization use a federated mix with differential privacy. Apple's marketing leans on the fact that the data never leaves your device.

Medical imaging consortia

Multi-hospital tumor classification models — each hospital keeps patient scans local; federated rounds aggregate the model. Used by NVIDIA Clara, the Brain Tumor Federated Learning challenge, and several FDA-cleared products.

Financial fraud detection

Banks have patterns of fraud they can't share with competitors. Federated learning lets them collaboratively train fraud models without exposing customer transactions. Used by SWIFT and several large banking consortia.