Machine Learning, Visualized · Vol. XXX

Transfer
Learning

A network trained on a million ImageNet images already knows what edges, eyes, and wheels look like. To classify a new dataset of 200 X-rays, you don't start over — you re-use the early layers, re-train just the last one. Freeze, fine-tune, ship.

The concept

Transfer learning takes a model pre-trained on a large source task and adapts it to a smaller target task with much less data.

Three flavors, increasingly invasive: (1) feature extraction — freeze all the pretrained weights (lock them so the optimizer doesn't touch them) and train only a small head on top; (2) fine-tuning — unfreeze the last few layers and continue training at a low learning rate, gently nudging them toward the new task; (3) full fine-tuning — unfreeze and update everything, very slowly.

The earlier layers learn task-agnostic features (edges, color blobs, textures); later layers learn task-specific abstractions (cat-shaped, dog-shaped). Transfer works because the early features generalize across many vision tasks.

Why ML cares

Transfer learning is the single most practical technique in modern ML. Almost no one trains image models from scratch anymore — they start from ImageNet-pretrained ResNet, ViT, or CLIP. Almost no one trains language models from scratch — they fine-tune Llama, Mistral, or a Gemma checkpoint.

The reason is data: pretraining on a billion items learns rich features; the target task may have only thousands of items. Without transfer, the small target dataset can't drag a deep network out of the random-initialization pit.

Try this

Click ▷ Train all three. The architecture strip up top shows which layers the optimizer is allowed to touch (oxblood = trainable, gray = locked). Below, three accuracy curves race: scratch is slowest; frozen jumps to a fast start but flat-lines low; fine-tune climbs steadily and finishes highest.
Drag the frozen layers slider. The trainable-parameter pill above the curves drops as you freeze more — fewer knobs to tune means a faster, cheaper run, but a lower achievable ceiling. Freeze 0 layers and fine-tune nearly matches scratch in cost; freeze 4 and it nearly matches frozen in cost.
Reduce target data size. The scratch curve falls fastest because random-init networks are data-hungry; transfer barely flinches. The gap is the entire point.
Watch the wall-clock counters at the bottom. Each strategy spends compute at a different rate per epoch — frozen is cheap (only the head trains), full fine-tune is expensive (every layer updates).

Symbol gloss

θ ← θ_pre + Δθ

θ (theta) · the full vector of model weights — millions of numbers, one per learnable parameter.
θ_pre · the pretrained weights you start from (e.g., ResNet's ImageNet checkpoint). Already good; not random.
Δθ (delta theta) · the small change the optimizer is allowed to make. In frozen, Δθ is zero everywhere except the head. In fine-tuning, Δθ is non-zero on the last few layers. In full FT, Δθ touches everything (gently).
Trainable parameters · the count of weights with non-zero allowed Δθ — the figure in the pill above the loss curves. Fewer trainable parameters = less GPU memory, faster epochs, less risk of overfitting.
Wall-clock minutes · GPU time elapsed. Frozen is cheap because most layers skip the gradient computation. Fine-tune is dearer because gradients flow through every layer.

· Three networks racing to fit the same target task. Scratch starts cold; frozen reuses pretrained features but only adapts the last layer; fine-tune unlocks the late layers and slowly adapts them too. The curves below show validation accuracy per epoch.

Before this

Before transfer learning, every new task started from scratch — random weights, weeks of training, mountains of labeled data. Then practitioners noticed that ImageNet-trained CNN features generalized stunningly well to almost any vision problem. Transfer learning (popularized around 2014) freezes those features and trains only a small head, cutting training time 10–100× and labels needed 100–1000×. Today every fine-tuned LLM is the same idea at scale.

Freeze the bottom, train the top

A pretrained network is mostly already correct. Lock the bottom layers (generic features); replace and train just the top.

The four-point spectrum

Catastrophic forgetting

If you fine-tune too aggressively (high LR, all layers unfrozen), the network "forgets" what made it good on the source task. Layer-wise LR decay (lower LR for earlier layers) is the standard fix.

In LLMs: LoRA

Modern LLM fine-tuning rarely updates the full weight matrix. LoRA adds a tiny low-rank adapter and trains only that — drastically cheaper, comparable performance.

Where you've seen this04 examples

Medical imaging

Classify diabetic retinopathy with 5,000 retinal images? Start from ImageNet-pretrained ResNet, swap the head, fine-tune. Almost every published medical-imaging classifier uses this recipe.

Custom LLM fine-tuning

"Train Llama 3 on our company's 100k support tickets" → LoRA adapters or full fine-tune. The base model already speaks English; you're just teaching it your domain.

Wildlife species classifiers

Conservation orgs use ImageNet-pretrained models to classify camera-trap photos by species. Often only a few hundred labeled examples per species — transfer is the only thing that makes it work.

Industrial defect detection

Detecting cracks in turbine blades or solder defects on PCBs — small, expensive-to-label datasets. Pretrained vision encoders + a fine-tuned head is the dominant recipe.