Transfer
Learning
A network trained on a million ImageNet images already knows what edges, eyes, and wheels look like. To classify a new dataset of 200 X-rays, you don't start over — you re-use the early layers, re-train just the last one. Freeze, fine-tune, ship.
Transfer learning takes a model pre-trained on a large source task and adapts it to a smaller target task with much less data.
Three flavors, increasingly invasive: (1) feature extraction — freeze all the pretrained weights (lock them so the optimizer doesn't touch them) and train only a small head on top; (2) fine-tuning — unfreeze the last few layers and continue training at a low learning rate, gently nudging them toward the new task; (3) full fine-tuning — unfreeze and update everything, very slowly.
The earlier layers learn task-agnostic features (edges, color blobs, textures); later layers learn task-specific abstractions (cat-shaped, dog-shaped). Transfer works because the early features generalize across many vision tasks.
Transfer learning is the single most practical technique in modern ML. Almost no one trains image models from scratch anymore — they start from ImageNet-pretrained ResNet, ViT, or CLIP. Almost no one trains language models from scratch — they fine-tune Llama, Mistral, or a Gemma checkpoint.
The reason is data: pretraining on a billion items learns rich features; the target task may have only thousands of items. Without transfer, the small target dataset can't drag a deep network out of the random-initialization pit.
- Click ▷ Train all three. The architecture strip up top shows which layers the optimizer is allowed to touch (oxblood = trainable, gray = locked). Below, three accuracy curves race: scratch is slowest; frozen jumps to a fast start but flat-lines low; fine-tune climbs steadily and finishes highest.
- Drag the frozen layers slider. The trainable-parameter pill above the curves drops as you freeze more — fewer knobs to tune means a faster, cheaper run, but a lower achievable ceiling. Freeze 0 layers and fine-tune nearly matches scratch in cost; freeze 4 and it nearly matches frozen in cost.
- Reduce target data size. The scratch curve falls fastest because random-init networks are data-hungry; transfer barely flinches. The gap is the entire point.
- Watch the wall-clock counters at the bottom. Each strategy spends compute at a different rate per epoch — frozen is cheap (only the head trains), full fine-tune is expensive (every layer updates).
- θ (theta) · the full vector of model weights — millions of numbers, one per learnable parameter.
- θpre · the pretrained weights you start from (e.g., ResNet's ImageNet checkpoint). Already good; not random.
- Δθ (delta theta) · the small change the optimizer is allowed to make. In frozen, Δθ is zero everywhere except the head. In fine-tuning, Δθ is non-zero on the last few layers. In full FT, Δθ touches everything (gently).
- Trainable parameters · the count of weights with non-zero allowed Δθ — the figure in the pill above the loss curves. Fewer trainable parameters = less GPU memory, faster epochs, less risk of overfitting.
- Wall-clock minutes · GPU time elapsed. Frozen is cheap because most layers skip the gradient computation. Fine-tune is dearer because gradients flow through every layer.
Classify diabetic retinopathy with 5,000 retinal images? Start from ImageNet-pretrained ResNet, swap the head, fine-tune. Almost every published medical-imaging classifier uses this recipe.
"Train Llama 3 on our company's 100k support tickets" → LoRA adapters or full fine-tune. The base model already speaks English; you're just teaching it your domain.
Conservation orgs use ImageNet-pretrained models to classify camera-trap photos by species. Often only a few hundred labeled examples per species — transfer is the only thing that makes it work.
Detecting cracks in turbine blades or solder defects on PCBs — small, expensive-to-label datasets. Pretrained vision encoders + a fine-tuned head is the dominant recipe.
- How transferable are features in deep neural networks? paper Yosinski et al. (2014) · The empirical paper that nailed down which layers transfer well and which don't. Foundational for the freeze/fine-tune choice.
- LoRA: Low-Rank Adaptation paper Hu et al. (2021) · The LoRA paper. The technique behind the explosion of fine-tuned Llama variants and most cost-conscious LLM customization today.
- fast.ai — Practical Deep Learning course Howard & Thomas · The course that operationalized "transfer learning by default" in modern practice. Lesson 1 fine-tunes ResNet on a custom dataset in 4 lines of code.
- Hugging Face Transformers — Fine-tuning docs The reference for fine-tuning pretrained transformers in production. Covers full fine-tuning, LoRA/QLoRA, parameter-efficient methods.