jthomas.site// notebook · v.4.2026
back to writing

Stop Wrestling with Boilerplate — Local Tinker Gives You a Clean API for Local LLM Fine-Tuning

A Tinker-style API that runs on your own GPU. Write clean training loops for SFT, PPO, DPO, and GRPO without managing a single CUDA memory allocation.

The boilerplate for local fine-tuning is a lot. Wire up HuggingFace Transformers, bolt on PEFT for LoRA, pick a tokenizer, manage your own gradient accumulation, configure mixed-precision training, fight with bitsandbytes quantization flags, handle checkpointing, and hope your VRAM math was right. By the time the actual training loop runs, you've written three hundred lines of glue code that have nothing to do with your experiment.

Local Tinker is an attempt to collapse all of that into four primitives. It's a high-level, Tinker-style API for LoRA fine-tuning of small LLMs on your own GPU. You bring the model name, the dataset, and the training loop you actually care about. Everything else — quantization, LoRA adapter attachment, optimizer state, checkpointing, memory management — is handled.

It's designed for the 1B–13B class of models: Llama-3, Qwen, Mistral, and the other open-weight families that fit on a single consumer or prosumer GPU. If you have a 3090, a 4090, an A6000, or an H100, you're in scope. If you're training a 70B model, you want something else.

Five Acronyms in One Place

This post uses a few terms that are usually defined three pages apart in three different papers. One-line intuitions, in the order they show up:

Why Low-Rank Works

The surprising empirical fact behind LoRA is that fine-tuning doesn't actually need much capacity. When you specialize a pretrained model — to a domain, a tone, a task — the weight changes ΔW that take you from "general" to "specialized" live in a very low-dimensional subspace of the full parameter space. The model already knows how to read English; you're nudging it, not rebuilding it.

So instead of learning a full d × d update matrix ΔW (millions of parameters per layer), LoRA factorizes it as the product of two skinny matrices: ΔW = A · B, where A is d × r and B is r × d with r typically 8–64. The math is forced to find the update inside an r-dimensional subspace. For a 4096-wide layer with rank 16, that's a 256x reduction in trainable parameters per layer. At inference, the original W stays frozen and the small A·B is added on top.

x FROZEN · PRETRAINED W d × d ~13B params total 4-bit quantized · no grads A d × r B r × d TRAINABLE ΔW = A·B rank r ≈ 16 ~8M params Wx ABx + y y = Wx + ABx PARAMETER BUDGET full fine-tune: 13,000,000,000 trainable LoRA rank 16: ~8,000,000 trainable (1600x fewer)
LoRA in one picture. The pretrained matrix W is huge and frozen; the trainable update ΔW = A·B is forced to live in a rank-r subspace. At inference, both branches are added.

Why Another Fine-Tuning Tool?

Thinking Machines Lab's Tinker API made a convincing case that the right abstraction for fine-tuning is very small: forward_backward, optim_step, and sample. Those three calls plus a client object cover supervised learning, reinforcement learning, and everything in between. It's elegant.

The catch is that Tinker is a hosted cloud service. That's the right call for large models where most users don't have the hardware. But for the 1B–13B range, I already have the hardware. I'd rather run the same clean API against my own GPU, with my own data, without uploading anything or paying per token.

Local Tinker takes the mental model and the surface area of Tinker's API and maps it onto local hardware. Same primitives, same training loop shape, your machine.

The Core Primitives

The API exposes three client objects:

ServiceClient is the entry point. It discovers which models are available locally, checks your GPU, and hands out training and sampling clients. You instantiate it once.

TrainingClient is where fine-tuning happens. It wraps a quantized base model and a LoRA adapter, exposes forward_backward for loss computation and gradient accumulation, and optim_step for applying the accumulated update. Hundreds of lines of HuggingFace + PEFT + bitsandbytes wiring collapse into those two calls.

SamplingClient generates completions from the current (or a saved) checkpoint. It's what you reach for in the middle of an RL loop when you need to roll out trajectories from the current policy, or at the end of training when you want to verify the model actually learned what you told it to.

A minimal SFT loop looks like this — the entire training script, from cold start to a working sampler, in fourteen lines:

from local_tinker import ServiceClient

service = ServiceClient()                                  # 1. discover hardware, pick a backend
trainer = service.create_lora_training_client(             # 2. attach a LoRA adapter to a quantized base
    base_model="meta-llama/Llama-3.2-3B",
    rank=32,                                            #    rank of the A·B factorization
    quantize="4bit",                                     #    base model in 4-bit, adapter in bf16
)

for batch in dataloader:
    loss = trainer.forward_backward(batch)                 # 3. forward + backward, grads accumulate
    trainer.optim_step()                                   # 4. apply, clip, advance optimizer
    print(f"loss: {loss:.4f}")

sampler = trainer.save_weights_and_get_sampling_client()   # 5. snapshot weights for inference
output = sampler.sample("Explain gradient descent simply:")

Three imports, one loop, one optim step, one sampling call at the end. That's the whole thing.

Without Local Tinker, you'd write…

For comparison — here is roughly the same SFT setup expressed directly against HuggingFace, PEFT, and bitsandbytes. This is the boilerplate Local Tinker absorbs:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

bnb = BitsAndBytesConfig(                                # quantization config
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B", quantization_config=bnb, device_map="auto",
)
model = prepare_model_for_kbit_training(model)            # gradient checkpointing, fp32 layernorms…
peft_cfg = LoraConfig(                                    # LoRA wiring
    r=32, lora_alpha=64, lora_dropout=0.05,
    target_modules=["q_proj","k_proj","v_proj","o_proj"],
    bias="none", task_type="CAUSAL_LM",
)
model = get_peft_model(model, peft_cfg)
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B")
optim = torch.optim.AdamW(filter(lambda p: p.requires_grad, model.parameters()), lr=2e-4)
scaler = torch.cuda.amp.GradScaler()                      # mixed-precision plumbing

accum = 8
for step, batch in enumerate(dataloader):                 # manual gradient accumulation
    batch = {k: v.to(model.device) for k, v in batch.items()}
    with torch.cuda.amp.autocast(dtype=torch.bfloat16):
        out = model(**batch); loss = out.loss / accum
    scaler.scale(loss).backward()
    if (step + 1) % accum == 0:
        scaler.unscale_(optim)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        scaler.step(optim); scaler.update(); optim.zero_grad()
# …plus checkpointing, eval hooks, an inference path that re-merges adapters, etc.

Everything in that block is real work that someone has to do. Local Tinker's contribution is doing it once, in the library, so you don't keep doing it on every project.

Two-Phase Design: Gradient Accumulation and LoRA

Why two phases? The reason is memory.

Fine-tuning a 7B model with LoRA means loading the base model in 4-bit quantized form (frozen), attaching a small LoRA adapter in bf16 or fp16 (trainable), and keeping the optimizer state in fp32. Even with that arrangement, the gradients for the LoRA adapter can be substantial, and you usually want to accumulate several micro-batches before applying an update to get a reasonable effective batch size.

Phase 1 — forward_backward — runs a single micro-batch through the model, computes the loss, and backprops through the LoRA adapter only. Gradients accumulate in place. You can call it many times in a row without ever calling optim_step.

Phase 2 — optim_step — applies the accumulated gradients, clips them, and advances the optimizer state. You call this once per effective batch.

ServiceClient entry point · hardware discovery creates TrainingClient LoRA adapter on a frozen quantized base PHASE 1 forward_backward() compute loss accumulate LoRA grads call N times per batch PHASE 2 optim_step() apply accumulated grads clip + advance optimizer call once per batch accumulate SamplingClient generate completions from current weights
Four objects, two phases. Phase 1 runs many times to accumulate gradients; Phase 2 applies them once per effective batch.

Separating the two phases gives you exact control over the effective batch size without writing any gradient-accumulation plumbing yourself. You decide how many forward_backward calls go between optim_step calls; the client handles everything else.

Beyond SFT: Reinforcement Learning on Your GPU

The same two primitives cover reinforcement learning — which is really the reason this shape matters. SFT is easy to express in any framework; RL is where the boilerplate gets ugly.

Local Tinker has a loss function hierarchy that supports SFT (default), DPO for preference pairs, PPO for advantage-weighted updates with an internal value head, and GRPO for group-relative policy optimization. You pick the loss by passing loss_fn="grpo" (or "dpo", "ppo") to forward_backward. The client handles the bookkeeping.

The four objectives share a skeleton. The differences are entirely in what you feed forward_backward and what loss it computes — the surrounding loop is the same:

STEP SFT DPO PPO GRPO 1. input (prompt, target answer) (prompt, chosen, rejected) prompt only (roll out 1) prompt only (roll out N) 2. signal teacher tokens are gold human preference (implicit reward) scalar reward + value baseline scalar reward, group-normalized 3. loss −log p(target) log-ratio gap (chosen − rejected) clipped surrogate × advantage clipped surrogate × group advantage 4. fwd_bwd trainer.forward_backward(batch, loss_fn="…") 5. step trainer.optim_step() Steps 1–3 differ across objectives. Steps 4–5 are identical: the same two primitives carry every loss.
Same skeleton, four objectives. SFT learns from gold answers; DPO from preference pairs; PPO from a reward model + value baseline; GRPO from a group of rollouts whose rewards are normalized within the group.

A GRPO loop in code — the most involved of the four, but still the same shape:

from local_tinker import ServiceClient

service = ServiceClient()
trainer = service.create_lora_training_client("Qwen/Qwen2.5-7B", rank=16)

for step in range(max_steps):
    # 1. roll out N completions per prompt from the current policy
    sampler = trainer.save_weights_and_get_sampling_client()
    completions = sampler.sample_batch(prompts, n=8)        #    8 = the "group" in GRPO

    # 2. score with YOUR reward function (regex, unit test, classifier, judge LLM…)
    rewards = [reward_fn(p, c) for p, c in zip(prompts, completions)]

    # 3. GRPO update — group normalization happens inside the loss
    trainer.forward_backward(completions, rewards, loss_fn="grpo")
    trainer.optim_step()

The interesting line is rewards = [reward_fn(p, c) for ...]. The reward function is yours. It can be a regex, a unit test, a classifier, or another LLM acting as a judge. Local Tinker doesn't care — it just needs a float per completion.

DPO follows the same shape but takes preferred and dispreferred pairs instead of rollouts. PPO manages a value head internally so you don't have to. All four loss functions use the same forward_backward + optim_step surface. The training loop doesn't change shape when you change objectives, which is the whole point.

The Tech Stack

PyTorch HuggingFace Transformers PEFT bitsandbytes

Local Tinker is a thin layer of glue over a short stack of well-maintained libraries. PyTorch provides the tensor and autograd runtime. HuggingFace Transformers provides the base models and tokenizers. PEFT provides the LoRA adapter layers. bitsandbytes provides the 4-bit and 8-bit quantization kernels. Local Tinker is the glue that wires them into a single client API so you don't have to think about any of it during an experiment.

Getting Started

Step 1 — clone and install (the package is editable so you can patch a recipe and rerun without reinstalling):

git clone https://github.com/josephgec/finetuning.git
cd finetuning
pip install -e .

Step 2 — see which models you can train on your current hardware. The CLI inspects your GPU and prints VRAM estimates next to a status flag for each base model in the catalog:

$ local-tinker models

MODEL                          PARAMS    QUANT   VRAM     STATUS
meta-llama/Llama-3.2-1B        1.2B      4bit    ~3 GB    ready
meta-llama/Llama-3.2-3B        3.2B      4bit    ~6 GB    ready
mistralai/Mistral-7B-v0.3      7.3B      4bit    ~9 GB    ready
Qwen/Qwen2.5-7B                7.6B      4bit    ~10 GB   tight
meta-llama/Llama-3.1-13B       13.0B     4bit    ~16 GB   OOM

The status column is the useful one. ready means the model and a reasonable batch size fit comfortably. tight means it'll work but you may need to reduce micro-batch size or LoRA rank. OOM means don't bother — it won't fit on this card.

Step 3 — launch a training run. The CLI is a thin wrapper around the same primitives shown above; everything it does, you can do from Python:

local-tinker run \
  --model meta-llama/Llama-3.2-3B \
  --dataset ./data/instructions.jsonl \
  --rank 32 \
  --epochs 3 \
  --quantize 4bit

There's a recipes/ directory in the repo with working configs for SFT, DPO, PPO, and GRPO. Copy one, point it at your dataset, and go.

Who This Is For

Local Tinker is for developers and researchers who have their own GPU, are working with 1B–13B models, and want to iterate quickly without fighting the framework. If you have a model idea and you want to test it before dinner, this is for you.

It's not a replacement for the cloud Tinker or for a full training platform like Axolotl or TRL. For models larger than ~13B, for distributed training across multiple machines, or for production training pipelines, use the right tool. Use Local Tinker for the tight loop of "wonder if X works → try X → see if it worked" on a model you can fit on one card.

Get Started

The source is on GitHub: github.com/josephgec/finetuning. Clone it, install it, run local-tinker models to see what fits on your GPU, and try an SFT run before you reach for RL. The recipes/ directory has known-good starting points for each loss function.

If you try it and something breaks, or if you have a use case that doesn't map cleanly onto the current primitives, open an issue or a PR. The API surface is small on purpose and I'd rather keep it small — but I'm open to good suggestions.

Happy tinkering.

References

The papers and projects this one builds on, for readers who want to go deeper.

LoRA and quantization

Reinforcement learning from feedback

Libraries and prior art

Project source