Stop Wrestling with Boilerplate — Local Tinker Gives You a Clean API for Local LLM Fine-Tuning

Before this

In 2020 a "fine-tune" meant updating every weight of a model. A 7B parameter model is 14 GB just for weights in fp16, and the optimizer state for full fine-tuning needs another two-to-four times that — comfortably 60 GB+ of VRAM, which meant multi-GPU clusters and a weekend of plumbing. LoRA (2021) cut the trainable parameters by 1000x, and QLoRA (2023) made the frozen base 4-bit, putting 7B fine-tunes on a single 24 GB consumer card. The remaining problem was the code: even with LoRA available, a working training loop required wiring together HuggingFace Transformers, PEFT, bitsandbytes, accelerate, and a custom optimizer — typically 300+ lines before a single gradient flowed. Tinker (2025) showed that the right surface area was much smaller: four primitives. Local Tinker brings that shape to your own GPU.

The boilerplate for local fine-tuning is a lot. Wire up HuggingFace Transformers, bolt on PEFT for LoRA, pick a tokenizer, manage your own gradient accumulation, configure mixed-precision training, fight with bitsandbytes quantization flags, handle checkpointing, and hope your VRAM math was right. By the time the actual training loop runs, you've written three hundred lines of glue code that have nothing to do with your experiment.

Local Tinker is an attempt to collapse all of that into four primitives. It's a high-level, Tinker-style API for LoRA fine-tuning of small LLMs on your own GPU. You bring the model name, the dataset, and the training loop you actually care about. Everything else — quantization, LoRA adapter attachment, optimizer state, checkpointing, memory management — is handled.

It's designed for the 1B–13B class of models: Llama-3, Qwen, Mistral, and the other open-weight families that fit on a single consumer or prosumer GPU. If you have a 3090, a 4090, an A6000, or an H100, you're in scope. If you're training a 70B model, you want something else.

Five Acronyms in One Place

This post uses a few terms that are usually defined three pages apart in three different papers. One-line intuitions, in the order they show up:

LoRA — low-rank adaptation. Freeze the giant pretrained weight matrix W and learn a tiny pair of skinny matrices A and B on the side; the effective weight at inference is W + A·B. You train ~0.1% of the original parameters and recover most of the quality.
SFT — supervised fine-tuning. The vanilla case: show the model an input and the correct output, minimize next-token cross-entropy. This is what "fine-tune on instructions" means.
DPO — direct preference optimization. Given pairs of (preferred, dispreferred) responses, push the model to score the preferred one higher — without ever training a separate reward model.
PPO — proximal policy optimization. Sample completions, score them with a reward model, and nudge the policy toward higher-reward completions while clipping how far it can move per step. The workhorse of RLHF.
GRPO — group relative policy optimization. Sample several completions per prompt, normalize their rewards within the group, and use that as the advantage. No separate value head; cheaper and more stable than PPO for LLMs.

Why Low-Rank Works

The surprising empirical fact behind LoRA is that fine-tuning doesn't actually need much capacity. When you specialize a pretrained model — to a domain, a tone, a task — the weight changes ΔW that take you from "general" to "specialized" live in a very low-dimensional subspace of the full parameter space. The model already knows how to read English; you're nudging it, not rebuilding it.

So instead of learning a full d × d update matrix ΔW (millions of parameters per layer), LoRA factorizes it as the product of two skinny matrices: ΔW = A · B, where A is d × r and B is r × d with r typically 8–64. The math is forced to find the update inside an r-dimensional subspace. For a 4096-wide layer with rank 16, that's a 256x reduction in trainable parameters per layer. At inference, the original W stays frozen and the small A·B is added on top.

LoRA in one picture. The pretrained matrix W is huge and frozen; the trainable update ΔW = A·B is forced to live in a rank-r subspace. At inference, both branches are added.

Why Another Fine-Tuning Tool?

Thinking Machines Lab's Tinker API made a convincing case that the right abstraction for fine-tuning is very small: forward_backward, optim_step, and sample. Those three calls plus a client object cover supervised learning, reinforcement learning, and everything in between. It's elegant.

The catch is that Tinker is a hosted cloud service. That's the right call for large models where most users don't have the hardware. But for the 1B–13B range, I already have the hardware. I'd rather run the same clean API against my own GPU, with my own data, without uploading anything or paying per token.

Local Tinker takes the mental model and the surface area of Tinker's API and maps it onto local hardware. Same primitives, same training loop shape, your machine.

The Core Primitives

The API exposes three client objects:

ServiceClient is the entry point. It discovers which models are available locally, checks your GPU, and hands out training and sampling clients. You instantiate it once.

TrainingClient is where fine-tuning happens. It wraps a quantized base model and a LoRA adapter, exposes forward_backward for loss computation and gradient accumulation, and optim_step for applying the accumulated update. Hundreds of lines of HuggingFace + PEFT + bitsandbytes wiring collapse into those two calls.

SamplingClient generates completions from the current (or a saved) checkpoint. It's what you reach for in the middle of an RL loop when you need to roll out trajectories from the current policy, or at the end of training when you want to verify the model actually learned what you told it to.

A minimal SFT loop looks like this — the entire training script, from cold start to a working sampler, in fourteen lines:

from local_tinker import ServiceClient

service = ServiceClient()                                  # 1. discover hardware, pick a backend
trainer = service.create_lora_training_client(             # 2. attach a LoRA adapter to a quantized base
    base_model="meta-llama/Llama-3.2-3B",
    rank=32,                                            #    rank of the A·B factorization
    quantize="4bit",                                     #    base model in 4-bit, adapter in bf16
)

for batch in dataloader:
    loss = trainer.forward_backward(batch)                 # 3. forward + backward, grads accumulate
    trainer.optim_step()                                   # 4. apply, clip, advance optimizer
    print(f"loss: {loss:.4f}")

sampler = trainer.save_weights_and_get_sampling_client()   # 5. snapshot weights for inference
output = sampler.sample("Explain gradient descent simply:")

Three imports, one loop, one optim step, one sampling call at the end. That's the whole thing.

Without Local Tinker, you'd write…

For comparison — here is roughly the same SFT setup expressed directly against HuggingFace, PEFT, and bitsandbytes. This is the boilerplate Local Tinker absorbs:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

bnb = BitsAndBytesConfig(                                # quantization config
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B", quantization_config=bnb, device_map="auto",
)
model = prepare_model_for_kbit_training(model)            # gradient checkpointing, fp32 layernorms…
peft_cfg = LoraConfig(                                    # LoRA wiring
    r=32, lora_alpha=64, lora_dropout=0.05,
    target_modules=["q_proj","k_proj","v_proj","o_proj"],
    bias="none", task_type="CAUSAL_LM",
)
model = get_peft_model(model, peft_cfg)
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B")
optim = torch.optim.AdamW(filter(lambda p: p.requires_grad, model.parameters()), lr=2e-4)
scaler = torch.cuda.amp.GradScaler()                      # mixed-precision plumbing

accum = 8
for step, batch in enumerate(dataloader):                 # manual gradient accumulation
    batch = {k: v.to(model.device) for k, v in batch.items()}
    with torch.cuda.amp.autocast(dtype=torch.bfloat16):
        out = model(**batch); loss = out.loss / accum
    scaler.scale(loss).backward()
    if (step + 1) % accum == 0:
        scaler.unscale_(optim)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        scaler.step(optim); scaler.update(); optim.zero_grad()
# …plus checkpointing, eval hooks, an inference path that re-merges adapters, etc.

Everything in that block is real work that someone has to do. Local Tinker's contribution is doing it once, in the library, so you don't keep doing it on every project.

Two-Phase Design: Gradient Accumulation and LoRA

Why two phases? The reason is memory.

Fine-tuning a 7B model with LoRA means loading the base model in 4-bit quantized form (frozen), attaching a small LoRA adapter in bf16 or fp16 (trainable), and keeping the optimizer state in fp32. Even with that arrangement, the gradients for the LoRA adapter can be substantial, and you usually want to accumulate several micro-batches before applying an update to get a reasonable effective batch size.

Phase 1 — forward_backward — runs a single micro-batch through the model, computes the loss, and backprops through the LoRA adapter only. Gradients accumulate in place. You can call it many times in a row without ever calling optim_step.

Phase 2 — optim_step — applies the accumulated gradients, clips them, and advances the optimizer state. You call this once per effective batch.

Four objects, two phases. Phase 1 runs many times to accumulate gradients; Phase 2 applies them once per effective batch.

Separating the two phases gives you exact control over the effective batch size without writing any gradient-accumulation plumbing yourself. You decide how many forward_backward calls go between optim_step calls; the client handles everything else.

Beyond SFT: Reinforcement Learning on Your GPU

The same two primitives cover reinforcement learning — which is really the reason this shape matters. SFT is easy to express in any framework; RL is where the boilerplate gets ugly.

Local Tinker has a loss function hierarchy that supports SFT (default), DPO for preference pairs, PPO for advantage-weighted updates with an internal value head, and GRPO for group-relative policy optimization. You pick the loss by passing loss_fn="grpo" (or "dpo", "ppo") to forward_backward. The client handles the bookkeeping.

The four objectives share a skeleton. The differences are entirely in what you feed forward_backward and what loss it computes — the surrounding loop is the same:

Same skeleton, four objectives. SFT learns from gold answers; DPO from preference pairs; PPO from a reward model + value baseline; GRPO from a group of rollouts whose rewards are normalized within the group.

A GRPO loop in code — the most involved of the four, but still the same shape:

from local_tinker import ServiceClient

service = ServiceClient()
trainer = service.create_lora_training_client("Qwen/Qwen2.5-7B", rank=16)

for step in range(max_steps):
    # 1. roll out N completions per prompt from the current policy
    sampler = trainer.save_weights_and_get_sampling_client()
    completions = sampler.sample_batch(prompts, n=8)        #    8 = the "group" in GRPO

    # 2. score with YOUR reward function (regex, unit test, classifier, judge LLM…)
    rewards = [reward_fn(p, c) for p, c in zip(prompts, completions)]

    # 3. GRPO update — group normalization happens inside the loss
    trainer.forward_backward(completions, rewards, loss_fn="grpo")
    trainer.optim_step()

The interesting line is rewards = [reward_fn(p, c) for ...]. The reward function is yours. It can be a regex, a unit test, a classifier, or another LLM acting as a judge. Local Tinker doesn't care — it just needs a float per completion.

DPO follows the same shape but takes preferred and dispreferred pairs instead of rollouts. PPO manages a value head internally so you don't have to. All four loss functions use the same forward_backward + optim_step surface. The training loop doesn't change shape when you change objectives, which is the whole point.

The Tech Stack

PyTorch HuggingFace Transformers PEFT bitsandbytes

Local Tinker is a thin layer of glue over a short stack of well-maintained libraries. PyTorch provides the tensor and autograd runtime. HuggingFace Transformers provides the base models and tokenizers. PEFT provides the LoRA adapter layers. bitsandbytes provides the 4-bit and 8-bit quantization kernels. Local Tinker is the glue that wires them into a single client API so you don't have to think about any of it during an experiment.

Getting Started

Step 1 — clone and install (the package is editable so you can patch a recipe and rerun without reinstalling):

git clone https://github.com/josephgec/finetuning.git
cd finetuning
pip install -e .

Step 2 — see which models you can train on your current hardware. The CLI inspects your GPU and prints VRAM estimates next to a status flag for each base model in the catalog:

$ local-tinker models

MODEL                          PARAMS    QUANT   VRAM     STATUS
meta-llama/Llama-3.2-1B        1.2B      4bit    ~3 GB    ready
meta-llama/Llama-3.2-3B        3.2B      4bit    ~6 GB    ready
mistralai/Mistral-7B-v0.3      7.3B      4bit    ~9 GB    ready
Qwen/Qwen2.5-7B                7.6B      4bit    ~10 GB   tight
meta-llama/Llama-3.1-13B       13.0B     4bit    ~16 GB   OOM

The status column is the useful one. ready means the model and a reasonable batch size fit comfortably. tight means it'll work but you may need to reduce micro-batch size or LoRA rank. OOM means don't bother — it won't fit on this card.

Step 3 — launch a training run. The CLI is a thin wrapper around the same primitives shown above; everything it does, you can do from Python:

local-tinker run \
  --model meta-llama/Llama-3.2-3B \
  --dataset ./data/instructions.jsonl \
  --rank 32 \
  --epochs 3 \
  --quantize 4bit

There's a recipes/ directory in the repo with working configs for SFT, DPO, PPO, and GRPO. Copy one, point it at your dataset, and go.

Who This Is For

Local Tinker is for developers and researchers who have their own GPU, are working with 1B–13B models, and want to iterate quickly without fighting the framework. If you have a model idea and you want to test it before dinner, this is for you.

It's not a replacement for the cloud Tinker or for a full training platform like Axolotl or TRL. For models larger than ~13B, for distributed training across multiple machines, or for production training pipelines, use the right tool. Use Local Tinker for the tight loop of "wonder if X works → try X → see if it worked" on a model you can fit on one card.

Get Started

The source is on GitHub: github.com/josephgec/finetuning. Clone it, install it, run local-tinker models to see what fits on your GPU, and try an SFT run before you reach for RL. The recipes/ directory has known-good starting points for each loss function.

If you try it and something breaks, or if you have a use case that doesn't map cleanly onto the current primitives, open an issue or a PR. The API surface is small on purpose and I'd rather keep it small — but I'm open to good suggestions.

Happy tinkering.

References

The papers and projects this one builds on, for readers who want to go deeper.

LoRA and quantization

Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. arXiv:2106.09685
Dettmers et al. QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023. arXiv:2305.14314
Dettmers et al. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. NeurIPS 2022. arXiv:2208.07339

Reinforcement learning from feedback

Schulman et al. Proximal Policy Optimization Algorithms. 2017. arXiv:1707.06347
Rafailov et al. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. arXiv:2305.18290
Shao et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. 2024 (introduces GRPO). arXiv:2402.03300

Libraries and prior art

Thinking Machines Lab. Tinker API. thinkingmachines.ai/tinker
HuggingFace. PEFT — Parameter-Efficient Fine-Tuning. github.com/huggingface/peft
bitsandbytes. github.com/bitsandbytes-foundation/bitsandbytes
HuggingFace. TRL — Transformer Reinforcement Learning. github.com/huggingface/trl

Project source

Local Tinker on GitHub — github.com/josephgec/finetuning