Back to Writing

Stop Wrestling with Boilerplate — Local Tinker Gives You a Clean API for Local LLM Fine-Tuning

A Tinker-style API that runs on your own GPU. Write clean training loops for SFT, PPO, DPO, and GRPO without managing a single CUDA memory allocation.

The boilerplate for local fine-tuning is a lot. Wire up HuggingFace Transformers, bolt on PEFT for LoRA, pick a tokenizer, manage your own gradient accumulation, configure mixed-precision training, fight with bitsandbytes quantization flags, handle checkpointing, and hope your VRAM math was right. By the time the actual training loop runs, you've written three hundred lines of glue code that have nothing to do with your experiment.

Local Tinker is an attempt to collapse all of that into four primitives. It's a high-level, Tinker-style API for LoRA fine-tuning of small LLMs on your own GPU. You bring the model name, the dataset, and the training loop you actually care about. Everything else — quantization, LoRA adapter attachment, optimizer state, checkpointing, memory management — is handled.

It's designed for the 1B–13B class of models: Llama-3, Qwen, Mistral, and the other open-weight families that fit on a single consumer or prosumer GPU. If you have a 3090, a 4090, an A6000, or an H100, you're in scope. If you're training a 70B model, you want something else.

Why Another Fine-Tuning Tool?

Thinking Machines Lab's Tinker API made a convincing case that the right abstraction for fine-tuning is very small: forward_backward, optim_step, and sample. Those three calls plus a client object cover supervised learning, reinforcement learning, and everything in between. It's elegant.

The catch is that Tinker is a hosted cloud service. That's the right call for large models where most users don't have the hardware. But for the 1B–13B range, I already have the hardware. I'd rather run the same clean API against my own GPU, with my own data, without uploading anything or paying per token.

Local Tinker takes the mental model and the surface area of Tinker's API and maps it onto local hardware. Same primitives, same training loop shape, your machine.

The Core Primitives

The API exposes three client objects:

ServiceClient is the entry point. It discovers which models are available locally, checks your GPU, and hands out training and sampling clients. You instantiate it once.

TrainingClient is where fine-tuning happens. It wraps a quantized base model and a LoRA adapter, exposes forward_backward for loss computation and gradient accumulation, and optim_step for applying the accumulated update. Hundreds of lines of HuggingFace + PEFT + bitsandbytes wiring collapse into those two calls.

SamplingClient generates completions from the current (or a saved) checkpoint. It's what you reach for in the middle of an RL loop when you need to roll out trajectories from the current policy, or at the end of training when you want to verify the model actually learned what you told it to.

A minimal SFT loop looks like this:

from local_tinker import ServiceClient

service = ServiceClient()
trainer = service.create_lora_training_client(
    base_model="meta-llama/Llama-3.2-3B",
    rank=32,
    quantize="4bit",
)

for batch in dataloader:
    loss = trainer.forward_backward(batch)
    trainer.optim_step()
    print(f"loss: {loss:.4f}")

sampler = trainer.save_weights_and_get_sampling_client()
output = sampler.sample("Explain gradient descent simply:")

Three imports, one loop, one optim step, one sampling call at the end. That's the whole thing.

Two-Phase Design: Gradient Accumulation and LoRA

Why two phases? The reason is memory.

Fine-tuning a 7B model with LoRA means loading the base model in 4-bit quantized form (frozen), attaching a small LoRA adapter in bf16 or fp16 (trainable), and keeping the optimizer state in fp32. Even with that arrangement, the gradients for the LoRA adapter can be substantial, and you usually want to accumulate several micro-batches before applying an update to get a reasonable effective batch size.

Phase 1 — forward_backward — runs a single micro-batch through the model, computes the loss, and backprops through the LoRA adapter only. Gradients accumulate in place. You can call it many times in a row without ever calling optim_step.

Phase 2 — optim_step — applies the accumulated gradients, clips them, and advances the optimizer state. You call this once per effective batch.

ServiceClient entry point · hardware discovery creates TrainingClient LoRA adapter on a frozen quantized base PHASE 1 forward_backward() compute loss accumulate LoRA grads call N times per batch PHASE 2 optim_step() apply accumulated grads clip + advance optimizer call once per batch accumulate SamplingClient generate completions from current weights
Four objects, two phases. Phase 1 runs many times to accumulate gradients; Phase 2 applies them once per effective batch.

Separating the two phases gives you exact control over the effective batch size without writing any gradient-accumulation plumbing yourself. You decide how many forward_backward calls go between optim_step calls; the client handles everything else.

Beyond SFT: Reinforcement Learning on Your GPU

The same two primitives cover reinforcement learning — which is really the reason this shape matters. SFT is easy to express in any framework; RL is where the boilerplate gets ugly.

Local Tinker has a loss function hierarchy that supports SFT (default), DPO for preference pairs, PPO for advantage-weighted updates with an internal value head, and GRPO for group-relative policy optimization. You pick the loss by passing loss_fn="grpo" (or "dpo", "ppo") to forward_backward. The client handles the bookkeeping.

A GRPO loop looks like this:

from local_tinker import ServiceClient

service = ServiceClient()
trainer = service.create_lora_training_client("Qwen/Qwen2.5-7B", rank=16)

for step in range(max_steps):
    # roll out completions from the current policy
    sampler = trainer.save_weights_and_get_sampling_client()
    completions = sampler.sample_batch(prompts, n=8)

    # score with YOUR reward function (regex, test, classifier, judge LLM...)
    rewards = [reward_fn(p, c) for p, c in zip(prompts, completions)]

    # GRPO update — same two-phase primitives
    trainer.forward_backward(completions, rewards, loss_fn="grpo")
    trainer.optim_step()

The interesting line is rewards = [reward_fn(p, c) for ...]. The reward function is yours. It can be a regex, a unit test, a classifier, or another LLM acting as a judge. Local Tinker doesn't care — it just needs a float per completion.

DPO follows the same shape but takes preferred and dispreferred pairs instead of rollouts. PPO manages a value head internally so you don't have to. All four loss functions use the same forward_backward + optim_step surface. The training loop doesn't change shape when you change objectives, which is the whole point.

The Tech Stack

PyTorch HuggingFace Transformers PEFT bitsandbytes

Local Tinker is a thin layer of glue over a short stack of well-maintained libraries. PyTorch provides the tensor and autograd runtime. HuggingFace Transformers provides the base models and tokenizers. PEFT provides the LoRA adapter layers. bitsandbytes provides the 4-bit and 8-bit quantization kernels. Local Tinker is the glue that wires them into a single client API so you don't have to think about any of it during an experiment.

Getting Started

Step 1 — clone and install:

git clone https://github.com/josephgec/finetuning.git
cd finetuning
pip install -e .

Step 2 — see which models you can train on your current hardware:

$ local-tinker models

MODEL                          PARAMS    QUANT   VRAM     STATUS
meta-llama/Llama-3.2-1B        1.2B      4bit    ~3 GB    ready
meta-llama/Llama-3.2-3B        3.2B      4bit    ~6 GB    ready
mistralai/Mistral-7B-v0.3      7.3B      4bit    ~9 GB    ready
Qwen/Qwen2.5-7B                7.6B      4bit    ~10 GB   tight
meta-llama/Llama-3.1-13B       13.0B     4bit    ~16 GB   OOM

The status column is the useful one. ready means the model and a reasonable batch size fit comfortably. tight means it'll work but you may need to reduce micro-batch size or LoRA rank. OOM means don't bother — it won't fit on this card.

Step 3 — launch a training run:

local-tinker run \
  --model meta-llama/Llama-3.2-3B \
  --dataset ./data/instructions.jsonl \
  --rank 32 \
  --epochs 3 \
  --quantize 4bit

There's a recipes/ directory in the repo with working configs for SFT, DPO, PPO, and GRPO. Copy one, point it at your dataset, and go.

Who This Is For

Local Tinker is for developers and researchers who have their own GPU, are working with 1B–13B models, and want to iterate quickly without fighting the framework. If you have a model idea and you want to test it before dinner, this is for you.

It's not a replacement for the cloud Tinker or for a full training platform like Axolotl or TRL. For models larger than ~13B, for distributed training across multiple machines, or for production training pipelines, use the right tool. Use Local Tinker for the tight loop of "wonder if X works → try X → see if it worked" on a model you can fit on one card.

Get Started

The source is on GitHub: github.com/josephgec/finetuning. Clone it, install it, run local-tinker models to see what fits on your GPU, and try an SFT run before you reach for RL. The recipes/ directory has known-good starting points for each loss function.

If you try it and something breaks, or if you have a use case that doesn't map cleanly onto the current primitives, open an issue or a PR. The API surface is small on purpose and I'd rather keep it small — but I'm open to good suggestions.

Happy tinkering.

References

The papers and projects this one builds on, for readers who want to go deeper.

LoRA and quantization

Reinforcement learning from feedback

Libraries and prior art

Project source