Stop Wrestling with Boilerplate — Local Tinker Gives You a Clean API for Local LLM Fine-Tuning
A Tinker-style API that runs on your own GPU. Write clean training loops for SFT, PPO, DPO, and GRPO without managing a single CUDA memory allocation.
The boilerplate for local fine-tuning is a lot. Wire up HuggingFace Transformers, bolt on PEFT for LoRA, pick a tokenizer, manage your own gradient accumulation, configure mixed-precision training, fight with bitsandbytes quantization flags, handle checkpointing, and hope your VRAM math was right. By the time the actual training loop runs, you've written three hundred lines of glue code that have nothing to do with your experiment.
Local Tinker is an attempt to collapse all of that into four primitives. It's a high-level, Tinker-style API for LoRA fine-tuning of small LLMs on your own GPU. You bring the model name, the dataset, and the training loop you actually care about. Everything else — quantization, LoRA adapter attachment, optimizer state, checkpointing, memory management — is handled.
It's designed for the 1B–13B class of models: Llama-3, Qwen, Mistral, and the other open-weight families that fit on a single consumer or prosumer GPU. If you have a 3090, a 4090, an A6000, or an H100, you're in scope. If you're training a 70B model, you want something else.
Why Another Fine-Tuning Tool?
Thinking Machines Lab's Tinker API made a convincing case that the right abstraction for fine-tuning is very small: forward_backward, optim_step, and sample. Those three calls plus a client object cover supervised learning, reinforcement learning, and everything in between. It's elegant.
The catch is that Tinker is a hosted cloud service. That's the right call for large models where most users don't have the hardware. But for the 1B–13B range, I already have the hardware. I'd rather run the same clean API against my own GPU, with my own data, without uploading anything or paying per token.
Local Tinker takes the mental model and the surface area of Tinker's API and maps it onto local hardware. Same primitives, same training loop shape, your machine.
The Core Primitives
The API exposes three client objects:
ServiceClient is the entry point. It discovers which models are available locally, checks your GPU, and hands out training and sampling clients. You instantiate it once.
TrainingClient is where fine-tuning happens. It wraps a quantized base model and a LoRA adapter, exposes forward_backward for loss computation and gradient accumulation, and optim_step for applying the accumulated update. Hundreds of lines of HuggingFace + PEFT + bitsandbytes wiring collapse into those two calls.
SamplingClient generates completions from the current (or a saved) checkpoint. It's what you reach for in the middle of an RL loop when you need to roll out trajectories from the current policy, or at the end of training when you want to verify the model actually learned what you told it to.
A minimal SFT loop looks like this:
from local_tinker import ServiceClient
service = ServiceClient()
trainer = service.create_lora_training_client(
base_model="meta-llama/Llama-3.2-3B",
rank=32,
quantize="4bit",
)
for batch in dataloader:
loss = trainer.forward_backward(batch)
trainer.optim_step()
print(f"loss: {loss:.4f}")
sampler = trainer.save_weights_and_get_sampling_client()
output = sampler.sample("Explain gradient descent simply:")
Three imports, one loop, one optim step, one sampling call at the end. That's the whole thing.
Two-Phase Design: Gradient Accumulation and LoRA
Why two phases? The reason is memory.
Fine-tuning a 7B model with LoRA means loading the base model in 4-bit quantized form (frozen), attaching a small LoRA adapter in bf16 or fp16 (trainable), and keeping the optimizer state in fp32. Even with that arrangement, the gradients for the LoRA adapter can be substantial, and you usually want to accumulate several micro-batches before applying an update to get a reasonable effective batch size.
Phase 1 — forward_backward — runs a single micro-batch through the model, computes the loss, and backprops through the LoRA adapter only. Gradients accumulate in place. You can call it many times in a row without ever calling optim_step.
Phase 2 — optim_step — applies the accumulated gradients, clips them, and advances the optimizer state. You call this once per effective batch.
Separating the two phases gives you exact control over the effective batch size without writing any gradient-accumulation plumbing yourself. You decide how many forward_backward calls go between optim_step calls; the client handles everything else.
Beyond SFT: Reinforcement Learning on Your GPU
The same two primitives cover reinforcement learning — which is really the reason this shape matters. SFT is easy to express in any framework; RL is where the boilerplate gets ugly.
Local Tinker has a loss function hierarchy that supports SFT (default), DPO for preference pairs, PPO for advantage-weighted updates with an internal value head, and GRPO for group-relative policy optimization. You pick the loss by passing loss_fn="grpo" (or "dpo", "ppo") to forward_backward. The client handles the bookkeeping.
A GRPO loop looks like this:
from local_tinker import ServiceClient
service = ServiceClient()
trainer = service.create_lora_training_client("Qwen/Qwen2.5-7B", rank=16)
for step in range(max_steps):
# roll out completions from the current policy
sampler = trainer.save_weights_and_get_sampling_client()
completions = sampler.sample_batch(prompts, n=8)
# score with YOUR reward function (regex, test, classifier, judge LLM...)
rewards = [reward_fn(p, c) for p, c in zip(prompts, completions)]
# GRPO update — same two-phase primitives
trainer.forward_backward(completions, rewards, loss_fn="grpo")
trainer.optim_step()
The interesting line is rewards = [reward_fn(p, c) for ...]. The reward function is yours. It can be a regex, a unit test, a classifier, or another LLM acting as a judge. Local Tinker doesn't care — it just needs a float per completion.
DPO follows the same shape but takes preferred and dispreferred pairs instead of rollouts. PPO manages a value head internally so you don't have to. All four loss functions use the same forward_backward + optim_step surface. The training loop doesn't change shape when you change objectives, which is the whole point.
The Tech Stack
Local Tinker is a thin layer of glue over a short stack of well-maintained libraries. PyTorch provides the tensor and autograd runtime. HuggingFace Transformers provides the base models and tokenizers. PEFT provides the LoRA adapter layers. bitsandbytes provides the 4-bit and 8-bit quantization kernels. Local Tinker is the glue that wires them into a single client API so you don't have to think about any of it during an experiment.
Getting Started
Step 1 — clone and install:
git clone https://github.com/josephgec/finetuning.git
cd finetuning
pip install -e .
Step 2 — see which models you can train on your current hardware:
$ local-tinker models
MODEL PARAMS QUANT VRAM STATUS
meta-llama/Llama-3.2-1B 1.2B 4bit ~3 GB ready
meta-llama/Llama-3.2-3B 3.2B 4bit ~6 GB ready
mistralai/Mistral-7B-v0.3 7.3B 4bit ~9 GB ready
Qwen/Qwen2.5-7B 7.6B 4bit ~10 GB tight
meta-llama/Llama-3.1-13B 13.0B 4bit ~16 GB OOM
The status column is the useful one. ready means the model and a reasonable batch size fit comfortably. tight means it'll work but you may need to reduce micro-batch size or LoRA rank. OOM means don't bother — it won't fit on this card.
Step 3 — launch a training run:
local-tinker run \
--model meta-llama/Llama-3.2-3B \
--dataset ./data/instructions.jsonl \
--rank 32 \
--epochs 3 \
--quantize 4bit
There's a recipes/ directory in the repo with working configs for SFT, DPO, PPO, and GRPO. Copy one, point it at your dataset, and go.
Who This Is For
Local Tinker is for developers and researchers who have their own GPU, are working with 1B–13B models, and want to iterate quickly without fighting the framework. If you have a model idea and you want to test it before dinner, this is for you.
It's not a replacement for the cloud Tinker or for a full training platform like Axolotl or TRL. For models larger than ~13B, for distributed training across multiple machines, or for production training pipelines, use the right tool. Use Local Tinker for the tight loop of "wonder if X works → try X → see if it worked" on a model you can fit on one card.
Get Started
The source is on GitHub: github.com/josephgec/finetuning. Clone it, install it, run local-tinker models to see what fits on your GPU, and try an SFT run before you reach for RL. The recipes/ directory has known-good starting points for each loss function.
If you try it and something breaks, or if you have a use case that doesn't map cleanly onto the current primitives, open an issue or a PR. The API surface is small on purpose and I'd rather keep it small — but I'm open to good suggestions.
Happy tinkering.
References
The papers and projects this one builds on, for readers who want to go deeper.
LoRA and quantization
- Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. arXiv:2106.09685
- Dettmers et al. QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023. arXiv:2305.14314
- Dettmers et al. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. NeurIPS 2022. arXiv:2208.07339
Reinforcement learning from feedback
- Schulman et al. Proximal Policy Optimization Algorithms. 2017. arXiv:1707.06347
- Rafailov et al. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. arXiv:2305.18290
- Shao et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. 2024 (introduces GRPO). arXiv:2402.03300
Libraries and prior art
- Thinking Machines Lab. Tinker API. thinkingmachines.ai/tinker
- HuggingFace. PEFT — Parameter-Efficient Fine-Tuning. github.com/huggingface/peft
- bitsandbytes. github.com/bitsandbytes-foundation/bitsandbytes
- HuggingFace. TRL — Transformer Reinforcement Learning. github.com/huggingface/trl
Project source
- Local Tinker on GitHub — github.com/josephgec/finetuning