Two years ago, training a large language model required either renting time at a research lab or accepting that fine-tuning was the preserve of billion-dollar companies. By May 2026, Qwen3.6-27B can be taken from a Hugging Face download to a domain-specialised model on a single rented H100 for less than fifteen dollars. The tools have changed. The underlying mathematics has not, but the population of those who use it has expanded. This article describes how to train an open-source LLM in practice today: what hardware is required, which model to choose, how to format the data so that the trainer does not silently discard it, and how to place the result behind a serving endpoint that responds in milliseconds.
Summary
What this post covers: A working 2026 playbook for fine-tuning open-source LLMs using three concrete anchors — the dense Qwen3.6-27B, the MoE Qwen3.5-122B-A10B, and OpenAI’s GPT-OSS-120B — from environment setup through deployment.
Key insights:
- QLoRA on a single H100 (80GB) now fine-tunes a 27B dense model in 8 to 12 hours for $10 to $16 of cloud rental, retaining 80 to 90 percent of full fine-tuning quality.
- MoE models like Qwen3.5-122B-A10B (10B active) and GPT-OSS-120B (5.1B active) need VRAM to hold all 122B or 117B weights, even though per-token compute is small — the “active parameter” headline number is a runtime FLOPs claim, not a memory one.
- Chat-template mismatch between training and inference is the single most common cause of a “trained but acts untrained” model — Qwen’s
<|im_start|>markers and GPT-OSS’s harmony format are not interchangeable. - GPT-OSS-120B ships post-trained with MXFP4 quantization on the MoE weights, which is why a 117B-total-parameter model fits in a single 80GB H100 at inference time.
- For anything past 70B at full precision, FSDP2 or DeepSpeed ZeRO-3 sharding is no longer optional — single-node training caps out around 32B dense in FP16 even on H200 (141GB) hardware.
Main topics: The State of Open-Source LLM Training in 2026, Meet the Three Anchor Models, Choosing Full Fine-Tune LoRA or QLoRA, Setting Up the Training Environment, Preparing the Dataset, The Actual Training Run, Evaluation That Isn’t Theatre, Deployment, Common Pitfalls and Debugging.
The State of Open-Source LLM Training in 2026
The open-source LLM landscape in May 2026 bears little resemblance to that of early 2024. Two structural shifts have transformed what a single engineer can accomplish alone.
The first shift is architectural. Mixture-of-Experts (MoE) models, in which each token activates only a small subset of total parameters, have become the dominant configuration for any model larger than 30B. A dense model uses every weight on every token; an MoE model uses a router to direct each token to a small fraction of “expert” sub-networks. Qwen3.5-122B-A10B has 122B total parameters but only approximately 10B active per forward pass. GPT-OSS-120B contains 117B total parameters with 5.1B active. The runtime FLOPs resemble those of a small model; the VRAM footprint does not.
The second shift concerns post-training tooling. QLoRA, in which the base weights are frozen at 4-bit NF4 (NormalFloat-4, a quantisation format optimised for the distribution of neural network weights) and only a small low-rank adapter is trained, has moved from a research curiosity in 2023 to the default starting point in 2026. LoRA (Low-Rank Adaptation) retains 90 to 95 per cent of full fine-tuning performance. QLoRA retains 80 to 90 per cent while reducing VRAM by approximately 75 per cent compared with FP16.
The practical implication is as follows: a 7B model that required approximately 14GB of VRAM to fine-tune in FP16 now fits in 5 to 6GB under QLoRA. A 70B model that required approximately 140GB now fits in 46GB. The hardware threshold has dropped sufficiently that the question has shifted from whether training is affordable to what should be trained.
The implications for a practitioner intending to train a model today are as follows: prosumer hardware—a single H100 or H200, or even a 48GB consumer card such as the RTX 6000 Ada—can handle QLoRA on models up to 70B. Beyond that point, multi-GPU LoRA or sharded full fine-tuning is required. Specific recipes for each scenario are presented below.
Pretraining from scratch—the 2.1 million H100-hour run that produced GPT-OSS-120B—remains out of reach for almost all practitioners. Within reach, however, is taking one of these three checkpoints and adapting it to a particular dataset, domain, or task. This is what “training an open-source LLM” means in practice in 2026.
The Three Anchor Models
Three models cover the practical range of what is fine-tuned today: a dense 27B model that fits comfortably on prosumer hardware, a sparse 122B model that requires cluster-class memory but inexpensive compute, and a 117B MoE model that ships pre-quantised to fit on a single 80GB card.
Qwen3.6-27B
Released on 22 April 2026 by Alibaba’s Qwen team. Dense: every one of the 27 billion parameters participates in every forward pass. It uses Gated DeltaNet, a hybrid attention scheme that combines a linear-attention path (constant memory cost per token) with traditional softmax self-attention. The linear path handles long-range context, while the softmax path preserves short-range precision.
Native context is 262,144 tokens, extensible to one million via position-encoding extrapolation. The model is natively multimodal: the same checkpoint accepts images and text. A “Thinking Preservation” mechanism maintains a chain-of-thought reasoning mode and a fast non-thinking mode within a single set of weights.
Benchmark figures from the Qwen team include SWE-bench Verified 77.2 (compared with Qwen3.5-397B-A17B at 76.2), SWE-bench Pro 53.5 (compared with 50.9), Terminal-Bench 2.0 59.3 (compared with 52.5), and SkillsBench 48.2 (compared with 30.0). A 27B dense model surpassing its 397B MoE predecessor on code-related work is the kind of result that re-establishes the importance of architecture choice.
The model can be downloaded from the QwenLM/Qwen3.6 official repository or the Hugging Face Qwen/Qwen3.6-27B mirror. The licence is Apache 2.0: commercial use is permitted with attribution.
Qwen3.5-122B-A10B
Released on 24 February 2026. A sparse MoE: 122 billion total parameters, approximately 10 billion active per forward pass. The “A10B” suffix denotes the active-parameter count. Each token is routed through a small subset of experts, while the remainder of the network remains idle for that token.
The model shares the Gated DeltaNet hybrid attention of Qwen3.6-27B and the same 262K native context, extensible to 1M+. It is text-only at this size. The MoE structure means inference compute resembles that of a 10B model, but VRAM must still hold all 122B weights, because the router cannot determine in advance which expert any given token will require.
This is the appropriate model when strong quality is required alongside inexpensive per-token serving. The active-parameter count determines latency and energy cost; the total parameter count determines hardware purchasing decisions. The trade-off is frequently misunderstood on first encounter.
GPT-OSS-120B
OpenAI’s first open-weight LLMs since GPT-2 (2019), released in August 2025. The model contains 117 billion total parameters with 5.1 billion active, under an Apache 2.0 licence. It was trained on NVIDIA H100 GPUs using PyTorch with custom Triton kernels. The training run consumed 2.1 million H100-hours, which at $2 per hour in cloud pricing represents approximately $4.2 million in compute alone.
What makes GPT-OSS-120B unusual is that it ships post-trained with MXFP4 quantisation on the MoE weights. MXFP4 is a 4-bit floating-point format with a shared scale per micro-block. Because the bulk of the parameter count resides in the MoE expert layers, quantising those layers to 4-bit reduces the on-disk and in-VRAM footprint sufficiently to fit on a single 80GB GPU (H100 or AMD MI300X). The non-expert layers remain at higher precision.
The benchmark posture indicates near-parity with OpenAI’s o4-mini on core reasoning. For a model that can run on a single rented GPU, this is a notable result. The model card and weights are available at huggingface.co/openai/gpt-oss-120b; the official repository is at github.com/openai/gpt-oss; the launch announcement is at openai.com/index/introducing-gpt-oss.
| Attribute | Qwen3.6-27B | Qwen3.5-122B-A10B | GPT-OSS-120B |
|---|---|---|---|
| Total params | 27B | 122B | 117B |
| Active params | 27B (dense) | ~10B | 5.1B |
| Architecture | Dense, Gated DeltaNet | MoE, Gated DeltaNet | MoE, grouped-query attn |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 |
| Release date | 2026-04-22 | 2026-02-24 | August 2025 |
| Native context | 262K (extensible to 1M) | 262K (extensible to 1M+) | 128K |
| Multimodal | Yes (vision + text) | Text only | Text only |
| Download | HF: Qwen/Qwen3.6-27B | HF: Qwen/Qwen3.5-122B-A10B | HF: openai/gpt-oss-120b |
Choosing Full Fine-Tune, LoRA, or QLoRA
Three fine-tuning methods cover essentially the entire field. They occupy positions along a cost-versus-quality spectrum, and the appropriate choice depends on the volume of available data and the degree to which the target domain differs from the base model’s training distribution.
Full fine-tuning updates every parameter. It requires approximately four times the model’s memory footprint during training: model weights, gradients, optimizer states (two for AdamW: first and second moment), and activations. A 7B model requires approximately 14GB in FP16 for weights alone; with optimizer states and gradients, peak usage approaches 60GB.
LoRA (Low-Rank Adaptation) freezes the base weights and inserts trainable low-rank matrices into the attention projection layers. Instead of updating the full weight matrix W (for example, 4096×4096 = approximately 16.7M parameters), two small matrices B (4096×r) and A (r×4096) are trained, where r is typically 8, 16, or 32. The model effectively learns ΔW = B·A, which is added to the frozen W at inference. For r = 16, this amounts to approximately 131K trainable parameters per layer rather than 16.7M, roughly 128 times fewer.
QLoRA extends LoRA further. The frozen base weights are quantised to 4-bit NF4 (NormalFloat-4, designed to match the typical Gaussian distribution of neural network weights), and LoRA adapters sit on top in FP16 or BF16. The weights are de-quantised on the fly only during forward and backward passes. Memory consumption decreases by approximately 75 per cent compared with FP16 training.
| Method | VRAM (7B) | VRAM (70B) | Wall time (1 H100) | Cost (cloud) | Quality retention |
|---|---|---|---|---|---|
| Full FT | ~60 GB | ~560 GB (needs 8×H100) | 24-48h on 8×H100 | $250-510 | 100% (baseline) |
| LoRA | ~16 GB | ~160 GB (2-4 GPUs) | 10-15h | $20-40 | 90-95% |
| QLoRA | ~6 GB | ~46 GB (1 H100/H200) | 8-12h | $10-16 | 80-90% |
The practical selection heuristic is to begin with QLoRA. If quality is insufficient after a sweep over rank, learning rate, and data size, the next step is LoRA. Full fine-tuning should be reserved for cases in which the domain shift is so substantial that the base model’s representation is genuinely wrong—for example, a model trained predominantly on English required to operate in a low-resource language. The 80 to 90 per cent quality retention of QLoRA is sufficient for the majority of production tasks.
It is worth noting that GPT-OSS-120B’s 4-bit inference figure (approximately 35 GB) is substantially lower than Qwen3.5-122B’s 62 GB despite similar total parameter counts. This is the advantage of MXFP4-native quantisation. Qwen3.5 must be quantised after training (AWQ or GPTQ), incurring some additional accuracy loss; GPT-OSS-120B was post-trained with the 4-bit format already in mind.
Setting Up the Training Environment
Three years ago, this section would have been considerably more complex: CUDA versions, PyTorch builds, mismatched Triton, and broken bitsandbytes. In May 2026 the process remains finicky, but the recipe is more stable.
The requirements are CUDA 12.6 or newer (CUDA 12.8 ships well with the H100/H200 SXM5 drivers), cuDNN 9.5 or newer, PyTorch 2.7 stable or 2.8 nightly, and recent versions of transformers, peft, accelerate, trl, bitsandbytes, and vllm. Flash Attention 3 requires Hopper (H100/H200) or newer; on Ampere (A100), Flash Attention 2 is the fallback.
The cleanest approach uses a Docker container that pins all of these versions. Building locally is the second-cleanest option. Operating in a bare Python environment invites an evening of debugging mismatched CUDA symbols. Containerising the training environment with a known-good base image, typically nvidia/cuda:12.8.0-cudnn-devel-ubuntu24.04, is the standard approach.
A working pyproject.toml for a fine-tuning project as of May 2026 is shown below:
[project]
name = "llm-finetune"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = [
"torch==2.7.0",
"transformers==4.50.2",
"peft==0.14.1",
"bitsandbytes==0.46.0",
"accelerate==1.4.0",
"trl==0.16.0",
"datasets==3.5.0",
"unsloth==2026.5.3",
"flash-attn==3.0.1",
"vllm==0.9.2",
"wandb==0.19.5",
"sentencepiece==0.2.0",
"tiktoken==0.7.0",
"lm-eval==0.4.7",
]
[tool.uv]
index-strategy = "unsafe-best-match"
[[tool.uv.index]]
name = "pytorch-cuda128"
url = "https://download.pytorch.org/whl/cu128"
A Dockerfile producing a known-good training image is shown below:
FROM nvidia/cuda:12.8.0-cudnn-devel-ubuntu24.04
ENV DEBIAN_FRONTEND=noninteractive \
PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1 \
HF_HOME=/workspace/.cache/huggingface \
TORCH_CUDA_ARCH_LIST="9.0;10.0"
RUN apt-get update && apt-get install -y --no-install-recommends \
python3.11 python3.11-venv python3-pip git curl ca-certificates \
build-essential ninja-build cmake \
&& rm -rf /var/lib/apt/lists/*
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
ENV PATH="/root/.local/bin:${PATH}"
WORKDIR /workspace
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev
# Flash Attention 3 needs to compile against the installed torch
RUN uv pip install --no-build-isolation flash-attn==3.0.1
COPY . .
CMD ["uv", "run", "python", "-m", "train"]
The framework landscape in 2026 is as follows: TRL is HuggingFace’s official trainer for SFT (supervised fine-tuning) and reinforcement learning post-training. Axolotl is a YAML-config layer on top of TRL that handles much of the data-preparation boilerplate. Unsloth is a Triton-optimised custom kernel package that claims up to twice the training speed and 60 per cent lower VRAM consumption through hand-tuned kernels, and is now stable enough for production use. torchtitan is Meta’s reference scaffolding for large-scale pretraining and full fine-tuning with FSDP2.
| Framework | Primary use case | Scaling target | Ergonomics | Recent activity |
|---|---|---|---|---|
| TRL | SFT, DPO, GRPO, PPO | 1-8 GPUs, single node | Python API, flexible | Very active |
| Axolotl | SFT, DPO with YAML config | 1-8 GPUs | YAML, low boilerplate | Active |
| Unsloth | Single-GPU QLoRA/LoRA, speed | 1 GPU (multi-GPU in 2025) | Drop-in trainer wrapper | Very active |
| torchtitan | Full FT, pretraining at scale | Multi-node FSDP2 | Lower-level, reference impl | Active |
For 99 per cent of fine-tuning work, the appropriate choice is TRL (or Unsloth-wrapped TRL for single-GPU work). torchtitan should be reserved for full fine-tuning at 70B and above, where FSDP2 sharding becomes mandatory.
Preparing the Dataset
The training loop is the straightforward part. Dataset preparation is where most fine-tuning runs silently fail. The model trains for hours and the loss curve appears acceptable, but at evaluation the outputs are wrong, off-topic, or in the wrong format. The problem is almost always located upstream of the trainer.
A modern instruction-tuning dataset consists of a list of conversations. Each conversation is a list of turns, and each turn has a role (system, user, assistant) and content. JSONL is the standard on-disk format, with one JSON object per line. Parquet is the high-performance alternative for large datasets requiring columnar compression.
A single training example for Qwen3.6-27B appears as follows:
{"messages": [
{"role": "system", "content": "You are an industrial controls expert."},
{"role": "user", "content": "Explain why a PID loop on a cobot joint might oscillate after a payload change."},
{"role": "assistant", "content": "When the end-effector payload changes, the effective inertia at the joint changes. The PID gains were tuned for the original inertia. With higher effective mass, the proportional term applies the same torque correction to a slower-responding system, which can overshoot the setpoint and drive sustained oscillation. Two practical fixes: gain scheduling (load the right PID set when the gripper picks up the part) or an adaptive controller (online inertia estimation). The first is easier to validate in safety review."}
]}
The tokenizer then applies the model’s chat template—a Jinja-style template defined inside tokenizer_config.json—to convert that list of turns into a single tokenised sequence with the model’s special tokens. For Qwen3.6, the chat template wraps each turn in <|im_start|>role\ncontent<|im_end|>. For GPT-OSS-120B, the harmony format with <|start_of_turn|> and channel markers is used. These are not interchangeable. A model trained with the wrong template and inferred with the correct one will behave as though it had never been trained.
The standard loss-masking pattern is as follows: the model is trained to predict assistant tokens, but the loss is masked (set to -100, the standard ignore_index for PyTorch’s CrossEntropyLoss) on system and user tokens. It is undesirable to teach the model to generate user messages.
A representative data-loading pipeline for Qwen3.6-27B, using the HuggingFace datasets library, is shown below:
from datasets import load_dataset
from transformers import AutoTokenizer
MODEL_ID = "Qwen/Qwen3.6-27B"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
def format_example(example):
"""Apply Qwen's chat template and tokenize."""
text = tokenizer.apply_chat_template(
example["messages"],
tokenize=False,
add_generation_prompt=False,
)
return {"text": text}
ds = load_dataset("json", data_files="data/train.jsonl", split="train")
ds = ds.map(format_example, remove_columns=ds.column_names)
# Train/eval split with a fixed seed for reproducibility
split = ds.train_test_split(test_size=0.05, seed=42)
train_ds, eval_ds = split["train"], split["test"]
print(f"Train: {len(train_ds)}, Eval: {len(eval_ds)}")
print("Sample formatted text:")
print(train_ds[0]["text"][:500])
Before training, two additional passes should be performed on the dataset. First, deduplication: exact-match dedup is inexpensive (a hash per example), while MinHash or SimHash near-dedup catches paraphrases. Duplicates inflate the loss curve and bias the model toward memorising common patterns.
Second, a contamination check: it must be ensured that none of the training data overlaps with the evaluation benchmarks. If the evaluation is MMLU and the training data was scraped from Common Crawl, there is a real probability that MMLU questions are present. A substring search of evaluation questions against the training set should be conducted, with any matches removed.
When data preparation is sufficiently complex to warrant orchestration, Airflow data pipelines are a suitable fit, as the dedup, contamination check, and tokenisation steps map well to a directed acyclic graph.
tokenizer.apply_chat_template to confirm that it matches the format expected by the model. The first 1000 characters of a tokenised example should be printed before any long run.
The Actual Training Run
Three concrete recipes are presented below, covering the three anchor models across three hardware budgets. Each provides a known-working starting point from which learning rate, rank, and data mixture may be tuned.
Recipe 1: QLoRA on Qwen3.6-27B, Single H100 (80GB)
This is the most accessible setup. One rented H100 from Lambda Labs, RunPod, or a comparable cloud provider costs approximately $1.80 to $2.50 per hour as of May 2026. With 50,000 training examples and three epochs, the target wall time is eight to twelve hours, for a total bill of $10 to $16. This is the recipe most teams actually use.
# train_qlora_qwen36.py
import torch
from transformers import (
AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
)
from peft import LoraConfig, prepare_model_for_kbit_training
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset
MODEL_ID = "Qwen/Qwen3.6-27B"
OUTPUT_DIR = "out/qwen36-27b-qlora"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat-4
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # nested quantization of the quant constants
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
tokenizer.padding_side = "right" # important: right-pad for SFT
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=bnb_config,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_3",
device_map="auto",
trust_remote_code=True,
)
model = prepare_model_for_kbit_training(model)
model.config.use_cache = False # cache is not used during training; saves VRAM
peft_config = LoraConfig(
r=16,
lora_alpha=32, # alpha/r = 2 is a common starting ratio
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
)
train_ds = load_dataset("json", data_files="data/train.jsonl", split="train")
eval_ds = load_dataset("json", data_files="data/eval.jsonl", split="train")
sft_config = SFTConfig(
output_dir=OUTPUT_DIR,
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8, # effective batch = 16
gradient_checkpointing=True, # trade compute for VRAM
learning_rate=2e-4, # LoRA-typical; full FT would use ~1e-5
lr_scheduler_type="cosine",
warmup_ratio=0.03,
optim="paged_adamw_8bit", # 8-bit optimizer to save more VRAM
bf16=True,
max_seq_length=4096,
packing=True, # pack short examples to maximize GPU use
eval_strategy="steps",
eval_steps=500,
save_steps=1000,
save_total_limit=3,
logging_steps=20,
report_to="wandb",
seed=42,
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=sft_config,
train_dataset=train_ds,
eval_dataset=eval_ds,
peft_config=peft_config,
)
trainer.train()
trainer.save_model(OUTPUT_DIR)
The principal design choices in the script merit explanation:
- NF4 with double quantisation: NF4 quantises the weights themselves; double quantisation additionally quantises the per-block scaling constants, saving a further approximately 0.4 bits per parameter on average.
- Gradient checkpointing: activations are recomputed during the backward pass rather than stored. This reduces activation memory by approximately the square root of the sequence length at a cost of roughly 30 per cent additional compute. The trade is almost always worthwhile for LoRA and QLoRA.
- Gradient accumulation: with a per-device batch size of 2 and accumulation steps of 8, the effective batch is 16. This is useful when VRAM constrains the per-step batch but the optimisation signal of a larger batch is desired.
- Paged AdamW 8-bit: optimiser states (first and second moments) at 8-bit precision, with paging to CPU when not in use. Reduces optimiser-state memory by a factor of four compared with FP32 AdamW.
- Packing: concatenates multiple short examples into one sequence up to
max_seq_length. Without packing, padding to 4096 tokens wastes most of the compute on short examples.
Recipe 2: Multi-GPU LoRA on Qwen3.5-122B-A10B
122B total parameters corresponds to approximately 244GB in FP16 for the weights alone. Two H200s (141GB each, 282GB combined) or four H100s (320GB combined) handle this comfortably with tensor parallelism. The accelerate configuration below specifies FSDP2 with the model sharded across eight GPUs.
# accelerate_config_fsdp.yaml
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
mixed_precision: bf16
num_processes: 8
num_machines: 1
machine_rank: 0
gpu_ids: all
fsdp_config:
fsdp_version: 2
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: Qwen3MoeDecoderLayer
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_offload_params: false
fsdp_use_orig_params: true
fsdp_sync_module_states: true
fsdp_cpu_ram_efficient_loading: true
fsdp_activation_checkpointing: true
Launch the run with: accelerate launch --config_file accelerate_config_fsdp.yaml train_lora_qwen35.py
The training script is structurally similar to Recipe 1, with three changes: no BitsAndBytesConfig (LoRA rather than QLoRA), device_map=None (FSDP manages placement), and per-device batch size reduced to 1 with accumulation steps increased to maintain an effective batch of approximately 32. Wall time for 50K examples over three epochs on 8× H100 is approximately 18 to 24 hours.
Recipe 3: Multi-Node Full Fine-Tune on GPT-OSS-120B
Full fine-tuning a 117B MoE is genuinely expensive. The model weights in BF16 alone occupy approximately 234GB. With the addition of gradients, optimiser states (AdamW = twice the parameter count, in FP32 at 8 bytes each, approximately 940GB), and activations, cluster-class storage is required. The lower bound is 32 H100 GPUs across four nodes, using torchtitan with FSDP2 sharding across all 32 GPUs and tensor parallelism within each node.
For most use cases this is not the appropriate path. Even with full fine-tuning, there is a risk of losing the post-training calibration and safety tuning baked into the released checkpoint. The pragmatic path for GPT-OSS-120B is LoRA with rank 32, with the adapter applied to attention and MoE expert gate projections only.
| Setup | Combined VRAM | What it can train |
|---|---|---|
| Single H100 QLoRA | 80 GB | Up to ~70B with QLoRA; Qwen3.6-27B comfortably |
| Single H200 QLoRA | 141 GB | Up to ~120B with QLoRA; comfortable 70B LoRA |
| 2× H200 LoRA | 282 GB | Full LoRA on Qwen3.5-122B-A10B with FSDP2 |
| 8× H100 LoRA | 640 GB | LoRA on any model up to ~200B with sharding |
| 8× H100 full FT | 640 GB | Full FT up to ~70B with FSDP2 + activation checkpointing |
| 32× H100 multi-node | 2,560 GB | Full FT on 120B+ MoE; small pretraining runs |
Across all three recipes, the choice of optimiser matters more than is commonly appreciated. AdamW with a cosine learning rate schedule and 3 per cent warm-up is the strong default. For LoRA, the learning rate is typically 1e-4 to 2e-4—substantially higher than the 1e-5 to 5e-5 used for full fine-tuning—because LoRA’s adapter layers begin near zero and require larger steps to learn meaningful deltas. Checkpoints should be saved every 1000 steps. Adapter-only (PEFT) checkpoints are preferable to full-model checkpoints; they are approximately one hundred times smaller.
For systematic optimisation of learning rate and rank, Bayesian hyperparameter optimisation with Gaussian processes is efficient. Random search is acceptable when the additional complexity is not warranted; grid search is almost never worthwhile for LoRA.
Substantive Evaluation
Most fine-tuning evaluation amounts to theatre. The model is trained, training loss decreases, an “evaluation” runs on a sliver of the training set (or the same data slightly shuffled), and the team declares success. The model is then deployed to production, where it underperforms.
Substantive evaluation requires three properties: the evaluation data must not have been observed during training; the evaluation metric must measure the actual task rather than a proxy; and the metric must be reproducible across runs.
For general language understanding and reasoning, the standard benchmarks are MMLU (multi-task language understanding across 57 subjects), HumanEval (function-completion code), GSM8K (grade-school mathematics word problems), and MT-Bench (multi-turn instruction following, judged by a strong LLM). For code-heavy use cases, SWE-bench Verified and Terminal-Bench 2.0 are the current standards.
The community-standard tool is lm-evaluation-harness from EleutherAI, which runs the model against a registered benchmark suite in a reproducible manner:
lm_eval \
--model hf \
--model_args pretrained=out/qwen36-27b-qlora,trust_remote_code=True \
--tasks mmlu,gsm8k,humaneval \
--batch_size auto \
--output_path eval_results/qwen36-qlora.json
The contamination problem is real and frequently neglected. If the training data was scraped from the public web, there is a non-trivial probability that benchmark questions are present. The decontamination check consists of an n-gram (typically 8-gram) overlap test between the training set and each benchmark’s question text, with any matches removed from training. Without this check, evaluation scores represent an upper bound that obscures the effect of contamination.
Beyond standard benchmarks, a domain-specific evaluation set should be held out, constructed from realistic prompts drawn from the actual use case. Benchmark suites measure general capability; a custom evaluation set measures whether the model performs better at the relevant task. The two metrics frequently disagree, and the custom set is the one that ultimately matters.
Deployment
When training is complete, the adapter or full checkpoint resides in a directory and must be served.
The two standard serving stacks in 2026 are vLLM and SGLang. vLLM has the broadest support and is the production default for most teams. SGLang is faster for structured-output workloads (JSON, regex-constrained generation) and provides superior RadixAttention KV-cache reuse for repeated-prefix workloads such as RAG and multi-turn chat.
Both implement continuous batching, a serving technique that keeps the GPU saturated by dynamically inserting new requests into the batch as existing requests complete, rather than waiting for the whole batch to finish. The throughput multiplier of continuous batching over static batching is typically a factor of three to five, sometimes more.
For a fine-tuned Qwen3.6-27B served on a single H100, the launch command is as follows:
vllm serve out/qwen36-27b-qlora \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 32768 \
--dtype bfloat16 \
--enable-lora \
--lora-modules my-adapter=out/qwen36-27b-qlora \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--tensor-parallel-size 1
The serving endpoint exposes an OpenAI-compatible API at http://localhost:8000/v1. On the client side, it functions as a direct substitute for the OpenAI SDK:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY", # vLLM ignores the key by default
)
response = client.chat.completions.create(
model="my-adapter",
messages=[
{"role": "system", "content": "You are an industrial controls expert."},
{"role": "user", "content": "What causes oscillation after a payload change on a cobot joint?"},
],
temperature=0.2,
max_tokens=512,
)
print(response.choices[0].message.content)
If the deployment forms part of a larger application, the serving pods may be run on Kubernetes with a GPU-aware scheduler. For tool-augmented workflows, tool calling support in vLLM via Hermes-style JSON output operates by default for Qwen3.6 and GPT-OSS. For broader integrations, the Model Context Protocol (MCP) is emerging as the de facto integration standard for tool-using LLM applications.
Common Pitfalls and Debugging
Most training failures derive from a small set of recurring mistakes. Awareness of these in advance saves substantial debugging time.
Chat template mismatch. Previously noted, but worth repeating because it is the most common silent failure. The training-time template and the inference-time template must be identical. A fully tokenised example with special tokens visible (tokenizer.decode(input_ids, skip_special_tokens=False)) should be printed before beginning any long run.
Out-of-memory mid-training. The loss curve appears acceptable for 5,000 steps, after which a single long sequence in a batch exceeds the activation memory budget. The remedy is to lower max_seq_length, enable packing=True with a sequence cap, or reduce per-device batch size and increase gradient accumulation to compensate.
Tokenizer drift. The base model has been loaded with one tokenizer revision and inference performed with another, causing the vocabulary or special-token IDs to shift. The tokenizer commit hash should be locked explicitly: AutoTokenizer.from_pretrained(MODEL_ID, revision="abc123def...").
Loss spikes. A large upward jump in loss at a specific step almost always indicates a bad batch—corrupted data, a tokenisation error on a single example, or an unusually long sequence. The data at that step should be inspected. If recurrence is rare, gradient clipping (max_grad_norm=1.0) should be added and training resumed from the last good checkpoint.
Evaluation/training distribution mismatch. Training loss is low, while evaluation loss is high and fails to improve. The evaluation set is drawn from a different distribution from the training set. Either the evaluation set should be drawn from the same source as the training data (with a fresh seed split), or the gap should be accepted as a measure of generalisation rather than a training failure.
Gradient explosion. Loss diverges to NaN within a few steps. The learning rate is too high for the task, gradient clipping has been omitted, or the data contain an extreme outlier in numerical features. Training should restart with learning_rate halved and max_grad_norm=1.0.
MoE-specific: expert collapse. Specific to MoE training (Qwen3.5-122B, GPT-OSS-120B). The router learns to route everything to one or two experts, and the remainder of the model atrophies. The mitigation is an auxiliary load-balancing loss, which TRL and torchtitan include by default; this should nonetheless be verified as enabled rather than silently overridden by a configuration setting.
FAQ
Can these models be fine-tuned on a consumer GPU such as an RTX 4090?
Qwen3.6-27B can be fine-tuned on a 4090 with QLoRA. The 24GB of VRAM on a 4090 is tight but workable with gradient checkpointing, a paged 8-bit optimiser, and a short sequence length (approximately 2048 tokens). Qwen3.5-122B-A10B and GPT-OSS-120B require at least 80GB of VRAM, which corresponds to H100/H200/MI300X-class hardware. The released GPT-OSS-120B can be served (though not trained) on a single 80GB card due to MXFP4 quantisation.
How much data is actually required?
Less than is commonly expected. For domain adaptation with LoRA or QLoRA, 5,000 to 20,000 high-quality examples are sufficient for most domains. Quality matters considerably more than quantity: a tightly curated 10,000-example set consistently outperforms a noisy 100,000-example set. For format adaptation (teaching the model a new structured output schema), 1,000 to 2,000 examples often suffice.
How does this compare with using a managed API?
The two represent different problem spaces. Managed APIs (OpenAI, Anthropic) excel in convenience and access to the latest models. Self-hosted fine-tuned models excel in cost per million tokens at scale, data sovereignty, custom domain adaptation, and predictable cost (no per-call billing). The crossover point is typically around 100M tokens per month; below this, managed services are usually preferable, and above it, self-hosted is usually cheaper.
What is the quality difference between LoRA and full fine-tuning?
LoRA retains 90 to 95 per cent of full fine-tuning quality across most tasks. QLoRA retains 80 to 90 per cent. The remaining gap is largest on tasks requiring substantial representational shift from the base model—for example, fine-tuning an English-pretrained model to operate fluently in a low-resource language. For typical instruction tuning, code adaptation, or structured-output tasks, the gap is sufficiently small that the cost savings of LoRA dominate.
Should continued pretraining precede instruction tuning?
Only when the domain is genuinely far from the base model’s training distribution—medical literature, legal contracts in a non-English language, or highly specialised scientific notation. For most domains, the base model has sufficient coverage that instruction tuning alone closes the gap. Continued pretraining is expensive and easily mishandled, with the principal risk being catastrophic forgetting of the base model’s general competence.
Related Reading
- Self-supervised learning is the foundation underneath every modern LLM pretraining run
- Transfer learning and fine-tuning for domain adaptation — a working applied example
- Containerizing the training environment with Docker
- Orchestrating data prep pipelines with Apache Airflow
- Bayesian hyperparameter optimization for tuning learning rate and rank
- Kubernetes for distributed training and serving
- Tool calling and function calling for post-trained models
- The Model Context Protocol for deployment integration
References
- Qwen Team — Qwen3.6-27B announcement (April 22, 2026)
- QwenLM — Qwen3.6 official repository
- OpenAI — Introducing GPT-OSS (August 2025)
- OpenAI — GPT-OSS-120B model card on Hugging Face
- OpenAI — openai/gpt-oss GitHub repository
Conclusion
Training open-source LLMs in 2026 is no longer the closed activity it was two years ago. The combination of Apache 2.0 base models with frontier-class reasoning (GPT-OSS-120B approaching o4-mini), QLoRA on a single rented GPU, and serving infrastructure capable of handling thousands of concurrent users on commodity hardware has placed production-grade LLM customisation within reach of any team with a modest budget and a clear use case.
The three anchor models cover the practical range: Qwen3.6-27B for the single-GPU dense workflow, Qwen3.5-122B-A10B for inexpensive MoE serving when multi-GPU capacity is available, and GPT-OSS-120B for single-GPU serving of a frontier-class reasoner enabled by MXFP4. None of these is universally “best”; each addresses different questions about hardware, latency, and quality.
The principal challenge is no longer the technology; it is the data—assembling, deduplicating, formatting, and contamination-checking a dataset that actually teaches the model the intended behaviour. The trainer runs in eight hours. The dataset takes eight weeks. Planning should be adjusted accordingly.
Leave a Reply