Home AI/ML How to Train Open-Source LLMs in 2026: Qwen3.6, Qwen3.5, GPT-OSS

How to Train Open-Source LLMs in 2026: Qwen3.6, Qwen3.5, GPT-OSS

Last updated: May 27, 2026
k
Published May 21, 2026 · Updated May 27, 2026 · 34 min read

Two years ago, training a large language model required either renting time at a research lab or accepting that fine-tuning was the preserve of billion-dollar companies. By May 2026, Qwen3.6-27B can be taken from a Hugging Face download to a domain-specialised model on a single rented H100 for less than fifteen dollars. The tools have changed. The underlying mathematics has not, but the population of those who use it has expanded. This article describes how to train an open-source LLM in practice today: what hardware is required, which model to choose, how to format the data so that the trainer does not silently discard it, and how to place the result behind a serving endpoint that responds in milliseconds.

Summary

What this post covers: A working 2026 playbook for fine-tuning open-source LLMs using three concrete anchors — the dense Qwen3.6-27B, the MoE Qwen3.5-122B-A10B, and OpenAI’s GPT-OSS-120B — from environment setup through deployment.

Key insights:

  • QLoRA on a single H100 (80GB) now fine-tunes a 27B dense model in 8 to 12 hours for $10 to $16 of cloud rental, retaining 80 to 90 percent of full fine-tuning quality.
  • MoE models like Qwen3.5-122B-A10B (10B active) and GPT-OSS-120B (5.1B active) need VRAM to hold all 122B or 117B weights, even though per-token compute is small — the “active parameter” headline number is a runtime FLOPs claim, not a memory one.
  • Chat-template mismatch between training and inference is the single most common cause of a “trained but acts untrained” model — Qwen’s <|im_start|> markers and GPT-OSS’s harmony format are not interchangeable.
  • GPT-OSS-120B ships post-trained with MXFP4 quantization on the MoE weights, which is why a 117B-total-parameter model fits in a single 80GB H100 at inference time.
  • For anything past 70B at full precision, FSDP2 or DeepSpeed ZeRO-3 sharding is no longer optional — single-node training caps out around 32B dense in FP16 even on H200 (141GB) hardware.

Main topics: The State of Open-Source LLM Training in 2026, Meet the Three Anchor Models, Choosing Full Fine-Tune LoRA or QLoRA, Setting Up the Training Environment, Preparing the Dataset, The Actual Training Run, Evaluation That Isn’t Theatre, Deployment, Common Pitfalls and Debugging.

The State of Open-Source LLM Training in 2026

The open-source LLM landscape in May 2026 bears little resemblance to that of early 2024. Two structural shifts have transformed what a single engineer can accomplish alone.

The first shift is architectural. Mixture-of-Experts (MoE) models, in which each token activates only a small subset of total parameters, have become the dominant configuration for any model larger than 30B. A dense model uses every weight on every token; an MoE model uses a router to direct each token to a small fraction of “expert” sub-networks. Qwen3.5-122B-A10B has 122B total parameters but only approximately 10B active per forward pass. GPT-OSS-120B contains 117B total parameters with 5.1B active. The runtime FLOPs resemble those of a small model; the VRAM footprint does not.

The second shift concerns post-training tooling. QLoRA, in which the base weights are frozen at 4-bit NF4 (NormalFloat-4, a quantisation format optimised for the distribution of neural network weights) and only a small low-rank adapter is trained, has moved from a research curiosity in 2023 to the default starting point in 2026. LoRA (Low-Rank Adaptation) retains 90 to 95 per cent of full fine-tuning performance. QLoRA retains 80 to 90 per cent while reducing VRAM by approximately 75 per cent compared with FP16.

The practical implication is as follows: a 7B model that required approximately 14GB of VRAM to fine-tune in FP16 now fits in 5 to 6GB under QLoRA. A 70B model that required approximately 140GB now fits in 46GB. The hardware threshold has dropped sufficiently that the question has shifted from whether training is affordable to what should be trained.

Three Open-Source LLMs at a Glance (May 2026) Qwen3.6-27B Dense, multimodal Total params: 27B Active per token: 27B Architecture: Dense Attention: Gated DeltaNet (linear + self-attn hybrid) Context: 262K native (extensible to 1M) Modalities: Vision + text Released: 2026-04-22 License: Apache 2.0 Best for: Single-GPU fine-tuning, multimodal agents, long context tasks Qwen3.5-122B-A10B MoE, sparse Total params: 122B Active per token: ~10B Architecture: MoE Attention: Gated DeltaNet (linear + self-attn hybrid) Context: 262K native (extensible to 1M+) Modalities: Text Released: 2026-02-24 License: Apache 2.0 Best for: Cheap inference, scale via tensor parallelism, reasoning workloads GPT-OSS-120B MoE, MXFP4 native Total params: 117B Active per token: 5.1B Architecture: MoE Attention: Standard (grouped-query) Context: 128K Modalities: Text Released: Aug 2025 License: Apache 2.0 Best for: Single 80GB GPU serving, reasoning near o4-mini, drop-in OpenAI replacement

The implications for a practitioner intending to train a model today are as follows: prosumer hardware—a single H100 or H200, or even a 48GB consumer card such as the RTX 6000 Ada—can handle QLoRA on models up to 70B. Beyond that point, multi-GPU LoRA or sharded full fine-tuning is required. Specific recipes for each scenario are presented below.

Pretraining from scratch—the 2.1 million H100-hour run that produced GPT-OSS-120B—remains out of reach for almost all practitioners. Within reach, however, is taking one of these three checkpoints and adapting it to a particular dataset, domain, or task. This is what “training an open-source LLM” means in practice in 2026.

Key Takeaway: Training in 2026 almost always means fine-tuning a released checkpoint. The interesting choice is not pretraining versus fine-tuning but rather which fine-tuning method and which base model to use.

The Three Anchor Models

Three models cover the practical range of what is fine-tuned today: a dense 27B model that fits comfortably on prosumer hardware, a sparse 122B model that requires cluster-class memory but inexpensive compute, and a 117B MoE model that ships pre-quantised to fit on a single 80GB card.

Qwen3.6-27B

Released on 22 April 2026 by Alibaba’s Qwen team. Dense: every one of the 27 billion parameters participates in every forward pass. It uses Gated DeltaNet, a hybrid attention scheme that combines a linear-attention path (constant memory cost per token) with traditional softmax self-attention. The linear path handles long-range context, while the softmax path preserves short-range precision.

Native context is 262,144 tokens, extensible to one million via position-encoding extrapolation. The model is natively multimodal: the same checkpoint accepts images and text. A “Thinking Preservation” mechanism maintains a chain-of-thought reasoning mode and a fast non-thinking mode within a single set of weights.

Benchmark figures from the Qwen team include SWE-bench Verified 77.2 (compared with Qwen3.5-397B-A17B at 76.2), SWE-bench Pro 53.5 (compared with 50.9), Terminal-Bench 2.0 59.3 (compared with 52.5), and SkillsBench 48.2 (compared with 30.0). A 27B dense model surpassing its 397B MoE predecessor on code-related work is the kind of result that re-establishes the importance of architecture choice.

The model can be downloaded from the QwenLM/Qwen3.6 official repository or the Hugging Face Qwen/Qwen3.6-27B mirror. The licence is Apache 2.0: commercial use is permitted with attribution.

Qwen3.5-122B-A10B

Released on 24 February 2026. A sparse MoE: 122 billion total parameters, approximately 10 billion active per forward pass. The “A10B” suffix denotes the active-parameter count. Each token is routed through a small subset of experts, while the remainder of the network remains idle for that token.

The model shares the Gated DeltaNet hybrid attention of Qwen3.6-27B and the same 262K native context, extensible to 1M+. It is text-only at this size. The MoE structure means inference compute resembles that of a 10B model, but VRAM must still hold all 122B weights, because the router cannot determine in advance which expert any given token will require.

This is the appropriate model when strong quality is required alongside inexpensive per-token serving. The active-parameter count determines latency and energy cost; the total parameter count determines hardware purchasing decisions. The trade-off is frequently misunderstood on first encounter.

GPT-OSS-120B

OpenAI’s first open-weight LLMs since GPT-2 (2019), released in August 2025. The model contains 117 billion total parameters with 5.1 billion active, under an Apache 2.0 licence. It was trained on NVIDIA H100 GPUs using PyTorch with custom Triton kernels. The training run consumed 2.1 million H100-hours, which at $2 per hour in cloud pricing represents approximately $4.2 million in compute alone.

What makes GPT-OSS-120B unusual is that it ships post-trained with MXFP4 quantisation on the MoE weights. MXFP4 is a 4-bit floating-point format with a shared scale per micro-block. Because the bulk of the parameter count resides in the MoE expert layers, quantising those layers to 4-bit reduces the on-disk and in-VRAM footprint sufficiently to fit on a single 80GB GPU (H100 or AMD MI300X). The non-expert layers remain at higher precision.

The benchmark posture indicates near-parity with OpenAI’s o4-mini on core reasoning. For a model that can run on a single rented GPU, this is a notable result. The model card and weights are available at huggingface.co/openai/gpt-oss-120b; the official repository is at github.com/openai/gpt-oss; the launch announcement is at openai.com/index/introducing-gpt-oss.

Attribute Qwen3.6-27B Qwen3.5-122B-A10B GPT-OSS-120B
Total params 27B 122B 117B
Active params 27B (dense) ~10B 5.1B
Architecture Dense, Gated DeltaNet MoE, Gated DeltaNet MoE, grouped-query attn
License Apache 2.0 Apache 2.0 Apache 2.0
Release date 2026-04-22 2026-02-24 August 2025
Native context 262K (extensible to 1M) 262K (extensible to 1M+) 128K
Multimodal Yes (vision + text) Text only Text only
Download HF: Qwen/Qwen3.6-27B HF: Qwen/Qwen3.5-122B-A10B HF: openai/gpt-oss-120b

 

Choosing Full Fine-Tune, LoRA, or QLoRA

Three fine-tuning methods cover essentially the entire field. They occupy positions along a cost-versus-quality spectrum, and the appropriate choice depends on the volume of available data and the degree to which the target domain differs from the base model’s training distribution.

Full fine-tuning updates every parameter. It requires approximately four times the model’s memory footprint during training: model weights, gradients, optimizer states (two for AdamW: first and second moment), and activations. A 7B model requires approximately 14GB in FP16 for weights alone; with optimizer states and gradients, peak usage approaches 60GB.

LoRA (Low-Rank Adaptation) freezes the base weights and inserts trainable low-rank matrices into the attention projection layers. Instead of updating the full weight matrix W (for example, 4096×4096 = approximately 16.7M parameters), two small matrices B (4096×r) and A (r×4096) are trained, where r is typically 8, 16, or 32. The model effectively learns ΔW = B·A, which is added to the frozen W at inference. For r = 16, this amounts to approximately 131K trainable parameters per layer rather than 16.7M, roughly 128 times fewer.

QLoRA extends LoRA further. The frozen base weights are quantised to 4-bit NF4 (NormalFloat-4, designed to match the typical Gaussian distribution of neural network weights), and LoRA adapters sit on top in FP16 or BF16. The weights are de-quantised on the fly only during forward and backward passes. Memory consumption decreases by approximately 75 per cent compared with FP16 training.

Cost vs Quality Spectrum: Fine-Tuning Methods Lower VRAM & cost Higher VRAM & cost 100% 50% 0% Quality retention (% of full FT) Prompting / RAG ~0 VRAM Quality: ~60-70% QLoRA 80-90% 7B: ~6GB | 70B: ~46GB $10-16 single H100 LoRA 90-95% 7B: ~16GB | 70B: ~160GB 2-4× H100 for 70B Full FT 100% (baseline) 7B: ~60GB | 70B: ~560GB 8× H100, $250-510

Method VRAM (7B) VRAM (70B) Wall time (1 H100) Cost (cloud) Quality retention
Full FT ~60 GB ~560 GB (needs 8×H100) 24-48h on 8×H100 $250-510 100% (baseline)
LoRA ~16 GB ~160 GB (2-4 GPUs) 10-15h $20-40 90-95%
QLoRA ~6 GB ~46 GB (1 H100/H200) 8-12h $10-16 80-90%

 

How LoRA and QLoRA Work W₀ (frozen) Base weights d × d e.g. 4096 × 4096 = 16.7M params LoRA: FP16 QLoRA: 4-bit NF4 No gradient. No optimizer state. + B d × r init to 0 · A r × d Gaussian init = W = W₀ + B·A Effective weight For r = 16: B = 4096×16 = 65K A = 16×4096 = 65K 131K trainable vs 16.7M dense ~128× fewer Per attention projection Quantize to NF4 (QLoRA only) ~75% VRAM saved

The practical selection heuristic is to begin with QLoRA. If quality is insufficient after a sweep over rank, learning rate, and data size, the next step is LoRA. Full fine-tuning should be reserved for cases in which the domain shift is so substantial that the base model’s representation is genuinely wrong—for example, a model trained predominantly on English required to operate in a low-resource language. The 80 to 90 per cent quality retention of QLoRA is sufficient for the majority of production tasks.

Tip: A LoRA rank (r) of 16 serves as a sensible default. It should be increased to 32 or 64 only if the task differs substantially from the base model’s training distribution. Higher rank consumes more VRAM and rarely provides benefits beyond r ≥ 16 for most domains.

VRAM Budget by Model and Mode 600 480 360 240 120 60 0 VRAM (GB) H100 = 80GB H200 = 141GB Qwen3.6-27B 54 14 22 ~270 Qwen3.5-122B 244 62 ~80 ~600+ GPT-OSS-120B 234 35* ~75 ~560 Inference FP16 Inference 4-bit QLoRA training Full FT training (peak) * MXFP4 native

It is worth noting that GPT-OSS-120B’s 4-bit inference figure (approximately 35 GB) is substantially lower than Qwen3.5-122B’s 62 GB despite similar total parameter counts. This is the advantage of MXFP4-native quantisation. Qwen3.5 must be quantised after training (AWQ or GPTQ), incurring some additional accuracy loss; GPT-OSS-120B was post-trained with the 4-bit format already in mind.

Setting Up the Training Environment

Three years ago, this section would have been considerably more complex: CUDA versions, PyTorch builds, mismatched Triton, and broken bitsandbytes. In May 2026 the process remains finicky, but the recipe is more stable.

The requirements are CUDA 12.6 or newer (CUDA 12.8 ships well with the H100/H200 SXM5 drivers), cuDNN 9.5 or newer, PyTorch 2.7 stable or 2.8 nightly, and recent versions of transformers, peft, accelerate, trl, bitsandbytes, and vllm. Flash Attention 3 requires Hopper (H100/H200) or newer; on Ampere (A100), Flash Attention 2 is the fallback.

The cleanest approach uses a Docker container that pins all of these versions. Building locally is the second-cleanest option. Operating in a bare Python environment invites an evening of debugging mismatched CUDA symbols. Containerising the training environment with a known-good base image, typically nvidia/cuda:12.8.0-cudnn-devel-ubuntu24.04, is the standard approach.

A working pyproject.toml for a fine-tuning project as of May 2026 is shown below:

[project]
name = "llm-finetune"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = [
    "torch==2.7.0",
    "transformers==4.50.2",
    "peft==0.14.1",
    "bitsandbytes==0.46.0",
    "accelerate==1.4.0",
    "trl==0.16.0",
    "datasets==3.5.0",
    "unsloth==2026.5.3",
    "flash-attn==3.0.1",
    "vllm==0.9.2",
    "wandb==0.19.5",
    "sentencepiece==0.2.0",
    "tiktoken==0.7.0",
    "lm-eval==0.4.7",
]

[tool.uv]
index-strategy = "unsafe-best-match"

[[tool.uv.index]]
name = "pytorch-cuda128"
url = "https://download.pytorch.org/whl/cu128"

A Dockerfile producing a known-good training image is shown below:

FROM nvidia/cuda:12.8.0-cudnn-devel-ubuntu24.04

ENV DEBIAN_FRONTEND=noninteractive \
    PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1 \
    HF_HOME=/workspace/.cache/huggingface \
    TORCH_CUDA_ARCH_LIST="9.0;10.0"

RUN apt-get update && apt-get install -y --no-install-recommends \
        python3.11 python3.11-venv python3-pip git curl ca-certificates \
        build-essential ninja-build cmake \
    && rm -rf /var/lib/apt/lists/*

RUN curl -LsSf https://astral.sh/uv/install.sh | sh
ENV PATH="/root/.local/bin:${PATH}"

WORKDIR /workspace
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev

# Flash Attention 3 needs to compile against the installed torch
RUN uv pip install --no-build-isolation flash-attn==3.0.1

COPY . .

CMD ["uv", "run", "python", "-m", "train"]

The framework landscape in 2026 is as follows: TRL is HuggingFace’s official trainer for SFT (supervised fine-tuning) and reinforcement learning post-training. Axolotl is a YAML-config layer on top of TRL that handles much of the data-preparation boilerplate. Unsloth is a Triton-optimised custom kernel package that claims up to twice the training speed and 60 per cent lower VRAM consumption through hand-tuned kernels, and is now stable enough for production use. torchtitan is Meta’s reference scaffolding for large-scale pretraining and full fine-tuning with FSDP2.

Framework Primary use case Scaling target Ergonomics Recent activity
TRL SFT, DPO, GRPO, PPO 1-8 GPUs, single node Python API, flexible Very active
Axolotl SFT, DPO with YAML config 1-8 GPUs YAML, low boilerplate Active
Unsloth Single-GPU QLoRA/LoRA, speed 1 GPU (multi-GPU in 2025) Drop-in trainer wrapper Very active
torchtitan Full FT, pretraining at scale Multi-node FSDP2 Lower-level, reference impl Active

 

For 99 per cent of fine-tuning work, the appropriate choice is TRL (or Unsloth-wrapped TRL for single-GPU work). torchtitan should be reserved for full fine-tuning at 70B and above, where FSDP2 sharding becomes mandatory.

Preparing the Dataset

The training loop is the straightforward part. Dataset preparation is where most fine-tuning runs silently fail. The model trains for hours and the loss curve appears acceptable, but at evaluation the outputs are wrong, off-topic, or in the wrong format. The problem is almost always located upstream of the trainer.

A modern instruction-tuning dataset consists of a list of conversations. Each conversation is a list of turns, and each turn has a role (system, user, assistant) and content. JSONL is the standard on-disk format, with one JSON object per line. Parquet is the high-performance alternative for large datasets requiring columnar compression.

A single training example for Qwen3.6-27B appears as follows:

{"messages": [
  {"role": "system", "content": "You are an industrial controls expert."},
  {"role": "user", "content": "Explain why a PID loop on a cobot joint might oscillate after a payload change."},
  {"role": "assistant", "content": "When the end-effector payload changes, the effective inertia at the joint changes. The PID gains were tuned for the original inertia. With higher effective mass, the proportional term applies the same torque correction to a slower-responding system, which can overshoot the setpoint and drive sustained oscillation. Two practical fixes: gain scheduling (load the right PID set when the gripper picks up the part) or an adaptive controller (online inertia estimation). The first is easier to validate in safety review."}
]}

The tokenizer then applies the model’s chat template—a Jinja-style template defined inside tokenizer_config.json—to convert that list of turns into a single tokenised sequence with the model’s special tokens. For Qwen3.6, the chat template wraps each turn in <|im_start|>role\ncontent<|im_end|>. For GPT-OSS-120B, the harmony format with <|start_of_turn|> and channel markers is used. These are not interchangeable. A model trained with the wrong template and inferred with the correct one will behave as though it had never been trained.

Chat Template: From Conversation to Training Sequence Input: Structured messages role: system “You are a Python expert.” role: user “Why does my asyncio.gather() block?” role: assistant “asyncio.gather() awaits the collected futures. If you wrap a blocking call without to_thread() the whole loop stalls…” apply_chat_template() + tokenizer.encode() Qwen chat template output <|im_start|>system You are a Python expert. <|im_end|> <|im_start|>user Why does my asyncio.gather() block? <|im_end|> <|im_start|>assistant asyncio.gather() awaits the collected futures. If you wrap a blocking call without to_thread() the whole loop stalls… <|im_end|> Loss mask: System + user tokens: ignore_index = -100 Assistant tokens: train normally CRITICAL: GPT-OSS uses harmony format, NOT <|im_start|>. Templates are not portable.

The standard loss-masking pattern is as follows: the model is trained to predict assistant tokens, but the loss is masked (set to -100, the standard ignore_index for PyTorch’s CrossEntropyLoss) on system and user tokens. It is undesirable to teach the model to generate user messages.

A representative data-loading pipeline for Qwen3.6-27B, using the HuggingFace datasets library, is shown below:

from datasets import load_dataset
from transformers import AutoTokenizer

MODEL_ID = "Qwen/Qwen3.6-27B"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

def format_example(example):
    """Apply Qwen's chat template and tokenize."""
    text = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
        add_generation_prompt=False,
    )
    return {"text": text}

ds = load_dataset("json", data_files="data/train.jsonl", split="train")
ds = ds.map(format_example, remove_columns=ds.column_names)

# Train/eval split with a fixed seed for reproducibility
split = ds.train_test_split(test_size=0.05, seed=42)
train_ds, eval_ds = split["train"], split["test"]

print(f"Train: {len(train_ds)}, Eval: {len(eval_ds)}")
print("Sample formatted text:")
print(train_ds[0]["text"][:500])

Before training, two additional passes should be performed on the dataset. First, deduplication: exact-match dedup is inexpensive (a hash per example), while MinHash or SimHash near-dedup catches paraphrases. Duplicates inflate the loss curve and bias the model toward memorising common patterns.

Second, a contamination check: it must be ensured that none of the training data overlaps with the evaluation benchmarks. If the evaluation is MMLU and the training data was scraped from Common Crawl, there is a real probability that MMLU questions are present. A substring search of evaluation questions against the training set should be conducted, with any matches removed.

When data preparation is sufficiently complex to warrant orchestration, Airflow data pipelines are a suitable fit, as the dedup, contamination check, and tokenisation steps map well to a directed acyclic graph.

Caution: The most common training failure is also the most silent: chat template mismatch. The output fed to the trainer should always be verified with tokenizer.apply_chat_template to confirm that it matches the format expected by the model. The first 1000 characters of a tokenised example should be printed before any long run.

The Actual Training Run

Three concrete recipes are presented below, covering the three anchor models across three hardware budgets. Each provides a known-working starting point from which learning rate, rank, and data mixture may be tuned.

End-to-End Training Pipeline 1. Data prep dedup, filter hours-days (offline) 2. Tokenize chat template minutes (cached) 3. Forward compute logits ~50-200ms/step 4. Loss backward + grads ~70-300ms/step 5. Optimizer AdamW step ~10-30ms/step Repeat for N steps per epoch 6. Eval held-out set loss every N steps 7. Checkpoint save adapter / weights every K steps or best eval 8. Benchmark lm-eval-harness end of training Total wall time (QLoRA 27B, single H100, 50K examples, 3 epochs): ~8-12 hours end-to-end | per-step: ~150-400ms | eval: every 500 steps | checkpoint: every 1000 steps

Recipe 1: QLoRA on Qwen3.6-27B, Single H100 (80GB)

This is the most accessible setup. One rented H100 from Lambda Labs, RunPod, or a comparable cloud provider costs approximately $1.80 to $2.50 per hour as of May 2026. With 50,000 training examples and three epochs, the target wall time is eight to twelve hours, for a total bill of $10 to $16. This is the recipe most teams actually use.

# train_qlora_qwen36.py
import torch
from transformers import (
    AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
)
from peft import LoraConfig, prepare_model_for_kbit_training
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

MODEL_ID = "Qwen/Qwen3.6-27B"
OUTPUT_DIR = "out/qwen36-27b-qlora"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",        # NormalFloat-4
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,   # nested quantization of the quant constants
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
tokenizer.padding_side = "right"  # important: right-pad for SFT

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_3",
    device_map="auto",
    trust_remote_code=True,
)
model = prepare_model_for_kbit_training(model)
model.config.use_cache = False  # cache is not used during training; saves VRAM

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,             # alpha/r = 2 is a common starting ratio
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
)

train_ds = load_dataset("json", data_files="data/train.jsonl", split="train")
eval_ds  = load_dataset("json", data_files="data/eval.jsonl",  split="train")

sft_config = SFTConfig(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,   # effective batch = 16
    gradient_checkpointing=True,     # trade compute for VRAM
    learning_rate=2e-4,              # LoRA-typical; full FT would use ~1e-5
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    optim="paged_adamw_8bit",        # 8-bit optimizer to save more VRAM
    bf16=True,
    max_seq_length=4096,
    packing=True,                    # pack short examples to maximize GPU use
    eval_strategy="steps",
    eval_steps=500,
    save_steps=1000,
    save_total_limit=3,
    logging_steps=20,
    report_to="wandb",
    seed=42,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=sft_config,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    peft_config=peft_config,
)

trainer.train()
trainer.save_model(OUTPUT_DIR)

The principal design choices in the script merit explanation:

  • NF4 with double quantisation: NF4 quantises the weights themselves; double quantisation additionally quantises the per-block scaling constants, saving a further approximately 0.4 bits per parameter on average.
  • Gradient checkpointing: activations are recomputed during the backward pass rather than stored. This reduces activation memory by approximately the square root of the sequence length at a cost of roughly 30 per cent additional compute. The trade is almost always worthwhile for LoRA and QLoRA.
  • Gradient accumulation: with a per-device batch size of 2 and accumulation steps of 8, the effective batch is 16. This is useful when VRAM constrains the per-step batch but the optimisation signal of a larger batch is desired.
  • Paged AdamW 8-bit: optimiser states (first and second moments) at 8-bit precision, with paging to CPU when not in use. Reduces optimiser-state memory by a factor of four compared with FP32 AdamW.
  • Packing: concatenates multiple short examples into one sequence up to max_seq_length. Without packing, padding to 4096 tokens wastes most of the compute on short examples.

Recipe 2: Multi-GPU LoRA on Qwen3.5-122B-A10B

122B total parameters corresponds to approximately 244GB in FP16 for the weights alone. Two H200s (141GB each, 282GB combined) or four H100s (320GB combined) handle this comfortably with tensor parallelism. The accelerate configuration below specifies FSDP2 with the model sharded across eight GPUs.

# accelerate_config_fsdp.yaml
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
mixed_precision: bf16
num_processes: 8
num_machines: 1
machine_rank: 0
gpu_ids: all

fsdp_config:
  fsdp_version: 2
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: Qwen3MoeDecoderLayer
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_offload_params: false
  fsdp_use_orig_params: true
  fsdp_sync_module_states: true
  fsdp_cpu_ram_efficient_loading: true
  fsdp_activation_checkpointing: true

Launch the run with: accelerate launch --config_file accelerate_config_fsdp.yaml train_lora_qwen35.py

The training script is structurally similar to Recipe 1, with three changes: no BitsAndBytesConfig (LoRA rather than QLoRA), device_map=None (FSDP manages placement), and per-device batch size reduced to 1 with accumulation steps increased to maintain an effective batch of approximately 32. Wall time for 50K examples over three epochs on 8× H100 is approximately 18 to 24 hours.

FSDP2 / ZeRO-3: Sharding Across GPUs Naive Data Parallel (DDP) Each GPU holds full model + grads + optim state GPU 0 Params Grads Optim GPU 1 Params Grads Optim GPU 2 Params Grads Optim GPU 3 Params Grads Optim FSDP2 / ZeRO-3 Sharded Each GPU holds 1/N of each state GPU 0 P/4 G/4 O/4 GPU 1 P/4 G/4 O/4 GPU 2 P/4 G/4 O/4 GPU 3 P/4 G/4 O/4 Per-GPU memory: 70B model in BF16, 4 GPUs DDP (no sharding): ~560 GB/GPU (overflows 80GB H100 by 7×) ZeRO-2 (grads+optim): ~280 GB/GPU (still overflows) FSDP2 / ZeRO-3: ~140 GB/GPU (fits on H200, tight on H100) FSDP2 + 8× GPUs: ~70 GB/GPU (fits comfortably on H100)

Recipe 3: Multi-Node Full Fine-Tune on GPT-OSS-120B

Full fine-tuning a 117B MoE is genuinely expensive. The model weights in BF16 alone occupy approximately 234GB. With the addition of gradients, optimiser states (AdamW = twice the parameter count, in FP32 at 8 bytes each, approximately 940GB), and activations, cluster-class storage is required. The lower bound is 32 H100 GPUs across four nodes, using torchtitan with FSDP2 sharding across all 32 GPUs and tensor parallelism within each node.

For most use cases this is not the appropriate path. Even with full fine-tuning, there is a risk of losing the post-training calibration and safety tuning baked into the released checkpoint. The pragmatic path for GPT-OSS-120B is LoRA with rank 32, with the adapter applied to attention and MoE expert gate projections only.

Setup Combined VRAM What it can train
Single H100 QLoRA 80 GB Up to ~70B with QLoRA; Qwen3.6-27B comfortably
Single H200 QLoRA 141 GB Up to ~120B with QLoRA; comfortable 70B LoRA
2× H200 LoRA 282 GB Full LoRA on Qwen3.5-122B-A10B with FSDP2
8× H100 LoRA 640 GB LoRA on any model up to ~200B with sharding
8× H100 full FT 640 GB Full FT up to ~70B with FSDP2 + activation checkpointing
32× H100 multi-node 2,560 GB Full FT on 120B+ MoE; small pretraining runs

 

Across all three recipes, the choice of optimiser matters more than is commonly appreciated. AdamW with a cosine learning rate schedule and 3 per cent warm-up is the strong default. For LoRA, the learning rate is typically 1e-4 to 2e-4—substantially higher than the 1e-5 to 5e-5 used for full fine-tuning—because LoRA’s adapter layers begin near zero and require larger steps to learn meaningful deltas. Checkpoints should be saved every 1000 steps. Adapter-only (PEFT) checkpoints are preferable to full-model checkpoints; they are approximately one hundred times smaller.

For systematic optimisation of learning rate and rank, Bayesian hyperparameter optimisation with Gaussian processes is efficient. Random search is acceptable when the additional complexity is not warranted; grid search is almost never worthwhile for LoRA.

Substantive Evaluation

Most fine-tuning evaluation amounts to theatre. The model is trained, training loss decreases, an “evaluation” runs on a sliver of the training set (or the same data slightly shuffled), and the team declares success. The model is then deployed to production, where it underperforms.

Substantive evaluation requires three properties: the evaluation data must not have been observed during training; the evaluation metric must measure the actual task rather than a proxy; and the metric must be reproducible across runs.

For general language understanding and reasoning, the standard benchmarks are MMLU (multi-task language understanding across 57 subjects), HumanEval (function-completion code), GSM8K (grade-school mathematics word problems), and MT-Bench (multi-turn instruction following, judged by a strong LLM). For code-heavy use cases, SWE-bench Verified and Terminal-Bench 2.0 are the current standards.

The community-standard tool is lm-evaluation-harness from EleutherAI, which runs the model against a registered benchmark suite in a reproducible manner:

lm_eval \
  --model hf \
  --model_args pretrained=out/qwen36-27b-qlora,trust_remote_code=True \
  --tasks mmlu,gsm8k,humaneval \
  --batch_size auto \
  --output_path eval_results/qwen36-qlora.json

The contamination problem is real and frequently neglected. If the training data was scraped from the public web, there is a non-trivial probability that benchmark questions are present. The decontamination check consists of an n-gram (typically 8-gram) overlap test between the training set and each benchmark’s question text, with any matches removed from training. Without this check, evaluation scores represent an upper bound that obscures the effect of contamination.

Reading the Training Loss Curve 0 Training steps N Loss Healthy: monotonic decline eval loss tracks train loss Overfitting: eval rises while train keeps falling Loss spike → likely bad data batch Grad explosion → NaN lr too high, no clipping Diagnostic checklist — Healthy: smooth curve, eval ~= train — Overfit: stop early, more data, regularize — Spike: inspect batch at step N, dedup — Explosion: lower lr, add grad clipping Set grad_clip=1.0 as default; rerun from last good ckpt.

Beyond standard benchmarks, a domain-specific evaluation set should be held out, constructed from realistic prompts drawn from the actual use case. Benchmark suites measure general capability; a custom evaluation set measures whether the model performs better at the relevant task. The two metrics frequently disagree, and the custom set is the one that ultimately matters.

Tip: Construct the held-out evaluation set before fine-tuning begins, and store it at a separate file path that the training code cannot access. The temptation to inspect and “improve” the evaluation set after a poor run is a silent destroyer of meaningful evaluation.

Deployment

When training is complete, the adapter or full checkpoint resides in a directory and must be served.

The two standard serving stacks in 2026 are vLLM and SGLang. vLLM has the broadest support and is the production default for most teams. SGLang is faster for structured-output workloads (JSON, regex-constrained generation) and provides superior RadixAttention KV-cache reuse for repeated-prefix workloads such as RAG and multi-turn chat.

Both implement continuous batching, a serving technique that keeps the GPU saturated by dynamically inserting new requests into the batch as existing requests complete, rather than waiting for the whole batch to finish. The throughput multiplier of continuous batching over static batching is typically a factor of three to five, sometimes more.

Deployment Serving Stack Checkpoint PEFT adapter + base model Quantize AWQ / GPTQ / MXFP4 / FP8 vLLM / SGLang continuous batching KV cache PagedAttention block-based Clients OpenAI- compatible Throughput multiplier from continuous batching (vs static batching, same GPU) Static: 1× Continuous batching: ~3× + PagedAttention + prefix cache: ~5× or more on RAG workloads Measured tokens/second for concurrent 256-request streams Quantization trade-offs at inference time FP16 / BF16: baseline quality, 2× VRAM of int4 AWQ (Activation-aware Weight Quant): 4-bit, ~0.5pp quality loss, fast kernels in vLLM GPTQ: 4-bit, post-training, slightly lower quality than AWQ but broader compatibility MXFP4: 4-bit FP w/ block scale; GPT-OSS-120B trained with it; cleanest precision/cost trade

For a fine-tuned Qwen3.6-27B served on a single H100, the launch command is as follows:

vllm serve out/qwen36-27b-qlora \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 32768 \
  --dtype bfloat16 \
  --enable-lora \
  --lora-modules my-adapter=out/qwen36-27b-qlora \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --tensor-parallel-size 1

The serving endpoint exposes an OpenAI-compatible API at http://localhost:8000/v1. On the client side, it functions as a direct substitute for the OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",  # vLLM ignores the key by default
)

response = client.chat.completions.create(
    model="my-adapter",
    messages=[
        {"role": "system", "content": "You are an industrial controls expert."},
        {"role": "user",   "content": "What causes oscillation after a payload change on a cobot joint?"},
    ],
    temperature=0.2,
    max_tokens=512,
)

print(response.choices[0].message.content)

If the deployment forms part of a larger application, the serving pods may be run on Kubernetes with a GPU-aware scheduler. For tool-augmented workflows, tool calling support in vLLM via Hermes-style JSON output operates by default for Qwen3.6 and GPT-OSS. For broader integrations, the Model Context Protocol (MCP) is emerging as the de facto integration standard for tool-using LLM applications.

Common Pitfalls and Debugging

Most training failures derive from a small set of recurring mistakes. Awareness of these in advance saves substantial debugging time.

Chat template mismatch. Previously noted, but worth repeating because it is the most common silent failure. The training-time template and the inference-time template must be identical. A fully tokenised example with special tokens visible (tokenizer.decode(input_ids, skip_special_tokens=False)) should be printed before beginning any long run.

Out-of-memory mid-training. The loss curve appears acceptable for 5,000 steps, after which a single long sequence in a batch exceeds the activation memory budget. The remedy is to lower max_seq_length, enable packing=True with a sequence cap, or reduce per-device batch size and increase gradient accumulation to compensate.

Tokenizer drift. The base model has been loaded with one tokenizer revision and inference performed with another, causing the vocabulary or special-token IDs to shift. The tokenizer commit hash should be locked explicitly: AutoTokenizer.from_pretrained(MODEL_ID, revision="abc123def...").

Loss spikes. A large upward jump in loss at a specific step almost always indicates a bad batch—corrupted data, a tokenisation error on a single example, or an unusually long sequence. The data at that step should be inspected. If recurrence is rare, gradient clipping (max_grad_norm=1.0) should be added and training resumed from the last good checkpoint.

Evaluation/training distribution mismatch. Training loss is low, while evaluation loss is high and fails to improve. The evaluation set is drawn from a different distribution from the training set. Either the evaluation set should be drawn from the same source as the training data (with a fresh seed split), or the gap should be accepted as a measure of generalisation rather than a training failure.

Gradient explosion. Loss diverges to NaN within a few steps. The learning rate is too high for the task, gradient clipping has been omitted, or the data contain an extreme outlier in numerical features. Training should restart with learning_rate halved and max_grad_norm=1.0.

MoE-specific: expert collapse. Specific to MoE training (Qwen3.5-122B, GPT-OSS-120B). The router learns to route everything to one or two experts, and the remainder of the model atrophies. The mitigation is an auxiliary load-balancing loss, which TRL and torchtitan include by default; this should nonetheless be verified as enabled rather than silently overridden by a configuration setting.

Caution: Training should always be launched with W&B (or an equivalent) logging enabled, and the loss curve should be reviewed every few hundred steps. Detecting a failure in the first hour costs an hour; detecting it at the twelve-hour evaluation costs a day and the cloud bill.

FAQ

Can these models be fine-tuned on a consumer GPU such as an RTX 4090?

Qwen3.6-27B can be fine-tuned on a 4090 with QLoRA. The 24GB of VRAM on a 4090 is tight but workable with gradient checkpointing, a paged 8-bit optimiser, and a short sequence length (approximately 2048 tokens). Qwen3.5-122B-A10B and GPT-OSS-120B require at least 80GB of VRAM, which corresponds to H100/H200/MI300X-class hardware. The released GPT-OSS-120B can be served (though not trained) on a single 80GB card due to MXFP4 quantisation.

How much data is actually required?

Less than is commonly expected. For domain adaptation with LoRA or QLoRA, 5,000 to 20,000 high-quality examples are sufficient for most domains. Quality matters considerably more than quantity: a tightly curated 10,000-example set consistently outperforms a noisy 100,000-example set. For format adaptation (teaching the model a new structured output schema), 1,000 to 2,000 examples often suffice.

How does this compare with using a managed API?

The two represent different problem spaces. Managed APIs (OpenAI, Anthropic) excel in convenience and access to the latest models. Self-hosted fine-tuned models excel in cost per million tokens at scale, data sovereignty, custom domain adaptation, and predictable cost (no per-call billing). The crossover point is typically around 100M tokens per month; below this, managed services are usually preferable, and above it, self-hosted is usually cheaper.

What is the quality difference between LoRA and full fine-tuning?

LoRA retains 90 to 95 per cent of full fine-tuning quality across most tasks. QLoRA retains 80 to 90 per cent. The remaining gap is largest on tasks requiring substantial representational shift from the base model—for example, fine-tuning an English-pretrained model to operate fluently in a low-resource language. For typical instruction tuning, code adaptation, or structured-output tasks, the gap is sufficiently small that the cost savings of LoRA dominate.

Should continued pretraining precede instruction tuning?

Only when the domain is genuinely far from the base model’s training distribution—medical literature, legal contracts in a non-English language, or highly specialised scientific notation. For most domains, the base model has sufficient coverage that instruction tuning alone closes the gap. Continued pretraining is expensive and easily mishandled, with the principal risk being catastrophic forgetting of the base model’s general competence.

References

Conclusion

Training open-source LLMs in 2026 is no longer the closed activity it was two years ago. The combination of Apache 2.0 base models with frontier-class reasoning (GPT-OSS-120B approaching o4-mini), QLoRA on a single rented GPU, and serving infrastructure capable of handling thousands of concurrent users on commodity hardware has placed production-grade LLM customisation within reach of any team with a modest budget and a clear use case.

The three anchor models cover the practical range: Qwen3.6-27B for the single-GPU dense workflow, Qwen3.5-122B-A10B for inexpensive MoE serving when multi-GPU capacity is available, and GPT-OSS-120B for single-GPU serving of a frontier-class reasoner enabled by MXFP4. None of these is universally “best”; each addresses different questions about hardware, latency, and quality.

The principal challenge is no longer the technology; it is the data—assembling, deduplicating, formatting, and contamination-checking a dataset that actually teaches the model the intended behaviour. The trainer runs in eight hours. The dataset takes eight weeks. Planning should be adjusted accordingly.

You Might Also Like

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *