Can I fine-tune any of these models on a consumer GPU like an RTX 4090?

Qwen3.6-27B yes, with QLoRA, using gradient checkpointing, paged 8-bit optimizer, and a short sequence length. Qwen3.5-122B-A10B and GPT-OSS-120B require at least 80GB of VRAM.

How much data do I actually need to fine-tune an open-source LLM?

For domain adaptation with LoRA or QLoRA, 5,000 to 20,000 high-quality examples is enough for most domains. For format adaptation, 1,000-2,000 examples often suffice.

How does fine-tuning compare to using a managed API?

Managed APIs win on convenience. Self-hosted fine-tuned models win on cost at scale, data sovereignty, and custom domain adaptation. The crossover point is typically around 100M tokens per month.

Should I do continued pretraining before instruction tuning?

Only if your domain is genuinely far from the base model's training distribution. For most domains, instruction tuning alone closes the gap, and continued pretraining risks catastrophic forgetting.

How to Train Open-Source LLMs in 2026: Qwen3.6, Qwen3.5, GPT-OSS

Q: What's the quality difference between LoRA and full fine-tuning?

LoRA retains 90-95 percent of full fine-tuning quality. QLoRA retains 80-90 percent. The gap is largest on tasks requiring substantial representational shift from the base model.

Last updated: May 27, 2026

By kongastral

Published May 21, 2026 · Updated May 27, 2026 · 34 min read

Two years ago, training a large language model required either renting time at a research lab or accepting that fine-tuning was the preserve of billion-dollar companies. By May 2026, Qwen3.6-27B can be taken from a Hugging Face download to a domain-specialised model on a single rented H100 for less than fifteen dollars. The tools have changed. The underlying mathematics has not, but the population of those who use it has expanded. This article describes how to train an open-source LLM in practice today: what hardware is required, which model to choose, how to format the data so that the trainer does not silently discard it, and how to place the result behind a serving endpoint that responds in milliseconds.

Summary

What this post covers: A working 2026 playbook for fine-tuning open-source LLMs using three concrete anchors — the dense Qwen3.6-27B, the MoE Qwen3.5-122B-A10B, and OpenAI’s GPT-OSS-120B — from environment setup through deployment.

Key insights:

QLoRA on a single H100 (80GB) now fine-tunes a 27B dense model in 8 to 12 hours for $10 to $16 of cloud rental, retaining 80 to 90 percent of full fine-tuning quality.
MoE models like Qwen3.5-122B-A10B (10B active) and GPT-OSS-120B (5.1B active) need VRAM to hold all 122B or 117B weights, even though per-token compute is small — the “active parameter” headline number is a runtime FLOPs claim, not a memory one.
Chat-template mismatch between training and inference is the single most common cause of a “trained but acts untrained” model — Qwen’s <|im_start|> markers and GPT-OSS’s harmony format are not interchangeable.
GPT-OSS-120B ships post-trained with MXFP4 quantization on the MoE weights, which is why a 117B-total-parameter model fits in a single 80GB H100 at inference time.
For anything past 70B at full precision, FSDP2 or DeepSpeed ZeRO-3 sharding is no longer optional — single-node training caps out around 32B dense in FP16 even on H200 (141GB) hardware.

Main topics: The State of Open-Source LLM Training in 2026, Meet the Three Anchor Models, Choosing Full Fine-Tune LoRA or QLoRA, Setting Up the Training Environment, Preparing the Dataset, The Actual Training Run, Evaluation That Isn’t Theatre, Deployment, Common Pitfalls and Debugging.

The State of Open-Source LLM Training in 2026

The open-source LLM landscape in May 2026 bears little resemblance to that of early 2024. Two structural shifts have transformed what a single engineer can accomplish alone.

The first shift is architectural. Mixture-of-Experts (MoE) models, in which each token activates only a small subset of total parameters, have become the dominant configuration for any model larger than 30B. A dense model uses every weight on every token; an MoE model uses a router to direct each token to a small fraction of “expert” sub-networks. Qwen3.5-122B-A10B has 122B total parameters but only approximately 10B active per forward pass. GPT-OSS-120B contains 117B total parameters with 5.1B active. The runtime FLOPs resemble those of a small model; the VRAM footprint does not.

The second shift concerns post-training tooling. QLoRA, in which the base weights are frozen at 4-bit NF4 (NormalFloat-4, a quantisation format optimised for the distribution of neural network weights) and only a small low-rank adapter is trained, has moved from a research curiosity in 2023 to the default starting point in 2026. LoRA (Low-Rank Adaptation) retains 90 to 95 per cent of full fine-tuning performance. QLoRA retains 80 to 90 per cent while reducing VRAM by approximately 75 per cent compared with FP16.

The practical implication is as follows: a 7B model that required approximately 14GB of VRAM to fine-tune in FP16 now fits in 5 to 6GB under QLoRA. A 70B model that required approximately 140GB now fits in 46GB. The hardware threshold has dropped sufficiently that the question has shifted from whether training is affordable to what should be trained.

The implications for a practitioner intending to train a model today are as follows: prosumer hardware—a single H100 or H200, or even a 48GB consumer card such as the RTX 6000 Ada—can handle QLoRA on models up to 70B. Beyond that point, multi-GPU LoRA or sharded full fine-tuning is required. Specific recipes for each scenario are presented below.

Pretraining from scratch—the 2.1 million H100-hour run that produced GPT-OSS-120B—remains out of reach for almost all practitioners. Within reach, however, is taking one of these three checkpoints and adapting it to a particular dataset, domain, or task. This is what “training an open-source LLM” means in practice in 2026.

Key Takeaway: Training in 2026 almost always means fine-tuning a released checkpoint. The interesting choice is not pretraining versus fine-tuning but rather which fine-tuning method and which base model to use.

The Three Anchor Models

Three models cover the practical range of what is fine-tuned today: a dense 27B model that fits comfortably on prosumer hardware, a sparse 122B model that requires cluster-class memory but inexpensive compute, and a 117B MoE model that ships pre-quantised to fit on a single 80GB card.

Qwen3.6-27B

Released on 22 April 2026 by Alibaba’s Qwen team. Dense: every one of the 27 billion parameters participates in every forward pass. It uses Gated DeltaNet, a hybrid attention scheme that combines a linear-attention path (constant memory cost per token) with traditional softmax self-attention. The linear path handles long-range context, while the softmax path preserves short-range precision.

Native context is 262,144 tokens, extensible to one million via position-encoding extrapolation. The model is natively multimodal: the same checkpoint accepts images and text. A “Thinking Preservation” mechanism maintains a chain-of-thought reasoning mode and a fast non-thinking mode within a single set of weights.

Benchmark figures from the Qwen team include SWE-bench Verified 77.2 (compared with Qwen3.5-397B-A17B at 76.2), SWE-bench Pro 53.5 (compared with 50.9), Terminal-Bench 2.0 59.3 (compared with 52.5), and SkillsBench 48.2 (compared with 30.0). A 27B dense model surpassing its 397B MoE predecessor on code-related work is the kind of result that re-establishes the importance of architecture choice.

The model can be downloaded from the QwenLM/Qwen3.6 official repository or the Hugging Face Qwen/Qwen3.6-27B mirror. The licence is Apache 2.0: commercial use is permitted with attribution.

Qwen3.5-122B-A10B

Released on 24 February 2026. A sparse MoE: 122 billion total parameters, approximately 10 billion active per forward pass. The “A10B” suffix denotes the active-parameter count. Each token is routed through a small subset of experts, while the remainder of the network remains idle for that token.

The model shares the Gated DeltaNet hybrid attention of Qwen3.6-27B and the same 262K native context, extensible to 1M+. It is text-only at this size. The MoE structure means inference compute resembles that of a 10B model, but VRAM must still hold all 122B weights, because the router cannot determine in advance which expert any given token will require.

This is the appropriate model when strong quality is required alongside inexpensive per-token serving. The active-parameter count determines latency and energy cost; the total parameter count determines hardware purchasing decisions. The trade-off is frequently misunderstood on first encounter.

GPT-OSS-120B

OpenAI’s first open-weight LLMs since GPT-2 (2019), released in August 2025. The model contains 117 billion total parameters with 5.1 billion active, under an Apache 2.0 licence. It was trained on NVIDIA H100 GPUs using PyTorch with custom Triton kernels. The training run consumed 2.1 million H100-hours, which at $2 per hour in cloud pricing represents approximately $4.2 million in compute alone.

What makes GPT-OSS-120B unusual is that it ships post-trained with MXFP4 quantisation on the MoE weights. MXFP4 is a 4-bit floating-point format with a shared scale per micro-block. Because the bulk of the parameter count resides in the MoE expert layers, quantising those layers to 4-bit reduces the on-disk and in-VRAM footprint sufficiently to fit on a single 80GB GPU (H100 or AMD MI300X). The non-expert layers remain at higher precision.

The benchmark posture indicates near-parity with OpenAI’s o4-mini on core reasoning. For a model that can run on a single rented GPU, this is a notable result. The model card and weights are available at huggingface.co/openai/gpt-oss-120b; the official repository is at github.com/openai/gpt-oss; the launch announcement is at openai.com/index/introducing-gpt-oss.

Attribute	Qwen3.6-27B	Qwen3.5-122B-A10B	GPT-OSS-120B
Total params	27B	122B	117B
Active params	27B (dense)	~10B	5.1B
Architecture	Dense, Gated DeltaNet	MoE, Gated DeltaNet	MoE, grouped-query attn
License	Apache 2.0	Apache 2.0	Apache 2.0
Release date	2026-04-22	2026-02-24	August 2025
Native context	262K (extensible to 1M)	262K (extensible to 1M+)	128K
Multimodal	Yes (vision + text)	Text only	Text only
Download	HF: Qwen/Qwen3.6-27B	HF: Qwen/Qwen3.5-122B-A10B	HF: openai/gpt-oss-120b

Choosing Full Fine-Tune, LoRA, or QLoRA

Three fine-tuning methods cover essentially the entire field. They occupy positions along a cost-versus-quality spectrum, and the appropriate choice depends on the volume of available data and the degree to which the target domain differs from the base model’s training distribution.

Full fine-tuning updates every parameter. It requires approximately four times the model’s memory footprint during training: model weights, gradients, optimizer states (two for AdamW: first and second moment), and activations. A 7B model requires approximately 14GB in FP16 for weights alone; with optimizer states and gradients, peak usage approaches 60GB.

LoRA (Low-Rank Adaptation) freezes the base weights and inserts trainable low-rank matrices into the attention projection layers. Instead of updating the full weight matrix W (for example, 4096×4096 = approximately 16.7M parameters), two small matrices B (4096×r) and A (r×4096) are trained, where r is typically 8, 16, or 32. The model effectively learns ΔW = B·A, which is added to the frozen W at inference. For r = 16, this amounts to approximately 131K trainable parameters per layer rather than 16.7M, roughly 128 times fewer.

QLoRA extends LoRA further. The frozen base weights are quantised to 4-bit NF4 (NormalFloat-4, designed to match the typical Gaussian distribution of neural network weights), and LoRA adapters sit on top in FP16 or BF16. The weights are de-quantised on the fly only during forward and backward passes. Memory consumption decreases by approximately 75 per cent compared with FP16 training.

Method	VRAM (7B)	VRAM (70B)	Wall time (1 H100)	Cost (cloud)	Quality retention
Full FT	~60 GB	~560 GB (needs 8×H100)	24-48h on 8×H100	$250-510	100% (baseline)
LoRA	~16 GB	~160 GB (2-4 GPUs)	10-15h	$20-40	90-95%
QLoRA	~6 GB	~46 GB (1 H100/H200)	8-12h	$10-16	80-90%

The practical selection heuristic is to begin with QLoRA. If quality is insufficient after a sweep over rank, learning rate, and data size, the next step is LoRA. Full fine-tuning should be reserved for cases in which the domain shift is so substantial that the base model’s representation is genuinely wrong—for example, a model trained predominantly on English required to operate in a low-resource language. The 80 to 90 per cent quality retention of QLoRA is sufficient for the majority of production tasks.

Tip: A LoRA rank (r) of 16 serves as a sensible default. It should be increased to 32 or 64 only if the task differs substantially from the base model’s training distribution. Higher rank consumes more VRAM and rarely provides benefits beyond r ≥ 16 for most domains.

It is worth noting that GPT-OSS-120B’s 4-bit inference figure (approximately 35 GB) is substantially lower than Qwen3.5-122B’s 62 GB despite similar total parameter counts. This is the advantage of MXFP4-native quantisation. Qwen3.5 must be quantised after training (AWQ or GPTQ), incurring some additional accuracy loss; GPT-OSS-120B was post-trained with the 4-bit format already in mind.

Setting Up the Training Environment

Three years ago, this section would have been considerably more complex: CUDA versions, PyTorch builds, mismatched Triton, and broken bitsandbytes. In May 2026 the process remains finicky, but the recipe is more stable.

The requirements are CUDA 12.6 or newer (CUDA 12.8 ships well with the H100/H200 SXM5 drivers), cuDNN 9.5 or newer, PyTorch 2.7 stable or 2.8 nightly, and recent versions of transformers, peft, accelerate, trl, bitsandbytes, and vllm. Flash Attention 3 requires Hopper (H100/H200) or newer; on Ampere (A100), Flash Attention 2 is the fallback.

The cleanest approach uses a Docker container that pins all of these versions. Building locally is the second-cleanest option. Operating in a bare Python environment invites an evening of debugging mismatched CUDA symbols. Containerising the training environment with a known-good base image, typically nvidia/cuda:12.8.0-cudnn-devel-ubuntu24.04, is the standard approach.

A working pyproject.toml for a fine-tuning project as of May 2026 is shown below:

[project]
name = "llm-finetune"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = [
    "torch==2.7.0",
    "transformers==4.50.2",
    "peft==0.14.1",
    "bitsandbytes==0.46.0",
    "accelerate==1.4.0",
    "trl==0.16.0",
    "datasets==3.5.0",
    "unsloth==2026.5.3",
    "flash-attn==3.0.1",
    "vllm==0.9.2",
    "wandb==0.19.5",
    "sentencepiece==0.2.0",
    "tiktoken==0.7.0",
    "lm-eval==0.4.7",
]

[tool.uv]
index-strategy = "unsafe-best-match"

[[tool.uv.index]]
name = "pytorch-cuda128"
url = "https://download.pytorch.org/whl/cu128"

A Dockerfile producing a known-good training image is shown below:

FROM nvidia/cuda:12.8.0-cudnn-devel-ubuntu24.04

ENV DEBIAN_FRONTEND=noninteractive \
    PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1 \
    HF_HOME=/workspace/.cache/huggingface \
    TORCH_CUDA_ARCH_LIST="9.0;10.0"

RUN apt-get update && apt-get install -y --no-install-recommends \
        python3.11 python3.11-venv python3-pip git curl ca-certificates \
        build-essential ninja-build cmake \
    && rm -rf /var/lib/apt/lists/*

RUN curl -LsSf https://astral.sh/uv/install.sh | sh
ENV PATH="/root/.local/bin:${PATH}"

WORKDIR /workspace
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev

# Flash Attention 3 needs to compile against the installed torch
RUN uv pip install --no-build-isolation flash-attn==3.0.1

COPY . .

CMD ["uv", "run", "python", "-m", "train"]

The framework landscape in 2026 is as follows: TRL is HuggingFace’s official trainer for SFT (supervised fine-tuning) and reinforcement learning post-training. Axolotl is a YAML-config layer on top of TRL that handles much of the data-preparation boilerplate. Unsloth is a Triton-optimised custom kernel package that claims up to twice the training speed and 60 per cent lower VRAM consumption through hand-tuned kernels, and is now stable enough for production use. torchtitan is Meta’s reference scaffolding for large-scale pretraining and full fine-tuning with FSDP2.

Framework	Primary use case	Scaling target	Ergonomics	Recent activity
TRL	SFT, DPO, GRPO, PPO	1-8 GPUs, single node	Python API, flexible	Very active
Axolotl	SFT, DPO with YAML config	1-8 GPUs	YAML, low boilerplate	Active
Unsloth	Single-GPU QLoRA/LoRA, speed	1 GPU (multi-GPU in 2025)	Drop-in trainer wrapper	Very active
torchtitan	Full FT, pretraining at scale	Multi-node FSDP2	Lower-level, reference impl	Active

For 99 per cent of fine-tuning work, the appropriate choice is TRL (or Unsloth-wrapped TRL for single-GPU work). torchtitan should be reserved for full fine-tuning at 70B and above, where FSDP2 sharding becomes mandatory.

Preparing the Dataset

The training loop is the straightforward part. Dataset preparation is where most fine-tuning runs silently fail. The model trains for hours and the loss curve appears acceptable, but at evaluation the outputs are wrong, off-topic, or in the wrong format. The problem is almost always located upstream of the trainer.

A modern instruction-tuning dataset consists of a list of conversations. Each conversation is a list of turns, and each turn has a role (system, user, assistant) and content. JSONL is the standard on-disk format, with one JSON object per line. Parquet is the high-performance alternative for large datasets requiring columnar compression.

A single training example for Qwen3.6-27B appears as follows:

{"messages": [
  {"role": "system", "content": "You are an industrial controls expert."},
  {"role": "user", "content": "Explain why a PID loop on a cobot joint might oscillate after a payload change."},
  {"role": "assistant", "content": "When the end-effector payload changes, the effective inertia at the joint changes. The PID gains were tuned for the original inertia. With higher effective mass, the proportional term applies the same torque correction to a slower-responding system, which can overshoot the setpoint and drive sustained oscillation. Two practical fixes: gain scheduling (load the right PID set when the gripper picks up the part) or an adaptive controller (online inertia estimation). The first is easier to validate in safety review."}
]}

The tokenizer then applies the model’s chat template—a Jinja-style template defined inside tokenizer_config.json—to convert that list of turns into a single tokenised sequence with the model’s special tokens. For Qwen3.6, the chat template wraps each turn in <|im_start|>role\ncontent<|im_end|>. For GPT-OSS-120B, the harmony format with <|start_of_turn|> and channel markers is used. These are not interchangeable. A model trained with the wrong template and inferred with the correct one will behave as though it had never been trained.

The standard loss-masking pattern is as follows: the model is trained to predict assistant tokens, but the loss is masked (set to -100, the standard ignore_index for PyTorch’s CrossEntropyLoss) on system and user tokens. It is undesirable to teach the model to generate user messages.

A representative data-loading pipeline for Qwen3.6-27B, using the HuggingFace datasets library, is shown below:

from datasets import load_dataset
from transformers import AutoTokenizer

MODEL_ID = "Qwen/Qwen3.6-27B"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

def format_example(example):
    """Apply Qwen's chat template and tokenize."""
    text = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
        add_generation_prompt=False,
    )
    return {"text": text}

ds = load_dataset("json", data_files="data/train.jsonl", split="train")
ds = ds.map(format_example, remove_columns=ds.column_names)

# Train/eval split with a fixed seed for reproducibility
split = ds.train_test_split(test_size=0.05, seed=42)
train_ds, eval_ds = split["train"], split["test"]

print(f"Train: {len(train_ds)}, Eval: {len(eval_ds)}")
print("Sample formatted text:")
print(train_ds[0]["text"][:500])

Before training, two additional passes should be performed on the dataset. First, deduplication: exact-match dedup is inexpensive (a hash per example), while MinHash or SimHash near-dedup catches paraphrases. Duplicates inflate the loss curve and bias the model toward memorising common patterns.

Second, a contamination check: it must be ensured that none of the training data overlaps with the evaluation benchmarks. If the evaluation is MMLU and the training data was scraped from Common Crawl, there is a real probability that MMLU questions are present. A substring search of evaluation questions against the training set should be conducted, with any matches removed.

When data preparation is sufficiently complex to warrant orchestration, Airflow data pipelines are a suitable fit, as the dedup, contamination check, and tokenisation steps map well to a directed acyclic graph.

Caution: The most common training failure is also the most silent: chat template mismatch. The output fed to the trainer should always be verified with tokenizer.apply_chat_template to confirm that it matches the format expected by the model. The first 1000 characters of a tokenised example should be printed before any long run.

The Actual Training Run

Three concrete recipes are presented below, covering the three anchor models across three hardware budgets. Each provides a known-working starting point from which learning rate, rank, and data mixture may be tuned.

Recipe 1: QLoRA on Qwen3.6-27B, Single H100 (80GB)

This is the most accessible setup. One rented H100 from Lambda Labs, RunPod, or a comparable cloud provider costs approximately $1.80 to $2.50 per hour as of May 2026. With 50,000 training examples and three epochs, the target wall time is eight to twelve hours, for a total bill of $10 to $16. This is the recipe most teams actually use.

# train_qlora_qwen36.py
import torch
from transformers import (
    AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
)
from peft import LoraConfig, prepare_model_for_kbit_training
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

MODEL_ID = "Qwen/Qwen3.6-27B"
OUTPUT_DIR = "out/qwen36-27b-qlora"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",        # NormalFloat-4
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,   # nested quantization of the quant constants
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
tokenizer.padding_side = "right"  # important: right-pad for SFT

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_3",
    device_map="auto",
    trust_remote_code=True,
)
model = prepare_model_for_kbit_training(model)
model.config.use_cache = False  # cache is not used during training; saves VRAM

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,             # alpha/r = 2 is a common starting ratio
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
)

train_ds = load_dataset("json", data_files="data/train.jsonl", split="train")
eval_ds  = load_dataset("json", data_files="data/eval.jsonl",  split="train")

sft_config = SFTConfig(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,   # effective batch = 16
    gradient_checkpointing=True,     # trade compute for VRAM
    learning_rate=2e-4,              # LoRA-typical; full FT would use ~1e-5
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    optim="paged_adamw_8bit",        # 8-bit optimizer to save more VRAM
    bf16=True,
    max_seq_length=4096,
    packing=True,                    # pack short examples to maximize GPU use
    eval_strategy="steps",
    eval_steps=500,
    save_steps=1000,
    save_total_limit=3,
    logging_steps=20,
    report_to="wandb",
    seed=42,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=sft_config,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    peft_config=peft_config,
)

trainer.train()
trainer.save_model(OUTPUT_DIR)

The principal design choices in the script merit explanation:

NF4 with double quantisation: NF4 quantises the weights themselves; double quantisation additionally quantises the per-block scaling constants, saving a further approximately 0.4 bits per parameter on average.
Gradient checkpointing: activations are recomputed during the backward pass rather than stored. This reduces activation memory by approximately the square root of the sequence length at a cost of roughly 30 per cent additional compute. The trade is almost always worthwhile for LoRA and QLoRA.
Gradient accumulation: with a per-device batch size of 2 and accumulation steps of 8, the effective batch is 16. This is useful when VRAM constrains the per-step batch but the optimisation signal of a larger batch is desired.
Paged AdamW 8-bit: optimiser states (first and second moments) at 8-bit precision, with paging to CPU when not in use. Reduces optimiser-state memory by a factor of four compared with FP32 AdamW.
Packing: concatenates multiple short examples into one sequence up to max_seq_length. Without packing, padding to 4096 tokens wastes most of the compute on short examples.

Recipe 2: Multi-GPU LoRA on Qwen3.5-122B-A10B

122B total parameters corresponds to approximately 244GB in FP16 for the weights alone. Two H200s (141GB each, 282GB combined) or four H100s (320GB combined) handle this comfortably with tensor parallelism. The accelerate configuration below specifies FSDP2 with the model sharded across eight GPUs.

# accelerate_config_fsdp.yaml
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
mixed_precision: bf16
num_processes: 8
num_machines: 1
machine_rank: 0
gpu_ids: all

fsdp_config:
  fsdp_version: 2
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: Qwen3MoeDecoderLayer
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_offload_params: false
  fsdp_use_orig_params: true
  fsdp_sync_module_states: true
  fsdp_cpu_ram_efficient_loading: true
  fsdp_activation_checkpointing: true

Launch the run with: accelerate launch --config_file accelerate_config_fsdp.yaml train_lora_qwen35.py

The training script is structurally similar to Recipe 1, with three changes: no BitsAndBytesConfig (LoRA rather than QLoRA), device_map=None (FSDP manages placement), and per-device batch size reduced to 1 with accumulation steps increased to maintain an effective batch of approximately 32. Wall time for 50K examples over three epochs on 8× H100 is approximately 18 to 24 hours.

Recipe 3: Multi-Node Full Fine-Tune on GPT-OSS-120B

Full fine-tuning a 117B MoE is genuinely expensive. The model weights in BF16 alone occupy approximately 234GB. With the addition of gradients, optimiser states (AdamW = twice the parameter count, in FP32 at 8 bytes each, approximately 940GB), and activations, cluster-class storage is required. The lower bound is 32 H100 GPUs across four nodes, using torchtitan with FSDP2 sharding across all 32 GPUs and tensor parallelism within each node.

For most use cases this is not the appropriate path. Even with full fine-tuning, there is a risk of losing the post-training calibration and safety tuning baked into the released checkpoint. The pragmatic path for GPT-OSS-120B is LoRA with rank 32, with the adapter applied to attention and MoE expert gate projections only.

Setup	Combined VRAM	What it can train
Single H100 QLoRA	80 GB	Up to ~70B with QLoRA; Qwen3.6-27B comfortably
Single H200 QLoRA	141 GB	Up to ~120B with QLoRA; comfortable 70B LoRA
2× H200 LoRA	282 GB	Full LoRA on Qwen3.5-122B-A10B with FSDP2
8× H100 LoRA	640 GB	LoRA on any model up to ~200B with sharding
8× H100 full FT	640 GB	Full FT up to ~70B with FSDP2 + activation checkpointing
32× H100 multi-node	2,560 GB	Full FT on 120B+ MoE; small pretraining runs

Across all three recipes, the choice of optimiser matters more than is commonly appreciated. AdamW with a cosine learning rate schedule and 3 per cent warm-up is the strong default. For LoRA, the learning rate is typically 1e-4 to 2e-4—substantially higher than the 1e-5 to 5e-5 used for full fine-tuning—because LoRA’s adapter layers begin near zero and require larger steps to learn meaningful deltas. Checkpoints should be saved every 1000 steps. Adapter-only (PEFT) checkpoints are preferable to full-model checkpoints; they are approximately one hundred times smaller.

For systematic optimisation of learning rate and rank, Bayesian hyperparameter optimisation with Gaussian processes is efficient. Random search is acceptable when the additional complexity is not warranted; grid search is almost never worthwhile for LoRA.

Substantive Evaluation

Most fine-tuning evaluation amounts to theatre. The model is trained, training loss decreases, an “evaluation” runs on a sliver of the training set (or the same data slightly shuffled), and the team declares success. The model is then deployed to production, where it underperforms.

Substantive evaluation requires three properties: the evaluation data must not have been observed during training; the evaluation metric must measure the actual task rather than a proxy; and the metric must be reproducible across runs.

For general language understanding and reasoning, the standard benchmarks are MMLU (multi-task language understanding across 57 subjects), HumanEval (function-completion code), GSM8K (grade-school mathematics word problems), and MT-Bench (multi-turn instruction following, judged by a strong LLM). For code-heavy use cases, SWE-bench Verified and Terminal-Bench 2.0 are the current standards.

The community-standard tool is lm-evaluation-harness from EleutherAI, which runs the model against a registered benchmark suite in a reproducible manner:

lm_eval \
  --model hf \
  --model_args pretrained=out/qwen36-27b-qlora,trust_remote_code=True \
  --tasks mmlu,gsm8k,humaneval \
  --batch_size auto \
  --output_path eval_results/qwen36-qlora.json

The contamination problem is real and frequently neglected. If the training data was scraped from the public web, there is a non-trivial probability that benchmark questions are present. The decontamination check consists of an n-gram (typically 8-gram) overlap test between the training set and each benchmark’s question text, with any matches removed from training. Without this check, evaluation scores represent an upper bound that obscures the effect of contamination.

Beyond standard benchmarks, a domain-specific evaluation set should be held out, constructed from realistic prompts drawn from the actual use case. Benchmark suites measure general capability; a custom evaluation set measures whether the model performs better at the relevant task. The two metrics frequently disagree, and the custom set is the one that ultimately matters.

Tip: Construct the held-out evaluation set before fine-tuning begins, and store it at a separate file path that the training code cannot access. The temptation to inspect and “improve” the evaluation set after a poor run is a silent destroyer of meaningful evaluation.

Deployment

When training is complete, the adapter or full checkpoint resides in a directory and must be served.

The two standard serving stacks in 2026 are vLLM and SGLang. vLLM has the broadest support and is the production default for most teams. SGLang is faster for structured-output workloads (JSON, regex-constrained generation) and provides superior RadixAttention KV-cache reuse for repeated-prefix workloads such as RAG and multi-turn chat.

Both implement continuous batching, a serving technique that keeps the GPU saturated by dynamically inserting new requests into the batch as existing requests complete, rather than waiting for the whole batch to finish. The throughput multiplier of continuous batching over static batching is typically a factor of three to five, sometimes more.

For a fine-tuned Qwen3.6-27B served on a single H100, the launch command is as follows:

vllm serve out/qwen36-27b-qlora \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 32768 \
  --dtype bfloat16 \
  --enable-lora \
  --lora-modules my-adapter=out/qwen36-27b-qlora \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --tensor-parallel-size 1

The serving endpoint exposes an OpenAI-compatible API at http://localhost:8000/v1. On the client side, it functions as a direct substitute for the OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",  # vLLM ignores the key by default
)

response = client.chat.completions.create(
    model="my-adapter",
    messages=[
        {"role": "system", "content": "You are an industrial controls expert."},
        {"role": "user",   "content": "What causes oscillation after a payload change on a cobot joint?"},
    ],
    temperature=0.2,
    max_tokens=512,
)

print(response.choices[0].message.content)

If the deployment forms part of a larger application, the serving pods may be run on Kubernetes with a GPU-aware scheduler. For tool-augmented workflows, tool calling support in vLLM via Hermes-style JSON output operates by default for Qwen3.6 and GPT-OSS. For broader integrations, the Model Context Protocol (MCP) is emerging as the de facto integration standard for tool-using LLM applications.

Common Pitfalls and Debugging

Most training failures derive from a small set of recurring mistakes. Awareness of these in advance saves substantial debugging time.

Chat template mismatch. Previously noted, but worth repeating because it is the most common silent failure. The training-time template and the inference-time template must be identical. A fully tokenised example with special tokens visible (tokenizer.decode(input_ids, skip_special_tokens=False)) should be printed before beginning any long run.

Out-of-memory mid-training. The loss curve appears acceptable for 5,000 steps, after which a single long sequence in a batch exceeds the activation memory budget. The remedy is to lower max_seq_length, enable packing=True with a sequence cap, or reduce per-device batch size and increase gradient accumulation to compensate.

Tokenizer drift. The base model has been loaded with one tokenizer revision and inference performed with another, causing the vocabulary or special-token IDs to shift. The tokenizer commit hash should be locked explicitly: AutoTokenizer.from_pretrained(MODEL_ID, revision="abc123def...").

Loss spikes. A large upward jump in loss at a specific step almost always indicates a bad batch—corrupted data, a tokenisation error on a single example, or an unusually long sequence. The data at that step should be inspected. If recurrence is rare, gradient clipping (max_grad_norm=1.0) should be added and training resumed from the last good checkpoint.

Evaluation/training distribution mismatch. Training loss is low, while evaluation loss is high and fails to improve. The evaluation set is drawn from a different distribution from the training set. Either the evaluation set should be drawn from the same source as the training data (with a fresh seed split), or the gap should be accepted as a measure of generalisation rather than a training failure.

Gradient explosion. Loss diverges to NaN within a few steps. The learning rate is too high for the task, gradient clipping has been omitted, or the data contain an extreme outlier in numerical features. Training should restart with learning_rate halved and max_grad_norm=1.0.

MoE-specific: expert collapse. Specific to MoE training (Qwen3.5-122B, GPT-OSS-120B). The router learns to route everything to one or two experts, and the remainder of the model atrophies. The mitigation is an auxiliary load-balancing loss, which TRL and torchtitan include by default; this should nonetheless be verified as enabled rather than silently overridden by a configuration setting.

Caution: Training should always be launched with W&B (or an equivalent) logging enabled, and the loss curve should be reviewed every few hundred steps. Detecting a failure in the first hour costs an hour; detecting it at the twelve-hour evaluation costs a day and the cloud bill.

FAQ

Can these models be fine-tuned on a consumer GPU such as an RTX 4090?

Qwen3.6-27B can be fine-tuned on a 4090 with QLoRA. The 24GB of VRAM on a 4090 is tight but workable with gradient checkpointing, a paged 8-bit optimiser, and a short sequence length (approximately 2048 tokens). Qwen3.5-122B-A10B and GPT-OSS-120B require at least 80GB of VRAM, which corresponds to H100/H200/MI300X-class hardware. The released GPT-OSS-120B can be served (though not trained) on a single 80GB card due to MXFP4 quantisation.

How much data is actually required?

Less than is commonly expected. For domain adaptation with LoRA or QLoRA, 5,000 to 20,000 high-quality examples are sufficient for most domains. Quality matters considerably more than quantity: a tightly curated 10,000-example set consistently outperforms a noisy 100,000-example set. For format adaptation (teaching the model a new structured output schema), 1,000 to 2,000 examples often suffice.

How does this compare with using a managed API?

The two represent different problem spaces. Managed APIs (OpenAI, Anthropic) excel in convenience and access to the latest models. Self-hosted fine-tuned models excel in cost per million tokens at scale, data sovereignty, custom domain adaptation, and predictable cost (no per-call billing). The crossover point is typically around 100M tokens per month; below this, managed services are usually preferable, and above it, self-hosted is usually cheaper.

What is the quality difference between LoRA and full fine-tuning?

LoRA retains 90 to 95 per cent of full fine-tuning quality across most tasks. QLoRA retains 80 to 90 per cent. The remaining gap is largest on tasks requiring substantial representational shift from the base model—for example, fine-tuning an English-pretrained model to operate fluently in a low-resource language. For typical instruction tuning, code adaptation, or structured-output tasks, the gap is sufficiently small that the cost savings of LoRA dominate.

Should continued pretraining precede instruction tuning?

Only when the domain is genuinely far from the base model’s training distribution—medical literature, legal contracts in a non-English language, or highly specialised scientific notation. For most domains, the base model has sufficient coverage that instruction tuning alone closes the gap. Continued pretraining is expensive and easily mishandled, with the principal risk being catastrophic forgetting of the base model’s general competence.

References

Qwen Team — Qwen3.6-27B announcement (April 22, 2026)
QwenLM — Qwen3.6 official repository
OpenAI — Introducing GPT-OSS (August 2025)
OpenAI — GPT-OSS-120B model card on Hugging Face
OpenAI — openai/gpt-oss GitHub repository

Conclusion

Training open-source LLMs in 2026 is no longer the closed activity it was two years ago. The combination of Apache 2.0 base models with frontier-class reasoning (GPT-OSS-120B approaching o4-mini), QLoRA on a single rented GPU, and serving infrastructure capable of handling thousands of concurrent users on commodity hardware has placed production-grade LLM customisation within reach of any team with a modest budget and a clear use case.

The three anchor models cover the practical range: Qwen3.6-27B for the single-GPU dense workflow, Qwen3.5-122B-A10B for inexpensive MoE serving when multi-GPU capacity is available, and GPT-OSS-120B for single-GPU serving of a frontier-class reasoner enabled by MXFP4. None of these is universally “best”; each addresses different questions about hardware, latency, and quality.

The principal challenge is no longer the technology; it is the data—assembling, deduplicating, formatting, and contamination-checking a dataset that actually teaches the model the intended behaviour. The trainer runs in eight hours. The dataset takes eight weeks. Planning should be adjusted accordingly.

AI/MLTool Calling Explained: How AI Models Interact With the Real World Through Function Calling AI/MLGenetic Algorithms Explained: A Python Implementation Guide AI/MLThe Best AI Agents and Tools for Office Workers in 2026: A Complete Productivity Guide