Home AI/ML GP-Based Hyperparameter Optimization: Bayesian Tuning for ML Models

GP-Based Hyperparameter Optimization: Bayesian Tuning for ML Models

Last updated: May 27, 2026
k
Published April 26, 2026 · Updated May 27, 2026 · 38 min read

Summary

What this post covers: A practitioner’s guide to tuning ML hyperparameters with Gaussian Process Bayesian Optimization, walking through the full BayesOpt pipeline, acquisition functions, search-space design, and four working Python implementations (scikit-optimize, BoTorch, qNEHVI multi-objective, Optuna+BoTorch).

Key insights:

  • GP-based Bayesian optimization typically reaches a good configuration in approximately twenty trials, compared with roughly sixty for random search and millions for grid search. It is the appropriate default whenever each training run requires substantial GPU time.
  • GPs perform well for HPO because they natively model observation noise, quantify uncertainty across the search space, and produce a smooth surrogate that an acquisition function can exploit. This combination accounts for their sample efficiency in low-to-moderate dimensions.
  • The choice of acquisition function matters. Expected Improvement is the safe default, UCB exposes an explicit explore-versus-exploit parameter, and Thompson Sampling or qNEHVI are preferable when parallel batches or multi-objective Pareto fronts are required.
  • Search-space design—log-uniform priors for learning rate, integer dimensions, conditional parameters—frequently determines success more than the choice of optimizer. Combining GP-BO with Hyperband (BOHB) is the practical optimum once tens of GPUs are available.
  • For most teams, the appropriate stack is Optuna with the BoTorch sampler. It handles mixed and conditional spaces, parallelizes effectively, and provides GP-grade sample efficiency without requiring direct BoTorch use.

Main topics: Why Hyperparameter Tuning Is Hard, The HPO Landscape: A Survey of Methods, Why Gaussian Processes Are Effective for HPO, The Full BayesOpt Pipeline for HPO, Acquisition Functions Examined in Detail, Search Space Design, Full Python Implementation, Multi-Fidelity and Parallel HPO, Tools Comparison, Real-World Case Studies, Practical Guide and Pitfalls.

Tuning a ten-hyperparameter neural network by grid search with five values per dimension requires 9.7 million experiments. Random search reaches a comparable configuration in approximately sixty trials. Gaussian Process Bayesian Optimization typically requires twenty. The level of accuracy is the same; the compute requirement is reduced by a factor of roughly half a million.

This gap explains why GP-based hyperparameter optimization moved from an academic curiosity to the production default at Google, Meta, and OpenAI. When a single training run requires hours and costs hundreds of dollars in GPU time, grid search is economically infeasible. Random search is unreliable because it cannot incorporate the knowledge accumulated from previous trials. The optimizer must reason between trials, selecting the next configuration in light of every prior one.

Gaussian Processes provide the mathematical machinery that makes this possible. A GP fits a smooth surrogate to the validation-loss landscape, quantifies its own uncertainty across the search space, and an acquisition function converts that uncertainty into a principled decision about where to evaluate next.

This post is a practitioner guide. It does not re-derive GP regression; for the underlying mathematics covering kernels, posterior inference, and marginal likelihood, the Gaussian Process fundamentals post with Python and GPyTorch is the appropriate reference. The focus here is the applied question: how to tune XGBoost, a CNN, or a transformer using GP-based Bayesian optimization in production.

The remainder of the article presents four working code examples (scikit-optimize on XGBoost, BoTorch on a CNN, multi-objective BO with qNEHVI, and Optuna with the BoTorch sampler), a discussion of common acquisition functions, three accompanying diagrams, and a considered recommendation regarding tools.

Why Hyperparameter Tuning Is Hard

Before considering the merits of GPs, it is useful to acknowledge what makes HPO genuinely difficult, since this difficulty is what justifies the additional machinery of Bayesian optimization.

The Combinatorial Explosion

A typical modern machine-learning model has between ten and thirty tunable hyperparameters. A baseline XGBoost has ten to fifteen (learning rate, max depth, n_estimators, subsample, colsample_bytree, min_child_weight, gamma, reg_alpha, reg_lambda, scale_pos_weight, and others). A vision transformer has more (depth, width, heads, MLP ratio, patch size, dropout, attention dropout, learning rate, weight decay, warmup, label smoothing, mixup alpha, drop path, EMA, and similar).

Grid-searching ten hyperparameters with five values each requires 510 ≈ 9.77 million configurations. At thirty minutes per training run on a single GPU, this amounts to 5,580 GPU-years. Even with substantial parallelism, the approach is infeasible.

Non-Trivial Interactions

Hyperparameters are not independent. The optimal learning rate depends on batch size (the linear scaling rule), on the optimizer (Adam versus SGD), on weight initialization, and on architectural depth. Grid search assumes that hyperparameters can be examined one at a time, which is incorrect.

Random search handles this better because it samples jointly and therefore observes interactions. It nevertheless wastes compute on unpromising regions because it has no memory between trials.

Each Evaluation Is Expensive

Training a single configuration can take from minutes for a small XGBoost model to days for a large language model fine-tune. When each evaluation costs $50 to $500 in cloud GPU time, sample efficiency moves from an academic preference to a budgetary necessity.

Noise

The same hyperparameters produce different validation losses across random seeds. Variance arising from data shuffling, dropout randomness, weight initialization, and stochastic optimization means that every observation is noisy. A naive optimizer interprets this noise as signal. GPs handle observation noise natively through the kernel, which is a built-in advantage.

Mixed Types and Conditional Spaces

Real search spaces include continuous parameters such as the learning rate, integers such as max depth and the number of layers, categoricals such as activation function and optimizer choice, and conditional dimensions: the dropout rate matters only if dropout is enabled, and momentum matters only for SGD, not Adam. Standard GPs assume continuous Euclidean inputs, so this is a substantive engineering challenge that the search-space section addresses.

Key Takeaway: HPO is difficult because the search space is substantial and irregularly shaped, evaluations are expensive, observations are noisy, and no gradient is available. Each of these properties points away from grid search and toward a sample-efficient, model-based optimizer, which is precisely the role of GP-based Bayesian optimization.

The HPO Landscape: A Survey of Methods

Before focusing on GPs, the practical taxonomy of methods encountered in real applications is summarized below.

Grid search evaluates the Cartesian product of values for each hyperparameter. It is easy to implement and easy to parallelize, but markedly inefficient. The approach breaks down beyond four or five hyperparameters because of the curse of dimensionality. It is appropriate only for very small problems or the final pinning of two or three parameters.

Random search samples uniformly from the search space. Bergstra and Bengio (2012) demonstrated that it outperforms grid search because most hyperparameters do not matter equally; random search effectively projects onto the important axes. It is the baseline that every other method should be able to surpass. A method that cannot exceed random search is not functioning correctly.

Evolutionary and Genetic Algorithms

An evolutionary algorithm maintains a population of configurations, applies mutation and recombination, and selects the fittest. The method parallelizes well, requires no gradient, and handles unusual search spaces. Sample efficiency is moderate—better than random search but usually worse than Bayesian optimization. The approach is used extensively in neural architecture search (Regularized Evolution and AmoebaNet, for example). For a more detailed exposition, see the genetic algorithm Python implementation guide.

Bandit-Based Methods: Hyperband and ASHA

Bandit-based methods frame HPO as a multi-armed bandit problem. Many configurations are run for a small budget, the worst are eliminated, the budget for survivors is doubled, and the process repeats. Successive Halving is the core idea. Hyperband sweeps over different initial budgets to hedge against poor fidelity choices, and ASHA is the asynchronous variant that scales to substantial parallelism. These are multi-fidelity methods that use cheap proxies, such as early epochs, to filter more expensive trials.

Bayesian Optimization with GPs

This method fits a GP surrogate to pairs of (hyperparameter, validation_loss) values and uses an acquisition function to select the next trial. It is sample-efficient, provides principled uncertainty quantification, and is theoretically well grounded. It is the focus of this post.

TPE (Tree-Structured Parzen Estimator)

TPE is a Bayesian optimization method with a different surrogate. Rather than a GP, it models two densities, p(x | y < threshold) and p(x | y ≥ threshold), and selects x to maximize their ratio. TPE handles conditional spaces natively, scales well to higher dimensions, and underpins the default samplers in Optuna and HyperOpt. It is less sample-efficient than GP-based BO in low dimensions but more flexible in high dimensions and with mixed types.

A Hybrid Method: BOHB

Falkner et al. (2018) combined Bayesian Optimization (with TPE) and Hyperband. The combination yields the compute efficiency of Hyperband through early stopping and the informed sampling of BO in place of random sampling within rungs. BOHB is frequently the appropriate default for deep-learning HPO when tens of GPUs are available.

HPO Method Sample Efficiency on 10-D Problem Approximate trials needed to reach a good configuration (lower is better) Trials (log scale) 10 100 1k 10k 100k 1M+ Grid Search ~9.7M (off-chart) Random Search ~100 Genetic Algorithm ~80 TPE (Optuna) ~40 GP-BO (BoTorch) ~25 BOHB (multi-fid.) ~15 Winners: GP-BO (low-D) BOHB (deep nets) Note: BOHB advantage assumes you can early-stop confidently from partial training curves.

Quick Decision: When to Use What

Method Sample Efficiency Parallelism Complexity Categorical Support Best For
Grid Search Very low Trivial Trivial Native ≤3 hyperparams, final pinning
Random Search Low Trivial Trivial Native Baseline, exploration phase
Genetic Algorithm Medium Excellent Medium Native NAS, irregular spaces
Hyperband / ASHA Medium Excellent Medium Native Big compute, slow training
TPE High Good Medium Native, conditional Mixed types, conditional spaces
GP-BO Highest Good (qEI/Thompson) High Custom kernels needed ≤20 dims, expensive evals
BOHB Highest Excellent High Native (TPE-based) Deep learning at scale

 

Why Gaussian Processes Are Effective for HPO

For the majority of real HPO problems—those with fewer than twenty dimensions, expensive evaluations, and largely continuous parameters—GP-based BO is the strongest method on every published benchmark. The reasons are as follows.

Sample Efficiency Is Paramount

When each evaluation requires hours of GPU time, the few seconds of overhead associated with fitting a GP are inconsequential. The objective is to make every trial count. GPs use the full information of every prior observation when selecting the next one. Random search discards that information.

Principled Uncertainty

A GP does not merely predict the loss; it predicts the loss and a confidence interval. This capability enables intelligent exploration. The GP identifies the regions in which it is uncertain, and the acquisition function exploits this information. Without a probabilistic surrogate, “exploration” reduces to random sampling.

Smooth Surrogate for a Smooth Landscape

Hyperparameter loss landscapes are typically smooth, particularly in log-space coordinates such as learning rate and weight decay. The Matérn 5/2 kernel is a near-perfect inductive bias for this property. GPs interpolate cleanly between observations and provide a credible map of the search space after just ten to twenty trials.

Calibrated Exploration and Exploitation

Acquisition functions such as Expected Improvement automatically balance exploitation (sampling where the model predicts high quality) with exploration (sampling where the model is uncertain). The trade-off emerges from the mathematics rather than from a hand-tuned epsilon-greedy mechanism.

Effective Range: at Most Approximately Twenty Dimensions

GPs become unwieldy beyond approximately twenty dimensions because the kernel struggles to model meaningful similarity in high-dimensional Euclidean space. Fortunately, the vast majority of HPO problems fall within this regime. For higher dimensions, the discussion of TuRBO and random embeddings applies.

Tip: If the search space has fewer than twenty dimensions, a few seconds of GP-fitting overhead per trial is tolerable, and each trial is expensive (more than a minute), GP-based BO is almost always the appropriate choice. The principal exceptions are extreme parallelism (use Thompson sampling), conditional spaces (use TPE), and genuinely high-dimensional problems (use TuRBO).

The Full BayesOpt Pipeline for HPO

The operation of GP-based Bayesian optimization is described step by step below. The loop is the one implemented in BoTorch, scikit-optimize, and Optuna’s GP sampler.

Step 1: Define the Search Space

Specify the bounds and type of each hyperparameter, choosing among continuous (with optional log scale), integer, and categorical. This step is responsible for most production errors: bounds set too tight miss the optimum, bounds set too wide waste trials in poor regions, and incorrect scales (linear rather than log for the learning rate, for example) degrade the optimizer.

Step 2: Initial Random Trials

Five to ten random configurations should be run to seed the GP. Without these observations the GP has no signal, and the acquisition function repeatedly selects the geometric center of the search box. A common rule of thumb is n_init = max(5, 2 · d), where d is the search-space dimension.

Step 3: Fit the GP Surrogate

Given observations (x1, y1), …, (xn, yn), fit a GP with a Matérn 5/2 kernel, which is the standard default for HPO. Optimize the kernel hyperparameters (lengthscales, signal variance, noise) by maximizing the marginal likelihood. This takes seconds for n < 1000.

Step 4: Optimize the Acquisition Function

The acquisition function α(x) takes the GP posterior and returns a scalar that expresses the value of evaluating at x. Maximize α(x) over the search space using L-BFGS, multi-start methods, or random sampling for non-smooth cases. The argmax is the next trial.

Step 5: Run the Trial

Train the model with the proposed hyperparameters and record (xn+1, yn+1).

Step 6: Update and Repeat

Append the new observation, refit the GP, optimize the acquisition function again, and propose the next trial. The loop continues until the budget is exhausted.

BayesOpt Iterations: GP Posterior + Acquisition Function Iteration 5: wide uncertainty, exploring next x Iteration 10: narrowing on promising area next x Iteration 20: converged on local minimum explore Iteration 30: confirmed global optimum ★ optimum acquisition flat—converged GP mean GP ±2σ observed trial next trial acquisition α(x)

Caution: A trade-off seldom mentioned: GP fitting combined with acquisition optimization introduces one to ten seconds of overhead per trial. When each trial completes in five seconds, as for a small model on a small dataset, this overhead dominates and BO underperforms random search. BO is advantageous specifically when each trial requires minutes to days. Applying BO to a scikit-learn linear regression is therefore inappropriate.

Acquisition Functions Examined in Detail

The acquisition function is the mechanism by which exploration is balanced with exploitation. The choice of acquisition function matters less than is sometimes claimed; Expected Improvement is appropriate in roughly 90 percent of cases. Nonetheless, an understanding of the alternatives is helpful when diagnosing problems.

Expected Improvement (EI)

EI(x) = E[max(0, fbest − f(x))], that is, the expected improvement over the current best. For a Gaussian posterior with mean μ(x) and standard deviation σ(x), the expression has a closed form.

EI(x) = (fbest − μ(x)) · Φ(z) + σ(x) · φ(z), where z = (fbest − μ(x)) / σ(x).

Φ denotes the standard normal CDF and φ the PDF. The expression is smooth, differentiable, and well-behaved. EI is the default choice. It exhibits a slight bias toward exploitation, but in practice it explores adequately because σ(x) is large in unexplored regions.

Upper Confidence Bound (UCB)

UCB(x) = μ(x) − β · σ(x) for minimization, with sign flipped for maximization. The coefficient β explicitly controls the level of exploration: larger values produce more exploration. Theoretical regret bounds (Srinivas et al., 2010) establish that, with βt growing logarithmically, UCB has sublinear cumulative regret. In practice, β = 2 is a reasonable default. UCB is more aggressive about exploration than EI when σ is large.

Probability of Improvement (PI)

PI(x) = P(f(x) < fbest) = Φ(z), which is simply the probability of any improvement over the current best. PI is purely greedy: it selects any small improvement and can stagnate by exploiting near the current best indefinitely. It is rarely used in modern HPO except as a pedagogical example.

Thompson Sampling

Thompson sampling draws a function from the GP posterior and takes its argmin. The method is naturally diverse, since independent posterior samples select different points. Its principal advantage is trivial parallelization: for batch HPO of size k, k posterior samples can be drawn and their argmins evaluated simultaneously. It is widely used in production systems with many parallel workers.

Knowledge Gradient (KG)

EI is myopic: it considers only the immediate improvement. KG looks one step ahead and computes the expected best after an observation at x updates the GP. KG is more principled but also more expensive because it requires nested optimization. Empirically, it offers an improvement of roughly 10 to 20 percent for noisy problems. BoTorch’s qKnowledgeGradient is the standard implementation.

Max-Value Entropy Search (MES)

MES is an information-theoretic method: it selects x to maximize mutual information about the location of the optimum. The method is robust to noise and handles batches well, but it is more complex to implement (Wang and Jegelka, 2017). It is available as qMaxValueEntropy in BoTorch.

Acquisition Formula Intuition Strength Weakness When to Use
EI Expected gain over best so far Closed-form, balanced Slight exploitation bias Default—start here
UCB μ − β·σ Tunable exploration, regret bounds Need to set β When EI underexplores
PI Probability of any improvement Simplest Stagnates, no exploration Almost never
Thompson argmin of posterior sample Trivial parallelization Higher variance Batch / parallel HPO
KG Look-ahead expected best Robust to noise Expensive to compute Very noisy objectives
MES Mutual info about optimum Strong batch behavior Implementation complexity Research / best-of-best

 

Search Space Design

This is the most underappreciated aspect of HPO. A GP can only optimize what is specified, and most HPO failures can be traced to a poorly defined search space.

Log Scale for Multiplicative Parameters

Learning rates, weight decay, and regularization coefficients have a fundamentally multiplicative effect: moving from 1e-3 to 1e-4 is comparable in magnitude to moving from 1e-4 to 1e-5. Log-uniform sampling is appropriate, and bounds of 1e-5 to 1e-1 are typical for the learning rate.

Linear Scale for Additive Parameters

Layer sizes, the number of estimators, batch size, and the number of layers have additive and roughly linear effects.

Integer Handling

Most BO libraries treat integers as continuous and round at evaluation time. This works but creates plateaus in the objective. BoTorch’s OneHotToNumeric and Round input transforms handle the case cleanly. Optuna and scikit-optimize handle rounding automatically once the parameter is declared as integer.

Categorical Handling

Three approaches are available: (1) one-hot encode and treat as continuous, which functions adequately but incurs a slight efficiency loss; (2) use a custom kernel such as the categorical Hamming kernel, which is cleaner; or (3) use TPE, which handles categoricals natively. BoTorch’s MixedSingleTaskGP supports mixed continuous-categorical spaces.

Conditional Spaces

A dropout rate is meaningful only when dropout is enabled, and momentum is relevant only for SGD, not for Adam. TPE handles such structure natively and learns the conditional relationships. GP-based BO requires custom handling. The typical approach is to flatten to the union of possibilities and rely on the optimizer to learn that certain dimensions are irrelevant. For deeply conditional spaces, TPE is often preferable.

Hyperparameter Type Recommended Representation Typical Range
Learning rate Log-uniform continuous 1e-5 to 1e-1
Weight decay / L2 Log-uniform continuous 1e-6 to 1e-2
Dropout rate Linear continuous 0.0 to 0.5
Hidden size / width Log-uniform integer 32 to 1024
Number of layers Linear integer 2 to 12
Batch size Log-uniform integer (powers of 2) 8 to 512
Optimizer choice Categorical {Adam, SGD, AdamW, RMSprop}
Activation Categorical {ReLU, GELU, SiLU, Mish}
XGBoost max_depth Linear integer 3 to 12
XGBoost subsample Linear continuous 0.5 to 1.0

 

Caution: GPs extrapolate poorly outside their training data. If the best hyperparameter value lies on the boundary of the search space, this is a strong signal that the bounds were set too tight. The bounds should be widened and the optimization rerun.

Full Python Implementation

Four working examples are presented in order of increasing complexity. Any of them can serve as a starting template for a particular HPO task.

Example 1: Tuning XGBoost with scikit-optimize

scikit-optimize is the gentlest entry point: pip-installable, with a scikit-learn-style API and GP-based defaults. It is well suited to tabular machine learning.

"""
GP-BO for XGBoost using scikit-optimize.
pip install scikit-optimize xgboost scikit-learn matplotlib
"""
import numpy as np
from skopt import gp_minimize
from skopt.space import Real, Integer
from skopt.utils import use_named_args
from skopt.plots import plot_convergence, plot_objective
from sklearn.datasets import fetch_openml
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
import matplotlib.pyplot as plt

# Load a real tabular dataset
data = fetch_openml("adult", version=2, as_frame=True)
X = data.data.select_dtypes(include=[np.number]).fillna(0).values
y = (data.target == ">50K").astype(int).values

# Define the search space
space = [
    Real(1e-3, 0.3, prior="log-uniform", name="learning_rate"),
    Integer(3, 12, name="max_depth"),
    Integer(50, 500, name="n_estimators"),
    Real(0.5, 1.0, name="subsample"),
    Real(0.5, 1.0, name="colsample_bytree"),
    Real(1e-6, 1.0, prior="log-uniform", name="reg_alpha"),
    Real(1e-6, 1.0, prior="log-uniform", name="reg_lambda"),
    Real(0.0, 5.0, name="gamma"),
]

@use_named_args(space)
def objective(**params):
    """We minimize negative ROC AUC (skopt minimizes)."""
    clf = XGBClassifier(
        **params,
        tree_method="hist",
        eval_metric="logloss",
        n_jobs=-1,
        random_state=42,
        verbosity=0,
    )
    score = cross_val_score(
        clf, X, y, cv=3, scoring="roc_auc", n_jobs=1
    ).mean()
    return -score

# Run GP-BO with EI acquisition
result = gp_minimize(
    objective,
    space,
    n_calls=50,            # total trials
    n_initial_points=10,   # random seed trials
    acq_func="EI",         # Expected Improvement
    random_state=42,
    verbose=True,
)

print(f"Best AUC: {-result.fun:.4f}")
print("Best hyperparameters:")
for name, val in zip([s.name for s in space], result.x):
    print(f"  {name}: {val}")

# Diagnostics
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
plot_convergence(result, ax=axes[0])
axes[0].set_title("Convergence")
plot_objective(result, ax=axes[1] if False else None)  # separate fig
plt.tight_layout()
plt.savefig("xgb_bo_convergence.png", dpi=120)

The procedure runs ten random seed trials followed by forty GP-guided trials using Expected Improvement. The plot_convergence function displays the running best score against the trial number, the canonical visualization showing that BO outperforms random search. The plot_objective function displays partial-dependence plots for each hyperparameter and reveals which dimensions actually mattered.

On the Adult dataset with fifty trials, GP-BO typically improves on the fifty-trial best from random search by 0.5 to 1.5 percent AUC. The gain is modest in isolation but valuable because it requires no additional trial budget and is reproducible.

Example 2: Tuning a PyTorch CNN with BoTorch

BoTorch is the appropriate next step once scikit-optimize becomes restrictive. It is PyTorch-native, GPU-accelerated, and built on GPyTorch (the same library used in the GP fundamentals post). For research and production deep-learning HPO, it is the established standard.

"""
GP-BO for a PyTorch CNN using BoTorch.
pip install botorch gpytorch torch torchvision
"""
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from botorch.models import SingleTaskGP
from botorch.fit import fit_gpytorch_mll
from botorch.acquisition import qExpectedImprovement
from botorch.optim import optimize_acqf
from gpytorch.mlls import ExactMarginalLogLikelihood

device = "cuda" if torch.cuda.is_available() else "cpu"

# Search space: [log_lr, log_wd, dropout, log_hidden]
# Bounds in normalized space [0,1] mapped to actual ranges below.
BOUNDS = torch.tensor(
    [[0.0, 0.0, 0.0, 0.0], [1.0, 1.0, 1.0, 1.0]],
    device=device, dtype=torch.double,
)

def unnormalize(x):
    """Map [0,1]^4 to actual hyperparameter ranges."""
    log_lr   = -5.0 + x[..., 0] * 3.0   # 1e-5 to 1e-2
    log_wd   = -6.0 + x[..., 1] * 4.0   # 1e-6 to 1e-2
    dropout  = x[..., 2] * 0.5          # 0 to 0.5
    log_hidden = 5.0 + x[..., 3] * 4.0  # 32 to 512 (log2)
    return {
        "lr": float(10 ** log_lr),
        "wd": float(10 ** log_wd),
        "dropout": float(dropout),
        "hidden": int(2 ** round(log_hidden.item())),
    }

class SmallCNN(nn.Module):
    def __init__(self, hidden, dropout):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(1, 16, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(16, 32, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Flatten(),
            nn.Linear(32 * 7 * 7, hidden), nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden, 10),
        )
    def forward(self, x):
        return self.net(x)

# Load FashionMNIST (small enough to iterate quickly)
tfm = transforms.Compose([transforms.ToTensor()])
train_ds = datasets.FashionMNIST("./data", train=True, download=True, transform=tfm)
val_ds = datasets.FashionMNIST("./data", train=False, download=True, transform=tfm)
train_loader = DataLoader(train_ds, batch_size=256, shuffle=True, num_workers=2)
val_loader = DataLoader(val_ds, batch_size=512, num_workers=2)

def train_eval(params, epochs=3):
    """Train CNN with given hyperparams, return validation accuracy."""
    model = SmallCNN(params["hidden"], params["dropout"]).to(device)
    opt = optim.AdamW(model.parameters(), lr=params["lr"], weight_decay=params["wd"])
    crit = nn.CrossEntropyLoss()
    for _ in range(epochs):
        model.train()
        for xb, yb in train_loader:
            xb, yb = xb.to(device), yb.to(device)
            opt.zero_grad()
            crit(model(xb), yb).backward()
            opt.step()
    # Evaluate
    model.eval()
    correct = total = 0
    with torch.no_grad():
        for xb, yb in val_loader:
            xb, yb = xb.to(device), yb.to(device)
            preds = model(xb).argmax(1)
            correct += (preds == yb).sum().item()
            total += yb.size(0)
    return correct / total

# Initial random trials
N_INIT = 8
torch.manual_seed(0)
X_obs = torch.rand(N_INIT, 4, device=device, dtype=torch.double)
Y_obs = torch.tensor(
    [[train_eval(unnormalize(x))] for x in X_obs],
    device=device, dtype=torch.double,
)
print(f"Init complete. Best so far: {Y_obs.max().item():.4f}")

# BO loop
N_BO_ITERS = 20
for it in range(N_BO_ITERS):
    # Fit GP (BoTorch handles standardization, kernel, MLL)
    gp = SingleTaskGP(X_obs, Y_obs)
    mll = ExactMarginalLogLikelihood(gp.likelihood, gp)
    fit_gpytorch_mll(mll)

    # qEI acquisition (q=1 for sequential)
    acq = qExpectedImprovement(model=gp, best_f=Y_obs.max())
    candidate, _ = optimize_acqf(
        acq_function=acq,
        bounds=BOUNDS,
        q=1,
        num_restarts=10,
        raw_samples=512,
    )
    # Evaluate candidate
    new_y = train_eval(unnormalize(candidate.squeeze(0)))
    X_obs = torch.cat([X_obs, candidate], dim=0)
    Y_obs = torch.cat([Y_obs, torch.tensor([[new_y]], device=device, dtype=torch.double)], dim=0)
    print(f"Iter {it+1}: y={new_y:.4f} | best={Y_obs.max().item():.4f}")

best_idx = Y_obs.argmax()
print("\nBest hyperparameters:")
print(unnormalize(X_obs[best_idx]))

Several details merit note.

  • The implementation operates in normalized [0,1]d space and unnormalizes before training. BoTorch strongly prefers normalized inputs.
  • BoTorch’s SingleTaskGP uses a Matérn 5/2 kernel by default with automatic relevance determination, which learns per-dimension lengthscales.
  • optimize_acqf uses ten multi-start L-BFGS optimizations with 512 random initial points to find the global optimum of the acquisition function.
  • The loop executes twenty-eight trials in total (eight random plus twenty BO). On a single GPU with three-epoch FashionMNIST, this takes approximately thirty minutes.

Example 3: Multi-Objective BO with qNEHVI

Real-world deployment depends on more than accuracy: latency and memory also matter. Multi-objective BO produces the entire Pareto frontier between competing objectives.

"""
Multi-objective HPO: maximize accuracy AND minimize latency.
Returns the Pareto frontier instead of a single best.
"""
import time
import torch
from botorch.models import SingleTaskGP, ModelListGP
from botorch.fit import fit_gpytorch_mll
from botorch.acquisition.multi_objective.monte_carlo import qNoisyExpectedHypervolumeImprovement
from botorch.optim import optimize_acqf
from botorch.utils.multi_objective.box_decompositions.dominated import DominatedPartitioning
from gpytorch.mlls import ExactMarginalLogLikelihood

device = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = torch.double

# Search space: same 4-dim CNN tuning problem
BOUNDS = torch.tensor([[0.0]*4, [1.0]*4], device=device, dtype=DTYPE)

# Two objectives: accuracy (maximize) and -latency_ms (maximize, since BoTorch maximizes)
REF_POINT = torch.tensor([0.5, -200.0], device=device, dtype=DTYPE)  # worst-case bounds

def objective_2d(x_norm):
    """Returns [accuracy, -latency_ms]."""
    params = unnormalize(x_norm)  # reuse from Example 2
    acc = train_eval(params, epochs=3)
    # Measure latency on a batch
    model = SmallCNN(params["hidden"], params["dropout"]).to(device).eval()
    dummy = torch.randn(64, 1, 28, 28, device=device)
    # Warm up
    with torch.no_grad():
        _ = model(dummy)
    torch.cuda.synchronize() if device == "cuda" else None
    t0 = time.perf_counter()
    with torch.no_grad():
        for _ in range(20):
            _ = model(dummy)
    torch.cuda.synchronize() if device == "cuda" else None
    latency_ms = (time.perf_counter() - t0) * 1000 / 20
    return torch.tensor([acc, -latency_ms], device=device, dtype=DTYPE)

# Initial design
N_INIT = 10
torch.manual_seed(0)
X_obs = torch.rand(N_INIT, 4, device=device, dtype=DTYPE)
Y_obs = torch.stack([objective_2d(x) for x in X_obs])

# Multi-objective BO loop
for it in range(20):
    # Fit independent GPs for each objective
    models = [SingleTaskGP(X_obs, Y_obs[:, i:i+1]) for i in range(2)]
    model_list = ModelListGP(*models)
    for m in models:
        mll = ExactMarginalLogLikelihood(m.likelihood, m)
        fit_gpytorch_mll(mll)

    # qNEHVI acquisition
    acq = qNoisyExpectedHypervolumeImprovement(
        model=model_list,
        ref_point=REF_POINT,
        X_baseline=X_obs,
        prune_baseline=True,
    )
    candidate, _ = optimize_acqf(
        acq_function=acq, bounds=BOUNDS,
        q=2, num_restarts=10, raw_samples=512,
    )
    new_y = torch.stack([objective_2d(x) for x in candidate])
    X_obs = torch.cat([X_obs, candidate])
    Y_obs = torch.cat([Y_obs, new_y])
    # Compute hypervolume
    hv = DominatedPartitioning(ref_point=REF_POINT, Y=Y_obs).compute_hypervolume()
    print(f"Iter {it+1}: HV={hv.item():.3f} | n_obs={len(X_obs)}")

# Extract Pareto frontier
from botorch.utils.multi_objective.pareto import is_non_dominated
mask = is_non_dominated(Y_obs)
pareto = Y_obs[mask]
print(f"\nPareto frontier: {len(pareto)} points")
for acc, neg_lat in pareto.cpu().numpy():
    print(f"  acc={acc:.4f}, latency={-neg_lat:.2f}ms")

The output is not a single best configuration but a frontier of Pareto-optimal configurations. For each point on this frontier, accuracy cannot be improved without sacrificing latency, and vice versa. The hypervolume metric quantifies the size of the dominated region; larger values are better.

Example 4: Optuna with BoTorch Sampler

Optuna is the most widely adopted HPO library, and an underappreciated feature is that its default TPE sampler can be replaced with a GP-based BoTorch sampler in a single line of code.

"""
Optuna with GP (BoTorch) sampler vs default TPE.
pip install optuna botorch
"""
import optuna
from optuna.samplers import TPESampler
from optuna.integration import BoTorchSampler
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score
import numpy as np

X, y = load_breast_cancer(return_X_y=True)

def objective(trial):
    params = {
        "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.3, log=True),
        "max_depth": trial.suggest_int("max_depth", 3, 12),
        "n_estimators": trial.suggest_int("n_estimators", 50, 500),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
        "reg_alpha": trial.suggest_float("reg_alpha", 1e-6, 1.0, log=True),
        "reg_lambda": trial.suggest_float("reg_lambda", 1e-6, 1.0, log=True),
    }
    clf = xgb.XGBClassifier(
        **params, tree_method="hist", eval_metric="logloss",
        n_jobs=-1, random_state=42, verbosity=0,
    )
    return cross_val_score(clf, X, y, cv=5, scoring="roc_auc").mean()

# A: TPE sampler (Optuna default)
study_tpe = optuna.create_study(
    direction="maximize",
    sampler=TPESampler(seed=42, n_startup_trials=10),
)
study_tpe.optimize(objective, n_trials=50, show_progress_bar=True)

# B: BoTorch (GP) sampler
study_gp = optuna.create_study(
    direction="maximize",
    sampler=BoTorchSampler(n_startup_trials=10, seed=42),
)
study_gp.optimize(objective, n_trials=50, show_progress_bar=True)

print(f"TPE best AUC: {study_tpe.best_value:.4f}")
print(f"GP-BO best AUC: {study_gp.best_value:.4f}")

# Visualize convergence
import matplotlib.pyplot as plt
def running_best(trials):
    vals = [t.value for t in trials]
    return np.maximum.accumulate(vals)

plt.figure(figsize=(10, 5))
plt.plot(running_best(study_tpe.trials), label="TPE", linewidth=2)
plt.plot(running_best(study_gp.trials), label="GP-BO (BoTorch)", linewidth=2)
plt.xlabel("Trial")
plt.ylabel("Best AUC so far")
plt.legend()
plt.grid(alpha=0.3)
plt.title("TPE vs GP-BO convergence")
plt.savefig("tpe_vs_gp.png", dpi=120, bbox_inches="tight")

Empirically, for smaller search spaces (no more than ten dimensions) and noisy objectives, GP-BO converges faster than TPE in trial count. For larger spaces or those with conditional dimensions, TPE closes the gap. The principal benefit of Optuna is the framework: pruning, distributed trials, a web dashboard, and straightforward sampler substitution.

Tip: For an end-to-end HPO orchestration pipeline that queues trials, distributes them to workers, and persists results, Optuna pairs naturally with Apache Airflow. Each Airflow task corresponds to one trial, and the study state lives in a shared database.

Multi-Fidelity and Parallel HPO

A key fact of modern deep learning is that full training is expensive while partial training is informative. A 100-epoch run is ten times more expensive than a 10-epoch run, yet the 10-epoch result correlates strongly with the 100-epoch outcome. Multi-fidelity HPO exploits this relationship.

BOHB (Falkner et al., 2018)

BOHB combines Hyperband (early stopping based on partial training curves) with BO (informed sampling rather than random). Hyperband decides when to terminate a trial; BO decides which configurations to try at each rung. Empirically the combination outperforms either method alone for deep-learning HPO.

BOHB uses TPE rather than a GP for the BO component because the sampling-based density model handles the high-dimensional, conditional spaces of neural-network architectures well. GP variants exist (Falkner discusses the trade-offs), but TPE is the default.

Multi-Fidelity BO (MFBO)

MFBO adds fidelity, such as training epochs or dataset fraction, as an additional dimension in the GP. The GP learns the relationship between fidelity and final performance, and the acquisition function selects both x and a fidelity, balancing information gain against compute cost. BoTorch provides qMultiFidelityKnowledgeGradient for this purpose.

Asynchronous BO (Kriging Believer)

For batch parallelism, while a trial is running, its result is fantasized using the GP posterior mean. The hallucinated observation is added to the training set, a temporary GP is fitted, and the next trial is selected on the assumption that the in-flight trial will reach its predicted value. The observation is corrected when the trial finishes. This decouples scheduling from observations and enables many parallel workers without serializing on the GP fit.

Trust Region BO (TuRBO)

Eriksson et al. (2019) proposed TuRBO for high-dimensional HPO (50 or more dimensions). The method maintains a small trust region around the current best, fits a local GP, and optimizes within. The trust region expands when a step succeeds and contracts when it does not. The approach effectively decomposes a high-dimensional problem into many local low-dimensional problems. It is available in BoTorch.

Key Takeaway: With eight or more GPUs and slow training, BOHB typically outperforms vanilla GP-BO. With one GPU and up to twenty hyperparameters, vanilla GP-BO with Expected Improvement offers the best return on investment. With more than fifty hyperparameters, characteristic of neural architecture search, TuRBO or evolutionary methods are appropriate.

Tools Comparison

HPO Tools Landscape X-axis: Simplicity → Research flexibility  |  Y-axis: Sample efficiency Simplicity (left) ⟶ Research flexibility (right) Sample efficiency ⟶ simple API framework research-flexible high med low skopt GP-based Optuna TPE / GP Ax GP-based BoTorch GP-based HyperOpt TPE Ray Tune mixed W&B mixed Vizier GP-based SageMaker managed Backend legend GP-based TPE Mixed Managed cloud

Tool Default Backend Multi-Objective Constraints Conditional Spaces Best For
Optuna TPE (GP via BoTorch) Yes Limited Native Production engineering
Ax GP (BoTorch) Yes (Pareto) Yes Yes Adaptive experimentation
BoTorch GP (PyTorch) Yes Yes Custom Research, custom algorithms
scikit-optimize GP / RF No No No Quickstart, sklearn integration
HyperOpt TPE Limited No Native Mature distributed TPE
Ray Tune Pluggable (BO/TPE/PBT/ASHA) Yes (via Ax) Via backend Via backend Distributed orchestration
W&B Sweeps Bayes / Random / Grid No No Limited Experiment tracking integration
Vertex AI Vizier GP (Google) Yes Yes Yes Managed, GCP-native
SageMaker AMT GP / Hyperband No No Limited Managed, AWS-native

 

Practical Recommendation

For the majority of practitioners and HPO problems, the following guidance is appropriate.

  • Begin with Optuna. The API is the cleanest, the defaults are sensible, the dashboard is effective, and the BoTorch sampler can be substituted when TPE becomes inadequate.
  • Move to Ax when multi-objective optimization with constraints is required, or when a higher-level service-style API for ongoing experimentation is desirable.
  • Use BoTorch directly when implementing custom acquisition functions, conducting research, or requiring fine-grained control over GP fitting through custom kernels, priors, or multi-task models.
  • Use scikit-optimize for one-off tabular machine-learning tuning where simplicity outweighs power.
  • Use Ray Tune when distributed orchestration is the bottleneck and hundreds of workers require scheduling.

Real-World Case Studies

Google Vizier

Vizier is Google’s internal Bayesian Optimization service, used to tune systems ranging from ad models to ranking systems to LLM training pipelines. The original 2017 paper reported thousands of studies per day across the company. The default algorithm is GP-based BO with batched parallel evaluation. Vertex AI Vizier exposes the service externally on Google Cloud Platform.

Meta’s Ax and BoTorch

Meta open-sourced Ax and BoTorch from work on optimizing ranking models. Published results indicate ranking-quality improvements exceeding 40 percent relative to random search, with substantially fewer trials required. The same stack is used to tune hyperparameters in video-encoding research, ad-auction simulators, and infrastructure scheduling.

AlphaGo and AlphaFold

DeepMind has used Bayesian optimization in inner loops for many years. AlphaGo reportedly used GP-based BO to tune MCTS hyperparameters and training schedules. AlphaFold 2’s training pipeline used multi-fidelity BO for architecture-related hyperparameters where each evaluation was prohibitively expensive.

Drug Discovery and Protein Design

Beyond machine-learning hyperparameters, GP-BO is the standard tool for real experimental design: which molecules to synthesize next, which protein variants to screen, and which experimental conditions to test. Each trial requires days of laboratory time and thousands of dollars in reagents, making sample efficiency essential.

Key Takeaway: GP-based BO is not a research curiosity. It runs in production at scale at every major technology company and at most pharmaceutical firms. The supporting tools (BoTorch, Ax, Optuna, Vizier) reflect hundreds of person-years of engineering. Teams that do not use BO for HPO are likely forgoing accuracy gains of 0.5 to 5 percent.

Practical Guide and Pitfalls

Initial Design: Avoid a Cold Start

Five to ten random trials should be run before BO begins. Without seed observations, the GP has no signal and the acquisition function selects the geometric center of the search box. The rule of thumb is n_init = max(5, 2 · d), where d is the search-space dimension.

Parallelize Four to Eight Trials per BO Step

Modern HPO at scale uses batch acquisition functions (qEI, qNEI, qNEHVI) to propose four to eight candidates per BO iteration. This represents an effective compromise: enough parallelism to utilize a multi-GPU node, but not so much that GP information gain saturates within a batch.

Stopping Criteria

  • Trial budget (the most common): for example, running 100 trials. Simple and reproducible.
  • Time budget: for example, running for 24 hours. Useful in production where wall-clock time matters more than trial count.
  • Convergence: stop when the running-best improvement is less than ε for k consecutive trials. This criterion is risky in isolation because BO can stall before identifying the global optimum.
  • Combination: max(trial_budget, no_improvement_for_k_trials). A practical default.

Reproducibility

All random seeds should be set: numpy, torch, the BO library, and the model-training loop. Every (config, score, wallclock, seed) tuple should be logged. The most common way to lose value from HPO is to be unable to reproduce the best configuration. Pairing the optimizer with experiment tracking such as W&B or MLflow is sufficient.

Debugging GP Fits

If BO recommendations appear pathological (clustered in a corner, or oscillating widely), the following checks are appropriate.

  • Lengthscales: whether the lengthscales are reasonable. Very small values indicate that the GP is treating every observation as noise; very large values indicate that it considers the function constant.
  • Output standardization: BoTorch handles standardization internally; some libraries do not. Standardizing y manually when in doubt is prudent.
  • Input normalization: inputs should always be normalized to [0,1]d before being passed to a GP.
  • Noise: if observation noise is too low, refit with a slightly higher noise prior.

High-Dimensional Pitfalls

Beyond approximately twenty dimensions, vanilla GPs degrade. The symptoms are that BO no longer outperforms random search and that GP lengthscales reach the boundary of their allowed range. Possible remedies include TuRBO (trust regions), random embeddings (REMBO), dimensionality reduction by PCA on a random sample, or a switch to evolutionary methods. For further discussion of high-dimensional optimization, see the companion posts on genetic algorithms and mixed-integer programming.

Constrained BO

Infeasible configurations should not consume evaluations. If a model has a memory budget, latency budget, or hardware constraint, the constraint should be modeled as a separate GP and used with a constrained acquisition function such as expected feasible improvement or qNEHVI with constraints in BoTorch. The savings in trial budget can be substantial.

The Cold-Start Problem

When tuning a new but related task, prior trials from similar tasks are typically available. Transferable BO initializes the GP using observations from prior studies (with appropriate weighting), providing an informative prior in place of a cold start. The method is available in Ax (multi-task BO) and in the academic literature.

Trial Replication and Noise

For genuinely noisy objectives such as reinforcement-learning rewards or classification on small datasets, the best candidates should be replicated to reduce noise. The Central Limit Theorem guide covers the underlying mathematics: averaging k noisy observations reduces the standard error by a factor of √k. Allocating 20 percent of the trial budget to replication yields a substantially more reliable best configuration.

Caution: The most common HPO failure mode is not the wrong method but the wrong objective. If the validation loss is not a good proxy for test loss (a small validation set, data leakage, or distribution shift), no optimizer can compensate. The evaluation pipeline should be audited before tuning begins. Cross-validation, held-out validation, and techniques covered in the semi-supervised learning guide matter more than the choice of optimizer.

Frequently Asked Questions

Why is GP-based BO better than random search for HPO?

GP-based BO uses information from prior trials to pick the next one. Random search throws that information away. On benchmark HPO problems with 5–20 hyperparameters, GP-BO typically reaches the same accuracy as random search using 3–10× fewer trials. When each trial costs hours of GPU time, that compounds into significant compute savings—typically 60–90% of the budget.

When does TPE beat GP-based BO?

Three regimes: (1) high-dimensional spaces (30+ hyperparameters) where GPs degrade, (2) heavily conditional spaces (this hyperparameter only exists if that one is true) where TPE handles structure natively, (3) when you need very fast wall-clock per BO iteration because TPE’s sampling is cheaper than GP fitting + acquisition optimization. For most “normal” HPO with ≤20 dims, GP-BO is more sample-efficient.

How many initial random trials should I run before starting BO?

Rule of thumb: n_init = max(5, 2 · d) where d is the search space dimension. For a 4-dimensional space, 8–10 random trials. For 10 dimensions, 20 random trials. Without enough seeds, the GP has no signal and BO collapses to picking the box center repeatedly.

Can GP-BO handle categorical hyperparameters like activation function or optimizer choice?

Yes, three approaches: (1) one-hot encode and treat as continuous (works, slight efficiency loss), (2) use a custom kernel like Hamming distance for categoricals (cleaner, BoTorch’s MixedSingleTaskGP does this), (3) switch to TPE which handles categoricals natively. For 1–2 categorical dimensions, one-hot is fine. For many categoricals, use TPE or a properly mixed kernel.

BoTorch vs Optuna—which should I use?

For most production HPO, start with Optuna: cleaner API, better tooling (dashboard, study persistence, distributed trials), and you can swap in the BoTorch sampler for GP-BO when needed. Use BoTorch directly when you need custom acquisition functions, multi-task GPs, advanced features (qNEHVI, qKG, MES), or are doing research. Many production setups use both: Optuna for orchestration, BoTorch sampler under the hood.

References and Further Reading

  • Bergstra & Bengio (2012). Random Search for Hyper-Parameter Optimization. JMLR. The paper that established random search as the baseline.
  • Frazier (2018). A Tutorial on Bayesian Optimization. arXiv:1807.02811. The clearest intro to BO mathematics.
  • Falkner et al. (2018). BOHB: Robust and Efficient Hyperparameter Optimization at Scale. ICML. The BOHB paper.
  • Eriksson et al. (2019). Scalable Global Optimization via Local Bayesian Optimization. NeurIPS. TuRBO.
  • Wang & Jegelka (2017). Max-value Entropy Search for Efficient Bayesian Optimization. ICML.
  • BoTorch documentation,official docs for Meta’s Bayesian optimization library.
  • Optuna documentation—practical HPO framework with TPE and GP samplers.
  • scikit-optimize documentation—sklearn-style GP and forest-based BO.
  • Ax (Adaptive Experimentation Platform),Meta’s higher-level wrapper around BoTorch.
Related Reading:

You Might Also Like

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *