Home AI/ML xPatch Explained: Dual-Stream Time Series Forecasting with EMA Decomposition

xPatch Explained: Dual-Stream Time Series Forecasting with EMA Decomposition

Last updated: May 27, 2026
k
Published May 7, 2026 · Updated May 27, 2026 · 30 min read

PatchTST established the prevailing benchmark for transformer-based time series forecasting. A subsequent paper from KAIST then demonstrated a less comfortable result: a non-transformer model composed of two simple streams, an MLP and a CNN, outperforms PatchTST. xPatch achieves this with approximately one-quarter of the compute and an established idea, namely exponential moving averages.

The paper is xPatch: Dual-Stream Time Series Forecasting with Exponential Seasonal-Trend Decomposition by Artyom Stitsyuk and Jaesik Choi, published at AAAI 2025 (arXiv:2412.17323). It is the type of paper that quietly recalibrates the field. There is no new attention variant, no foundation model with 100 billion parameters, only a careful re-examination of which inductive biases actually contribute to forecasting performance for electricity load, traffic, weather, or stock returns.

This article examines in detail every load-bearing component of the paper: the EMA decomposition, the dual-stream architecture, the arctangent loss, the sigmoid learning-rate schedule, the experimental results, and the implications for practitioners deploying forecasts in production.

Summary

What this post covers: A detailed examination of the AAAI 2025 xPatch paper by Stitsyuk and Choi, including its EMA decomposition, dual-stream MLP and CNN architecture, training methods (arctangent loss, sigmoid learning-rate schedule, RevIN), benchmark results, and the implications for transformer-dominated time-series forecasting.

Key insights:

  • A non-transformer dual-stream model (a linear stream for the trend and a depthwise-separable CNN for the seasonal component) outperforms CARD, the previous current best, by an average of 2.46 percent in MSE and 2.34 percent in MAE across eight standard benchmarks, while running approximately four times faster.
  • The appropriate inductive bias (EMA trend-seasonal decomposition combined with patching and dual specialisation) consistently outperforms generic attention for typical multivariate forecasting, echoing the earlier critique advanced by DLinear in “Are Transformers Effective?”
  • Training-side techniques contribute meaningfully to performance. The arctangent loss (a horizon-weighted MAE that prevents any single horizon from dominating the gradient) and the sigmoid learning-rate schedule also transfer to PatchTST and CARD, suggesting that many architecture comparisons in the literature have employed suboptimal training recipes.
  • The recommended default for the EMA alpha is 0.3 on large benchmarks (Weather, Traffic, Electricity). On smaller or noisier datasets, a sweep over {0.1, 0.3, 0.5, 0.7, 0.9} is appropriate. A smaller alpha produces smoother trends, while a larger alpha produces more reactive trends.
  • xPatch is preferable to PatchTST as a production default unless the application involves heavy channel correlations that benefit from cross-channel attention, or requires a look-back longer than 96 steps. xPatch is faster to train, faster to infer, slightly more accurate, and easier to debug because the two streams are individually interpretable.

Main topics: Why this paper matters, The EMA Decomposition at the Centre of xPatch, The Dual-Stream Architecture, Training Components: Arctangent Loss, Sigmoid Schedule, and RevIN, Benchmark Results, Ablations: What Drives Performance, How to use xPatch (PyTorch sketch), When to use xPatch versus alternatives, Limitations and open questions, Implications for the Field, Frequently asked questions.

Why this paper matters

For approximately three years, time series forecasting has been dominated by transformer-based models. Informer (2021) made attention practical for long sequences. Autoformer (2021) incorporated series decomposition. FEDformer (2022) shifted attention into the frequency domain. PatchTST (2023) adapted the patching technique from Vision Transformers and became the strongest model on a substantial set of benchmarks. iTransformer (2024) inverted the embedding dimension. CARD (2024) refined the channel-aligned attention design.

DLinear, introduced in 2022, raised an awkward question: is attention actually required for forecasting? A two-line linear model, consisting of a single fully connected layer with a moving-average decomposition, could match or surpass several transformer variants on standard benchmarks. The community responded with a wave of “are transformers effective” papers, and the consensus that emerged was nuanced: transformers help on some datasets, harm on others, and the gains are often smaller than the speed advantages forgone.

xPatch takes the next logical step. Rather than abandoning the transformer entirely (as DLinear does) or retaining a transformer while refining attention (as CARD and iTransformer do), it constructs a dual-stream non-transformer model with stronger inductive biases. One stream is a simple MLP. The other is a compact depthwise-separable CNN. Combined with EMA-based decomposition and an improved loss function, the result outperforms CARD, the previous current best, while training approximately four times faster.

For an overview of the broader landscape in which these models operate, see the companion overview of time series forecasting models in 2026. xPatch is one of the clearest examples of a non-foundation-model approach that continues to deliver competitive performance on real benchmarks.

Key Takeaway: xPatch provides evidence that for typical multivariate forecasting, appropriate inductive biases (decomposition, patching, and dual specialisation) contribute more than attention itself. Architecture is not the only frontier; loss functions and learning-rate schedules also account for a substantial share of observed performance differences.

xPatch: Dual-Stream Architecture Input X L × N RevIN normalize EMA decomposition X_T (trend) X_S = X − X_T Linear Stream (X_T) FC → AvgPool(k=2) → LN FC → AvgPool(k=2) → LN no activation, project → T CNN Stream (X_S) Patch P=16, S=8 Depthwise (k=P) → Pointwise GELU → BatchNorm → residual Concat + Linear de-RevIN → Ŷ Linear stream handles smooth trend; CNN stream handles bursty seasonal patterns.

The EMA Decomposition at the Centre of xPatch

The single most important point to retain about xPatch is the following: the model’s first operation is to separate every channel of the input series into a slow component and a fast component, and then to model each component with a distinct network. The separation is performed using an exponential moving average.

Why decomposition matters

Trend and seasonality have fundamentally different dynamics. A trend is slow, often nearly linear over short windows, and dominated by accumulating shifts in level. A seasonal component is fast, often locally periodic, and frequently bursty (for example, traffic spikes or weather fronts). If one network is asked to model both at once, it must compromise: smooth filters blur the seasonal spikes, while sharp filters chase the trend’s drift. Decomposition removes that conflict by assigning each component to a specialist.

This is not a new idea. Classical statistics has applied decomposition for decades:

  • STL (Seasonal-Trend decomposition using Loess): local polynomial regression for seasonality extraction.
  • Holt-Winters: three exponential smoothers (level, trend, and seasonal) chained together.
  • X-11 / X-13ARIMA-SEATS: a workhorse of official statistics based on iterative moving averages.

Recent machine-learning approaches retained the spirit of decomposition while employing different tools. DLinear used a simple moving-average filter, and FEDformer projected the series into the frequency domain via Fourier transforms. xPatch adopts a different choice: an exponential moving average.

The recursive formula

The EMA decomposition is defined by Equation 2 of the paper:

s₀ = x₀
sₐ = α · xₐ + (1 - α) · sₐ₋₁    for t > 0

X_T = EMA(X)         (trend)
X_S = X − X_T        (seasonal residual)

The parameter α is the smoothing factor, taking values in (0, 1). A small α (such as 0.1) produces a very smooth trend dominated by older observations, while a large α (such as 0.9) causes the trend to track the most recent value almost immediately. The seasonal stream consists of whatever the trend cannot explain.

The recursion appears computationally expensive, since it is sequential by definition. However, Appendix D of the paper presents a vectorised form with O(1) per-step cost in terms of GPU operations. The technique is to expand the recursion into a closed-form weighted sum and compute it as a single matrix multiplication with a Toeplitz-style weight matrix. In practice, the EMA pre-processing is essentially free relative to the rest of the forward pass.

Why α = 0.3 performs best on large datasets

The paper sweeps α over {0.1, 0.3, 0.5, 0.7, 0.9}. On Weather, Traffic, and Electricity, the larger and more channel-rich benchmarks, α = 0.3 is consistently optimal. The intuition is as follows. With many noisy channels, the trend must be genuinely slow in order to filter short-lived noise while still tracking the multi-step drift. A smaller α oversmooths and deprives the seasonal stream of bandwidth, whereas a larger α allows excessive high-frequency content to leak into the trend. The value 0.3 sits in the appropriate range.

On smaller and noisier datasets the result is less clear-cut. In some cases α = 0.5 or 0.7 is preferable because the trend must react more quickly to abrupt regime changes. The paper treats α as a hyperparameter rather than a learnable parameter; making α learnable is one obvious direction for follow-up research.

Simple moving average versus exponential moving average

Property Simple Moving Average (DLinear-style) Exponential Moving Average (xPatch)
Weight scheme Uniform inside a window Geometric decay, recent > old
Hyperparameter Window length k Smoothing factor α
Edge effects Hard window boundary Smooth, no boundary discontinuity
Reactivity to recent shocks Slow (averaged equally with old data) Fast (recent point gets weight α)
Implementation cost O(k) per step O(1) per step (vectorized)

 

EMA Decomposition (α = 0.3) Original X Trend X_T = EMA(X) sₐ = α·xₐ + (1−α)·sₐ₋₁ Seasonal X_S = X − X_T Trend: smooth low-pass via EMA. Seasonal: bursty residual carries the high-frequency structure.

The Dual-Stream Architecture

Once X_T (the trend) and X_S (the seasonal component) are obtained, xPatch processes them in two specialised streams. The design principle is to use the appropriate tool for each component and combine the results at the end.

The linear stream (processing X_T)

The trend is, by construction, smooth. After EMA filtering, little non-linear structure remains. xPatch therefore processes the trend through two MLP-style blocks, each composed of:

  • A fully connected (FC) projection.
  • A 1D average pooling layer with kernel size k = 2.
  • A LayerNorm operation.

Importantly, there is no non-linear activation function anywhere in the linear stream. Up to the LayerNorm, the entire stream consists of a sequence of linear operators. The final output is projected to dimension T (the forecast horizon). Readers familiar with DLinear will recognise the structure: xPatch retains the DLinear approach for trend modelling.

The LayerNorm is the only operator in the stream with a non-linear character, since it divides by an instance-computed standard deviation that is data-dependent. It stabilises training when the trend’s scale varies across samples. The average pooling acts as an additional smoothing step, reducing the probability that the linear stream over-fits to high-frequency noise that leaks through the decomposition.

The CNN stream (processing X_S)

The seasonal stream is where most of the modelling work occurs. Seasonal residuals are bursty, locally periodic, and channel-correlated. xPatch handles them with a depthwise-separable CNN:

  • Patching: the input is segmented into patches of length P = 16 with stride S = 8. The number of patches is N = ⌊(L − P) / S⌋ + 2, matching the PatchTST configuration. With L = 96, the result is approximately 12 patches per channel.
  • Depthwise convolution: kernel size P = 16, stride P = 16, with groups equal to the number of channels N. Each channel receives its own filter aligned to patch boundaries, with no cross-channel mixing at this step.
  • Pointwise convolution: a 1×1 convolution that mixes information across channels.
  • GELU activation: the only major non-linearity in the entire model. The smooth saturating shape of GELU is well suited to spiky residuals.
  • BatchNorm: applied for training stability across batches.
  • Residual connection: the input is added back to the output, which simplifies optimisation and allows the stream to behave approximately as an identity if the seasonal component is near zero.

The depthwise plus pointwise pattern is the classic MobileNet-style separable convolution. It reduces parameters substantially relative to a full convolution while retaining a similar receptive field. For time series with many channels (Traffic has 862 and Electricity has 321), the reduction is essential, since a full Conv1D would be prohibitively large.

Why this division of labour is effective

An MLP can learn arbitrary linear projections but must allocate capacity to discover local structure. A patch-aligned CNN encodes locality and translation-equivariance directly into the architecture. By passing only the seasonal residual into the CNN, xPatch allows the CNN to concentrate on local patterns, the task it is best suited to, without expending capacity on re-learning the trend. Conversely, the linear stream is not required to model seasonal spikes that would force a compromise.

This is the same lesson that graph attention networks illustrate in a different domain: the architecture’s inductive biases should align with the structure of the signal being modelled. Attention is a powerful general-purpose mixer, but its generality is not free.

Combining the two streams

The outputs of the linear and CNN streams are concatenated and passed through a final linear layer (Equation 12 in the paper) to produce the forecast over horizon T. The combination is intentionally simple. The model is not required to learn a complex gating mechanism; it learns a linear combination of the two specialists’ outputs.

Tip: For implementations starting from scratch, an effective sanity check is to begin with the linear stream alone and verify that it matches DLinear performance on ETTh1. The CNN stream can then be added, and the gains will become visible on noisier datasets such as Weather and Traffic.

Training Components: Arctangent Loss, Sigmoid Schedule, and RevIN

The architecture is only half of the story. The other half is the training recipe, and the paper makes a strong case that some of the gains derive from techniques that any forecasting model can adopt.

RevIN (Reversible Instance Normalisation)

Distribution shift is endemic in time series. The mean and variance of a channel during training rarely match those at inference time, particularly in non-stationary domains such as finance, traffic, or weather. RevIN addresses this issue with a simple procedure:

  1. Before the model: subtract the per-instance mean and divide by the per-instance standard deviation, where the instance is a single look-back window.
  2. After the model: multiply by the same standard deviation and add back the same mean, along with learnable affine parameters.

The model therefore only sees standardised inputs and does not need to memorise the level or scale of any particular channel. The de-normalisation at the output returns the forecast to the original scale. RevIN is now standard equipment in modern forecasting models, and xPatch employs it in the same manner as PatchTST and CARD.

The arctangent loss

This is one of the more novel components of the paper. CARD popularised a horizon-weighted loss that assigns greater importance to longer-horizon predictions, with weights that grow exponentially. The motivation is reasonable, since long-horizon errors compound, but exponential weighting grows quickly and can dominate the optimisation.

xPatch replaces this with a slower-growing function based on the arctangent (Equations 16 and 17):

ρ_arctan(i) = −arctan(i) + π/4 + 1

L_arctan = (1/T) · Σᵢ ρ_arctan(i) · ||Ŷᵢ − yᵢ||₁

The motivation for the arctangent function is that it is bounded (growth slows asymptotically), monotonic, and smooth. Unlike exponential weighting, it does not allow any single horizon to dominate the gradient. The result is more uniform attention across the entire forecast window, which empirically translates into improved performance on long horizons without degrading performance on shorter ones.

The paper’s most notable ablation finding is that the arctangent loss helps even when applied to other models. Substituting it into PatchTST or CARD improves accuracy. The loss is therefore a transferable technique that can serve as a free upgrade for an existing forecasting pipeline.

Sigmoid learning-rate schedule

Standard schedules in this literature are step decay (the learning rate is halved every K epochs) or cosine annealing. xPatch introduces a sigmoid-shaped schedule (Equation 23) with a warm-up parameter w. The shape consists of a smooth ramp-up from a low initial value, a flat plateau in the middle, and a gentle ramp-down. Compared with step decay, it avoids the discontinuities that can destabilise training. Compared with cosine annealing, the explicit warm-up provides the optimiser with time to locate a suitable basin before the learning rate becomes high.

As with the arctangent loss, the paper shows that the sigmoid schedule transfers cleanly to other models. The implication is that learning-rate schedules are often under-tuned in benchmark comparisons. When all models use the same default, any architecture that claims a win must outperform the also-suboptimal training of every competitor.

Compute footprint

xPatch is trained for 100 epochs on a single NVIDIA Quadro RTX 6000. The configuration corresponds to a single mid-range GPU and a short schedule by current standards. There is no foundation-model pre-training, no distributed setup, and no specialised quantisation. This minimal footprint is part of the paper’s argument: current best forecasting does not necessarily require current best compute.

Caution: The arctangent loss assumes that all horizons matter equally. If the downstream application weights the next-step forecast more heavily (for example, real-time anomaly detection on the next minute), the weighting should be shifted toward shorter horizons, or a custom ρ function should be used. The paper’s choice is well motivated for the standard MSE-on-all-horizons benchmark, but it is not necessarily optimal for every production setting.

Benchmark Results

The experimental setup is the standard long-horizon forecasting suite that has dominated the literature since Informer.

Datasets

Dataset Dim Frequency Forecast horizons
ETTh1, ETTh2 7 Hourly 96, 192, 336, 720
ETTm1, ETTm2 7 15 min 96, 192, 336, 720
Weather 21 10 min 96, 192, 336, 720
Traffic 862 Hourly 96, 192, 336, 720
Electricity 321 Hourly 96, 192, 336, 720
Exchange-rate 8 Daily 96, 192, 336, 720
Solar 137 10 min 96, 192, 336, 720
ILI 7 Weekly 24, 36, 48, 60

 

The look-back window is L = 96 for all datasets except ILI, which uses L = 36. The baselines are the principal models of the past few years: Autoformer, FEDformer, ETSformer, TimesNet, DLinear, RLinear, MICN, PatchTST, iTransformer, TimeMixer, and CARD.

Headline numbers

Dataset Horizon xPatch MSE xPatch MAE
ETTh1 96 0.428 0.419
Weather 720 0.310 0.322

 

Across all eight datasets and all four horizons, xPatch outperforms CARD, the previous current best, by an average of 2.46 percent in MSE and 2.34 percent in MAE. The margin is small but clear, given how saturated these benchmarks have become. Gains of 1 to 3 percent are now considered meaningful in the literature, and such gains are typically obtained at the cost of new attention variants, larger models, or longer training.

Speed

While accuracy is the headline result, the speed advantage is equally important. Table 3 of the paper reports per-step training and inference times.

Model Training (msec/step) Inference (msec/step) Relative speed vs xPatch
xPatch 3.099 1.303 1.0×
CARD 14.877 4.8× slower

 

Training is approximately 4.8 times faster than CARD per step. The paper does not provide equivalently precise per-step numbers for PatchTST and DLinear, but the general ordering reported is DLinear < xPatch < PatchTST < CARD in training time. In production settings, where forecasting models may be retrained daily on streaming data, this speed advantage matters more than the marginal MSE gain.

Speed vs Accuracy: xPatch is Pareto-optimal Training time per step (msec) — lower is better MSE — lower is better 1 3 7 12 15 20 0.42 0.44 0.46 0.48 0.50 DLinear (1 msec, 0.50) iTransformer (~10 msec, ~0.46) PatchTST (~7 msec, ~0.45) CARD (15 msec, 0.44) xPatch (3 msec, 0.43) — Pareto-optimal MSE values are illustrative averages across benchmarks; xPatch achieves both lower MSE and faster training than CARD/PatchTST.

Ablations: What Drives Performance

Ablation studies indicate whether a paper’s gains are robust or fragile. The ablations reported for xPatch are transparent and informative.

EMA α sweep

α Weather Traffic Electricity Notes
0.1 slightly worse slightly worse slightly worse Trend too smooth, leaks structure
0.3 best best best Optimal balance for big datasets
0.5 close close close Reasonable fallback
0.7 worse worse worse Trend tracks too fast
0.9 worst worst worst Trend ~= input, decomposition fails

 

The pattern is clear: 0.3 dominates on the larger datasets. The paper notes that smaller and noisier datasets sometimes favour higher α values, so fixing α = 0.3 for every problem is unwise. The parameter should instead be swept on a held-out validation split.

Necessity of both streams

The paper ablates the removal of each stream. Removing the linear stream (so that the CNN handles both trend and seasonal components) degrades performance. Removing the CNN stream (so that the linear stream attempts to capture seasonality) degrades performance more substantially. The two streams are genuinely complementary, and neither is dispensable.

Transferability of the arctangent loss

This is arguably the most important ablation in the paper. When the standard MSE loss in PatchTST or CARD is replaced with the arctangent loss, those models also improve. The loss is therefore a free upgrade for the field. Practitioners operating an existing forecasting pipeline can adopt the new loss as a one-line change and likely gain a few percentage points in accuracy.

Transferability of the sigmoid schedule

The same conclusion applies to the sigmoid schedule: it also helps other models. The implication is uncomfortable for the literature. A non-trivial fraction of past “architecture wins” may have been confounded by suboptimal training schedules. xPatch at least isolates how much of its margin derives from the loss and the schedule, as distinct from the dual-stream design itself.

Key Takeaway: A meaningful share of the gains attributed to xPatch derives from training methods rather than architecture. The honest reading is that xPatch outperforms on multiple dimensions, including better decomposition, better dual-stream design, a better loss, and a better schedule. Practitioners should consider carefully which of these components to adopt independently.

How to use xPatch (PyTorch sketch)

The official implementation is available at github.com/stitsyuk/xPatch and follows the structure of standard long-horizon forecasting library scaffolds. The full code includes data loaders, evaluation harnesses, and configurations for each benchmark, but the model itself is compact enough to summarise in a single screen.

The following is a minimal but faithful PyTorch outline. It is not a drop-in replacement for the official repository, which should be used for benchmarking, but it represents the architecture clearly.

import torch
import torch.nn as nn
import torch.nn.functional as F


class EMADecomp(nn.Module):
    """Exponential moving-average decomposition (Eq. 2)."""
    def __init__(self, alpha: float = 0.3):
        super().__init__()
        self.alpha = alpha

    def forward(self, x):
        # x shape: (B, L, N)  batch, look-back, channels
        B, L, N = x.shape
        trend = torch.zeros_like(x)
        trend[:, 0, :] = x[:, 0, :]
        for t in range(1, L):
            trend[:, t, :] = (
                self.alpha * x[:, t, :]
                + (1.0 - self.alpha) * trend[:, t - 1, :]
            )
        seasonal = x - trend
        return trend, seasonal


class LinearStream(nn.Module):
    """2 FC + AvgPool + LayerNorm blocks, no activation."""
    def __init__(self, L: int, T: int, hidden: int = 128):
        super().__init__()
        self.fc1 = nn.Linear(L, hidden)
        self.pool1 = nn.AvgPool1d(kernel_size=2, stride=1, padding=1)
        self.ln1 = nn.LayerNorm(hidden + 1)
        self.fc2 = nn.Linear(hidden + 1, hidden)
        self.pool2 = nn.AvgPool1d(kernel_size=2, stride=1, padding=1)
        self.ln2 = nn.LayerNorm(hidden + 1)
        self.proj = nn.Linear(hidden + 1, T)

    def forward(self, x):
        # x: (B, L, N) -> (B, N, L)
        x = x.transpose(1, 2)
        h = self.pool1(self.fc1(x).transpose(1, 2)).transpose(1, 2)
        h = self.ln1(h)
        h = self.pool2(self.fc2(h).transpose(1, 2)).transpose(1, 2)
        h = self.ln2(h)
        return self.proj(h)  # (B, N, T)


class CNNStream(nn.Module):
    """Patch -> depthwise -> pointwise -> GELU -> BN -> residual."""
    def __init__(self, N: int, L: int, T: int,
                 P: int = 16, S: int = 8):
        super().__init__()
        self.P, self.S = P, S
        n_patches = (L - P) // S + 2
        self.depthwise = nn.Conv1d(
            in_channels=N, out_channels=N,
            kernel_size=P, stride=P, groups=N,
        )
        self.pointwise = nn.Conv1d(N, N, kernel_size=1)
        self.bn = nn.BatchNorm1d(N)
        self.proj = nn.Linear(n_patches * P, T)

    def forward(self, x):
        # x: (B, L, N) -> (B, N, L)
        x = x.transpose(1, 2)
        h = self.depthwise(x)
        h = self.pointwise(h)
        h = F.gelu(h)
        h = self.bn(h)
        # residual: pad and add (omitted for brevity)
        h = h.flatten(start_dim=2)
        h = F.pad(h, (0, max(0, self.proj.in_features - h.size(-1))))
        return self.proj(h[..., :self.proj.in_features])


class XPatch(nn.Module):
    def __init__(self, L: int, T: int, N: int, alpha: float = 0.3):
        super().__init__()
        self.decomp = EMADecomp(alpha)
        self.linear_stream = LinearStream(L, T)
        self.cnn_stream = CNNStream(N, L, T)
        self.fuse = nn.Linear(2 * T, T)

    def forward(self, x):
        # RevIN
        mean = x.mean(dim=1, keepdim=True)
        std = x.std(dim=1, keepdim=True) + 1e-5
        x_norm = (x - mean) / std

        trend, seasonal = self.decomp(x_norm)
        y_lin = self.linear_stream(trend)        # (B, N, T)
        y_cnn = self.cnn_stream(seasonal)        # (B, N, T)
        y = torch.cat([y_lin, y_cnn], dim=-1)
        y = self.fuse(y).transpose(1, 2)         # (B, T, N)

        # de-RevIN
        return y * std + mean


def arctangent_loss(pred, target):
    """L_arctan from Eq. 16-17."""
    T = pred.size(1)
    i = torch.arange(T, device=pred.device, dtype=torch.float32)
    rho = -torch.atan(i) + torch.pi / 4 + 1.0
    abs_err = (pred - target).abs().mean(dim=-1)  # (B, T)
    return (rho * abs_err).mean()

Several practical notes apply:

  • The Python loop in EMADecomp should be replaced with the vectorised closed-form for a genuine speed-up. The mathematics is presented in Appendix D of the paper, and the official repository implements the vectorised version.
  • The CNN stream’s output projection is sketched in a simplified manner here; the official implementation handles the patching dimensions more carefully.
  • For a clean initial configuration, use L = 96, P = 16, S = 8, α = 0.3, 100 epochs, the sigmoid learning-rate schedule with a warm-up of approximately 10 epochs, and the arctangent loss.

For applications involving anomaly detection on the same series, the overview of time series anomaly detection models is relevant. Many of the same training techniques (RevIN, patching, decomposition) carry over.

Hyperparameter reference

Hyperparameter Default When to change
Look-back L 96 (36 for ILI) Increase if your seasonality is longer than 96 steps
Patch size P 16 Should align with your series’ natural local period
Stride S 8 Smaller for more overlap, larger for fewer patches
EMA α 0.3 Sweep {0.1, 0.3, 0.5, 0.7, 0.9} on small/noisy data
Epochs 100 Use early stopping to cut wasted compute
Loss Arctangent Switch to standard MAE if all horizons matter equally

 

When to use xPatch versus alternatives

No single model is appropriate for every problem. xPatch occupies a specific region of the design space: low-latency, accuracy-competitive, supervised, point-forecast, and multivariate. The following framework is useful for selecting an appropriate model.

Need Recommended approach Why
Fastest training/inference, good accuracy xPatch Beats CARD, ~5× faster than CARD per training step
Foundation model / zero-shot TimesFM, Chronos, Moirai Pretrained at scale, generalize across domains without fine-tuning
Calibrated uncertainty estimates Gaussian processes Native posterior variances, principled credible intervals
Long-context attention reasoning PatchTST, iTransformer When channel relationships are essential and context exceeds ~512 steps
Tabular-style features without temporal structure XGBoost / LightGBM When good lag/window features can be engineered, GBMs are difficult to beat on tabular forecasting
Linear/stationary signal, minimal compute DLinear, classical ARIMA If the data is genuinely simple, simpler is better
High-throughput streaming infra xPatch + Kafka time-series engine Low-latency model fits well with streaming pipelines

 

For principled tuning of hyperparameters in any of these alternatives, the companion note on Bayesian hyperparameter optimisation is a useful reference.

Limitations and open questions

xPatch is a strong paper, but no paper is without weaknesses. The honest limitations are as follows:

  • α is a hyperparameter rather than a learned parameter. A natural extension is to make α differentiable, or even to make it both per-channel and per-timescale. The paper acknowledges this and identifies it as future work.
  • The datasets are relatively small. The largest is Traffic, with 862 channels and approximately 17,000 timesteps. This is small compared with the data on which foundation models such as Chronos and TimesFM are pre-trained. The behaviour of xPatch on substantially larger streams remains untested in the paper.
  • Two streams imply two forward passes. Inference remains fast, but a fused single-pass implementation would be faster still and might be feasible with a careful architectural redesign.
  • The model produces point forecasts only. xPatch produces a single-trajectory forecast without a probabilistic interpretation. For risk-sensitive applications such as finance, energy, and healthcare, quantiles or full distributions are typically required, and xPatch does not provide them natively. A quantile head or a Bayesian wrapper is necessary.
  • Benchmark saturation. The community has acknowledged that ETTh, Weather, and related benchmarks are showing signs of saturation. Gains of 2 to 3 percent may not transfer to messier real-world data with greater drift, missing values, and concept shift. xPatch’s results are current best on these benchmarks; whether they generalise to, for example, the tick data of a finance trading desk is an empirical question.
  • The paper presents no theoretical analysis. The contribution is empirical. There is no generalisation bound, no convergence proof for the recursion, and no analysis of the loss landscape. This is acceptable for an applied paper but leaves room for follow-up theory.
Caution: If an application is characterised by heavy concept drift (for example, post-COVID demand forecasting or regime-changing financial markets), benchmark gains do not automatically transfer. Practitioners should evaluate on their own data with a realistic backtest before relying on leaderboard results.

Implications for the Field

Considered at a higher level, the broader narrative is more interesting than the architectural details alone:

  • Inductive biases continue to matter. Decomposition (the separation of trend from seasonality) has been valuable since the 1950s, and it remains valuable in 2025. Patching, locality, and dual-specialisation all encode useful priors. Generic attention without such priors is rarely the appropriate choice for time series.
  • Loss functions and learning-rate schedules are underrated. The fact that the arctangent loss and the sigmoid schedule transfer to other models suggests that the field has been comparing architectures under suboptimal training. Future benchmark papers should standardise the training recipe before claiming architectural wins.
  • The Pareto frontier is the appropriate evaluation axis. A model that is 1 percent more accurate but 10 times slower may not be worth deploying. xPatch occupies the region in which accuracy is competitive and speed is meaningfully better, which is the appropriate position for production systems.
  • Foundation models are not the only path forward. The same year that produced TimesFM and Chronos also produced xPatch, which is task-specific, compact, fast, and competitive. Both styles will coexist; the appropriate choice depends on deployment constraints.
  • Self-supervised pre-training remains an open opportunity. xPatch is fully supervised. Whether self-supervised pre-training of the CNN stream, analogous to TS2Vec and related methods, would unlock further gains is an open question. The overview of self-supervised pretraining covers the relevant techniques.

For a concise reminder of the statistical foundations on which these models rest (independence, the role of variance, the importance of sample size for stable estimators), the explainer on the Central Limit Theorem is relevant. For deployment considerations, the comparison of databases for preprocessed time series reviews the relevant trade-offs.

Frequently asked questions

Why does a non-transformer model outperform PatchTST?

Three factors combine. First, the EMA decomposition provides the model with two cleaner sub-signals rather than a single mixed signal. Second, the dual-stream architecture matches the appropriate tool to each component: a linear stream for the smooth trend and a CNN for the bursty seasonal residual. Third, the arctangent loss and the sigmoid learning-rate schedule provide a training-side improvement. PatchTST employs channel-independent attention and learnable patching, but it asks a single stack of attention layers to handle both trend and seasonal components simultaneously. xPatch’s specialisation wins by an average of 2.46 percent in MSE while running approximately 4.8 times faster than CARD.

Should xPatch or PatchTST be used in production?

The default choice should be xPatch unless there is a specific reason to prefer PatchTST. xPatch is faster to train, faster to infer, slightly more accurate on the standard benchmarks, and easier to debug because the streams are individually interpretable. PatchTST is preferable if the dataset is heavily channel-correlated and the cross-channel mixing of attention is essential, or if a look-back longer than 96 steps is required and the global receptive field of attention is needed.

How is the EMA alpha parameter tuned?

The recommended starting point is α = 0.3, which is optimal for the largest benchmarks in the paper (Weather, Traffic, Electricity). For smaller or noisier datasets, a sweep over {0.1, 0.3, 0.5, 0.7, 0.9} on a held-out validation split is appropriate. A smaller α produces smoother trends, which is suitable when noise dominates. A larger α produces more reactive trends, which is suitable when regime changes are abrupt. The paper deliberately keeps α non-learnable; making it learnable is a reasonable research extension.

What is the arctangent loss and why does it help?

The arctangent loss replaces standard MSE or MAE with a horizon-weighted MAE in which the weights follow ρ(i) = −arctan(i) + π/4 + 1. The arctangent grows much more slowly than the exponential weighting used by CARD, which prevents any single horizon from dominating the gradient. The result is a more uniform learning signal across all forecast horizons. Empirically, the loss benefits not only xPatch but also other models such as PatchTST and CARD, which makes it a transferable upgrade for any forecasting pipeline.

Does xPatch support multivariate forecasting?

Yes. The architecture is designed for multivariate inputs. The depthwise convolution in the CNN stream operates per channel (groups = N), and the pointwise convolution mixes information across channels. The linear stream processes each channel through the same weights while preserving the channel dimension. The paper evaluates on datasets with up to 862 channels (Traffic) without modification.

Related reading

Related reading:

External references

This article is for informational and educational purposes only. It summarizes a publicly available academic paper and is not a substitute for reading the original. Implementation details should be verified against the official repository before production use.

You Might Also Like

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *