Time-Series Forecasting in 2026: From ARIMA to Foundation Models

Last updated: May 27, 2026

By kongastral

Published April 3, 2026 · Updated May 27, 2026 · 26 min read

Summary

What this post covers: A practitioner’s roadmap to time-series forecasting in 2026, tracing the evolution from ARIMA through PatchTST and iTransformer to foundation models like TimesFM, Chronos, and Moirai, with benchmarks and a model-selection framework.

Key insights:

Classical methods (ARIMA, ETS, seasonal naive) remain competitive baselines that the M5 and subsequent competitions show often match deep learning on univariate, well-behaved series, so always benchmark against them first.
Gradient boosting (LightGBM, XGBoost) quietly dominates many real-world, feature-rich forecasting problems and beat all deep learning entries at the M5 competition; ignore it at your peril.
Foundation models like TimesFM, Chronos, and Moirai deliver competitive zero-shot forecasts without any task-specific training and are bridging toward fully-supervised accuracy via efficient fine-tuning on a few hundred examples.
PatchTST and iTransformer demonstrate that the right inductive bias (patching the time axis, inverting which dimension attention operates over) often matters more than model size or attention sophistication.
The best forecasting system is the best pipeline, not the best model: data quality, proper time-series cross-validation, forecast reconciliation, and monitoring matter more than any single architecture choice.

Main topics: Why Time-Series Forecasting Matters More Than Ever, Classical Foundations That Still Work, Gradient Boosting for Time Series: An Underused Practitioner Tool, The Deep Learning Era: N-BEATS, N-HiTS, and TFT, PatchTST: When Vision Meets Time Series (ICLR 2023), iTransformer: Inverting the Attention Paradigm (ICLR 2024), Foundation Models: Zero-Shot Forecasting Arrives, Benchmarks: How Models Actually Compare, Practical Model Selection Guide, Implementation: End-to-End Forecasting Pipeline, The Future of Forecasting, References.

In March 2021, the container ship Ever Given lodged sideways in the Suez Canal, blocking 12% of global trade for six days. The economic damage exceeded 54 billion USD. Supply chain managers across the world were required to re-route shipments, adjust inventory forecasts, and estimate when normal flow would resume. The companies that weathered the crisis best were not those with the largest inventories but those with the most accurate demand forecasting models, capable of recalculating their entire supply chain within hours rather than weeks.

Time-series forecasting—the task of predicting future values from historical observations—is the quantitative foundation of decision-making across nearly every industry. Retailers forecast demand to stock shelves. Energy companies forecast load to schedule generation. Financial institutions forecast volatility to price options. Hospitals forecast patient admissions to staff wards. The accuracy of these forecasts directly determines whether resources are allocated efficiently or wasted at scale.

The field has undergone substantial transformation since 2022. For decades, ARIMA and exponential smoothing dominated. They were followed by deep learning architectures—N-BEATS, Temporal Fusion Transformers, DeepAR—that challenged classical methods on complex, multivariate problems. In 2025 and 2026, the most significant shift is the emergence of foundation models pre-trained on billions of time points that can forecast series they have not previously seen, without any task-specific training. The implications for practitioners are substantial, and uncertainty about which model to use has rarely been greater.

This guide aims to clarify that uncertainty. It traces the evolution from classical methods through deep learning to the current frontier, benchmarks the models that matter, and offers a practical framework for selecting the appropriate approach for a given problem. The treatment focuses on what works, what does not, and the reasons for each.

Why Time-Series Forecasting Matters More Than Ever

The volume of time-stamped data generated globally has expanded sharply. IoT sensors, financial markets, application telemetry, social media engagement metrics, weather stations, and wearable health devices all produce continuous streams of sequential observations. Organisations that aim to derive value from this data require not only appropriate forecasting models but also suitable databases for storing preprocessed time-series data and robust pipelines for moving data between systems. The International Data Corporation estimates that the global datasphere will exceed 180 zettabytes by 2025, with a substantial portion of that data being temporal.

Volume alone, however, does not explain why forecasting has become more important. Three structural trends are increasing demand for accurate predictions:

Just-in-time operations. Modern supply chains, cloud infrastructure, and service delivery systems operate with minimal slack. Real-time complex event processing pipelines built on Apache Flink are increasingly paired with forecasting models to detect anomalies as they occur. Amazon’s fulfilment network, Uber’s driver allocation, and Netflix’s content delivery all depend on accurate short-term forecasts to match supply with demand in near real time. Forecast errors of even 10% result in either costly over-provisioning or customer-visible failures.

Renewable energy integration. As solar and wind generation transitions from supplementary to primary energy sources, grid operators must forecast intermittent generation with high accuracy to maintain stability. A 5% error in the solar generation forecast for a large grid can mean the difference between smooth operation and emergency natural gas peaking, with associated costs measured in millions of dollars and unnecessary emissions.

Algorithmic decision-making at scale. Automated systems, ranging from algorithmic trading to dynamic pricing and autonomous vehicle planning, consume forecasts as inputs to decisions that execute without human review. The performance ceiling of these systems is bounded by the accuracy of their underlying forecasts.

Key Takeaway: Time-series forecasting has evolved from a quarterly planning exercise carried out by analysts into an operational capability that runs continuously, feeds automated systems, and directly affects revenue and reliability. The standard for accuracy, and the cost of inaccuracy, has rarely been higher.

Classical Foundations That Still Work

Before turning to transformers and foundation models, it is important to acknowledge that classical statistical methods remain highly competitive on many forecasting problems. The 2022 M5 competition and subsequent analyses have repeatedly shown that simple methods, properly tuned, often match or surpass complex deep learning models on univariate and low-dimensional problems.

ARIMA and SARIMA

AutoRegressive Integrated Moving Average (ARIMA) models capture three components of a time series: autoregressive behaviour (current values depend on past values), differencing (to achieve stationarity), and moving average effects (current values depend on past forecast errors). The seasonal variant, SARIMA, adds explicit seasonal terms.

ARIMA’s principal strengths are its theoretical foundation and interpretability: every parameter carries a clear statistical meaning. Its weakness is that it assumes linear relationships and handles only univariate series. For a single well-behaved time series with clear trend and seasonality (monthly sales, daily temperature), ARIMA remains a strong, fast, and interpretable baseline. When working with sensor data at scale, pairing ARIMA with a sound metadata management strategy for facility and sensor signals ensures that the appropriate model can be tracked against each data stream.

Exponential Smoothing (ETS)

Exponential Smoothing State Space models (ETS) decompose a time series into error, trend, and seasonal components, each of which can be additive or multiplicative. The Holt-Winters method, a specific ETS configuration with additive or multiplicative trend and seasonality, is among the most widely deployed forecasting models in industry, particularly in retail demand planning.

Prophet

Prophet (Taylor and Letham, 2018, Meta) was designed for business forecasting at scale. It decomposes time series into trend, seasonality (multiple periods), and holiday effects, fitted using a Bayesian approach. Prophet’s principal innovation was practical: it handles missing data gracefully, automatically detects changepoints in trend, and allows users to inject domain knowledge (holidays, known events) without statistical expertise. While no longer the most accurate option, Prophet remains one of the fastest paths from raw data to a reasonable forecast for business metrics.

from prophet import Prophet
import pandas as pd

# Prophet requires a DataFrame with 'ds' (date) and 'y' (value) columns
df = pd.DataFrame({'ds': dates, 'y': values})

model = Prophet(
    yearly_seasonality=True,
    weekly_seasonality=True,
    daily_seasonality=False,
    changepoint_prior_scale=0.05,  # Controls trend flexibility
)
model.add_country_holidays(country_name='US')
model.fit(df)

# Forecast 90 days ahead
future = model.make_future_dataframe(periods=90)
forecast = model.predict(future)

# forecast contains: yhat, yhat_lower, yhat_upper (prediction intervals)

StatsForecast: Classical Methods at Scale

The StatsForecast library from Nixtla warrants particular attention. It provides highly optimised implementations of classical methods (AutoARIMA, ETS, Theta, CES, MSTL) that run 100 to 1,000 times faster than traditional implementations. This speed advantage permits the fitting of individual models per time series across thousands of series, which often yields better results than a single complex model fitted globally.

from statsforecast import StatsForecast
from statsforecast.models import (
    AutoARIMA, AutoETS, AutoTheta, MSTL, SeasonalNaive
)

# Fit multiple models simultaneously across many series
sf = StatsForecast(
    models=[
        AutoARIMA(season_length=7),
        AutoETS(season_length=7),
        AutoTheta(season_length=7),
        MSTL(season_lengths=[7, 365]),  # Weekly + yearly seasonality
        SeasonalNaive(season_length=7),  # Baseline
    ],
    freq='D',
    n_jobs=-1,  # Parallelize across all CPU cores
)

# df must have columns: unique_id, ds, y
forecasts = sf.forecast(df=train_df, h=30)  # 30-day forecast

Gradient Boosting for Time Series: An Underused Practitioner Tool

An important fact about practical forecasting that often receives insufficient attention is that gradient-boosted decision trees—LightGBM, XGBoost, CatBoost—applied to time-series features often outperform both classical statistical models and deep learning on tabular-structured forecasting problems. This approach, sometimes referred to as “ML forecasting” or “feature-based forecasting,” operates by converting the time-series problem into a supervised regression problem.

The decisive step is feature engineering: instead of feeding raw time-series values to the model, the practitioner constructs features that capture temporal patterns:

import lightgbm as lgb
import pandas as pd
import numpy as np

def create_time_features(df, target_col='y', lags=[1, 7, 14, 28]):
    """Create temporal features for gradient boosting."""
    result = df.copy()

    # Calendar features
    result['dayofweek'] = result['ds'].dt.dayofweek
    result['month'] = result['ds'].dt.month
    result['dayofyear'] = result['ds'].dt.dayofyear
    result['weekofyear'] = result['ds'].dt.isocalendar().week.astype(int)
    result['is_weekend'] = (result['dayofweek'] >= 5).astype(int)

    # Lag features (past values)
    for lag in lags:
        result[f'lag_{lag}'] = result[target_col].shift(lag)

    # Rolling statistics
    for window in [7, 14, 30]:
        result[f'rolling_mean_{window}'] = (
            result[target_col].shift(1).rolling(window).mean()
        )
        result[f'rolling_std_{window}'] = (
            result[target_col].shift(1).rolling(window).std()
        )

    # Expanding mean (long-term average up to current point)
    result['expanding_mean'] = result[target_col].shift(1).expanding().mean()

    return result.dropna()

features_df = create_time_features(df)
feature_cols = [c for c in features_df.columns if c not in ['ds', 'y']]

model = lgb.LGBMRegressor(
    n_estimators=1000,
    learning_rate=0.05,
    num_leaves=31,
    subsample=0.8,
)
model.fit(features_df[feature_cols], features_df['y'])

The reason this approach is effective is that gradient boosting captures complex nonlinear relationships between features—including interactions among calendar effects, lagged values, and rolling statistics that linear models cannot represent. Feature engineering renders the temporal structure explicit, allowing tree-based models to discover patterns such as “demand is high on Fridays in December when the previous week’s demand was above average”—patterns that require multiple conditional splits and that ARIMA cannot represent at all.

Tip: In Kaggle time-series competitions, LightGBM with careful feature engineering has won more forecasting competitions than any deep learning model. The combination is fast to train, easy to interpret (via feature importance), handles missing data natively, and scales well to millions of time series. For a production forecasting system without a clear starting point, LightGBM with temporal features is a strong default.

The Deep Learning Era: N-BEATS, N-HiTS, and TFT

N-BEATS: Neural Basis Expansion (2020)

N-BEATS (Oreshkin et al., 2020) was the first deep learning model to conclusively surpass statistical methods on the M4 competition benchmark—a landmark result. Its architecture is elegantly simple: a deep stack of fully-connected blocks, each producing a partial forecast and a partial backcast (reconstruction of the input). The final forecast is the sum of all blocks’ partial forecasts.

N-BEATS exists in two variants: a generic architecture in which blocks learn arbitrary basis functions, and an interpretable architecture in which blocks are constrained to learn trend and seasonality components, producing decompositions analogous to those of classical methods but with the expressiveness of deep learning. The interpretable variant is particularly valuable in business settings where stakeholders must understand why the model forecasts what it does.

N-HiTS: Hierarchical Interpolation (2023)

N-HiTS (Challu et al., 2023) extends N-BEATS with a multi-rate signal sampling approach. Different blocks in the stack process the input at different temporal resolutions: some blocks focus on long-term trends (downsampled signal), while others focus on short-term fluctuations (full-resolution signal). This hierarchical approach significantly improves long-horizon forecasting accuracy while reducing computational cost by a factor of three to five compared with N-BEATS.

Temporal Fusion Transformer (2021)

Temporal Fusion Transformer (TFT) (Lim et al., 2021, Google) is designed for the real-world complexity that pure time-series models ignore: it jointly processes static metadata (store location, product category), known future inputs (holidays, promotions, day of week), and observed past values. TFT uses attention mechanisms to learn which historical time steps are most relevant for each forecast horizon and produces interpretable multi-horizon forecasts with prediction intervals.

TFT’s architecture includes a variable selection network that learns which input features are most important, providing built-in feature importance that other deep models lack. For multi-horizon forecasting with rich covariate information, TFT remains one of the strongest available models.

DeepAR: Probabilistic Forecasting at Scale (2020)

DeepAR (Salinas et al., 2020, Amazon) takes a different approach: it trains a single autoregressive RNN model across all time series in a dataset, learning shared patterns while generating probabilistic (not point) forecasts. DeepAR outputs full probability distributions rather than single values, enabling decision-makers to reason about uncertainty rather than only expected outcomes.

DeepAR’s “global model” approach is especially powerful when individual series are short or sparse. A new product with only 10 days of sales data benefits from patterns learned across millions of other products. This cold-start capability is essential in retail and e-commerce forecasting.

PatchTST: When Vision Meets Time Series (ICLR 2023)

PatchTST (Nie et al., 2023) brought a key insight from computer vision to time-series forecasting. Rather than treating each time step as a separate token (computationally expensive and prone to attention dilution), PatchTST groups consecutive time steps into patches, analogously to the way Vision Transformers (ViT) group image pixels into patches.

A time series of 512 points, with a patch size of 16, becomes a sequence of 32 tokens, each representing a local temporal pattern. The transformer’s self-attention then operates over these 32 patches rather than 512 individual points, substantially reducing computational cost while preserving the model’s ability to capture long-range dependencies between patches.

PatchTST also introduced channel-independent processing: in multivariate settings, each variable is processed by the same transformer backbone independently, with shared weights. This counterintuitive choice—ignoring cross-variable correlations—improves generalisation substantially for many datasets, because it prevents the model from overfitting to spurious inter-variable correlations in training data.

Model	Year	Architecture	Key Innovation	Best For
N-BEATS	2020	Fully connected stacks	Basis expansion, interpretable variant	Univariate, interpretability needed
DeepAR	2020	Autoregressive RNN	Global model, probabilistic output	Many related series, cold start
TFT	2021	Transformer + variable selection	Multi-horizon, rich covariates	Complex business forecasting
N-HiTS	2023	Hierarchical FC stacks	Multi-rate signal sampling	Long-horizon forecasting
PatchTST	2023	Patched Transformer	Patching + channel independence	Long-range multivariate

iTransformer: Inverting the Attention Paradigm (ICLR 2024)

iTransformer (Liu et al., 2024, Tsinghua) poses a pointed question: whether transformers have been applied to time series incorrectly to date.

In standard transformer-based forecasting, each time step is a token, and the model applies self-attention across time, with each time step attending to every other time step. The feed-forward layers process individual time-step features, while the attention mechanism captures temporal dependencies.

iTransformer inverts this arrangement: each variable (channel) becomes a token, and the entire time series of that variable becomes the token’s embedding. Self-attention now operates across variables, learning which variables are relevant to each other, while the feed-forward layers process temporal patterns within each variable.

This inversion is highly effective. On standard multivariate benchmarks (ETTh, ETTm, Weather, Electricity, Traffic), iTransformer achieves leading or near-leading results while being simpler to implement than many competitors. The implication is that, for multivariate forecasting, learning cross-variable relationships through attention is more important than learning temporal patterns through attention; temporal patterns can be captured adequately by simpler feed-forward networks.

# iTransformer conceptual structure (simplified)
# Standard Transformer: tokens = time steps, embedding = features
# iTransformer:          tokens = features,   embedding = time steps

import torch.nn as nn

class iTransformerLayer(nn.Module):
    def __init__(self, n_vars, seq_len, d_model):
        super().__init__()
        # Project each variable's full time series into d_model dims
        self.embed = nn.Linear(seq_len, d_model)  # Per-variable

        # Attention operates ACROSS variables (not time)
        self.attention = nn.MultiheadAttention(d_model, nhead=8)

        # FFN processes temporal patterns within each variable
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.GELU(),
            nn.Linear(d_model * 4, d_model),
        )

    def forward(self, x):
        # x: (batch, seq_len, n_vars)
        # Transpose to (batch, n_vars, seq_len), embed
        x = x.permute(0, 2, 1)           # (B, V, T)
        x = self.embed(x)                 # (B, V, D)
        x = x.permute(1, 0, 2)           # (V, B, D) for attention
        attn_out, _ = self.attention(x, x, x)  # Cross-variable attention
        x = x + attn_out
        x = x + self.ffn(x)              # Temporal pattern refinement
        return x

Foundation Models: Zero-Shot Forecasting Arrives

The paradigm shift that has drawn the most attention in the forecasting community is the emergence of foundation models capable of forecasting time series on which they were never trained. This capability is analogous to GPT’s ability to answer questions on topics it was not explicitly fine-tuned for: the model has learned general patterns of sequential data from substantial pre-training and applies those patterns to new inputs at inference time.

TimesFM (Google, 2024)

TimesFM is a 200M-parameter decoder-only transformer pre-trained on approximately 100 billion time points from Google Trends, Wikipedia page views, synthetic data, and various public datasets. Its architecture uses input patching (similar to PatchTST) with variable patch sizes, allowing it to handle different granularities and frequencies.

TimesFM’s zero-shot performance is notable: on datasets it has never previously seen, it matches or exceeds supervised models trained specifically on those datasets. Google’s internal evaluations indicate that TimesFM outperforms tuned ARIMA and ETS on 60% to 70% of retail forecasting series, without a single gradient update on retail data.

import timesfm

# Load the pre-trained model
tfm = timesfm.TimesFm(
    hparams=timesfm.TimesFmHparams(
        backend="gpu",
        per_core_batch_size=32,
        horizon_len=128,
    ),
    checkpoint=timesfm.TimesFmCheckpoint(
        huggingface_repo_id="google/timesfm-1.0-200m-pytorch"
    ),
)

# Zero-shot forecast — no training required
point_forecast, experimental_quantile_forecast = tfm.forecast(
    inputs=[historical_series_1, historical_series_2],  # List of arrays
    freq=[0, 0],  # 0=high-freq, 1=medium, 2=low
)
# Returns forecasts for all input series simultaneously

Chronos (Amazon, 2024)

Chronos tokenises continuous time-series values into discrete bins using mean scaling and quantisation, then applies a T5 language model architecture. By treating forecasting as a language problem—predicting the next token given the sequence so far—Chronos uses decades of NLP architecture innovations and training procedures.

Chronos offers multiple sizes (20M to 710M parameters) and produces probabilistic forecasts natively, with each prediction representing a distribution over possible future values. The model is well suited to applications where uncertainty quantification matters (inventory planning, risk management, resource allocation).

A noteworthy feature is synthetic data augmentation during pre-training. Chronos generates millions of synthetic time series using Gaussian processes with diverse kernels, ensuring that the model has been exposed to a wide range of temporal patterns—seasonal, trending, noisy, smooth, and multi-scale—even where the real-world training data does not cover all of them.

Moirai (Salesforce, 2024)

Moirai (Woo et al., 2024) is a universal forecasting model designed to handle any time series regardless of frequency, number of variables, or forecast horizon. Its architecture addresses a key limitation of other foundation models: distribution shift across datasets.

Different time series have radically different scales and statistical properties. Server CPU usage ranges from 0 to 100%. Stock prices range from 1 to 5,000 USD. Energy consumption may be measured in megawatts. Moirai uses a mixture distribution output—predicting parameters of a mixture of distributions rather than point values—that adapts naturally to different scales and distributional shapes without manual normalisation.

Moirai also introduces Any-Variate Attention, which allows the model to process multivariate time series with arbitrary numbers of variables at inference time, even when the model was pre-trained on series of different dimensionality. This flexibility makes Moirai one of the most versatile foundation models available.

TimeMixer++ and TSMixer (2024-2025)

TSMixer (Google, 2023) demonstrated that a simple MLP-Mixer architecture, alternating between time-mixing (across time steps) and feature-mixing (across variables), achieves results competitive with transformers while being significantly faster. TimeMixer++ extends this with multi-scale decomposition, processing different frequency components through separate mixing paths.

These mixer-based architectures are particularly attractive for production deployment because their computational complexity scales linearly with sequence length (rather than quadratically as in standard attention), which makes them practical for very long context windows and high-frequency data.

Foundation Model	Organization	Parameters	Open Source	Output Type	Multivariate
TimesFM	Google	200M	Yes	Point + quantiles	Per-channel
Chronos	Amazon	20M–710M	Yes	Probabilistic	Per-channel
Moirai	Salesforce	14M–311M	Yes	Mixture distribution	Native multivariate
MOMENT	CMU	40M–385M	Yes	Point	Per-channel
TimeGPT	Nixtla	Undisclosed	No (API)	Point + intervals	Per-channel
Timer	Tsinghua	67M	Yes	Autoregressive	Per-channel

Caution: Foundation model hype is real, but so are their limitations. Most foundation models process each variable independently (per-channel) and do not capture cross-variable correlations. For problems in which inter-variable relationships are critical (for example, predicting energy demand from weather, price, and grid load), a trained multivariate model such as TFT or iTransformer may still outperform. Foundation models also struggle with domain-specific patterns they have not encountered in pre-training: a financial time series with quarterly earnings seasonality may not be well represented in pre-training data dominated by daily and weekly patterns.

Benchmarks: How Models Actually Compare

The most widely used benchmarks for long-term forecasting are the ETT datasets (Electricity Transformer Temperature), Weather, Electricity, and Traffic. The following table presents representative results using Mean Squared Error (MSE), where lower values are better, on standard prediction horizons.

Model	ETTh1 (96)	ETTh1 (720)	Weather (96)	Electricity (96)	Traffic (96)
ARIMA	0.423	0.618	0.284	0.227	0.662
N-HiTS	0.384	0.464	0.166	0.169	0.415
PatchTST	0.370	0.449	0.149	0.129	0.370
iTransformer	0.355	0.434	0.141	0.126	0.360
TimesFM (zero-shot)	0.391	0.478	0.168	0.155	0.410
Chronos-Base (zero-shot)	0.398	0.491	0.172	0.160	0.425

Numbers are approximate and representative. Lower MSE is better. (96) and (720) denote the forecast horizon length. Results compiled from published papers and reproductions.

Several patterns emerge from the benchmarks:

iTransformer and PatchTST lead among supervised models on most multivariate long-range benchmarks, with iTransformer holding a slight edge on datasets in which cross-variable correlations are important.
Foundation models (zero-shot) are competitive but do not yet surpass trained models. TimesFM and Chronos typically fall between classical methods and the best supervised deep models, which is notable given the absence of training but not dominant. The gap narrows on datasets whose patterns are well represented in pre-training data.
Classical methods remain surprisingly strong on univariate series, particularly when combined with ensembling (averaging forecasts from AutoARIMA, ETS, and Theta). The overhead of deep learning is not always justified.
The performance gap widens at longer horizons. The advantage of deep models over classical methods is largest at prediction horizons of 336 steps or more, where complex temporal patterns compound and the assumptions of statistical models break down.

Practical Model Selection Guide

Given this landscape, how should a practitioner choose the right model for a given problem? The following decision framework draws on practical constraints:

Scenario 1: Quick deployment with no training-data infrastructure

Use: Foundation model (Chronos or TimesFM) in zero-shot mode

When forecasts are required immediately and investment in a training pipeline is not feasible, foundation models deliver competitive accuracy with no setup. Install the library, feed in the data, and obtain forecasts. This option is well suited to proofs of concept, new data streams, and situations in which the cost of deploying a custom model exceeds the cost of slightly reduced accuracy.

Scenario 2: Thousands of univariate series, where speed and reliability are required

Use: StatsForecast (AutoARIMA + AutoETS + AutoTheta ensemble)

For large-scale retail demand forecasting, financial time series, or IoT monitoring in which each series is relatively independent, fitting per-series statistical models is fast, reliable, and often the most accurate approach. StatsForecast’s optimised implementations make this feasible even for millions of series.

Scenario 3: Multivariate with rich covariates (promotions, holidays, metadata)

Use: Temporal Fusion Transformer or LightGBM with temporal features

When the forecast depends on external factors—promotional calendars, weather forecasts, economic indicators, or product attributes—a model that ingests covariates natively is required. TFT handles this elegantly with built-in variable selection. LightGBM with engineered features is faster to iterate and often equally accurate.

Scenario 4: Long-horizon multivariate forecasting where accuracy is paramount

Use: iTransformer or PatchTST

For applications in which prediction accuracy directly affects high-value decisions (energy trading, infrastructure capacity planning, financial risk management), investment in training a supervised deep model on historical data is appropriate. iTransformer and PatchTST represent the current accuracy frontier for long-range multivariate forecasting.

Scenario 5: Uncertainty quantification is critical

Use: Chronos (probabilistic) or DeepAR

When prediction intervals are required rather than only point forecasts, Chronos provides calibrated probabilistic forecasts out of the box, and DeepAR produces full probability distributions trained on the user’s specific data. These methods are essential for inventory optimisation (balancing stockout against overstock risk) and financial risk management.

Tip: The most consistently effective practical advice for forecasting accuracy is to ensemble. Averaging forecasts from three to five diverse models (a statistical model, a gradient boosting model, and a deep learning model) consistently outperforms any individual model. The M-series competitions have demonstrated this repeatedly. Ensembling is unglamorous, but it produces better results than almost any other practice.

Implementation: End-to-End Forecasting Pipeline

A complete forecasting pipeline involves far more than model selection. The architecture used in production systems is as follows:

# Production forecasting pipeline using NeuralForecast + StatsForecast
from neuralforecast import NeuralForecast
from neuralforecast.models import NHITS, PatchTST, TimesNet
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA, AutoETS, AutoTheta
import pandas as pd
import numpy as np

# Step 1: Data preparation
# df must have columns: unique_id, ds, y
train_df = df[df['ds'] < '2026-01-01']
test_df = df[df['ds'] >= '2026-01-01']
horizon = 30  # 30-day forecast

# Step 2: Statistical models (fast, per-series)
sf = StatsForecast(
    models=[
        AutoARIMA(season_length=7),
        AutoETS(season_length=7),
        AutoTheta(season_length=7),
    ],
    freq='D',
    n_jobs=-1,
)
stat_forecasts = sf.forecast(df=train_df, h=horizon)

# Step 3: Deep learning models (slower, more expressive)
nf = NeuralForecast(
    models=[
        NHITS(
            input_size=180,
            h=horizon,
            max_steps=1000,
            n_pool_kernel_size=[4, 4, 4],
        ),
        PatchTST(
            input_size=512,
            h=horizon,
            max_steps=1000,
            patch_len=16,
        ),
    ],
    freq='D',
)
nf.fit(df=train_df)
neural_forecasts = nf.predict()

# Step 4: Ensemble (simple average — often the best approach)
combined = stat_forecasts.merge(neural_forecasts, on=['unique_id', 'ds'])
model_cols = [c for c in combined.columns
              if c not in ['unique_id', 'ds']]
combined['ensemble'] = combined[model_cols].mean(axis=1)

# Step 5: Evaluate
from utilsforecast.losses import mae, mse, smape
evaluation = {
    'MAE': mae(test_df['y'], combined['ensemble']),
    'MSE': mse(test_df['y'], combined['ensemble']),
    'sMAPE': smape(test_df['y'], combined['ensemble']),
}
print(f"Ensemble performance: {evaluation}")

Important pipeline components beyond the model include:

Data quality checks. Missing values, duplicates, timezone inconsistencies, and outliers in training data directly degrade forecast quality. Automated data validation before model training is essential. If the time-series data originates from InfluxDB, an InfluxDB-to-Iceberg pipeline with Telegraf can centralise and validate data before it reaches the models.
Cross-validation for time series. Random train-test splits should never be used for time series. Use expanding-window or sliding-window cross-validation that respects temporal ordering. The utilsforecast library provides optimised implementations.
Forecast reconciliation. When forecasts exist at multiple hierarchical levels (store, region, national), they must be coherent: the sum of store forecasts should equal the regional forecast. Methods such as MinTrace reconciliation ensure consistency.
Backtesting and monitoring. Production forecasts must be continuously evaluated against actuals. Forecast accuracy that degrades over time, owing to concept drift, data pipeline issues, or regime changes, requires automated detection and model-retraining triggers.

The Future of Forecasting

Time-series forecasting sits at an interesting juncture. Classical methods remain competitive for many problems. Deep learning models set the accuracy frontier for complex, multivariate, long-horizon tasks. Foundation models promise to make forecasting more broadly accessible by eliminating the need for per-dataset training. Meanwhile, gradient boosting consistently outperforms both on many real-world, feature-rich problems. For teams building production systems, pairing forecasting with Apache Kafka for multivariate time-series streaming provides the real-time data backbone these models require.

Several trends will shape the next wave of innovation:

Foundation model fine-tuning is bridging the gap between zero-shot and fully supervised performance. The pattern is to pre-train on billions of diverse time points and then fine-tune on a specific domain with as few as a few hundred data points. Early results indicate that fine-tuned Chronos and TimesFM can match or exceed fully supervised models using only a fraction of the training data.

Conformal prediction for calibrated uncertainty is replacing ad hoc prediction interval methods. Conformal prediction provides distribution-free, mathematically guaranteed coverage intervals: when 95% intervals are requested, they contain the true value 95% of the time, regardless of the underlying data distribution. Libraries such as MAPIE and EnbPI make this practical for production use.

LLM-enhanced forecasting is an emerging research direction in which large language models augment numerical forecasts with textual context. A model that incorporates information such as “Black Friday is next week” or “a competitor has announced a price cut”—information present in text but not in numerical time-series history—can produce forecasts that purely numerical models cannot match. Early papers from Amazon and Google report promising results for retail demand forecasting.

Real-time adaptive models that continuously update their parameters as new data arrives (online learning) are becoming practical for streaming applications. Rather than periodic batch retraining, the model learns from each new observation in real time, automatically adapting to concept drift without human intervention.

The most important practical lesson from the current landscape is that the best forecasting system is not the best model but the best pipeline. Data quality, feature engineering, cross-validation, ensembling, monitoring, and retraining together determine forecast accuracy more than any individual model choice. Teams that invest in pipeline infrastructure consistently outperform teams that chase the latest model architecture. The recommended approach is to begin with a simple, well-engineered pipeline and add complexity only when measured accuracy improvements justify it. A seasonal naive baseline should always be used as a reference point, since even the most sophisticated model is of little value if it cannot improve on “same as last week.”

References

Nie, Yuqi, et al. “A Time Series is Worth 64 Words: Long-term Forecasting with Transformers.” (PatchTST) ICLR 2023.
Liu, Yong, et al. “iTransformer: Inverted Transformers Are Effective for Time Series Forecasting.” ICLR 2024.
Das, Abhimanyu, et al. “A Decoder-Only Foundation Model for Time-Series Forecasting.” (TimesFM) ICML 2024.
Ansari, Abdul Fatir, et al. “Chronos: Learning the Language of Time Series.” arXiv:2403.07815, 2024.
Woo, Gerald, et al. “Unified Training of Universal Time Series Forecasting Transformers.” (Moirai) ICML 2024.
Oreshkin, Boris N., et al. “N-BEATS: Neural Basis Expansion Analysis for Interpretable Time Series Forecasting.” ICLR 2020.
Challu, Cristian, et al. “N-HiTS: Neural Hierarchical Interpolation for Time Series Forecasting.” AAAI 2023.
Lim, Bryan, et al. “Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting.” International Journal of Forecasting, 2021.
Salinas, David, et al. “DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks.” International Journal of Forecasting, 2020.
Goswami, Mononito, et al. “MOMENT: A Family of Open Time-Series Foundation Models.” ICML 2024.
Wu, Haixu, et al. “TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis.” ICLR 2023.
Taylor, Sean J. and Benjamin Letham. “Forecasting at Scale.” (Prophet) The American Statistician, 2018.
NeuralForecast GitHub, Production deep learning forecasting
StatsForecast GitHub—Lightning-fast statistical forecasting
Time-Series-Library (THU)—Unified deep learning framework
Chronos GitHub Repository
TimesFM GitHub Repository

AI/MLSemi-Supervised Learning Explained: Pseudo-Labeling, FixMatch, and More AI/MLTransfer Learning, Fine-Tuning, and Domain Adaptation: A Complete Guide with Anomaly Detection for Heterogeneous Cobots AI/MLHow to Use AI Agents to Learn Any Skill 10x Faster: From Programming to Languages to Music

Time-Series Forecasting in 2026: From ARIMA to Foundation Models — A Complete Guide