Home AI/ML Time-Series Forecasting in 2026: From ARIMA to Foundation Models — A Complete Guide

Time-Series Forecasting in 2026: From ARIMA to Foundation Models — A Complete Guide

In March 2021, the container ship Ever Given wedged itself sideways in the Suez Canal, blocking 12% of global trade for six days. The economic damage exceeded $54 billion. Supply chain managers across the world scrambled to re-route shipments, adjust inventory forecasts, and estimate when normal flow would resume. The companies that weathered the crisis best weren’t the ones with the largest inventories — they were the ones with the most accurate demand forecasting models, the ones that could recalculate their entire supply chain within hours rather than weeks.

Time-series forecasting — the task of predicting future values based on historical observations — is the quantitative backbone of decision-making across nearly every industry. Retailers forecast demand to stock shelves. Energy companies forecast load to schedule generation. Financial institutions forecast volatility to price options. Hospitals forecast patient admissions to staff wards. The accuracy of these forecasts directly determines whether resources are allocated efficiently or wasted catastrophically.

The field has undergone a dramatic transformation since 2022. For decades, ARIMA and exponential smoothing dominated. Then came deep learning architectures — N-BEATS, Temporal Fusion Transformers, DeepAR — that challenged classical methods on complex, multivariate problems. Now, in 2025-2026, we’re witnessing the most significant shift yet: foundation models pre-trained on billions of time points that can forecast series they’ve never seen before, without any task-specific training. The implications for practitioners are profound — and the confusion about which model to actually use has never been greater.

This guide cuts through that confusion. We’ll trace the evolution from classical methods through deep learning to the current frontier, benchmark the models that matter, and give you a practical framework for choosing the right approach for your specific problem. No hype. No hand-waving. Just what works, what doesn’t, and why.

Why Time-Series Forecasting Matters More Than Ever

The volume of time-stamped data generated globally has exploded. IoT sensors, financial markets, application telemetry, social media engagement metrics, weather stations, wearable health devices — all produce continuous streams of sequential observations. The International Data Corporation estimates that the global datasphere will exceed 180 zettabytes by 2025, and a significant portion of that data is temporal.

But volume alone doesn’t explain why forecasting has become more critical. Three structural trends are driving increased demand for accurate predictions:

Just-in-time everything. Modern supply chains, cloud infrastructure, and service delivery systems operate with minimal slack. Amazon’s fulfillment network, Uber’s driver allocation, Netflix’s content delivery — all depend on accurate short-term forecasts to match supply with demand in near real-time. When forecasts are wrong by even 10%, the result is either costly over-provisioning or customer-visible failures.

Renewable energy integration. As solar and wind generation grow from supplementary to primary energy sources, grid operators must forecast intermittent generation with high accuracy to maintain grid stability. A 5% error in solar generation forecast for a large grid can mean the difference between smooth operation and emergency natural gas peaking — costing millions of dollars and producing unnecessary emissions.

Algorithmic decision-making at scale. Automated systems — from algorithmic trading to dynamic pricing to autonomous vehicle planning — consume forecasts as inputs to decisions that execute without human review. The quality ceiling of these automated systems is bounded by the accuracy of their underlying forecasts.

Key Takeaway: Time-series forecasting has evolved from a planning exercise done quarterly by analysts into an operational capability that runs continuously, feeds automated systems, and directly impacts revenue and reliability. The bar for accuracy — and the cost of inaccuracy — has never been higher.

Classical Foundations That Still Work

Before diving into transformers and foundation models, it’s essential to acknowledge that classical statistical methods remain remarkably competitive for many forecasting problems. The 2022 M5 competition and subsequent analyses have repeatedly shown that simple methods, properly tuned, often match or beat complex deep learning models on univariate and low-dimensional problems.

ARIMA and SARIMA

AutoRegressive Integrated Moving Average (ARIMA) models capture three components of a time series: autoregressive behavior (current values depend on past values), differencing (to achieve stationarity), and moving average effects (current values depend on past forecast errors). The seasonal variant, SARIMA, adds explicit seasonal terms.

ARIMA’s strength is its strong theoretical foundation and interpretability — every parameter has a clear statistical meaning. Its weakness is that it assumes linear relationships and handles only univariate series. For a single well-behaved time series with clear trend and seasonality (monthly sales, daily temperature), ARIMA remains a strong, fast, and interpretable baseline.

Exponential Smoothing (ETS)

Exponential Smoothing State Space models (ETS) decompose a time series into error, trend, and seasonal components, each of which can be additive or multiplicative. The Holt-Winters method — a specific ETS configuration with additive or multiplicative trend and seasonality — is one of the most widely deployed forecasting models in industry, particularly in retail demand planning.

Prophet

Prophet (Taylor & Letham, 2018, Meta) was designed for business forecasting at scale. It decomposes time series into trend, seasonality (multiple periods), and holiday effects, fitted using a Bayesian approach. Prophet’s key innovation was practical: it handles missing data gracefully, automatically detects changepoints in trend, and allows users to inject domain knowledge (holidays, known events) without statistical expertise. While it’s no longer state-of-the-art in accuracy, Prophet remains one of the fastest paths from raw data to a reasonable forecast for business metrics.

from prophet import Prophet
import pandas as pd

# Prophet requires a DataFrame with 'ds' (date) and 'y' (value) columns
df = pd.DataFrame({'ds': dates, 'y': values})

model = Prophet(
    yearly_seasonality=True,
    weekly_seasonality=True,
    daily_seasonality=False,
    changepoint_prior_scale=0.05,  # Controls trend flexibility
)
model.add_country_holidays(country_name='US')
model.fit(df)

# Forecast 90 days ahead
future = model.make_future_dataframe(periods=90)
forecast = model.predict(future)

# forecast contains: yhat, yhat_lower, yhat_upper (prediction intervals)

StatsForecast: Classical Methods at Scale

The StatsForecast library from Nixtla deserves special mention. It provides highly optimized implementations of classical methods (AutoARIMA, ETS, Theta, CES, MSTL) that run 100-1000x faster than traditional implementations. This speed advantage means you can fit individual models per time series across thousands of series — often yielding better results than a single complex model fitted globally.

from statsforecast import StatsForecast
from statsforecast.models import (
    AutoARIMA, AutoETS, AutoTheta, MSTL, SeasonalNaive
)

# Fit multiple models simultaneously across many series
sf = StatsForecast(
    models=[
        AutoARIMA(season_length=7),
        AutoETS(season_length=7),
        AutoTheta(season_length=7),
        MSTL(season_lengths=[7, 365]),  # Weekly + yearly seasonality
        SeasonalNaive(season_length=7),  # Baseline
    ],
    freq='D',
    n_jobs=-1,  # Parallelize across all CPU cores
)

# df must have columns: unique_id, ds, y
forecasts = sf.forecast(df=train_df, h=30)  # 30-day forecast

Gradient Boosting for Time Series: The Practitioner’s Secret Weapon

One of the best-kept secrets in practical forecasting is that gradient-boosted decision trees — LightGBM, XGBoost, CatBoost — applied to time-series features often outperform both classical statistical models and deep learning on tabular-structured forecasting problems. This approach, sometimes called “ML forecasting” or “feature-based forecasting,” works by converting the time-series problem into a supervised regression problem.

The key is feature engineering: instead of feeding raw time-series values to the model, you construct features that capture temporal patterns:

import lightgbm as lgb
import pandas as pd
import numpy as np

def create_time_features(df, target_col='y', lags=[1, 7, 14, 28]):
    """Create temporal features for gradient boosting."""
    result = df.copy()

    # Calendar features
    result['dayofweek'] = result['ds'].dt.dayofweek
    result['month'] = result['ds'].dt.month
    result['dayofyear'] = result['ds'].dt.dayofyear
    result['weekofyear'] = result['ds'].dt.isocalendar().week.astype(int)
    result['is_weekend'] = (result['dayofweek'] >= 5).astype(int)

    # Lag features (past values)
    for lag in lags:
        result[f'lag_{lag}'] = result[target_col].shift(lag)

    # Rolling statistics
    for window in [7, 14, 30]:
        result[f'rolling_mean_{window}'] = (
            result[target_col].shift(1).rolling(window).mean()
        )
        result[f'rolling_std_{window}'] = (
            result[target_col].shift(1).rolling(window).std()
        )

    # Expanding mean (long-term average up to current point)
    result['expanding_mean'] = result[target_col].shift(1).expanding().mean()

    return result.dropna()

features_df = create_time_features(df)
feature_cols = [c for c in features_df.columns if c not in ['ds', 'y']]

model = lgb.LGBMRegressor(
    n_estimators=1000,
    learning_rate=0.05,
    num_leaves=31,
    subsample=0.8,
)
model.fit(features_df[feature_cols], features_df['y'])

Why does this work so well? Gradient boosting excels at learning complex non-linear relationships between features — including interactions between calendar effects, lagged values, and rolling statistics that linear models can’t capture. The feature engineering makes the temporal structure explicit, allowing tree-based models to discover patterns like “demand is high on Fridays in December when last week’s demand was above average” — patterns that require multiple conditional splits and that ARIMA fundamentally cannot represent.

Tip: In Kaggle time-series competitions, LightGBM with careful feature engineering has won more forecasting competitions than any deep learning model. The combination is fast to train, easy to interpret (via feature importance), handles missing data natively, and scales well to millions of time series. If you’re building a production forecasting system and don’t know where to start, LightGBM with temporal features is a strong default.

The Deep Learning Era: N-BEATS, N-HiTS, and TFT

N-BEATS: Neural Basis Expansion (2020)

N-BEATS (Oreshkin et al., 2020) was the first deep learning model to conclusively beat statistical methods on the M4 competition benchmark — a landmark result. Its architecture is elegantly simple: a deep stack of fully-connected blocks, each producing a partial forecast and a partial backcast (reconstruction of the input). The final forecast is the sum of all blocks’ partial forecasts.

N-BEATS comes in two variants: a generic architecture where blocks learn arbitrary basis functions, and an interpretable architecture where blocks are constrained to learn trend and seasonality components — producing decompositions similar to classical methods but with deep learning’s expressiveness. The interpretable variant is particularly valuable in business settings where stakeholders need to understand why the model forecasts what it does.

N-HiTS: Hierarchical Interpolation (2023)

N-HiTS (Challu et al., 2023) extends N-BEATS with a multi-rate signal sampling approach. Different blocks in the stack process the input at different temporal resolutions — some blocks focus on long-term trends (downsampled signal), while others focus on short-term fluctuations (full-resolution signal). This hierarchical approach significantly improves long-horizon forecasting accuracy while reducing computational cost by 3-5x compared to N-BEATS.

Temporal Fusion Transformer (2021)

Temporal Fusion Transformer (TFT) (Lim et al., 2021, Google) is designed for the real-world complexity that pure time-series models ignore: it jointly processes static metadata (store location, product category), known future inputs (holidays, promotions, day of week), and observed past values. TFT uses attention mechanisms to learn which historical time steps are most relevant for each forecast horizon and produces interpretable multi-horizon forecasts with prediction intervals.

TFT’s architecture includes a variable selection network that learns which input features are most important — providing built-in feature importance that other deep models lack. For multi-horizon forecasting with rich covariate information, TFT remains one of the strongest available models.

DeepAR: Probabilistic Forecasting at Scale (2020)

DeepAR (Salinas et al., 2020, Amazon) takes a different approach: it trains a single autoregressive RNN model across all time series in a dataset, learning shared patterns while generating probabilistic (not point) forecasts. DeepAR outputs full probability distributions, not single values — enabling decision-makers to reason about uncertainty, not just expected outcomes.

DeepAR’s “global model” approach is especially powerful when individual series are short or sparse. A new product with only 10 days of sales data benefits from patterns learned across millions of other products. This cold-start capability is essential in retail and e-commerce forecasting.

PatchTST: When Vision Meets Time Series (ICLR 2023)

PatchTST (Nie et al., 2023) brought a transformative insight from computer vision to time-series forecasting: instead of treating each time step as a separate token (computationally expensive and prone to attention dilution), PatchTST groups consecutive time steps into patches — analogous to how Vision Transformers (ViT) group image pixels into patches.

A time series of 512 points, with a patch size of 16, becomes a sequence of 32 tokens — each representing a local temporal pattern. The transformer’s self-attention then operates over these 32 patches rather than 512 individual points, dramatically reducing computational cost while preserving the model’s ability to capture long-range dependencies between patches.

PatchTST also introduced channel-independent processing: in multivariate settings, each variable is processed by the same transformer backbone independently, with shared weights. This counterintuitive choice — ignoring cross-variable correlations — turns out to improve generalization significantly for many datasets, because it prevents the model from overfitting to spurious inter-variable correlations in training data.

Model Year Architecture Key Innovation Best For
N-BEATS 2020 Fully connected stacks Basis expansion, interpretable variant Univariate, interpretability needed
DeepAR 2020 Autoregressive RNN Global model, probabilistic output Many related series, cold start
TFT 2021 Transformer + variable selection Multi-horizon, rich covariates Complex business forecasting
N-HiTS 2023 Hierarchical FC stacks Multi-rate signal sampling Long-horizon forecasting
PatchTST 2023 Patched Transformer Patching + channel independence Long-range multivariate

 

iTransformer: Inverting the Attention Paradigm (ICLR 2024)

iTransformer (Liu et al., 2024, Tsinghua) asks a provocative question: what if transformers have been applied to time series incorrectly all along?

In standard transformer-based forecasting, each time step is a token, and the model applies self-attention across time — each time step attends to every other time step. This means the feed-forward layers process individual time-step features, and the attention mechanism captures temporal dependencies.

iTransformer inverts this: each variable (channel) becomes a token, and the entire time series of that variable becomes the token’s embedding. Self-attention now operates across variables — learning which variables are relevant to each other — while the feed-forward layers process temporal patterns within each variable.

This inversion is surprisingly effective. On standard multivariate benchmarks (ETTh, ETTm, Weather, Electricity, Traffic), iTransformer achieves state-of-the-art or near-state-of-the-art results while being simpler to implement than many competitors. The insight it validates: for multivariate forecasting, learning cross-variable relationships through attention is more important than learning temporal patterns through attention — temporal patterns can be captured adequately by simpler feed-forward networks.

# iTransformer conceptual structure (simplified)
# Standard Transformer: tokens = time steps, embedding = features
# iTransformer:          tokens = features,   embedding = time steps

import torch.nn as nn

class iTransformerLayer(nn.Module):
    def __init__(self, n_vars, seq_len, d_model):
        super().__init__()
        # Project each variable's full time series into d_model dims
        self.embed = nn.Linear(seq_len, d_model)  # Per-variable

        # Attention operates ACROSS variables (not time)
        self.attention = nn.MultiheadAttention(d_model, nhead=8)

        # FFN processes temporal patterns within each variable
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.GELU(),
            nn.Linear(d_model * 4, d_model),
        )

    def forward(self, x):
        # x: (batch, seq_len, n_vars)
        # Transpose to (batch, n_vars, seq_len), embed
        x = x.permute(0, 2, 1)           # (B, V, T)
        x = self.embed(x)                 # (B, V, D)
        x = x.permute(1, 0, 2)           # (V, B, D) for attention
        attn_out, _ = self.attention(x, x, x)  # Cross-variable attention
        x = x + attn_out
        x = x + self.ffn(x)              # Temporal pattern refinement
        return x

Foundation Models: Zero-Shot Forecasting Arrives

The paradigm shift that has most excited the forecasting community is the emergence of foundation models that can forecast time series they’ve never been trained on. This is analogous to GPT’s ability to answer questions about topics it wasn’t explicitly fine-tuned for — the model has learned general patterns of sequential data from massive pre-training, and it applies those patterns to new inputs at inference time.

TimesFM (Google, 2024)

TimesFM is a 200M-parameter decoder-only transformer pre-trained on approximately 100 billion time points from Google Trends, Wikipedia page views, synthetic data, and various public datasets. Its architecture uses input patching (similar to PatchTST) with variable patch sizes, allowing it to handle different granularities and frequencies.

TimesFM’s zero-shot performance is remarkable: on datasets it has never seen, it matches or exceeds supervised models that were trained specifically on those datasets. Google’s internal evaluations show TimesFM outperforming tuned ARIMA and ETS on 60-70% of retail forecasting series — without a single gradient update on retail data.

import timesfm

# Load the pre-trained model
tfm = timesfm.TimesFm(
    hparams=timesfm.TimesFmHparams(
        backend="gpu",
        per_core_batch_size=32,
        horizon_len=128,
    ),
    checkpoint=timesfm.TimesFmCheckpoint(
        huggingface_repo_id="google/timesfm-1.0-200m-pytorch"
    ),
)

# Zero-shot forecast — no training required
point_forecast, experimental_quantile_forecast = tfm.forecast(
    inputs=[historical_series_1, historical_series_2],  # List of arrays
    freq=[0, 0],  # 0=high-freq, 1=medium, 2=low
)
# Returns forecasts for all input series simultaneously

Chronos (Amazon, 2024)

Chronos tokenizes continuous time-series values into discrete bins using mean scaling and quantization, then applies a T5 language model architecture. By treating forecasting as a “language” problem — predict the next token given the sequence so far — Chronos leverages decades of NLP architecture innovations and training recipes.

Chronos offers multiple sizes (20M to 710M parameters) and produces probabilistic forecasts natively — each prediction is a distribution over possible future values. This makes it ideal for applications where uncertainty quantification matters (inventory planning, risk management, resource allocation).

A key advantage: Chronos includes synthetic data augmentation during pre-training. It generates millions of synthetic time series using Gaussian processes with diverse kernels, ensuring the model has seen a wide range of temporal patterns — seasonal, trending, noisy, smooth, multi-scale — even if the real-world training data doesn’t cover all of them.

Moirai (Salesforce, 2024)

Moirai (Woo et al., 2024) is a universal forecasting model designed to handle any time series regardless of frequency, number of variables, or forecast horizon. Its architecture addresses a key limitation of other foundation models: distribution shift across datasets.

Different time series have radically different scales and statistical properties. Server CPU usage ranges from 0-100%. Stock prices range from $1 to $5,000. Energy consumption might be measured in megawatts. Moirai uses a mixture distribution output — predicting parameters of a mixture of distributions rather than point values — that naturally adapts to different scales and distributional shapes without manual normalization.

Moirai also introduces Any-Variate Attention, allowing it to process multivariate time series with arbitrary numbers of variables at inference time, even if the model was pre-trained on series with different dimensionality. This flexibility makes Moirai one of the most versatile foundation models available.

TimeMixer++ and TSMixer (2024-2025)

TSMixer (Google, 2023) demonstrated that a simple MLP-Mixer architecture — alternating between time-mixing (across time steps) and feature-mixing (across variables) — achieves competitive results with transformers while being significantly faster. TimeMixer++ extends this with multi-scale decomposition, processing different frequency components through separate mixing paths.

These mixer-based architectures are particularly attractive for production deployment because their computational complexity scales linearly with sequence length (versus quadratically for vanilla attention), making them practical for very long context windows and high-frequency data.

Foundation Model Organization Parameters Open Source Output Type Multivariate
TimesFM Google 200M Yes Point + quantiles Per-channel
Chronos Amazon 20M–710M Yes Probabilistic Per-channel
Moirai Salesforce 14M–311M Yes Mixture distribution Native multivariate
MOMENT CMU 40M–385M Yes Point Per-channel
TimeGPT Nixtla Undisclosed No (API) Point + intervals Per-channel
Timer Tsinghua 67M Yes Autoregressive Per-channel

 

Caution: Foundation model hype is real, but so are their limitations. Most foundation models process each variable independently (per-channel) and don’t capture cross-variable correlations. For problems where inter-variable relationships are critical (e.g., predicting energy demand from weather + price + grid load), a trained multivariate model like TFT or iTransformer may still outperform. Foundation models also struggle with domain-specific patterns they haven’t seen in pre-training — a financial time series with quarterly earnings seasonality may not be well-represented in pre-training data dominated by daily and weekly patterns.

Benchmarks: How Models Actually Compare

The most widely used benchmarks for long-term forecasting are the ETT datasets (Electricity Transformer Temperature), Weather, Electricity, and Traffic datasets. Below are representative results using Mean Squared Error (MSE) — lower is better — on standard prediction horizons.

Model ETTh1 (96) ETTh1 (720) Weather (96) Electricity (96) Traffic (96)
ARIMA 0.423 0.618 0.284 0.227 0.662
N-HiTS 0.384 0.464 0.166 0.169 0.415
PatchTST 0.370 0.449 0.149 0.129 0.370
iTransformer 0.355 0.434 0.141 0.126 0.360
TimesFM (zero-shot) 0.391 0.478 0.168 0.155 0.410
Chronos-Base (zero-shot) 0.398 0.491 0.172 0.160 0.425

 

Numbers are approximate and representative. Lower MSE is better. (96) and (720) denote the forecast horizon length. Results compiled from published papers and reproductions.

Several patterns emerge from the benchmarks:

  • iTransformer and PatchTST lead supervised models on most multivariate long-range benchmarks, with iTransformer having a slight edge on datasets where cross-variable correlations matter.
  • Foundation models (zero-shot) are competitive but don’t yet beat trained models. TimesFM and Chronos typically land between classical methods and the best supervised deep models — impressive given zero training, but not dominant. The gap narrows on datasets whose patterns are well-represented in pre-training data.
  • Classical methods remain surprisingly strong on univariate series, especially when combined with ensembling (averaging forecasts from AutoARIMA, ETS, and Theta). The overhead of deep learning is not always justified.
  • The performance gap widens at longer horizons. Deep models’ advantage over classical methods is largest at prediction horizons of 336+ steps, where complex temporal patterns compound and statistical models’ assumptions break down.

Practical Model Selection Guide

Given this landscape, how do you choose the right model for your problem? Here’s a decision framework based on practical constraints:

Scenario 1: Quick deployment, no training data infrastructure

Use: Foundation model (Chronos or TimesFM) → zero-shot

When you need forecasts immediately and can’t invest in a training pipeline, foundation models deliver competitive accuracy with zero setup. Install the library, feed in your data, get forecasts. This is ideal for proofs of concept, new data streams, and situations where the cost of deploying a custom model exceeds the cost of slightly reduced accuracy.

Scenario 2: Thousands of univariate series, need speed and reliability

Use: StatsForecast (AutoARIMA + AutoETS + AutoTheta ensemble)

For large-scale retail demand forecasting, financial time-series, or IoT monitoring where each series is relatively independent, fitting per-series statistical models is fast, reliable, and often the most accurate approach. StatsForecast’s optimized implementations make this feasible even for millions of series.

Scenario 3: Multivariate with rich covariates (promotions, holidays, metadata)

Use: Temporal Fusion Transformer or LightGBM with temporal features

When your forecast depends on external factors — promotional calendars, weather forecasts, economic indicators, product attributes — you need a model that ingests covariates natively. TFT handles this elegantly with built-in variable selection. LightGBM with engineered features is faster to iterate and often equally accurate.

Scenario 4: Long-horizon multivariate forecasting, accuracy is paramount

Use: iTransformer or PatchTST

For applications where prediction accuracy directly impacts high-value decisions (energy trading, infrastructure capacity planning, financial risk management), invest in training a supervised deep model on your historical data. iTransformer and PatchTST represent the current accuracy frontier for long-range multivariate forecasting.

Scenario 5: Uncertainty quantification is critical

Use: Chronos (probabilistic) or DeepAR

When you need prediction intervals — not just point forecasts — Chronos provides calibrated probabilistic forecasts out of the box, and DeepAR produces full probability distributions trained on your specific data. These are essential for inventory optimization (balancing stockout vs. overstock risk) and financial risk management.

Tip: The single best practical advice for forecasting accuracy is: always ensemble. Averaging forecasts from 3-5 diverse models (a statistical model, a gradient boosting model, and a deep learning model) consistently outperforms any individual model. The M-series competitions have demonstrated this repeatedly. Ensembling is boring, unglamorous, and it works better than almost anything else.

Implementation: End-to-End Forecasting Pipeline

A complete forecasting pipeline involves much more than model selection. Here’s the architecture that production systems use:

# Production forecasting pipeline using NeuralForecast + StatsForecast
from neuralforecast import NeuralForecast
from neuralforecast.models import NHITS, PatchTST, TimesNet
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA, AutoETS, AutoTheta
import pandas as pd
import numpy as np

# Step 1: Data preparation
# df must have columns: unique_id, ds, y
train_df = df[df['ds'] < '2026-01-01']
test_df = df[df['ds'] >= '2026-01-01']
horizon = 30  # 30-day forecast

# Step 2: Statistical models (fast, per-series)
sf = StatsForecast(
    models=[
        AutoARIMA(season_length=7),
        AutoETS(season_length=7),
        AutoTheta(season_length=7),
    ],
    freq='D',
    n_jobs=-1,
)
stat_forecasts = sf.forecast(df=train_df, h=horizon)

# Step 3: Deep learning models (slower, more expressive)
nf = NeuralForecast(
    models=[
        NHITS(
            input_size=180,
            h=horizon,
            max_steps=1000,
            n_pool_kernel_size=[4, 4, 4],
        ),
        PatchTST(
            input_size=512,
            h=horizon,
            max_steps=1000,
            patch_len=16,
        ),
    ],
    freq='D',
)
nf.fit(df=train_df)
neural_forecasts = nf.predict()

# Step 4: Ensemble (simple average — often the best approach)
combined = stat_forecasts.merge(neural_forecasts, on=['unique_id', 'ds'])
model_cols = [c for c in combined.columns
              if c not in ['unique_id', 'ds']]
combined['ensemble'] = combined[model_cols].mean(axis=1)

# Step 5: Evaluate
from utilsforecast.losses import mae, mse, smape
evaluation = {
    'MAE': mae(test_df['y'], combined['ensemble']),
    'MSE': mse(test_df['y'], combined['ensemble']),
    'sMAPE': smape(test_df['y'], combined['ensemble']),
}
print(f"Ensemble performance: {evaluation}")

Critical pipeline components beyond the model:

  • Data quality checks: Missing values, duplicates, timezone inconsistencies, and outliers in training data directly degrade forecast quality. Automated data validation before model training is essential.
  • Cross-validation for time series: Never use random train-test splits for time series. Use expanding window or sliding window cross-validation that respects temporal ordering. The utilsforecast library provides optimized implementations.
  • Forecast reconciliation: When forecasts exist at multiple hierarchical levels (store-level, region-level, national-level), they must be coherent — the sum of store forecasts should equal the regional forecast. Methods like MinTrace reconciliation ensure consistency.
  • Backtesting and monitoring: Production forecasts must be continuously evaluated against actuals. Forecast accuracy that degrades over time (due to concept drift, data pipeline issues, or regime changes) needs automated detection and model retraining triggers.

The Future of Forecasting

Time-series forecasting is at a fascinating crossroads. Classical methods remain competitive for many problems. Deep learning models set the accuracy frontier for complex, multivariate, long-horizon tasks. Foundation models promise to democratize forecasting by eliminating the need for per-dataset training. And gradient boosting quietly outperforms both on many real-world, feature-rich problems.

Several trends will shape the next wave of innovation:

Foundation model fine-tuning is bridging the gap between zero-shot and fully supervised performance. Pre-train on billions of diverse time points, then fine-tune on your specific domain with as little as a few hundred data points. Early results show fine-tuned Chronos and TimesFM matching or exceeding fully supervised models with a fraction of the training data — the best of both worlds.

Conformal prediction for calibrated uncertainty is replacing ad-hoc prediction interval methods. Conformal prediction provides distribution-free, mathematically guaranteed coverage intervals — if you request 95% intervals, they will contain the true value 95% of the time, regardless of the underlying data distribution. Libraries like MAPIE and EnbPI make this practical for production use.

LLM-enhanced forecasting is an emerging research direction where large language models augment numerical forecasts with textual context. A model that knows “Black Friday is next week” or “a competitor just announced a price cut” — information contained in text but not in numerical time-series history — can produce forecasts that purely numerical models cannot match. Early papers from Amazon and Google show promising results for retail demand forecasting.

Real-time adaptive models that continuously update their parameters as new data arrives — online learning — are becoming practical for streaming applications. Instead of periodic batch retraining, the model learns from each new observation in real-time, automatically adapting to concept drift without human intervention.

The most important practical takeaway from the current landscape is that the best forecasting system is not the best model — it’s the best pipeline. Data quality, feature engineering, cross-validation, ensembling, monitoring, and retraining together determine forecast accuracy more than any individual model choice. The teams that invest in pipeline infrastructure consistently outperform teams that chase the latest model architecture. Start with a simple, well-engineered pipeline. Add complexity only when measured accuracy improvements justify it. And always, always benchmark against a seasonal naive baseline — because the most sophisticated model in the world is worthless if it can’t beat “same as last week.”


References

  • Nie, Yuqi, et al. “A Time Series is Worth 64 Words: Long-term Forecasting with Transformers.” (PatchTST) ICLR 2023.
  • Liu, Yong, et al. “iTransformer: Inverted Transformers Are Effective for Time Series Forecasting.” ICLR 2024.
  • Das, Abhimanyu, et al. “A Decoder-Only Foundation Model for Time-Series Forecasting.” (TimesFM) ICML 2024.
  • Ansari, Abdul Fatir, et al. “Chronos: Learning the Language of Time Series.” arXiv:2403.07815, 2024.
  • Woo, Gerald, et al. “Unified Training of Universal Time Series Forecasting Transformers.” (Moirai) ICML 2024.
  • Oreshkin, Boris N., et al. “N-BEATS: Neural Basis Expansion Analysis for Interpretable Time Series Forecasting.” ICLR 2020.
  • Challu, Cristian, et al. “N-HiTS: Neural Hierarchical Interpolation for Time Series Forecasting.” AAAI 2023.
  • Lim, Bryan, et al. “Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting.” International Journal of Forecasting, 2021.
  • Salinas, David, et al. “DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks.” International Journal of Forecasting, 2020.
  • Goswami, Mononito, et al. “MOMENT: A Family of Open Time-Series Foundation Models.” ICML 2024.
  • Wu, Haixu, et al. “TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis.” ICLR 2023.
  • Taylor, Sean J. and Benjamin Letham. “Forecasting at Scale.” (Prophet) The American Statistician, 2018.
  • NeuralForecast GitHub — Production deep learning forecasting
  • StatsForecast GitHub — Lightning-fast statistical forecasting
  • Time-Series-Library (THU) — Unified deep learning framework
  • Chronos GitHub Repository
  • TimesFM GitHub Repository

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *