Introduction: Why Time Series Forecasting Matters More Than Ever
Time series forecasting, the discipline of predicting future values from historical patterns, has become one of the most consequential applications of artificial intelligence. From predicting stock market movements and energy demand to forecasting supply-chain bottlenecks and hospital admissions, accurate time series predictions can determine the difference between substantial profit and significant loss.
Yet for decades, the field was dominated by classical statistical methods like ARIMA (AutoRegressive Integrated Moving Average), Exponential Smoothing, and Prophet. These methods, while reliable and interpretable, struggled with the complexity of modern datasets: thousands of interrelated variables, irregular sampling intervals, and the need to generalize across entirely different domains without retraining.
This situation changed substantially between 2023 and 2026. A wave of innovation, driven by the same transformer architectures that power ChatGPT and other large language models, transformed the time series field. The result is a new generation of models that forecast with high accuracy, often with zero or minimal fine-tuning on the target data.
This guide examines the most recent and influential time series forecasting models, explains how they work in accessible terms, compares their strengths and weaknesses, and offers practical guidance for selecting an appropriate model for a given use case. It is intended for data scientists, quantitative investors, and business leaders seeking to understand the technology.
The Evolution from Statistical to Deep Learning Models
To appreciate the significance of the most recent models, it is useful to understand the developments that preceded them. Time series forecasting has evolved through several distinct eras, each addressing the limitations of its predecessor.
The Classical Era (1970s-2010s): ARIMA, ETS, and Prophet
The workhorse of time series forecasting for nearly half a century was the ARIMA family of models. Developed by Box and Jenkins in the 1970s, ARIMA models decompose a time series into autoregressive (AR) components, integrated (differencing) components, and moving average (MA) components. They work beautifully for univariate, stationary time series with clear patterns.
Exponential Smoothing (ETS) offered a complementary approach, assigning exponentially decreasing weights to older observations. Facebook’s Prophet (released in 2017) made time series accessible to non-specialists by automatically handling seasonality, holidays, and trend changes.
All of these methods share a fundamental limitation, however: they are univariate (or handle multivariate data awkwardly), they require manual feature engineering, and they must be trained separately for each time series. Forecasting 10,000 product SKUs requires 10,000 separate models.
The Early Deep Learning Era (2017-2022): DeepAR, N-BEATS, and Temporal Fusion Transformer
Deep learning entered the time series arena with Amazon’s DeepAR (2017), which used recurrent neural networks (RNNs) to produce probabilistic forecasts across related time series. N-BEATS (2019) from Element AI showed that pure deep learning architectures could beat statistical ensembles on the M4 competition benchmark, a prestigious forecasting competition.
The Temporal Fusion Transformer (TFT), published by Google in 2021, combined attention mechanisms with gating layers to handle multiple input types (static metadata, known future inputs, and observed past values). TFT became one of the most popular deep learning forecasting models, offering both accuracy and interpretability through its attention weights.
Despite these advances, these models still required substantial training data from the target domain and significant computational resources to train. They were not “general-purpose” forecasters.
The Foundation Model Era (2023-2026): Zero-Shot Forecasting
The breakthrough came when researchers applied the “foundation model” paradigm — pre-training on massive, diverse datasets and then applying the model to new tasks without fine-tuning — to time series data. Just as GPT-3 could answer questions about topics it was never explicitly trained on, these new models can forecast time series they have never seen before.
This paradigm shift was enabled by three key insights:
- Tokenization of time series: Converting continuous numerical values into discrete tokens (similar to how text is tokenized for language models) allows transformer architectures to process time series data effectively.
- Cross-domain pre-training: Training on hundreds of thousands of diverse time series (energy, finance, weather, retail, healthcare) teaches the model general patterns like seasonality, trends, and level shifts that transfer across domains.
- Scaling laws apply: Larger models trained on more data consistently produce better forecasts, following the same scaling behavior observed in large language models.
Foundation Models for Time Series: The 2024-2026 Shift
Foundation models represent the most significant recent development in time series forecasting. These models are pre-trained on large collections of time series data and can generate forecasts for entirely new datasets without any task-specific training. The most important examples are described below.
Amazon Chronos
Released by Amazon Science in March 2024, Chronos is a family of pre-trained probabilistic time series forecasting models based on the T5 (Text-to-Text Transfer Transformer) architecture. What makes Chronos unique is its approach to tokenization: it converts real-valued time series into a sequence of discrete tokens using scaling and quantization, then trains a language model to predict the next token in the sequence.
How It Works
Chronos treats time series forecasting as a language modeling problem. Given a sequence of historical values [v1, v2, …, vT], the model:
- Scales the values using mean absolute scaling to normalize different magnitudes
- Quantizes the scaled values into a fixed vocabulary of bins (e.g., 4096 bins)
- Feeds the token sequence into a T5 encoder-decoder transformer
- Generates future tokens autoregressively, which are then mapped back to real values
- Produces probabilistic forecasts by sampling multiple trajectories
Key Strengths
- Zero-shot capability: Performs competitively with models trained specifically on the target dataset
- Multiple model sizes: Available in Mini (8M), Small (46M), Base (200M), and Large (710M) parameter variants
- Data augmentation: Uses synthetic data generated by Gaussian processes during pre-training to improve robustness
- Open source: Fully available on Hugging Face under Apache 2.0 license
Benchmark Results
On the extensive benchmark of 27 datasets compiled by the Chronos team, the Large model achieved the best aggregate zero-shot performance, outperforming task-specific models like DeepAR and AutoARIMA on many datasets. On the widely-used Monash Forecasting Archive, Chronos ranked first or second on the majority of datasets.
Google TimesFM
TimesFM (Time Series Foundation Model) was released by Google Research in February 2024. Unlike Chronos, which adapts a language model architecture, TimesFM was designed from scratch specifically for time series forecasting. It uses a decoder-only transformer architecture with a unique patched decoding approach.
How It Works
TimesFM introduces the concept of “input patches” — contiguous segments of the time series that are fed into the model as single tokens. Rather than processing one time step at a time, the model processes chunks of, say, 32 consecutive values as a single input patch. This dramatically reduces sequence length and allows the model to capture longer-range dependencies.
The key innovation is variable output patch lengths: during training, the model learns to output predictions at different granularities (e.g., 1 step, 16 steps, or 128 steps at a time), which gives it flexibility at inference time to handle arbitrary forecast horizons efficiently.
Key Strengths
- 200M parameters: Trained on a massive corpus of 100 billion time points from Google Trends, Wiki Pageviews, and synthetic data
- Handles variable horizons: A single model can forecast 1 step ahead or 1000 steps ahead without retraining
- Point and probabilistic forecasts: Provides both median forecasts and prediction intervals
- Very fast inference: The patched architecture makes it significantly faster than autoregressive models at long horizons
Benchmark Results
Google’s benchmarks show TimesFM achieving state-of-the-art zero-shot performance on the Darts, Monash, and Informer benchmarks, often matching or exceeding supervised baselines that were trained on the target data. It was particularly strong on long-horizon forecasting tasks (96 to 720 steps ahead).
Salesforce Moirai
Moirai (released by Salesforce AI Research in February 2024) takes yet another approach. It is built on a masked encoder architecture and is designed as a universal forecasting transformer that handles multiple frequencies, prediction lengths, and variable counts within a single model.
How It Works
Moirai’s key innovation is the Any-Variate Attention mechanism. Traditional transformers process multivariate time series by either flattening all variables into one sequence (which loses variable identity) or processing each variable independently (which misses cross-variable relationships). Moirai’s Any-Variate Attention allows the model to dynamically attend to any combination of variables and time steps, regardless of how many variables are present.
The model also uses multiple input/output projection layers for different data frequencies (minutely, hourly, daily, weekly, etc.), allowing a single model to handle data at any sampling rate.
Key Strengths
- True multivariate forecasting: Unlike Chronos and TimesFM (which are primarily univariate), Moirai natively handles multivariate time series
- Frequency-agnostic: A single model works across different sampling frequencies
- Three model sizes: Small (14M), Base (91M), and Large (311M) parameters
- Pre-trained on LOTSA: The Large-scale Open Time Series Archive, a curated collection of 27 billion observations across 9 domains
Nixtla TimeGPT
TimeGPT-1, developed by Nixtla, was actually one of the earliest time series foundation models (first announced in October 2023). Unlike the open-source models above, TimeGPT is offered as a commercial API service, similar to how OpenAI offers GPT access.
How It Works
TimeGPT uses a proprietary transformer-based architecture trained on over 100 billion data points from publicly available datasets spanning finance, weather, energy, web traffic, and more. The exact architecture details are not fully published, but the model follows an encoder-decoder design with attention mechanisms optimized for temporal patterns.
Key Strengths
- Easiest to use: Simple API call — no model loading, no GPU required
- Fine-tuning support: can be fine-tuned on the user’s data through the API for improved performance
- Anomaly detection: Built-in anomaly detection capabilities alongside forecasting
- Conformal prediction intervals: Statistically rigorous uncertainty quantification
Transformer-Based Architectures That Advanced the Field
Beyond the foundation models, several transformer-based architectures have advanced supervised time series forecasting. These models require training on a specific dataset but often achieve the highest accuracy when sufficient training data is available.
PatchTST (Patch Time Series Transformer)
Published at ICLR 2023 by researchers from Princeton and IBM, PatchTST introduced two simple but powerful ideas that dramatically improved transformer performance on time series data.
The Two Key Innovations
Patching: Instead of feeding individual time steps as tokens to the transformer (which creates very long sequences for high-frequency data), PatchTST divides the time series into fixed-length patches (e.g., segments of 16 consecutive values). Each patch becomes a single token, reducing sequence length by a factor of 16 and allowing the attention mechanism to capture much longer-range dependencies within the same computational budget.
Channel Independence: Rather than mixing all variables together (which often confuses the model), PatchTST processes each variable independently through a shared transformer backbone. This counterintuitive design choice turned out to be remarkably effective, as it prevents the model from overfitting to spurious cross-variable correlations in the training data.
Why It Matters
PatchTST demonstrated that transformers can excel at time series forecasting when the tokenization strategy is right. Prior to PatchTST, several papers (notably “Are Transformers Effective for Time Series Forecasting?” by Zeng et al., 2023) had argued that simple linear models outperform transformers on long-term forecasting. PatchTST comprehensively refuted this claim, achieving state-of-the-art results on all major benchmarks at the time.
iTransformer
Published at ICLR 2024 by researchers from Tsinghua University and Ant Group, iTransformer (Inverted Transformer) takes a radically different approach to applying transformers to multivariate time series.
The Inversion Idea
In a standard transformer for time series, each token represents a time step across all variables. The attention mechanism then captures relationships between different time steps. iTransformer inverts this: each token represents an entire variable’s history, and the attention mechanism captures relationships between different variables.
Concretely, for a multivariate time series with 7 variables and 96 historical time steps:
- Standard transformer: 96 tokens, each containing 7 values
- iTransformer: 7 tokens, each containing 96 values
This inversion allows the feed-forward layers to learn temporal patterns within each variable, while the attention mechanism learns cross-variable dependencies — a much more natural decomposition of the problem.
Benchmark Results
iTransformer achieved state-of-the-art results on multiple long-term forecasting benchmarks including ETTh1, ETTh2, ETTm1, ETTm2, Weather, Electricity, and Traffic datasets. It showed particular strength on datasets with strong cross-variable correlations, where its inverted attention mechanism could exploit the relationships effectively.
TimeMixer
Published at ICLR 2024, TimeMixer from Zhejiang University introduces a unique multi-scale mixing architecture that decomposes time series at different temporal resolutions and mixes them together.
How It Works
TimeMixer operates on the insight that time series patterns exist at multiple scales: daily patterns, weekly patterns, monthly patterns, and so on. The model:
- Past Decomposable Mixing (PDM): Decomposes the historical data into multiple temporal resolutions using average pooling, then mixes seasonal and trend components across scales
- Future Multipredictor Mixing (FMM): Generates predictions at each scale independently, then combines them using learnable weights
This multi-scale approach is particularly effective for datasets with complex, multi-period seasonality (e.g., electricity consumption with daily, weekly, and annual patterns).
Lightweight Models That Rival Deep Learning
Not every use case requires a billion-parameter model. Recent research has shown that well-designed lightweight models can match or even exceed the performance of complex transformer architectures, while being orders of magnitude faster to train and deploy.
TSMixer and TSMixer-Rev
TSMixer, published by Google Research in 2023, is an MLP-based (Multi-Layer Perceptron) architecture that uses only simple fully-connected layers and achieves competitive performance with transformer models. The key innovation is alternating time-mixing and feature-mixing operations:
- Time-mixing MLPs: Apply shared weights across variables to capture temporal patterns
- Feature-mixing MLPs: Apply shared weights across time steps to capture cross-variable relationships
TSMixer-Rev (Revised), published in early 2024, added reversible instance normalization to handle distribution shifts in time series data more effectively, further improving performance.
Why Consider TSMixer
- 10-100x faster than transformer models to train
- Minimal memory footprint — runs on CPUs
- Competitive accuracy on most benchmarks
- Easy to understand, debug, and maintain
TiDE (Time-series Dense Encoder)
TiDE, also from Google Research (2023), is another MLP-based model that uses an encoder-decoder architecture with dense layers. It encodes the historical time series and covariates into a fixed-size representation, then decodes it into future predictions.
TiDE’s main advantage is its linear computational complexity with respect to both the lookback window and the forecast horizon. While transformers have quadratic complexity (O(n^2)) due to self-attention, TiDE’s MLP-based design scales linearly, making it practical for very long sequences and real-time applications.
Head-to-Head Comparison: Selecting an Appropriate Model
Choosing an appropriate model depends on the specific requirements of the task. The table below summarizes the key characteristics of each model discussed in this article.
| Model | Type | Zero-Shot | Multivariate | Open Source | Best For |
|---|---|---|---|---|---|
| Chronos | Foundation | Yes | No (univariate) | Yes | General-purpose, quick start |
| TimesFM | Foundation | Yes | No (univariate) | Yes | Long-horizon forecasting |
| Moirai | Foundation | Yes | Yes | Yes | Multivariate, mixed frequency |
| TimeGPT | Foundation | Yes | Yes | No (API) | Non-technical users, fast prototyping |
| PatchTST | Supervised | No | Yes (channel-ind.) | Yes | Long-term forecasting with training data |
| iTransformer | Supervised | No | Yes (native) | Yes | Cross-variable correlation datasets |
| TimeMixer | Supervised | No | Yes | Yes | Multi-scale seasonality |
| TSMixer | Supervised | No | Yes | Yes | Resource-constrained, fast training |
| TiDE | Supervised | No | Yes | Yes | Real-time, low-latency applications |
Decision Framework
The following decision framework helps in selecting an appropriate model for a given situation.
Availability of training data for the specific use case.
- None or very little: use a foundation model (Chronos, TimesFM, or Moirai).
- Substantial: consider supervised models (PatchTST, iTransformer) for potentially higher accuracy.
Need for multivariate forecasting.
- Required: Moirai (zero-shot) or iTransformer (supervised).
- Not required: Chronos or TimesFM, for simplicity.
Resource constraints.
- Constrained: TSMixer or TiDE (MLP-based, capable of running on a CPU).
- Unconstrained: any transformer-based model.
Need for interpretability.
- Required: TFT (Temporal Fusion Transformer) remains the best choice for interpretable forecasting.
- Not required: select on the basis of accuracy.
Practical Guide: Getting Started with Modern Time Series Models
This section describes how to begin with the two most accessible models: Chronos (for zero-shot forecasting) and PatchTST (for supervised forecasting).
Getting Started with Chronos
Chronos is available through the Hugging Face Transformers library, which makes it straightforward to use.
# Install dependencies
# pip install chronos-forecasting torch
import torch
import numpy as np
from chronos import ChronosPipeline
# Load the pre-trained model (choose: tiny, mini, small, base, large)
pipeline = ChronosPipeline.from_pretrained(
"amazon/chronos-t5-small",
device_map="auto",
torch_dtype=torch.float32,
)
# Your historical data (just a 1D numpy array or list)
historical_data = torch.tensor([
112, 118, 132, 129, 121, 135, 148, 148, 136, 119,
104, 118, 115, 126, 141, 135, 125, 149, 170, 170,
158, 133, 114, 140, # ... more data points
], dtype=torch.float32)
# Generate forecasts (12 steps ahead, 20 sample paths)
forecast = pipeline.predict(
context=historical_data,
prediction_length=12,
num_samples=20,
)
# Get median forecast and prediction intervals
median_forecast = np.quantile(forecast[0].numpy(), 0.5, axis=0)
lower_bound = np.quantile(forecast[0].numpy(), 0.1, axis=0)
upper_bound = np.quantile(forecast[0].numpy(), 0.9, axis=0)
print("Median forecast:", median_forecast)
print("80% prediction interval:", lower_bound, "to", upper_bound)
No training, feature engineering, or hyperparameter tuning is required. The model works by default on any univariate time series.
Key Libraries and Frameworks
The time series ecosystem includes several capable frameworks that implement many of these models under a unified API.
- NeuralForecast (Nixtla): Implements PatchTST, iTransformer, TimeMixer, TiDE, TSMixer, and more under a scikit-learn-like API. Great for supervised models.
- GluonTS (Amazon): Production-grade framework for probabilistic time series modeling. Includes DeepAR, TFT, and integrates with Chronos.
- Darts (Unit8): User-friendly library supporting both classical (ARIMA, ETS) and deep learning models. Good for beginners.
- UniTS: A unified framework from CMU for training and evaluating time series foundation models.
Investment and Business Implications
The rapid advancement in time series forecasting models has significant implications for investors and businesses across multiple sectors.
Companies Leading Development
Several publicly traded companies are at the forefront of time series AI development and deployment.
- Amazon (AMZN): Developer of Chronos, DeepAR, and GluonTS. Uses time series forecasting extensively in supply chain optimization and demand forecasting across its retail operations.
- Google/Alphabet (GOOGL): Developer of TimesFM, TiDE, TSMixer, and the original Temporal Fusion Transformer. Applies these models in Google Cloud’s Vertex AI forecasting service.
- Salesforce (CRM): Developer of Moirai and other AI research. Integrates forecasting capabilities into its CRM and analytics products.
- Palantir (PLTR): Uses advanced time series models in its Foundry platform for defense, healthcare, and commercial forecasting applications.
- Snowflake (SNOW): Offers time series forecasting as part of its Cortex AI capabilities within the data cloud platform.
Industries Being Transformed
| Industry | Application | Impact |
|---|---|---|
| Energy | Demand forecasting, renewable output prediction | 10-30% reduction in forecasting error |
| Finance | Volatility modeling, risk assessment, algorithmic trading | Improved risk-adjusted returns |
| Retail | Demand forecasting, inventory optimization | 15-25% reduction in stockouts |
| Healthcare | Patient admissions, resource planning | Better capacity planning, fewer bottlenecks |
| Manufacturing | Predictive maintenance, quality control | 20-40% reduction in unplanned downtime |
ETFs and Investment Vehicles
For investors seeking exposure to the AI and data-analytics companies driving time series forecasting innovation, the following ETFs are relevant.
- Global X Artificial Intelligence & Technology ETF (AIQ): Broad exposure to AI companies including cloud providers
- iShares Exponential Technologies ETF (XT): Includes companies at the intersection of AI, big data, and cloud computing
- ARK Autonomous Technology & Robotics ETF (ARKQ): Focuses on companies leveraging AI for automation
- First Trust Cloud Computing ETF (SKYY): Cloud infrastructure providers that host and serve these models
Conclusion: The Future of Time Series Forecasting
The time series forecasting landscape has undergone a substantial transformation in a few years. The field has moved from a situation in which every forecasting problem required building a custom model from scratch to one in which pre-trained foundation models can generate competitive forecasts by default, across domains they have never previously encountered.
The key conclusions of this analysis are summarized below.
Foundation models are the most important development. Chronos, TimesFM, Moirai, and TimeGPT represent a paradigm shift comparable to what GPT did for natural language processing. They democratize forecasting by making state-of-the-art predictions accessible without deep machine learning expertise.
Transformers have proven their worth for time series. After initial skepticism about whether transformers could outperform simple linear models, architectures like PatchTST, iTransformer, and TimeMixer have conclusively demonstrated that transformer-based models excel at capturing complex temporal patterns when designed with the right inductive biases.
Lightweight models should not be overlooked. TSMixer and TiDE show that well-designed MLP architectures can match transformer performance at a fraction of the computational cost. For production systems where latency and resource efficiency matter, these models are invaluable.
The field is still rapidly evolving. New models and architectures continue to emerge at a remarkable pace. The integration of time series capabilities into multimodal foundation models (combining text, images, and time series) is an active area of research that could unlock even more powerful forecasting capabilities in the coming years.
For practitioners, the recommended approach is clear: begin with a foundation model such as Chronos to establish a quick zero-shot baseline, then experiment with supervised models if greater accuracy is needed, and consider lightweight alternatives for production deployment. The barrier to entry for high-quality time series forecasting has never been lower.
References
- Ansari, A. F., et al. (2024). “Chronos: Learning the Language of Time Series.” Amazon Science. arXiv:2403.07815
- Das, A., et al. (2024). “A Decoder-Only Foundation Model for Time-Series Forecasting.” Google Research. arXiv:2310.10688
- Woo, G., et al. (2024). “Unified Training of Universal Time Series Forecasting Transformers.” Salesforce AI Research. arXiv:2402.02592
- Garza, A. and Mergenthaler-Canseco, M. (2023). “TimeGPT-1.” Nixtla. arXiv:2310.03589
- Nie, Y., et al. (2023). “A Time Series is Worth 64 Words: Long-term Forecasting with Transformers.” ICLR 2023. arXiv:2211.14730
- Liu, Y., et al. (2024). “iTransformer: Inverted Transformers Are Effective for Time Series Forecasting.” ICLR 2024. arXiv:2310.06625
- Wang, S., et al. (2024). “TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting.” ICLR 2024. arXiv:2405.14616
- Chen, S., et al. (2023). “TSMixer: An All-MLP Architecture for Time Series Forecasting.” Google Research. arXiv:2303.06053
- Das, A., et al. (2023). “Long-term Forecasting with TiDE: Time-series Dense Encoder.” Google Research. arXiv:2304.08424
- Lim, B., et al. (2021). “Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting.” International Journal of Forecasting. arXiv:1912.09363
Leave a Reply