Time-Series Anomaly Detection in 2026: From Classical Methods to Foundation Models

On July 19, 2024, a faulty content update from CrowdStrike caused 8.5 million Windows machines to crash simultaneously — the largest IT outage in history. Airlines grounded flights. Hospitals postponed surgeries. Banks froze transactions. The total economic damage exceeded $10 billion. The root cause was a single bad configuration file pushed to production. An anomaly detection system monitoring the deployment’s telemetry — CPU spikes, crash rates, memory patterns — could have flagged the cascading failure within seconds and triggered an automatic rollback before 0.1% of those machines were affected.

This is not a hypothetical benefit. Companies like Netflix, Uber, and Meta operate real-time anomaly detection systems that catch exactly these patterns — sudden deviations in request latency, error rates, transaction volumes, or system metrics that indicate something has gone wrong before users notice. The difference between catching an anomaly in 30 seconds versus 30 minutes can mean the difference between a minor incident and front-page news.

Time-series anomaly detection — the task of identifying unusual patterns in sequential, timestamped data — has experienced a remarkable transformation over the past three years. Classical statistical methods that served practitioners for decades are being augmented and in some cases replaced by deep learning architectures, transformer-based models, and most recently, pre-trained foundation models that can detect anomalies in time series they’ve never seen before, without any task-specific training. The pace of innovation in this space has been extraordinary, and the gap between what’s possible in a research paper and what works in production is narrowing rapidly.

This guide covers the full landscape: from classical approaches that remain surprisingly competitive, through the deep learning revolution of 2020-2024, to the foundation model frontier of 2025-2026. Whether you’re building anomaly detection for infrastructure monitoring, financial fraud detection, predictive maintenance, or healthcare, understanding these models — their strengths, limitations, and practical trade-offs — is essential.

Why Anomaly Detection in Time Series Is Harder Than You Think

Detecting anomalies in tabular data is relatively straightforward: a transaction amount of $50,000 when the customer’s average is $200 is clearly unusual. Time-series anomaly detection is fundamentally harder because the definition of “unusual” depends on temporal context — patterns that are normal at one time are anomalous at another.

Consider server CPU usage. A spike to 95% utilization at 3 AM might be perfectly normal — that’s when the batch processing job runs. The same spike at 3 PM, when only light API traffic is expected, might indicate a runaway process or a denial-of-service attack. A gradual drift from 40% baseline to 60% over six weeks might indicate a memory leak that will eventually cause a crash. Each of these requires the detection system to understand not just the current value but its relationship to seasonal patterns, trends, and the broader temporal context.

The challenges break down into several categories:

Rarity of labeled anomalies. In most real-world datasets, anomalies represent less than 1% of observations — often less than 0.01%. Supervised learning approaches struggle because the classes are so imbalanced. Most state-of-the-art methods therefore operate in unsupervised or semi-supervised settings, learning what “normal” looks like and flagging deviations.

Concept drift. What constitutes “normal” changes over time. A system that learned normal patterns from January data may flag perfectly healthy February patterns as anomalous if the business grew, the user base shifted, or infrastructure was upgraded. Models must adapt to evolving baselines without losing sensitivity to genuine anomalies.

Multivariate dependencies. Modern systems generate hundreds or thousands of metrics simultaneously. An anomaly may not be visible in any single metric — CPU looks fine, memory looks fine, disk I/O looks fine — but the specific combination of all three at slightly elevated levels, simultaneously, indicates an emerging problem. Capturing these inter-metric correlations is where deep learning approaches excel over classical univariate methods.

Key Takeaway: Time-series anomaly detection is difficult because “anomalous” is context-dependent, labeled data is scarce, normal behavior evolves, and the most dangerous anomalies may only manifest as subtle correlations across multiple variables. Models that handle all four challenges simultaneously are rare — which is why the field continues to advance rapidly.

A Taxonomy of Time-Series Anomalies

Before selecting a model, you need to know what kind of anomaly you’re looking for. Different model architectures excel at detecting different anomaly types:

Anomaly Type	Description	Example	Best Detection Approach
Point anomaly	A single observation far from expected	Sudden CPU spike to 100%	Statistical thresholds, Isolation Forest
Contextual anomaly	Normal value in wrong context	High traffic at 4 AM (normally low)	Seasonal decomposition, LSTM, Transformer
Collective anomaly	A sequence of observations anomalous together	Sustained elevated error rate for 10 minutes	Sliding-window models, sequence-to-sequence
Trend anomaly	Gradual shift from expected trajectory	Memory usage growing 2% weekly (leak)	Change-point detection, trend decomposition
Shapelet anomaly	Unusual pattern shape in a subsequence	Abnormal ECG waveform morphology	Matrix Profile, deep autoencoders

Classical Approaches: Where It All Started

Before deep learning, time-series anomaly detection relied on statistical methods that remain relevant and surprisingly competitive for many use cases. Understanding these foundations is essential — they serve as baselines, they’re interpretable, and they run efficiently without GPU infrastructure.

Statistical and Decomposition Methods

STL Decomposition + Residual Thresholding: Seasonal-Trend decomposition using LOESS (STL) separates a time series into trend, seasonal, and residual components. Anomalies are identified by flagging residuals that exceed a threshold (typically 3 standard deviations). This method is simple, interpretable, and handles seasonality well — making it excellent for business metrics like daily active users or hourly revenue.

ARIMA-based Detection: AutoRegressive Integrated Moving Average models forecast the next value based on historical patterns. Observations that deviate significantly from the forecast are flagged. ARIMA works well for stationary series with clear autoregressive structure but struggles with complex multi-seasonal patterns or non-linear dynamics.

Exponential Smoothing State Space Models (ETS): Similar in spirit to ARIMA but using exponential weighting of past observations. The Holt-Winters variant handles both trend and seasonality and remains a workhorse in production monitoring systems.

Isolation Forest and Tree-Based Methods

Isolation Forest (Liu et al., 2008) takes a brilliantly different approach: instead of building a model of normal behavior and looking for deviations, it directly identifies anomalies by measuring how easy they are to isolate. Anomalous points, being different from the majority, require fewer random partitions to separate from the rest of the data. Isolation Forest is fast, scales well to high-dimensional data, and handles multivariate anomaly detection naturally.

from sklearn.ensemble import IsolationForest
import numpy as np
import pandas as pd

# Create windowed features from raw time series
def create_features(series, window=24):
    features = []
    for i in range(window, len(series)):
        window_data = series[i-window:i]
        features.append({
            'mean': np.mean(window_data),
            'std': np.std(window_data),
            'min': np.min(window_data),
            'max': np.max(window_data),
            'last': window_data[-1],
            'trend': np.polyfit(range(window), window_data, 1)[0]
        })
    return pd.DataFrame(features)

# Fit Isolation Forest
features = create_features(cpu_usage_series, window=24)
model = IsolationForest(contamination=0.01, random_state=42)
predictions = model.fit_predict(features)
# -1 = anomaly, 1 = normal

Matrix Profile: The Subsequence Analysis Powerhouse

Matrix Profile (Yeh et al., 2016) computes the distance between every subsequence in a time series and its nearest neighbor, producing a profile of how “unique” each subsequence is. Subsequences with high matrix profile values — meaning their nearest neighbor is unusually far away — are anomalous. Matrix Profile excels at detecting shapelet anomalies (unusual pattern shapes) and is remarkably efficient thanks to the STOMP algorithm, which computes the full matrix profile in O(n² log n) time.

The Python library stumpy provides production-grade Matrix Profile implementations and remains one of the most underappreciated tools in the anomaly detection practitioner’s toolkit.

The Deep Learning Revolution in Anomaly Detection

Starting around 2019, deep learning models began consistently outperforming classical methods on complex, multivariate anomaly detection benchmarks. The key insight: deep neural networks can learn non-linear temporal patterns that are invisible to linear statistical models.

LSTM Autoencoders: The First Deep Success

The LSTM Autoencoder architecture — an encoder that compresses a time-series window into a latent representation, followed by a decoder that reconstructs the original window — became the first widely adopted deep learning approach for time-series anomaly detection. The model learns to reconstruct “normal” patterns during training. At inference time, windows with high reconstruction error are flagged as anomalous, because the model has never learned to reconstruct those patterns.

LSTM Autoencoders handle temporal dependencies (the LSTM component) and learn what to expect (the autoencoder objective) simultaneously. They were the standard deep approach from roughly 2019-2022 and remain effective for many applications.

import torch
import torch.nn as nn

class LSTMAutoencoder(nn.Module):
    def __init__(self, n_features, hidden_size=64, n_layers=2):
        super().__init__()
        self.encoder = nn.LSTM(
            n_features, hidden_size, n_layers, batch_first=True
        )
        self.decoder = nn.LSTM(
            hidden_size, hidden_size, n_layers, batch_first=True
        )
        self.output_layer = nn.Linear(hidden_size, n_features)

    def forward(self, x):
        # Encode: compress the sequence
        _, (hidden, cell) = self.encoder(x)

        # Decode: reconstruct the sequence
        seq_len = x.size(1)
        decoder_input = hidden[-1].unsqueeze(1).repeat(1, seq_len, 1)
        decoder_out, _ = self.decoder(decoder_input)
        reconstruction = self.output_layer(decoder_out)

        return reconstruction

# Anomaly score = reconstruction error (MSE per window)
# High reconstruction error → anomaly

GDN and GNN-Based Methods: Modeling Inter-Metric Relationships

Graph Deviation Network (GDN) (Deng & Hooi, 2021) introduced an elegant solution for multivariate anomaly detection: model the relationships between sensors/metrics as a graph, where each node is a time series and edges represent learned dependencies. When a metric deviates from what the graph structure predicts based on its neighbors’ values, it’s flagged as anomalous.

GDN’s key advantage is its ability to identify anomalies that are invisible in individual metrics but manifest as broken inter-metric correlations. For example, in a server cluster, CPU and memory usage typically correlate. If CPU spikes but memory doesn’t — or vice versa — GDN detects the correlation violation, even if both values are individually within normal ranges.

USAD: UnSupervised Anomaly Detection

USAD (Audibert et al., 2020) combines autoencoders with adversarial training. Two decoder networks compete: one reconstructs the input from the latent space, while the other tries to reconstruct the first decoder’s output. This adversarial training scheme forces the autoencoders to learn sharper boundaries between normal and anomalous patterns, significantly improving detection accuracy compared to standard autoencoders. USAD is fast to train, works well on multivariate data, and has become a popular baseline in academic benchmarks.

Transformer-Based Models: The Current State of the Art

The transformer architecture — originally designed for natural language processing — has proven remarkably effective for time-series analysis. Its self-attention mechanism can capture long-range dependencies in sequences without the vanishing gradient problems that limit RNNs and LSTMs. Several transformer-based models have set new state-of-the-art results on anomaly detection benchmarks.

Anomaly Transformer (ICLR 2022)

Anomaly Transformer (Xu et al., 2022) introduced a key insight: in normal time-series data, each point’s attention pattern should focus on adjacent points (the “prior-association”) and on semantically similar points elsewhere in the series (the “series-association”). These two association patterns align for normal data but diverge for anomalies. Anomaly Transformer introduces an Association Discrepancy metric that measures this divergence, providing a principled anomaly score.

The model achieved state-of-the-art results on six benchmark datasets at the time of publication and remains among the strongest methods for unsupervised multivariate anomaly detection. Its key contribution — using attention pattern discrepancy rather than reconstruction error as the anomaly score — represents a conceptual advance over prior autoencoder-based approaches.

DCdetector: Dual Attention Contrastive (ICML 2023)

DCdetector (Yang et al., 2023) builds on the association discrepancy idea with a contrastive learning framework. It creates two representations of each time step — one from a “patch-wise” attention view and one from a “channel-wise” attention view — and uses contrastive learning to maximize agreement for normal patterns and divergence for anomalies. DCdetector achieved new state-of-the-art results on multiple benchmarks, improving on Anomaly Transformer’s F1 scores by 2-5 points on several datasets.

TimesNet: From Temporal to Spatial (ICLR 2023)

TimesNet (Wu et al., 2023) takes a creative approach: it transforms 1D time-series data into 2D representations by reshaping each period (daily, weekly, etc.) into a 2D image-like tensor, then applies 2D convolutional neural networks to capture both intra-period and inter-period patterns simultaneously. This transformation allows TimesNet to leverage the powerful feature extraction capabilities of CNNs — originally developed for computer vision — on temporal data.

TimesNet is a general-purpose time-series model (it handles forecasting, classification, and anomaly detection), and its multi-task capability makes it a strong choice for teams that need a single architecture serving multiple analytical needs.

Model	Year	Core Idea	Strengths	Limitations
LSTM Autoencoder	2019	Reconstruct normal patterns	Simple, well-understood	Limited long-range context
GDN	2021	Graph-based inter-metric modeling	Catches correlation anomalies	Complex graph construction
Anomaly Transformer	2022	Attention association discrepancy	Strong benchmark results	Computationally expensive
TimesNet	2023	1D→2D transformation + CNN	Multi-task capable	Assumes periodic structure
DCdetector	2023	Dual-attention contrastive learning	SOTA on multiple benchmarks	Requires careful tuning

Foundation Models for Time Series: The 2025-2026 Frontier

The most exciting development in time-series analysis over the past two years has been the emergence of foundation models — large, pre-trained models that can perform time-series tasks (including anomaly detection) on data they’ve never seen before, without task-specific training. This is the same paradigm shift that GPT brought to language and CLIP brought to vision: train once on massive diverse data, then apply to arbitrary downstream tasks via fine-tuning or zero-shot inference.

TimesFM (Google, 2024)

TimesFM (Time Series Foundation Model) was developed by Google Research and pre-trained on approximately 100 billion time points from diverse sources — financial markets, weather stations, energy consumption, web traffic, and synthetic data. At 200 million parameters, TimesFM is designed as a decoder-only transformer that generates point forecasts, and anomaly detection is achieved by flagging observations that deviate significantly from the model’s zero-shot forecast.

TimesFM’s remarkable property is that it produces competitive forecasts — and therefore competitive anomaly detection — without ever seeing your specific data during training. You feed it a time series, it generates a forecast based on patterns learned from 100 billion diverse time points, and you compare actuals against forecasts. This zero-shot capability eliminates the need for per-dataset model training, dramatically reducing time-to-deployment for new monitoring use cases.

Chronos (Amazon, 2024)

Chronos (Ansari et al., 2024) from Amazon takes an innovative approach: it tokenizes time-series values into discrete bins (similar to how language models tokenize words) and then applies a standard language model architecture (T5) to the tokenized sequence. This allows Chronos to leverage battle-tested language model architectures and training recipes for time-series tasks.

Chronos offers multiple model sizes (Mini: 20M, Small: 46M, Base: 200M, Large: 710M parameters) and performs remarkably well in zero-shot evaluations. For anomaly detection, the approach is forecast-based: Chronos generates probabilistic forecasts, and observations falling outside the prediction intervals are flagged as anomalous.

import torch
from chronos import ChronosPipeline

# Load pre-trained Chronos model
pipeline = ChronosPipeline.from_pretrained(
    "amazon/chronos-t5-base",
    device_map="auto",
    torch_dtype=torch.float32,
)

# Generate probabilistic forecast (zero-shot — no training needed)
context = torch.tensor(historical_data)  # Your time series
forecast = pipeline.predict(
    context,
    prediction_length=24,  # Forecast next 24 steps
    num_samples=100,       # Generate 100 forecast samples
)

# Anomaly detection via prediction intervals
median_forecast = forecast.median(dim=1).values
lower_bound = forecast.quantile(0.025, dim=1).values  # 2.5th percentile
upper_bound = forecast.quantile(0.975, dim=1).values   # 97.5th percentile

# Points outside the 95% prediction interval are anomalies
anomalies = (actual_values < lower_bound) | (actual_values > upper_bound)

MOMENT (CMU, 2024)

MOMENT (Goswami et al., 2024) — Multi-task Open-source pre-trained Model for Every Time series — is a family of models specifically designed for multiple time-series tasks, including anomaly detection, classification, forecasting, and imputation. Unlike TimesFM and Chronos, which approach anomaly detection indirectly through forecasting, MOMENT is explicitly trained with an anomaly detection objective during pre-training.

MOMENT uses a masked reconstruction objective: during pre-training, random patches of the time series are masked, and the model learns to reconstruct them. For anomaly detection, the reconstruction error at each time step serves as the anomaly score. Observations that are hard for the model to reconstruct from context — because they deviate from patterns the model has learned across its massive pre-training dataset — receive high anomaly scores.

MOMENT is open-source, available on Hugging Face, and supports fine-tuning for domain-specific applications. Its anomaly detection performance is competitive with specialized models that were trained on the target dataset, despite MOMENT requiring zero task-specific training.

Timer and TimeGPT: Commercial and Research Alternatives

TimeGPT (Nixtla, 2024) is a commercially available foundation model with an API-based interface. Users send time-series data to the API and receive forecasts and anomaly scores without managing any model infrastructure. TimeGPT is attractive for teams that want foundation model capabilities without the complexity of model deployment, though it requires sending data to an external service — a non-starter for sensitive applications.

Timer (Liu et al., 2024) from Tsinghua University is a generative pre-trained transformer for time series that unifies multiple analytical tasks. It uses an autoregressive next-token prediction objective (analogous to GPT) on tokenized time-series data and can perform anomaly detection, forecasting, and imputation in a single framework.

Foundation Model	Origin	Parameters	Open Source	Anomaly Approach	Key Advantage
TimesFM	Google	200M	Yes	Forecast-based	Massive pre-training data (100B points)
Chronos	Amazon	20M-710M	Yes	Probabilistic forecast	Multiple sizes, LLM architecture
MOMENT	CMU	40M-385M	Yes	Masked reconstruction	Explicit anomaly detection objective
TimeGPT	Nixtla	Undisclosed	No (API)	Forecast-based	Zero infrastructure, API-ready
Timer	Tsinghua	67M	Yes	Autoregressive	GPT-style unified framework

Tip: Foundation models excel when you need to deploy anomaly detection quickly on new, unseen time series without collecting training data first. If you have abundant historical data with labeled anomalies for your specific domain, a fine-tuned specialized model (like Anomaly Transformer or DCdetector) may still outperform zero-shot foundation models. The right choice depends on whether your bottleneck is labeled data availability or model performance ceiling.

Benchmarks and Real-World Performance

The academic community evaluates anomaly detection models on several standard benchmark datasets. Understanding these benchmarks — and their limitations — helps calibrate expectations for real-world performance.

Dataset	Domain	Dimensions	Anomaly %	Key Challenge
SMD	Server Machines	38	~4.2%	Multi-entity, diverse patterns
MSL	NASA Spacecraft	55	~10.7%	Telemetry with complex physics
SMAP	NASA Soil Moisture	25	~13.1%	Sensor noise, gradual drifts
SWaT	Water Treatment Plant	51	~12.1%	Cyber-physical attacks, subtle
PSM	eBay Server Metrics	25	~27.8%	High anomaly rate, noisy labels

Caution: A 2023 paper by Kim et al. (“Towards a Rigorous Evaluation of Time-Series Anomaly Detection”) demonstrated that many published benchmark results are inflated by evaluation methodology issues — particularly the use of point-adjust (PA) metrics that credit models for detecting any point within an anomaly segment, even if the detection is delayed. When evaluated with stricter metrics, the performance gap between methods narrows considerably, and some classical methods perform comparably to deep models. Always evaluate models on your own data with metrics that reflect your operational requirements (detection latency, false positive rate at a target recall).

Practical Guide: Choosing the Right Model for Your Problem

With so many available models, the selection decision can feel overwhelming. Here’s a practical decision framework based on real-world constraints:

Decision Framework

Do you have labeled anomaly data?

Yes (100+ labeled anomalies): Fine-tune a supervised or semi-supervised model. Consider fine-tuning MOMENT or training DCdetector with the labels guiding threshold selection.
No: Use unsupervised methods. Continue to next question.

Is this a new deployment with no historical training data?

Yes: Use a foundation model (Chronos, TimesFM, or MOMENT) in zero-shot mode. You’ll get competitive detection immediately without any training.
No (ample historical data): Train a specialized model for best performance. Continue to next question.

Univariate or multivariate?

Univariate (single metric): STL decomposition + thresholding is hard to beat for simplicity and interpretability. For higher accuracy, use Matrix Profile or an LSTM autoencoder.
Multivariate (many correlated metrics): Use Anomaly Transformer, DCdetector, or GDN to capture inter-metric correlations.

Latency requirements?

Real-time (sub-second): Avoid transformer models for inference. Use Isolation Forest, streaming Matrix Profile (via STUMPY), or lightweight LSTM models.
Near-real-time (seconds to minutes): Any model is feasible with proper infrastructure.
Batch (hourly/daily): Prioritize accuracy over speed. Use the most capable model available.

Implementation: Building an Anomaly Detection Pipeline

A production anomaly detection system involves more than just a model. Here’s the full pipeline architecture:

# Complete anomaly detection pipeline with Chronos
import torch
import numpy as np
from chronos import ChronosPipeline
from dataclasses import dataclass
from typing import Optional

@dataclass
class AnomalyResult:
    timestamp: str
    value: float
    expected: float
    lower_bound: float
    upper_bound: float
    anomaly_score: float
    is_anomaly: bool

class TimeSeriesAnomalyDetector:
    def __init__(
        self,
        model_name: str = "amazon/chronos-t5-small",
        context_length: int = 512,
        prediction_length: int = 1,
        confidence_level: float = 0.95,
    ):
        self.pipeline = ChronosPipeline.from_pretrained(
            model_name,
            device_map="auto",
            torch_dtype=torch.float32,
        )
        self.context_length = context_length
        self.prediction_length = prediction_length
        self.alpha = 1 - confidence_level

    def detect(
        self,
        history: np.ndarray,
        actual_value: float,
        timestamp: str,
    ) -> AnomalyResult:
        """Detect if actual_value is anomalous given history."""
        # Use last context_length points
        context = torch.tensor(
            history[-self.context_length:]
        ).unsqueeze(0).float()

        # Generate probabilistic forecast
        forecast = self.pipeline.predict(
            context,
            prediction_length=self.prediction_length,
            num_samples=200,
        )

        # Extract prediction intervals
        median = forecast.median(dim=1).values[0, 0].item()
        lower = forecast.quantile(
            self.alpha / 2, dim=1
        ).values[0, 0].item()
        upper = forecast.quantile(
            1 - self.alpha / 2, dim=1
        ).values[0, 0].item()

        # Calculate anomaly score (normalized deviation)
        interval_width = upper - lower
        if interval_width > 0:
            score = abs(actual_value - median) / interval_width
        else:
            score = abs(actual_value - median)

        is_anomaly = actual_value < lower or actual_value > upper

        return AnomalyResult(
            timestamp=timestamp,
            value=actual_value,
            expected=median,
            lower_bound=lower,
            upper_bound=upper,
            anomaly_score=score,
            is_anomaly=is_anomaly,
        )

# Usage
detector = TimeSeriesAnomalyDetector()
result = detector.detect(
    history=cpu_usage_last_7_days,
    actual_value=current_cpu_reading,
    timestamp="2026-04-03T08:15:00Z",
)

if result.is_anomaly:
    print(f"ANOMALY at {result.timestamp}: "
          f"value={result.value:.1f}, "
          f"expected={result.expected:.1f} "
          f"[{result.lower_bound:.1f}, {result.upper_bound:.1f}]")

Key pipeline components beyond the model itself:

Data preprocessing: Handle missing values (forward-fill or interpolation), normalize scales across metrics, align timestamps across data sources.
Threshold calibration: Use a validation period of known-normal data to calibrate anomaly thresholds. A threshold set too low floods operators with false positives; too high misses real incidents.
Suppression and deduplication: A single incident may trigger dozens of anomaly alerts across correlated metrics. Group alerts by time window and root cause to avoid alert fatigue.
Feedback loop: Operators who acknowledge or dismiss alerts provide implicit labels. Feed this data back into the model as fine-tuning signal to improve detection over time.
Seasonal awareness: Explicitly model known business cycles (daily patterns, weekend effects, holiday traffic changes) to reduce false positives during expected-but-unusual periods.

Where the Field Is Heading

Time-series anomaly detection is at an inflection point. The convergence of foundation models, transformer architectures, and practical tooling is making it possible to deploy sophisticated anomaly detection systems with dramatically less effort than even two years ago. Where a 2022 deployment required collecting domain-specific training data, training a specialized model, and calibrating thresholds through iterative experimentation, a 2026 deployment can start with a zero-shot foundation model that delivers competitive performance from day one and improves with domain-specific fine-tuning.

Several trends will shape the next 2-3 years:

Multimodal foundation models that jointly reason over time-series metrics, log messages, and trace data are emerging from research labs. An anomaly detection system that can correlate a latency spike with a specific error message in the application logs and a deployment event in the change management system would dramatically reduce mean time to diagnosis — not just detection.

LLM-augmented anomaly explanation is another frontier. Current systems tell you that something is anomalous; they rarely tell you why. Integrating LLMs that can explain anomaly detections in natural language (“CPU spiked to 95% at 3:14 PM, coinciding with a deployment of version 2.4.1 to the payment service; historical pattern suggests a connection between this deployment and similar spikes”) would close the gap between detection and remediation.

Edge deployment of lightweight anomaly detection models is becoming practical as foundation model distillation techniques improve. Running a compact anomaly detector directly on IoT devices, industrial sensors, or network routers — without round-tripping data to a cloud service — enables real-time detection with lower latency and better data privacy.

The field has moved from “can we detect anomalies automatically?” (yes, reliably, since the late 2010s) to “can we detect anomalies without per-dataset training?” (yes, with foundation models, since 2024) to the current frontier: “can we detect, explain, and suggest remediation, all in real time?” That question is being actively answered, and the pace of progress suggests it won’t be open for long.

References

Xu, Jiehui, et al. “Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy.” ICLR 2022.
Yang, Yiyuan, et al. “DCdetector: Dual Attention Contrastive Representation Learning for Time Series Anomaly Detection.” ICML 2023.
Wu, Haixu, et al. “TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis.” ICLR 2023.
Ansari, Abdul Fatir, et al. “Chronos: Learning the Language of Time Series.” arXiv:2403.07815, 2024.
Das, Abhimanyu, et al. “A Decoder-Only Foundation Model for Time-Series Forecasting.” (TimesFM) ICML 2024.
Goswami, Mononito, et al. “MOMENT: A Family of Open Time-Series Foundation Models.” ICML 2024.
Deng, Ailin, and Bryan Hooi. “Graph Neural Network-Based Anomaly Detection in Multivariate Time Series.” AAAI 2021.
Audibert, Julien, et al. “USAD: UnSupervised Anomaly Detection on Multivariate Time Series.” KDD 2020.
Kim, Siwon, et al. “Towards a Rigorous Evaluation of Time-Series Anomaly Detection.” AAAI 2023.
Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. “Isolation Forest.” ICDM 2008.
Yeh, Chin-Chia Michael, et al. “Matrix Profile I: All Pairs Similarity Joins for Time Series.” ICDM 2016.
Time-Series-Library (THU) — Unified framework for time-series models including anomaly detection
Amazon Chronos GitHub Repository
MOMENT GitHub Repository