Home AI/ML Anomaly Detection Metrics Explained: AUROC, AUPRC, F1, Precision, Recall, FAR

Anomaly Detection Metrics Explained: AUROC, AUPRC, F1, Precision, Recall, FAR

Last updated: May 27, 2026
k
Published April 30, 2026 · Updated May 27, 2026 · 25 min read

This guide examines the evaluation metrics that are appropriate for anomaly detection systems, in which the positive class is by definition rare. When 99.9 percent of transactions are legitimate, a model that flags every record as “normal” attains 99.9 percent accuracy while delivering no operational value. The choice of evaluation metric is therefore one of the most consequential decisions in an anomaly detection project.

The discussion proceeds through the metrics that are relevant for this task, from the basic measures (Precision and Recall) to threshold-independent ranking metrics (AUROC and AUPRC) and the specialised time-series metrics (PA-F1 and VUS). For each metric the formula, the trade-offs, and a full Python implementation are presented so that the material can be applied directly.

Summary

What this post covers: A complete reference for selecting and computing anomaly detection metrics, including Precision, Recall, F1, FAR, MCC, AUROC, AUPRC, the time-series variants, and Top-K measures. The discussion presents the formulas, the trade-offs, and the Python implementations for ML engineers building rare-event detectors in fraud, intrusion, defects, and biometrics.

Key insights:

  • Accuracy is degenerate when anomalies are rare. A constant “normal” predictor can score 99.9 percent, so the first decision in any anomaly-detection project is to discard accuracy as the headline metric.
  • For severely imbalanced data (anomalies below 1 percent), AUPRC is the primary ranking metric and AUROC is secondary. AUROC can appear misleadingly high on heavily imbalanced data because the TN count dominates the denominator.
  • Different stakeholders require different metrics for the same model. Engineers focus on AUROC and AUPRC, operations focuses on FAR and alert volume, and finance focuses on dollar-weighted recall. A single number is therefore always a stakeholder choice in disguise.
  • Standard point-wise F1 fails for time-series anomalies because real anomalies are contiguous events, not isolated samples. Range-based F1, VUS, or NAB Score should be used instead.
  • Most production teams should report a small bundle: AUPRC, Precision@K, Recall, and FAR. This combination covers model quality, operational alert volume, miss rate, and false-alarm rate together.

Main topics: why anomaly metrics matter, the confusion matrix foundation, threshold-dependent metrics, threshold-independent metrics, a decision framework for picking metrics, time-series-specific metrics, Top-K ranking metrics, Python implementations, threshold selection for production, common pitfalls, and domain reporting templates.

Why Anomaly Detection Metrics Matter and Why Accuracy Does Not

Consider a scenario in which a team builds a fraud detector and reports that it attains 99.9 percent accuracy. The result appears impressive. When a stakeholder asks how many actual fraud cases the system caught in the previous quarter, however, the answer may be none. The model achieves 99.9 percent accuracy by predicting “not fraud” for every transaction, because the base rate of fraud at a typical payment processor is approximately 0.1 percent. The model is in effect a constant, the accuracy figure is real, and the system is operationally worthless.

This is the foundational point of anomaly detection: the positive class, namely the anomaly, is rare and sometimes extremely rare. Network intrusions, manufacturing defects, credit-card fraud, and rare diseases all have base rates between approximately 0.01 percent and 5 percent. When the negative class dominates, accuracy becomes a degenerate metric, and a model that predicts “normal” for every input will appear excellent.

This is the imbalance problem. A second issue is equally important: cost asymmetry. Missing a true anomaly (a false negative) almost always costs more than flagging a legitimate event by mistake (a false positive). A missed credit-card fraud may cost $5,000, while an unnecessary alert costs perhaps 30 seconds of an analyst’s time. These errors are not symmetric, and the chosen metric must reflect the asymmetry.

Different stakeholders are concerned with different metrics for the same model:

  • The ML engineer requires AUROC and AUPRC for comparing model architectures.
  • The product manager requires Precision@K because the user interface shows the top 50 alerts per day.
  • The operations lead requires False Alarm Rate (FAR) and Mean Time To Detect (MTTD) because analysts must triage every alert.
  • The CFO requires dollar-weighted recall, namely the fraction of fraud value caught, rather than the count of incidents.

The selection of a single number to optimise implicitly entails a stakeholder choice. The appropriate response is to report a small set of complementary metrics so that each audience receives the information that it requires.

Key Takeaway: Accuracy is almost never the appropriate metric for anomaly detection. The base rate is too low, and the cost of false negatives is too high. Precision, Recall, F1, AUPRC, and FAR should be used in combinations selected according to the operational objective.

The Confusion Matrix Foundation

Every metric in this guide is built from four numbers, namely the cells of the confusion matrix. By convention, in anomaly detection the anomaly is the positive class and the normal point is the negative class.

Term Definition Fraud Example
True Positive (TP) Model predicts anomaly, truly is anomaly Caught a fraudulent transaction
False Positive (FP) Model predicts anomaly, truly is normal Flagged a legitimate purchase
True Negative (TN) Model predicts normal, truly is normal Correctly cleared a normal payment
False Negative (FN) Model predicts normal, truly is anomaly Missed a fraudulent transaction

 

The following is a worked example. Consider 10,000 credit-card transactions in which 100 are fraudulent (a 1 percent anomaly rate) and the model produces the predictions shown below:

Confusion Matrix—Fraud Detection (1% anomaly rate) Predicted Anomaly (positive) Normal (negative) Actual Anomaly Normal TP = 95 caught fraud (of 100 frauds) FN = 5 missed fraud (slipped past) FP = 30 false alarm (of 9,900 normals) TN = 9,870 correctly cleared normal traffic Derived Metrics Precision = 95/(95+30) = 0.760 Recall = 95/(95+5) = 0.950 F1 = 2·P·R/(P+R) = 0.844 FAR = 30/(30+9870) = 0.0030 Accuracy = 99.65% (misleading) Total = 10,000 | True anomalies = 100 (1%) | Predicted anomalies = 125 Green cells = correct predictions | Red cells = errors Accuracy alone (99.65%) hides the fact that we missed 5 frauds and raised 30 false alarms.

From the cells above, every metric discussed in this guide is derivable. One observation is important: the accuracy for this model is (95 + 9870) / 10000 = 99.65 percent, which sounds excellent. A constant “always normal” model, however, would score 99.0 percent. The improvement from a real model is therefore only 0.65 percentage points. A comparison of two models on accuracy alone yields almost no useful information.

The fundamental trade-off in any threshold-based detector is as follows. Lowering the threshold catches more anomalies (TP increases) but also flags more normals (FP increases). Raising the threshold reduces false alarms (FP decreases) but misses more anomalies (FN increases). Every metric in this guide either fixes one threshold and reports performance at that point, or sweeps over all thresholds and summarises the trade-off.

Threshold-Dependent Metrics: Precision, Recall, F1, FAR, MCC

These metrics require commitment to a single decision threshold (typically 0.5 for probabilities, or a calibrated value for anomaly scores). Once the threshold is fixed, the four-cell confusion matrix can be computed and the metrics below derived.

Precision: The Purity of Alerts

Precision = TP / (TP + FP). The metric answers the question: of everything flagged as anomalous, how many actually were anomalous? In the worked example, Precision = 95/125 = 0.76, which indicates that 76 percent of the alerts were genuine fraud and 24 percent were false alarms.

Precision matters most in the following contexts:

  • Alert fatigue. If a SOC analyst receives 100 alerts per day of which 90 are incorrect, the analyst will cease to trust the system. The corresponding precision is 0.10.
  • Costly interventions. If acting on an alert involves freezing a customer’s account, the alert must be correct.
  • Limited human review capacity. When only the top 50 cases can be investigated, the investigated cases must be of high quality.

Recall (Sensitivity, True Positive Rate): The Proportion Caught

Recall = TP / (TP + FN). The metric answers: of all true anomalies, how many were caught? In the worked example, Recall = 95/100 = 0.95, a 95 percent catch rate.

Recall matters most in the following contexts:

  • Catastrophic miss costs. Cancer screening, cybersecurity intrusions, and aircraft engine faults are domains in which missing an event is unacceptable.
  • Rare but serious anomalies. When the cost of a false negative greatly exceeds the cost of a false positive.
  • Compliance and regulatory contexts. Anti-money-laundering regulations effectively mandate high recall.

F1 Score: A Balanced Measure

F1 = 2·P·R / (P + R) is the harmonic mean of Precision and Recall, constructed so that a low score in either component reduces F1 substantially. In the worked example, F1 = 2 · (0.76)(0.95) / (0.76 + 0.95) = 0.844.

The harmonic mean is preferred to the arithmetic mean because, for example, Precision = 1.0 and Recall = 0.01 (only one true anomaly flagged out of 100) should not average to 0.505, which would be misleading. The harmonic mean gives 0.0198, which more accurately reflects the model’s poor performance.

For asymmetric costs, the F-beta measure should be used:

Fβ = (1 + β2) · P · R / (β2·P + R)

  • β = 1 produces the standard F1, with equal weight on precision and recall.
  • β = 2 produces F2, in which recall is weighted twice as heavily as precision (suitable for medical or security applications).
  • β = 0.5 produces F0.5, in which precision is weighted twice as heavily as recall (suitable for alert-fatigue contexts).

Specificity (TNR) and False Alarm Rate (FAR/FPR)

Specificity = TN / (TN + FP) is the fraction of true normals correctly left alone. FAR (= FPR = 1 − Specificity) is the fraction of normals that have been flagged. In the worked example, FAR = 30/9900 = 0.30 percent.

FAR is the metric that the operations team typically quotes. When 1 million events are processed per day at FAR = 0.5 percent, the result is 5,000 false alarms per day, which is operationally unworkable. Most operational systems target FAR below 0.1 percent or even 0.01 percent and accept the resulting recall.

False Reject Rate (FRR)

FRR = FN / (FN + TP) = 1 − Recall. This is biometrics terminology: in face recognition or fingerprint authentication, FRR is the fraction of legitimate users incorrectly rejected. The “False Acceptance Rate” in biometrics is identical to FAR or FPR in this context.

Matthews Correlation Coefficient (MCC)

MCC = (TP·TN − FP·FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN))

The range is [−1, +1]. A value of +1 indicates perfect classification, 0 corresponds to random classification, and −1 indicates inverted classification. Unlike F1, MCC uses all four cells of the confusion matrix and remains informative even under severe imbalance. It is particularly useful when a single, balanced number that is not deceived by a majority-class predictor is required.

Balanced Accuracy

Balanced Accuracy = (Sensitivity + Specificity) / 2 is the simple average of the per-class accuracies. The “always normal” model achieves 50 percent balanced accuracy regardless of the imbalance. This metric is appropriate when an accuracy-like figure is required that does not reward majority-class prediction.

Metric Formula Range When to Use
Precision TP / (TP + FP) [0, 1] Alert fatigue, costly interventions
Recall (TPR, Sensitivity) TP / (TP + FN) [0, 1] Catastrophic miss costs, security, medical
F1 2PR / (P + R) [0, 1] Single threshold, balanced trade-off
Fβ (1+β2)PR / (β2P+R) [0, 1] Asymmetric costs (β>1: recall, β<1: precision)
Specificity (TNR) TN / (TN + FP) [0, 1] Medical screening (avoid false positives)
FAR (FPR) FP / (FP + TN) [0, 1] Operations, alert volume control
FRR (FNR) FN / (FN + TP) [0, 1] Biometrics
MCC see formula above [−1, 1] Balanced single number for imbalanced data
Balanced Accuracy (TPR + TNR) / 2 [0, 1] Accuracy-like, imbalance-aware
AUROC ∫TPR d(FPR) [0, 1] Threshold-free comparison, mild imbalance
AUPRC (AP) ∫P d(R) [0, 1] Severe imbalance—preferred over AUROC

 

Threshold-Independent Metrics: AUROC, AUPRC, DET

The metrics above all assume that a threshold has been chosen. During model development, however, a single number that summarises the model’s quality across all possible thresholds is usually required. Ranking metrics serve this purpose.

ROC Curve and AUROC

The Receiver Operating Characteristic (ROC) curve plots TPR (on the y-axis) against FPR (on the x-axis) as the threshold varies. Each point on the curve corresponds to a different decision threshold. The area under this curve, AUROC, has a useful probabilistic interpretation:

AUROC = P(score(positive) > score(negative))

If one anomaly and one normal point are drawn at random, AUROC is the probability that the model scores the anomaly higher. A value of 0.5 corresponds to random guessing, 1.0 corresponds to perfect ranking, and 0.95 indicates that 95 percent of randomly chosen pairs are correctly ordered.

AUROC has useful properties: it is threshold-independent, it is scale-invariant (only the rank order of scores matters), and the random baseline is always exactly 0.5 regardless of class balance. The last property is also its weakness.

Situations in Which AUROC Misleads

Consider the following scenario. A dataset of 1 million transactions includes 1,000 fraudulent records (a 0.1 percent rate). The model attains AUROC = 0.97, which sounds impressive. The operational usability is more sobering: at the threshold that produces 1,000 alerts, the model may catch 600 frauds and raise 400 false positives, yielding Precision = 60 percent and Recall = 60 percent. The model still misses 400 frauds, and 40 percent of alerts are false. AUROC = 0.97 has therefore conveyed an impression that the operational reality does not deliver.

The reason is that AUROC averages TPR over the full FPR range from 0 to 1. In production, however, only the range below approximately 1 percent FPR is of practical interest. Most of the AUROC area is contributed by regions in which the system will never operate. Under severe imbalance, even a sub-1 percent FPR generates substantial numbers of false positives because the negative class is very large.

Precision-Recall Curve and AUPRC

The PR curve plots Precision (on the y-axis) against Recall (on the x-axis) as the threshold varies. The area under this curve, AUPRC, also referred to as Average Precision (AP), is considerably more informative for imbalanced data. Saito and Rehmsmeier (2015) demonstrated empirically that PR curves provide a more informative picture than ROC curves when class imbalance is severe.

The random baseline for AUPRC equals the positive-class fraction. If anomalies constitute 1 percent of the data, a coin-flip detector attains AUPRC of approximately 0.01. Exceeding this baseline by a substantial margin is considerably more demanding than exceeding AUROC’s 0.5 baseline.

The following figure presents the canonical illustration of the same model evaluated by both curves on a severely imbalanced dataset.

Same Model, Two Stories, ROC vs PR (1% anomaly rate) ROC Curve AUROC = 0.95 (looks great) False Positive Rate True Positive Rate 0 1 0 1 random model Precision-Recall Curve AUPRC = 0.42 (much less impressive) Recall Precision 0 1 0 1 random = 0.01 model Both panels show the SAME model on the SAME data. AUROC inflates due to the considerable negative class.

The two curves describe the same model. AUROC = 0.95 suggests a top-tier detector, while AUPRC = 0.42 indicates that the model is adequate but will produce many false positives in production. The PR curve is closer to operational reality.

Caution: Both AUROC and AUPRC should be reported for imbalanced anomaly detection. Reporting only AUROC for a 0.1 percent anomaly task is, at best, misleading and, at worst, deceptive.

Detection Error Tradeoff (DET) Curve

The DET curve is widely used in biometrics and speaker recognition. It plots FAR (on the x-axis) against FRR (on the y-axis), with both axes on a probit (normal-deviate) scale. This transformation stretches the small-error region and facilitates comparison of near-perfect detectors. The Equal Error Rate (EER), the point at which FAR equals FRR, is a single-number summary commonly quoted in this domain.

When to Use Which Metric: A Decision Framework

If only one decision aid is to be retained from this article, the following table should be used:

Situation Recommended Metric(s)
Severe imbalance (anomalies < 1%) AUPRC (primary), AUROC (secondary)
Need a single threshold for production F1 (or F-beta if asymmetric costs)
Operations team cares about alert volume FAR + Recall, or Precision@K
Cost-sensitive (FN ≫ FP) Recall, F2, cost-weighted score
Cost-sensitive (FP ≫ FN) Precision, F0.5
Model selection across architectures AUROC for general comparison; AUPRC if imbalanced
Reporting to non-technical stakeholders Precision@K, Recall@K, dollar-weighted recall
Time-series anomaly detection Range-based F1, VUS, NAB Score
Biometrics / authentication EER, DET curve, FAR @ fixed FRR

 

Most production teams report a small bundle of metrics: AUPRC, Precision@K, Recall, and FAR. This combination covers model quality, operational alert volume, miss rate, and false-alarm rate, and is sufficient for useful discussion across stakeholder groups.

Time-Series-Specific Metrics

Time-series anomaly detection is the domain in which most standard metrics fail. The central issue is that anomalies are typically events, namely contiguous segments of points rather than isolated samples. If a real anomaly lasts from t = 100 to t = 120 (21 timesteps) and a model detects it at t = 103 only, has the model detected the event? Standard point F1 records “1 TP, 20 FN”, which yields a recall of 1/21 = 4.8 percent. Operationally, however, the event has been caught. The label suggests an almost complete miss.

Several alternative metrics have been proposed. None is fully satisfactory, and the appropriate choice remains a subject of active debate. For a more detailed survey of the models that produce these scores, see the companion guide on time-series anomaly detection models.

Point-Adjusted (PA) F1

Proposed in early time-series benchmarks (Xu et al., 2018), Point-Adjusted F1 specifies that if at least one point inside a true anomaly segment is detected, the entire segment is marked as detected. This adjustment substantially addresses the miss-by-one-point problem but it inflates scores in misleading ways. Kim et al. (2022) showed that even random scores can achieve PA-F1 above 0.9 on common benchmarks. PA-F1 should therefore be used with considerable caution and never as the sole metric.

Range-Based Precision and Recall (Tatbul et al., 2018)

The seminal paper by Tatbul et al. introduced a parametric framework for range-based recall and precision. Each detection range overlapping a real anomaly range earns partial credit, with adjustable parameters governing the reward for partial overlap (existence, cardinality, or size), the bias toward early or late detection, and the penalty for fragmentation. The framework is principled, configurable, and widely cited, but its parameters require careful selection for each use case.

NAB Score (Numenta Anomaly Benchmark)

This metric is designed for streaming detection. Each true anomaly segment is associated with a detection window. Points inside the window earn weighted positive credit (with greater credit for earlier detection), while points outside the window earn weighted negative credit. The result is normalised so that a perfect detector scores 100 and a “no detection” baseline scores 0. NAB is opinionated, since it explicitly rewards early detection, which makes it appropriate for streaming applications and inappropriate for retrospective analysis.

VUS (Volume Under the Surface, Paparrizos et al., 2022)

VUS is a range-aware extension of AUROC and AUPRC. Rather than computing area under a 2D curve, VUS computes volume under a 3D surface in which the third dimension is the detection-tolerance buffer. The result is a smooth, parameter-free range-aware metric. VUS-PR is currently among the most defensible single-number summaries for time-series anomaly detection benchmarks.

Affiliation-Based Metrics (Huet et al., 2022)

This metric defines a continuous “affiliation” between predicted and true segments based on temporal distance, with statistical normalisation that makes results comparable across datasets. It is more principled than PA-F1 but less widely supported by tooling.

Metric Range-Aware? Threshold-Free? Notes
Point F1 No No Penalizes brief detection lag harshly
Point-Adjusted F1 Partially No Inflates scores; controversial
Range-Based F1 (Tatbul) Yes No Configurable; needs parameters per use case
NAB Score Yes No Rewards early detection; for streaming
VUS-ROC / VUS-PR Yes Yes Modern, parameter-free, recommended
Affiliation Metrics Yes No Statistical normalization; less tooled

 

Tip: For new time-series benchmarks, VUS-PR and range-based F1 with documented parameters should be reported. Reliance on PA-F1 alone should be avoided, since recent literature has shown that it can be gamed by random scores.

Top-K Metrics for Ranking

In many production environments, the relevant property is not binary classification quality but ranking quality at the top of the list. A SOC analyst reviews the top 50 alerts per shift, and a fraud team escalates the top 100 highest-risk transactions per day. For such contexts, top-K metrics are more appropriate.

  • Precision@K: of the top K most anomalous predictions, the number that correspond to true anomalies. The measure is concrete and operationally meaningful.
  • Recall@K: of all true anomalies, the number that appear in the top K. The measure is useful when a fixed review budget is in place.
  • Mean Average Precision (MAP@K): the average precision computed up to position K, which is sometimes used in ranking contexts.
  • Lift@K: Precision@K divided by the base rate. A lift of 50 indicates that alerts in the top K are 50 times more likely to be anomalies than random samples.

Top-K metrics require K to be fixed, typically by the available human review capacity. They are less useful for academic comparisons, because different K values produce different rankings, but they are essential for production health monitoring.

Practical Implementation in Python

The following section presents the implementations. The discussion proceeds from the confusion matrix to bootstrapped AUROC confidence intervals, providing both scikit-learn shortcuts and from-scratch implementations.

Setup and Synthetic Data

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import (
    confusion_matrix, precision_score, recall_score, f1_score,
    fbeta_score, roc_auc_score, average_precision_score,
    roc_curve, precision_recall_curve, matthews_corrcoef,
    balanced_accuracy_score
)

np.random.seed(42)

# 10,000 samples, 1% anomaly rate
n = 10_000
anomaly_rate = 0.01
y_true = np.random.binomial(1, anomaly_rate, size=n)

# Synthetic anomaly score: anomalies tend to score higher
# Normal points: Beta(2, 5) -> mean ~0.29
# Anomalies: shifted up by 0.4 (clipped at 1.0)
y_score = np.random.beta(2, 5, size=n) + y_true * 0.4
y_score = np.clip(y_score, 0, 1)

print(f"Total samples: {n}")
print(f"Anomalies: {y_true.sum()} ({y_true.mean()*100:.2f}%)")
print(f"Score range: [{y_score.min():.3f}, {y_score.max():.3f}]")

Building the Confusion Matrix from Scratch

def confusion_from_scratch(y_true, y_pred):
    """Compute (TN, FP, FN, TP) without sklearn."""
    y_true = np.asarray(y_true).astype(int)
    y_pred = np.asarray(y_pred).astype(int)
    TP = int(((y_pred == 1) & (y_true == 1)).sum())
    FP = int(((y_pred == 1) & (y_true == 0)).sum())
    TN = int(((y_pred == 0) & (y_true == 0)).sum())
    FN = int(((y_pred == 0) & (y_true == 1)).sum())
    return TN, FP, FN, TP

threshold = 0.5
y_pred = (y_score >= threshold).astype(int)

TN, FP, FN, TP = confusion_from_scratch(y_true, y_pred)
print(f"TP = {TP}, FP = {FP}, TN = {TN}, FN = {FN}")

# Verify against sklearn
cm = confusion_matrix(y_true, y_pred)
assert (TN, FP, FN, TP) == (cm[0,0], cm[0,1], cm[1,0], cm[1,1])

All Threshold-Dependent Metrics, From Scratch

def metrics_from_confusion(TN, FP, FN, TP):
    """Compute every threshold-dependent metric from a confusion matrix."""
    eps = 1e-12
    precision = TP / (TP + FP + eps)
    recall    = TP / (TP + FN + eps)        # TPR / sensitivity
    specificity = TN / (TN + FP + eps)       # TNR
    fpr = FP / (FP + TN + eps)               # FAR / FPR
    fnr = FN / (FN + TP + eps)               # FRR
    accuracy = (TP + TN) / (TP + TN + FP + FN + eps)
    balanced_acc = (recall + specificity) / 2
    f1 = 2 * precision * recall / (precision + recall + eps)
    f2 = 5 * precision * recall / (4 * precision + recall + eps)
    f05 = 1.25 * precision * recall / (0.25 * precision + recall + eps)
    # MCC
    num = TP * TN - FP * FN
    den = np.sqrt((TP+FP) * (TP+FN) * (TN+FP) * (TN+FN) + eps)
    mcc = num / den

    return {
        "Precision": precision, "Recall": recall, "Specificity": specificity,
        "FAR (FPR)": fpr, "FRR (FNR)": fnr, "Accuracy": accuracy,
        "BalancedAcc": balanced_acc, "F1": f1, "F2": f2, "F0.5": f05, "MCC": mcc,
    }

m = metrics_from_confusion(TN, FP, FN, TP)
for k, v in m.items():
    print(f"  {k:14s} = {v:.4f}")

# Verify with sklearn
assert abs(m["F1"] - f1_score(y_true, y_pred)) < 1e-6
assert abs(m["MCC"] - matthews_corrcoef(y_true, y_pred)) < 1e-6
assert abs(m["BalancedAcc"] - balanced_accuracy_score(y_true, y_pred)) < 1e-6

AUROC and AUPRC With sklearn

auroc = roc_auc_score(y_true, y_score)
auprc = average_precision_score(y_true, y_score)
print(f"AUROC = {auroc:.4f}  (random baseline = 0.5)")
print(f"AUPRC = {auprc:.4f}  (random baseline = {y_true.mean():.4f})")

Plotting ROC and PR Curves

fpr, tpr, _ = roc_curve(y_true, y_score)
prec, rec, _ = precision_recall_curve(y_true, y_score)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

ax1.plot(fpr, tpr, lw=2, label=f"Model (AUROC = {auroc:.3f})")
ax1.plot([0, 1], [0, 1], "--", color="gray", label="Random")
ax1.set_xlabel("False Positive Rate")
ax1.set_ylabel("True Positive Rate")
ax1.set_title("ROC Curve")
ax1.legend()
ax1.grid(alpha=0.3)

ax2.plot(rec, prec, lw=2, color="crimson", label=f"Model (AUPRC = {auprc:.3f})")
ax2.axhline(y=y_true.mean(), linestyle="--", color="gray",
            label=f"Random = {y_true.mean():.3f}")
ax2.set_xlabel("Recall")
ax2.set_ylabel("Precision")
ax2.set_title("Precision-Recall Curve")
ax2.legend()
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.savefig("roc_pr_curves.png", dpi=120)

Finding the Optimal F1 Threshold

prec, rec, thresholds = precision_recall_curve(y_true, y_score)
# precision_recall_curve returns one extra point; align with thresholds
prec_t, rec_t = prec[:-1], rec[:-1]

f1_curve = 2 * prec_t * rec_t / (prec_t + rec_t + 1e-12)
best_idx = int(np.argmax(f1_curve))
best_threshold = thresholds[best_idx]
best_f1 = f1_curve[best_idx]

print(f"Best F1 = {best_f1:.4f} at threshold = {best_threshold:.4f}")
print(f"  Precision = {prec_t[best_idx]:.4f}")
print(f"  Recall    = {rec_t[best_idx]:.4f}")

Sweeping the Threshold

def threshold_sweep(y_true, y_score, n_thresholds=100):
    """Compute Precision, Recall, F1, FAR for a grid of thresholds."""
    grid = np.linspace(y_score.min(), y_score.max(), n_thresholds)
    rows = []
    for t in grid:
        y_pred = (y_score >= t).astype(int)
        TN, FP, FN, TP = confusion_from_scratch(y_true, y_pred)
        m = metrics_from_confusion(TN, FP, FN, TP)
        rows.append([t, m["Precision"], m["Recall"], m["F1"], m["FAR (FPR)"]])
    return np.asarray(rows)

sweep = threshold_sweep(y_true, y_score, 200)
t_grid, prec_g, rec_g, f1_g, far_g = sweep.T

plt.figure(figsize=(9, 5))
plt.plot(t_grid, prec_g, color="#e74c3c", label="Precision")
plt.plot(t_grid, rec_g,  color="#3498db", label="Recall")
plt.plot(t_grid, f1_g,   color="#27ae60", label="F1")
plt.plot(t_grid, far_g,  color="#f39c12", label="FAR")
plt.axvline(best_threshold, linestyle="--", color="black", alpha=0.6,
            label=f"Best F1 t={best_threshold:.3f}")
plt.xlabel("Threshold")
plt.ylabel("Metric value")
plt.title("Metric vs Threshold (1% anomaly rate)")
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()

Threshold Trade-off, Precision, Recall, F1, FAR Decision Threshold Metric Value 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Precision F1 FAR Best F1 t* ≈ 0.55

Cost-Weighted Metric

def cost_weighted_score(y_true, y_pred, c_fp=1.0, c_fn=10.0):
    """Lower is better. Useful when FN costs ~10x more than FP."""
    TN, FP, FN, TP = confusion_from_scratch(y_true, y_pred)
    return c_fp * FP + c_fn * FN

def best_threshold_by_cost(y_true, y_score, c_fp=1.0, c_fn=10.0, n=200):
    grid = np.linspace(y_score.min(), y_score.max(), n)
    costs = []
    for t in grid:
        y_pred = (y_score >= t).astype(int)
        costs.append(cost_weighted_score(y_true, y_pred, c_fp, c_fn))
    best = int(np.argmin(costs))
    return grid[best], costs[best]

t_cost, c_cost = best_threshold_by_cost(y_true, y_score, c_fp=1, c_fn=20)
print(f"Cost-optimal threshold = {t_cost:.4f}, total cost = {c_cost:.0f}")

Bootstrap Confidence Intervals: An Often Overlooked Step

Single-number reports without uncertainty estimates are problematic. A 1,000-sample test set containing 10 positives can produce widely varying AUPRC values across reasonable bootstrap resamples. The bootstrap is the standard method for attaching a confidence interval. The reason that averaging across many resamples produces a stable estimate derives from the Central Limit Theorem.

def bootstrap_ci(y_true, y_score, metric_fn, n_boot=1000, alpha=0.05, seed=0):
    """Bootstrap percentile CI for any score-based metric."""
    rng = np.random.default_rng(seed)
    n = len(y_true)
    scores = []
    for _ in range(n_boot):
        idx = rng.integers(0, n, size=n)
        y_t, y_s = y_true[idx], y_score[idx]
        if y_t.sum() == 0 or y_t.sum() == n:
            continue  # degenerate resample
        scores.append(metric_fn(y_t, y_s))
    scores = np.asarray(scores)
    lo = np.quantile(scores, alpha/2)
    hi = np.quantile(scores, 1 - alpha/2)
    return float(np.mean(scores)), (float(lo), float(hi))

mean_auroc, ci_auroc = bootstrap_ci(y_true, y_score, roc_auc_score, n_boot=500)
mean_auprc, ci_auprc = bootstrap_ci(y_true, y_score, average_precision_score, n_boot=500)

print(f"AUROC = {mean_auroc:.4f}  95% CI [{ci_auroc[0]:.4f}, {ci_auroc[1]:.4f}]")
print(f"AUPRC = {mean_auprc:.4f}  95% CI [{ci_auprc[0]:.4f}, {ci_auprc[1]:.4f}]")

Time-Series PA-F1 Implementation

def get_event_segments(y):
    """Return list of (start, end_inclusive) for runs of 1s."""
    y = np.asarray(y).astype(int)
    if len(y) == 0:
        return []
    diff = np.diff(np.concatenate(([0], y, [0])))
    starts = np.where(diff == 1)[0]
    ends   = np.where(diff == -1)[0] - 1
    return list(zip(starts.tolist(), ends.tolist()))

def point_adjusted_predictions(y_true, y_pred):
    """Apply Point-Adjusted (PA) protocol: if any point inside a true
    anomaly segment is detected, flag the entire segment as detected."""
    y_pred = y_pred.copy().astype(int)
    for s, e in get_event_segments(y_true):
        if y_pred[s:e+1].any():
            y_pred[s:e+1] = 1
    return y_pred

# Worked example
y_t = np.array([0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0])
y_p = np.array([0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0])

print("Raw point F1     =", round(f1_score(y_t, y_p), 4))
y_pa = point_adjusted_predictions(y_t, y_p)
print("PA-adjusted pred =", y_pa.tolist())
print("PA-F1            =", round(f1_score(y_t, y_pa), 4))

In this example the raw point F1 is approximately 0.18 (one TP, two FN inside the first event, one FP outside, and no detection on the second event). After point adjustment, the entire first event is marked as "detected" because one point inside it was flagged, and recall increases substantially. This is the inflation effect that Kim et al. (2022) identified: PA-F1 can appear impressive even when the underlying detection is weak. For range-aware alternatives, the VUS package or the Tatbul range-based implementation in the tsad Python library is recommended.

Selecting the Threshold for Production

Once the model has been trained and AUROC and AUPRC are acceptable, the question is which threshold to deploy. The five common strategies are presented below, ordered from the simplest to the most sophisticated.

Maximise F1 on the Validation Set

Thresholds are swept on a held-out validation set, and the one with the highest F1 is selected. The procedure is simple, defensible, and yields a balanced precision and recall point. Important caveat: the threshold should never be selected on the test set, as this constitutes data leakage. Validation data must always be reserved for hyperparameter and threshold selection.

Fixed FAR Budget

This is the operations-driven approach. For example, if the team can handle 100 alerts per day across 1 million events per day, FAR must be at most 0.01 percent. The threshold corresponding to FAR = 0.0001 on the validation set is selected, and the corresponding recall is reported. Most cybersecurity and network monitoring systems in production are tuned in this way.

def threshold_for_far_budget(y_true, y_score, far_budget=0.001):
    """Largest recall achievable subject to FAR ≤ far_budget."""
    fpr, tpr, thr = roc_curve(y_true, y_score)
    feasible = fpr <= far_budget
    if not feasible.any():
        return None, 0.0, 0.0
    idx = np.argmax(tpr * feasible)
    return float(thr[idx]), float(tpr[idx]), float(fpr[idx])

t, r, f = threshold_for_far_budget(y_true, y_score, far_budget=0.005)
print(f"Threshold = {t:.4f}, Recall = {r:.4f} at FAR = {f:.4f}")

Cost-Weighted Optimisation

If the dollar cost of a false positive (such as analyst time and customer impact) and a false negative (such as missed fraud value) can be quantified, the threshold that minimises CFP·FP + CFN·FN should be selected. This is the most defensible approach when the asymmetry is well understood.

Top-K Selection

This approach forgoes the threshold entirely. Scores are ranked and the top K cases are selected. It is appropriate when human review capacity is the binding constraint and alert volume per period is fixed.

Sliding or Contextual Threshold

Time-of-day, day-of-week, or per-segment thresholds may be used. A retail fraud detector might use a threshold of 0.6 on weekday afternoons and 0.4 on holiday weekends. Implementation typically involves a small lookup table or a contextual model that outputs both score and threshold.

Caution: Thresholds drift. As the data distribution shifts because of seasonal effects and the evolution of fraud patterns, the threshold that maximised F1 in January may produce twice the alert volume in June. Monthly threshold retuning should be scheduled, and precision and FAR should be monitored continuously.

Common Pitfalls to Avoid

The most frequently encountered errors across anomaly detection projects in fraud, manufacturing, security, and healthcare are listed below.

  • Reporting AUROC without AUPRC on imbalanced data. AUROC = 0.99 with 0.1 percent positives often corresponds to AUPRC = 0.40. Both should always be reported.
  • Reporting accuracy. For anomaly detection, accuracy is almost always uninformative. The "always negative" baseline outperforms most real models on accuracy.
  • Selecting the threshold on the test set. Tuning should be performed on the validation set, and evaluation on the test set. Maximising F1 across thresholds on the same test set constitutes overfitting.
  • Not using stratified k-fold. With 1 percent positives in 1,000 samples, a random fold may contain zero positives in the validation split. StratifiedKFold should be used.
  • Ignoring confidence intervals. A reported AUPRC of 0.42 ± 0.15 (95 percent CI) is qualitatively different from 0.42 ± 0.02. Bootstrap intervals should be computed and reported.
  • Comparing models on different test sets. This is not a like-for-like comparison. The same fixed test set must be used across all model comparisons.
  • Using point F1 for time series. A single-step detection lag reduces the score substantially. Range-based metrics or VUS should be used instead.
  • Confusion between microaverage and macroaverage in multi-class anomaly settings. Microaverage favours common classes; macroaverage equalises them. The choice must be made deliberately and documented.
  • Treating PA-F1 as a definitive measure. It can be inflated by random noise. If used, it should be reported alongside non-PA metrics.
  • Optimising offline metrics that do not translate to deployment. When the business operates on alert-volume budgets, the metric that respects that constraint should be optimised, rather than F1 alone.

Real-World Reporting Templates by Domain

Different domains converge on different metric stacks. The following recommendations are distilled from observed production systems. For more detailed treatment of the underlying anomaly detection methods, the companion guides on Deep SVDD and One-Class SVM may be consulted.

Domain Recommended Metric Stack Why
Fraud detection AUPRC, Precision@K, Recall, $-weighted recall Severe imbalance + dollar asymmetry
Network intrusion AUROC, Precision, FAR @ fixed Recall Operations cares about alert volume
Medical screening Sensitivity (Recall), Specificity, AUROC Regulatory norms; symmetric reporting
Industrial sensor Range-based F1, Precision@K, time-to-detect Time-series events; early detection valued
Server monitoring Precision@K, MTTD, false-alert-per-day Streaming context, on-call workload
Biometrics / authentication EER, DET curve, FAR @ fixed FRR Field-standard reporting
Anti-money-laundering Recall + Precision@K, regulatory alert quality Compliance sets minimum recall
Manufacturing defect Recall, Precision, cost-weighted score Defect cost vs over-inspection cost

 

If the model is built on top of transfer learning or fine-tuning approaches, the same metric framework applies, although particular caution should be taken with confidence intervals, since pre-training source-target distribution gaps can render small test sets highly noisy.

Key Takeaway: A robust default reporting set for any anomaly detection project comprises AUPRC, Precision@K, Recall, and FAR, each reported with bootstrap 95 percent confidence intervals and a documented threshold. This combination covers model quality, top-of-list usefulness, miss rate, and operational alert volume.

Frequently Asked Questions

Why isn't accuracy a good metric for anomaly detection?

Because anomalies are rare. If 99% of your data is normal, a "predict normal always" model achieves 99% accuracy without learning anything. Real models barely lift accuracy by a few tenths of a percentage point, so accuracy can't distinguish good models from useless ones. Use AUPRC, F1, or Precision@K instead.

AUROC vs AUPRC—when should I use which?

For mild imbalance (positives 5–50%), AUROC and AUPRC tell roughly similar stories, and AUROC is fine. For severe imbalance (positives below 1%), AUROC inflates because most of its area comes from FPR regions you'll never operate in. AUPRC is more honest because its random baseline equals the positive class fraction. Best practice: report both, but rely on AUPRC for imbalanced anomaly detection.

How do I pick a threshold for production?

Pick the strategy that matches your business constraint. If your team has a fixed alert-review budget, use top-K or fixed-FAR. If you can quantify costs, optimize C_FP·FP + C_FN·FN. If neither, maximize F1 on a held-out validation set. Always select the threshold on validation, evaluate on test, and re-tune monthly as data shifts.

What's the difference between FAR and FPR?

None — they are the same metric: FP / (FP + TN). "False Alarm Rate" is the operations and biometrics term; "False Positive Rate" is the statistical term. Some literature also uses "False Acceptance Rate" (biometrics, identical concept) or "Type I Error rate" (classical statistics).

Are time-series anomaly detection metrics different?

Yes. Anomalies in time series are typically contiguous events, not isolated points, so naive point-wise F1 over-penalises brief detection lag. Use range-based metrics (Tatbul et al., 2018), VUS-PR (Paparrizos et al., 2022), or NAB Score for streaming. Reliance on Point-Adjusted F1 alone should be avoided, since recent work has shown that it can be gamed by random noise.

References and Further Reading

Related Reading on aicodeinvest.com:

External References:

  • scikit-learn metrics documentation—https://scikit-learn.org/stable/modules/model_evaluation.html
  • Saito, T. & Rehmsmeier, M. (2015). "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets." PLOS ONE.
  • Tatbul, N., Lee, T. J., Zdonik, S., Alam, M., & Gottschlich, J. (2018). "Precision and Recall for Time Series." NeurIPS.
  • Paparrizos, J., Boniol, P., Palpanas, T., Tsay, R., Elmore, A., & Franklin, M. (2022). "Volume Under the Surface: A New Accuracy Evaluation Measure for Time-Series Anomaly Detection." VLDB.
  • Numenta Anomaly Benchmark (NAB),https://github.com/numenta/NAB
  • Huet, A., Navarro, J. M., & Rossi, D. (2022). "Local Evaluation of Time Series Anomaly Detection Algorithms." KDD.
  • Kim, S. et al. (2022). "Towards a Rigorous Evaluation of Time-Series Anomaly Detection." AAAI.

This article is for informational purposes only and does not constitute investment, security, or medical advice. Always validate metrics against your specific operational context.

You Might Also Like

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *