ChronoGraph: A Real-World Graph-Based Multivariate Time Series Dataset¶

Conference: NEURIPS2025
arXiv: 2509.04449
Code: https://github.com/bit-ml/ChronoGraph
Area: Autonomous Driving
Keywords: microservice telemetry, graph time series, anomaly detection, real-world dataset, service dependency graph

TL;DR¶

This paper presents ChronoGraph — the first real-world microservice dataset that simultaneously provides multivariate time series, explicit service dependency graphs, and event-level anomaly labels (6 months / ~700 services / 5-dimensional metrics / 8005 timesteps). Benchmark results reveal substantial room for improvement in long-horizon forecasting and topology-aware modeling among existing methods.

Background & Motivation¶

Background: In large-scale microservice systems, forecasting short-to-medium-term evolution of service metrics is critical for alerting, auto-scaling, and capacity planning. Existing graph-based time series benchmark datasets are predominantly drawn from traffic (e.g., METR-LA) and air quality domains, and have been widely adopted in time series forecasting research.

Limitations of Prior Work: - Traffic and air quality datasets are univariate and lack anomaly annotations; - Industrial control datasets such as SWaT and WADI provide anomaly labels and are multivariate, but only supply process diagrams rather than true adjacency matrices; - No existing dataset simultaneously provides multivariate time series + explicit dependency graphs + anomaly labels.

Key Challenge: The absence of real graph structures forces existing forecasting and anomaly detection methods to either process each series independently (topology-agnostic) or learn dense implicit graphs (e.g., fully connected attention, top-\(k\) similarity graphs). Such data-driven graphs may be inconsistent with the true service topology.

Goal: - Provide a multivariate time series dataset with a real service dependency graph, enabling the community to evaluate topology-aware methods; - Annotate real operational incidents to support anomaly detection evaluation on genuine failures; - Identify, through benchmarking, the dimensions along which existing methods fall short.

Key Insight: Six months of production telemetry from a large enterprise microservice platform, comprising ~700 service nodes, inter-service call edges, and 17 manually annotated anomaly segments.

Core Idea: Construct the first benchmark that jointly provides multivariate time series, a real service dependency graph, and anomaly labels within a single dataset, filling a critical data gap in graph-aware temporal modeling.

Method¶

Overall Architecture¶

ChronoGraph is a dataset and benchmark contribution rather than a new model. The overall pipeline consists of: - Data Collection: System-level metrics are collected at 30-minute intervals from all services on a production microservice platform. - Graph Construction: A directed dependency graph is built from actual inter-service call relationships. - Anomaly Annotation: Affected services and time windows are extracted from internal incident reports. - Benchmark Evaluation: Six categories of methods are evaluated on forecasting and anomaly detection tasks.

Key Designs¶

Dataset Composition:
- Function: Provide a real-world graph-structured multivariate time series dataset.
- Core Details: 708 service nodes, each with a 5-dimensional time series (CPU utilization, memory usage, memory working set, network ingress, network egress), totaling 8005 timesteps at 30-minute granularity (~6 months). Edges represent inter-service call dependencies, each carrying 8-dimensional features (request count, return codes, latency, etc.).
- Design Motivation: Existing datasets either lack a true graph structure or lack anomaly annotations; this dataset is the first to provide all three simultaneously.
Anomaly Annotation Pipeline:
- Function: Provide anomaly labels aligned with real operational incidents.
- Mechanism: Internal incident reports written by engineers are parsed to extract affected service names and timestamps, which are then mapped to fixed-length windows centered on the report time, yielding 17 labeled anomaly segments.
- Design Motivation: Conventional anomaly detection evaluation relies on synthetic anomalies or rule-based injection; this work provides labels derived from genuine failure events.
Evaluation Protocol:
- Function: Evaluate multiple method families on forecasting and anomaly detection tasks.
- Mechanism: A 60/40 train-test split is adopted. Forecasting performance is measured with MAE, MSE, and MASE; anomaly detection is evaluated with \(F1_K\)-AUC and \(ROC_K\)-AUC, which overcome the over-optimism of traditional point-adjustment (PA).
- Design Motivation: The conventional F1 + PA paradigm substantially overestimates anomaly detection performance; \(F1_K\)-AUC integrates over varying \(K\) ratios to provide a more balanced segment-level evaluation.

Baseline Coverage¶

Three major categories of methods are included: - Statistical Models: Prophet (trend + seasonality decomposition) - Time Series Foundation Models: Chronos-Bolt Base (zero-shot/few-shot), TabPFN-TS (transformer-based prior-data-fitted network) - Anomaly Detectors: Autoencoder (reconstruction error), Isolation Forest, OC-SVM (one-class SVM) - Ensemble Methods: Combination of Prophet, Isolation Forest, and Autoencoder

Key Experimental Results¶

Main Results — Forecasting Performance¶

Model	MAE (full 3202 steps)	MSE (full)	MASE (full)	MAE (first 500 steps)	MSE (first 500 steps)	MASE (first 500 steps)
Prophet	0.125±0.067	0.044±0.054	7.182±11.21	0.069±0.044	0.013±0.022	3.143±3.663
Chronos	0.150±0.173	0.343±2.426	7.902±12.71	0.044±0.030	0.007±0.015	1.938±1.731
TabPFN-TS	0.125±0.125	0.089±1.172	6.205±9.315	0.109±0.061	0.026±0.031	5.082±11.01

Main Results — Anomaly Detection Performance¶

Method	\(F1_K\) ↑	\(ROC_K\) ↑	FP Rate ↓	FN Rate ↓	F1 ↑
Prophet	20.57	62.97	2.02	97.98	2.39
Isolation Forest	17.49	56.39	46.9	50.48	7.08
OC-SVM	14.46	54.31	22.13	77.08	5.50
Autoencoder	13.86	59.79	0.38	99.58	0.72
TabPFN-TS	12.37	54.08	0.55	99.79	0.31
Chronos	12.41	49.78	2.49	97.84	2.49
Ensemble*	16.92	60.95	0.20	99.58	0.73

Key Findings¶

Large gap between short-term and long-term forecasting: Chronos achieves the best performance over the first 500 steps (MAE 0.044) but degrades substantially over the full 3202-step horizon (MAE 0.150), indicating that current methods cannot sustain long-horizon accuracy. TabPFN-TS shows the most consistent performance across both horizons.
All anomaly detection methods perform poorly: The best \(F1_K\) is only 20.57 (Prophet), and all methods exhibit extremely high FN rates (97.98% for Prophet), demonstrating that topology-agnostic anomaly detection is far from practical in microservice settings.
Spatial clustering of anomalies: Predicted anomalies from the ensemble tend to cluster around densely connected service regions in the graph, suggesting that failures propagate along the dependency graph — a direct motivation for topology-aware methods.
Limitations of foundation models: Chronos and TabPFN-TS, as per-series models, are unable to capture cross-node propagation effects, resulting in the worst anomaly detection performance.

Highlights & Insights¶

First "three-in-one" real-world dataset: Simultaneously provides multivariate time series, an explicit dependency graph, and anomaly labels. This combination fills a critical gap in graph-aware temporal modeling research, where practitioners previously had to validate forecasting and anomaly detection on separate datasets.
\(F1_K\)-AUC evaluation protocol: The use of metrics integrated over varying \(K\) values avoids the performance overestimation caused by traditional point-adjustment. This protocol is broadly applicable to other time series anomaly detection benchmarks.
Empirical evidence for anomaly propagation: Figure 1 visually demonstrates the spatial clustering of predicted anomalies in the graph, providing compelling empirical support for topology-aware anomaly detection.
Value of dark data: The paper notes that the dataset may contain transient anomalous behaviors that were never escalated to formal incidents and are currently counted as false positives — yet carry operational significance. This observation offers methodological insights for anomaly detection evaluation.

Limitations & Future Work¶

Sparse annotations: Only 17 anomaly segments are provided, covering only escalated service failures; numerous transient or self-recovering anomalies remain unlabeled, limiting the statistical significance of anomaly detection evaluation.
All baselines are topology-agnostic: No method that genuinely leverages graph structure (e.g., GNN-based forecasting or anomaly detection) is evaluated, making it impossible to quantify the benefit of topology awareness.
Single data source: The data originates from one enterprise's microservice platform; generalizability is unknown, as microservice architectures vary considerably across organizations.
Coarse temporal resolution: The 30-minute interval may miss fine-grained propagation details of rapid failures.
Future Directions:
- Evaluate graph neural networks (e.g., DCRNN, MTGNN, StemGNN) on this dataset for joint spatial-temporal forecasting.
- Model anomaly propagation along the dependency graph for root cause analysis.
- Leverage edge features (8-dimensional communication data) for edge-conditioned graph convolution.

vs. SWaT/WADI: Industrial control datasets provide anomaly labels and are multivariate, but only supply process diagrams without true adjacency matrices. ChronoGraph provides a machine-readable service dependency graph directly usable by graph models.
vs. METR-LA / PEMS-BAY: Traffic datasets include spatial graphs but are univariate (speed only) and lack anomaly annotations. ChronoGraph provides 5 dimensions per node and 8 dimensions per edge, along with event labels.
vs. Chronos / TabPFN-TS and other foundation models: These models are designed for per-series zero-shot forecasting and cannot exploit graph structure. This dataset directly exposes their deficiencies in structure-aware settings.

Rating¶

Novelty: ⭐⭐⭐⭐ First real-world multivariate time series dataset to simultaneously include an explicit service dependency graph and anomaly labels, filling a critical data gap.
Experimental Thoroughness: ⭐⭐⭐ Baselines are comprehensive (statistical models + foundation models + classical AD methods), but no topology-aware method is evaluated.
Writing Quality: ⭐⭐⭐⭐ Well-structured with detailed data descriptions and candid discussion of limitations.
Value: ⭐⭐⭐⭐ As a benchmark dataset, it offers lasting value for graph-aware time series research, though future work is needed to build genuine graph-based methods on top of it.