ChronoGraph: A Real-World Graph-Based Multivariate Time Series Dataset¶
Conference: NEURIPS2025
arXiv: 2509.04449
Code: https://github.com/bit-ml/ChronoGraph
Area: Autonomous Driving
Keywords: microservice telemetry, graph time series, anomaly detection, real-world dataset, service dependency graph
TL;DR¶
This paper presents ChronoGraph — the first real-world microservice dataset that simultaneously provides multivariate time series, explicit service dependency graphs, and event-level anomaly labels (6 months / ~700 services / 5-dimensional metrics / 8005 timesteps). Benchmark results reveal substantial room for improvement in long-horizon forecasting and topology-aware modeling among existing methods.
Background & Motivation¶
Background: In large-scale microservice systems, forecasting short-to-medium-term evolution of service metrics is critical for alerting, auto-scaling, and capacity planning. Existing graph-based time series benchmark datasets are predominantly drawn from traffic (e.g., METR-LA) and air quality domains, and have been widely adopted in time series forecasting research.
Limitations of Prior Work: - Traffic and air quality datasets are univariate and lack anomaly annotations; - Industrial control datasets such as SWaT and WADI provide anomaly labels and are multivariate, but only supply process diagrams rather than true adjacency matrices; - No existing dataset simultaneously provides multivariate time series + explicit dependency graphs + anomaly labels.
Key Challenge: The absence of real graph structures forces existing forecasting and anomaly detection methods to either process each series independently (topology-agnostic) or learn dense implicit graphs (e.g., fully connected attention, top-\(k\) similarity graphs). Such data-driven graphs may be inconsistent with the true service topology.
Goal: - Provide a multivariate time series dataset with a real service dependency graph, enabling the community to evaluate topology-aware methods; - Annotate real operational incidents to support anomaly detection evaluation on genuine failures; - Identify, through benchmarking, the dimensions along which existing methods fall short.
Key Insight: Six months of production telemetry from a large enterprise microservice platform, comprising ~700 service nodes, inter-service call edges, and 17 manually annotated anomaly segments.
Core Idea: Construct the first benchmark that jointly provides multivariate time series, a real service dependency graph, and anomaly labels within a single dataset, filling a critical data gap in graph-aware temporal modeling.
Method¶
Overall Architecture¶
ChronoGraph is a dataset and benchmark contribution rather than a new model. The overall pipeline consists of: - Data Collection: System-level metrics are collected at 30-minute intervals from all services on a production microservice platform. - Graph Construction: A directed dependency graph is built from actual inter-service call relationships. - Anomaly Annotation: Affected services and time windows are extracted from internal incident reports. - Benchmark Evaluation: Six categories of methods are evaluated on forecasting and anomaly detection tasks.
Key Designs¶
-
Dataset Composition:
- Function: Provide a real-world graph-structured multivariate time series dataset.
- Core Details: 708 service nodes, each with a 5-dimensional time series (CPU utilization, memory usage, memory working set, network ingress, network egress), totaling 8005 timesteps at 30-minute granularity (~6 months). Edges represent inter-service call dependencies, each carrying 8-dimensional features (request count, return codes, latency, etc.).
- Design Motivation: Existing datasets either lack a true graph structure or lack anomaly annotations; this dataset is the first to provide all three simultaneously.
-
Anomaly Annotation Pipeline:
- Function: Provide anomaly labels aligned with real operational incidents.
- Mechanism: Internal incident reports written by engineers are parsed to extract affected service names and timestamps, which are then mapped to fixed-length windows centered on the report time, yielding 17 labeled anomaly segments.
- Design Motivation: Conventional anomaly detection evaluation relies on synthetic anomalies or rule-based injection; this work provides labels derived from genuine failure events.
-
Evaluation Protocol:
- Function: Evaluate multiple method families on forecasting and anomaly detection tasks.
- Mechanism: A 60/40 train-test split is adopted. Forecasting performance is measured with MAE, MSE, and MASE; anomaly detection is evaluated with \(F1_K\)-AUC and \(ROC_K\)-AUC, which overcome the over-optimism of traditional point-adjustment (PA).
- Design Motivation: The conventional F1 + PA paradigm substantially overestimates anomaly detection performance; \(F1_K\)-AUC integrates over varying \(K\) ratios to provide a more balanced segment-level evaluation.
Baseline Coverage¶
Three major categories of methods are included: - Statistical Models: Prophet (trend + seasonality decomposition) - Time Series Foundation Models: Chronos-Bolt Base (zero-shot/few-shot), TabPFN-TS (transformer-based prior-data-fitted network) - Anomaly Detectors: Autoencoder (reconstruction error), Isolation Forest, OC-SVM (one-class SVM) - Ensemble Methods: Combination of Prophet, Isolation Forest, and Autoencoder
Key Experimental Results¶
Main Results — Forecasting Performance¶
| Model | MAE (full 3202 steps) | MSE (full) | MASE (full) | MAE (first 500 steps) | MSE (first 500 steps) | MASE (first 500 steps) |
|---|---|---|---|---|---|---|
| Prophet | 0.125±0.067 | 0.044±0.054 | 7.182±11.21 | 0.069±0.044 | 0.013±0.022 | 3.143±3.663 |
| Chronos | 0.150±0.173 | 0.343±2.426 | 7.902±12.71 | 0.044±0.030 | 0.007±0.015 | 1.938±1.731 |
| TabPFN-TS | 0.125±0.125 | 0.089±1.172 | 6.205±9.315 | 0.109±0.061 | 0.026±0.031 | 5.082±11.01 |
Main Results — Anomaly Detection Performance¶
| Method | \(F1_K\) ↑ | \(ROC_K\) ↑ | FP Rate ↓ | FN Rate ↓ | F1 ↑ |
|---|---|---|---|---|---|
| Prophet | 20.57 | 62.97 | 2.02 | 97.98 | 2.39 |
| Isolation Forest | 17.49 | 56.39 | 46.9 | 50.48 | 7.08 |
| OC-SVM | 14.46 | 54.31 | 22.13 | 77.08 | 5.50 |
| Autoencoder | 13.86 | 59.79 | 0.38 | 99.58 | 0.72 |
| TabPFN-TS | 12.37 | 54.08 | 0.55 | 99.79 | 0.31 |
| Chronos | 12.41 | 49.78 | 2.49 | 97.84 | 2.49 |
| Ensemble* | 16.92 | 60.95 | 0.20 | 99.58 | 0.73 |
Key Findings¶
- Large gap between short-term and long-term forecasting: Chronos achieves the best performance over the first 500 steps (MAE 0.044) but degrades substantially over the full 3202-step horizon (MAE 0.150), indicating that current methods cannot sustain long-horizon accuracy. TabPFN-TS shows the most consistent performance across both horizons.
- All anomaly detection methods perform poorly: The best \(F1_K\) is only 20.57 (Prophet), and all methods exhibit extremely high FN rates (97.98% for Prophet), demonstrating that topology-agnostic anomaly detection is far from practical in microservice settings.
- Spatial clustering of anomalies: Predicted anomalies from the ensemble tend to cluster around densely connected service regions in the graph, suggesting that failures propagate along the dependency graph — a direct motivation for topology-aware methods.
- Limitations of foundation models: Chronos and TabPFN-TS, as per-series models, are unable to capture cross-node propagation effects, resulting in the worst anomaly detection performance.
Highlights & Insights¶
- First "three-in-one" real-world dataset: Simultaneously provides multivariate time series, an explicit dependency graph, and anomaly labels. This combination fills a critical gap in graph-aware temporal modeling research, where practitioners previously had to validate forecasting and anomaly detection on separate datasets.
- \(F1_K\)-AUC evaluation protocol: The use of metrics integrated over varying \(K\) values avoids the performance overestimation caused by traditional point-adjustment. This protocol is broadly applicable to other time series anomaly detection benchmarks.
- Empirical evidence for anomaly propagation: Figure 1 visually demonstrates the spatial clustering of predicted anomalies in the graph, providing compelling empirical support for topology-aware anomaly detection.
- Value of dark data: The paper notes that the dataset may contain transient anomalous behaviors that were never escalated to formal incidents and are currently counted as false positives — yet carry operational significance. This observation offers methodological insights for anomaly detection evaluation.
Limitations & Future Work¶
- Sparse annotations: Only 17 anomaly segments are provided, covering only escalated service failures; numerous transient or self-recovering anomalies remain unlabeled, limiting the statistical significance of anomaly detection evaluation.
- All baselines are topology-agnostic: No method that genuinely leverages graph structure (e.g., GNN-based forecasting or anomaly detection) is evaluated, making it impossible to quantify the benefit of topology awareness.
- Single data source: The data originates from one enterprise's microservice platform; generalizability is unknown, as microservice architectures vary considerably across organizations.
- Coarse temporal resolution: The 30-minute interval may miss fine-grained propagation details of rapid failures.
- Future Directions:
- Evaluate graph neural networks (e.g., DCRNN, MTGNN, StemGNN) on this dataset for joint spatial-temporal forecasting.
- Model anomaly propagation along the dependency graph for root cause analysis.
- Leverage edge features (8-dimensional communication data) for edge-conditioned graph convolution.
Related Work & Insights¶
- vs. SWaT/WADI: Industrial control datasets provide anomaly labels and are multivariate, but only supply process diagrams without true adjacency matrices. ChronoGraph provides a machine-readable service dependency graph directly usable by graph models.
- vs. METR-LA / PEMS-BAY: Traffic datasets include spatial graphs but are univariate (speed only) and lack anomaly annotations. ChronoGraph provides 5 dimensions per node and 8 dimensions per edge, along with event labels.
- vs. Chronos / TabPFN-TS and other foundation models: These models are designed for per-series zero-shot forecasting and cannot exploit graph structure. This dataset directly exposes their deficiencies in structure-aware settings.
Rating¶
- Novelty: ⭐⭐⭐⭐ First real-world multivariate time series dataset to simultaneously include an explicit service dependency graph and anomaly labels, filling a critical data gap.
- Experimental Thoroughness: ⭐⭐⭐ Baselines are comprehensive (statistical models + foundation models + classical AD methods), but no topology-aware method is evaluated.
- Writing Quality: ⭐⭐⭐⭐ Well-structured with detailed data descriptions and candid discussion of limitations.
- Value: ⭐⭐⭐⭐ As a benchmark dataset, it offers lasting value for graph-aware time series research, though future work is needed to build genuine graph-based methods on top of it.