CausalDynamics: A Large-Scale Benchmark for Structural Discovery of Dynamical Causal Models¶

Conference: NeurIPS 2025 arXiv: 2505.16620 Code: kausable/CausalDynamics Area: Time Series Keywords: causal discovery, dynamical systems, benchmark, time series, ODE/SDE, causal graph

TL;DR¶

This paper introduces CausalDynamics — the largest benchmark to date for causal discovery in dynamical systems (14,000+ graphs, 50M+ samples) — encompassing a three-tier progressively complex hierarchy ranging from 3-dimensional chaotic ODE/SDE systems and hierarchically coupled systems to realistic climate models. The benchmark comprehensively evaluates 10 state-of-the-art causal discovery algorithms, revealing the shortcomings of current deep learning methods on high-dimensional nonlinear dynamical systems.

Background & Motivation¶

Broad demand for causal discovery: In nonlinear dynamical systems arising in climate science, biology, and finance, direct interventions are often infeasible, necessitating causal inference from observational time series data.
Inadequacy of existing benchmarks: Most causal discovery benchmarks are based on static causal graphs or autoregressive models, lacking characterization of continuous state-space evolution, complex feedback loops, stochasticity, and regime shifts.
Absence of ground truth in real data: Existing real-world datasets (e.g., CausalRivers, MoCap) lack fully interpretable causal ground truth, making it impossible to disentangle algorithmic limitations from data characteristics.
Limitations of synthetic data: Prior synthetic benchmarks typically contain only a small number of graphs (e.g., Netsim, DREAM3&4) and predominantly feature weakly nonlinear systems, which are insufficient for comprehensive evaluation on complex dynamical systems.
Unsystematic treatment of key challenges: Practical issues including noise, unobserved confounders, time-lag, and varsortability require a unified framework for systematic evaluation.
Paradigm-shift analogy: Just as CASP13 catalyzed AlphaFold and ImageNet enabled AlexNet, the authors argue that the causal discovery community similarly requires a large-scale standardized benchmark to drive methodological innovation.

Method¶

Overall Architecture: Three-Tier Progressive Complexity Hierarchy¶

CausalDynamics adopts a hierarchical design that increases in complexity across tiers:

Tier	Data Source	Challenges	# Graphs
Tier 1 (Simple)	59 three-dimensional chaotic ODE/SDE systems	Confounders, noise	585
Tier 2 (Coupled)	Hierarchically coupled ODE/SDE (\(N=3,5,10\))	Confounders, time-lag, normalization, forcing	14,096
Tier 3 (Climate)	MAOOAM + ENSO climate models	High dimensionality	12

Key Design 1: Structural Dynamical Causal Model (SDCM)¶

The classical structural causal model (SCM) is combined with differential equations to define the SDCM:

\[\frac{d}{dt}x_{k,t} := f^k(\boldsymbol{x}_{\text{PA}_k, t}, \delta), \quad x_{k,0} = x_k(0)\]

where \(\delta\) controls the noise magnitude: \(\delta=0\) corresponds to an ODE and \(\delta>0\) to an SDE. The causal graph of each system is represented by an adjacency matrix \(\mathcal{A}\).

Key Design 2: GNR-Based Hierarchical Coupled Graph Generation¶

Tier 2 employs the Growing Network with Redirection (GNR) model to generate scale-free DAGs:

Nodes (causal units): Each node represents a \(d\)-dimensional time series; root nodes can be driven by a dynamical system (Lorenz/Rössler), a periodic forcing (\(A\sin(\omega t + \phi)\)), or a linear driver.
Edges (coupling functions): Implemented via MLPs, with activation functions sampled from {identity, sin, sigmoid, tanh, ReLU} and weights sparsified by a dropout probability \(p_{\text{zero}}\).
Information aggregation: The value of a non-root node is obtained by summing MLP-transformed signals from its parent nodes: \(x_{v_k}(t) = \sum_{k \in \text{pa}_i} f_{(k,i)}(x_k(t))\).

Key Design 3: Systematic Injection of Causal Challenges¶

Confounders: Two adjacency matrices are sampled; the off-diagonal entries of the second are rotated by 90° and merged with the first, ensuring the resulting confounded graph remains a scale-free DAG.
Time-lag: With probability \(p_t\), a fixed delay \(\tau\) is introduced on an edge: \(x_{v_k}(t) = f_{(k,i)}(x_k(t-\tau))\), forming a temporally cyclic graph.
Normalization: Node values are standardized along the temporal dimension to eliminate varsortability artifacts.

Key Design 4: Tier 3 Realistic Climate Models¶

ENSO model (XRO): Integrates the Hasselmann stochastic framework with recharge oscillator dynamics, initialized from observed SSTs, with adjustable cross-basin coupling strength.
MAOOAM model (qgs): A quasi-geostrophic two-layer model solving barotropic/baroclinic interactions to simulate high-dimensional atmosphere–ocean coupled dynamics.

Loss & Training¶

As a benchmark paper, no new model training is involved. Evaluation metrics include:

AUROC (higher is better): Measures the classification capability for graph reconstruction.
AUPRC (higher is better): More sensitive to sparse graphs and penalizes false positives.
SHD (lower is better): Structural Hamming Distance, measuring the edit distance between the predicted and ground-truth graphs.

Key Experimental Results¶

Table 1: AUROC / AUPRC Results (Selected, 10 Algorithms)¶

Experiment	PCMCI+	F-PCMCI	VARLiNGAM	DYNOTEARS	NGC	TSCI
Simple-Default	.52/.71	.51/.70	.50/.69	.43/.67	.50/.69	.46/.68
Coupled-Default	.67/.25	.67/.27	.60/.19	.59/.21	.50/.15	.60/.23
Coupled-Confounder	.58/.20	.55/.19	.51/.17	.49/.17	.50/.16	.51/.18
Climate-MAOOAM	.69/.88	.50/.81	.50/.81	.64/.86	.50/.81	.58/.84
Climate-ENSO	.57/.70	.57/.70	.56/.69	.55/.69	.50/.67	.50/.67

Table 2: SHD Results (Selected, Lower is Better)¶

Experiment	PCMCI+	F-PCMCI	NGC	CUTS+	RCD
Simple-Default	41.04	35.30	28.91	48.11	61.85
Coupled-Default	224.80	192.90	840.95	152.00	157.05
Coupled-Time-lag	327.72	350.61	793.67	247.22	201.11
Climate-MAOOAM	80.00	130.00	31.00	130.00	130.00
Climate-ENSO	529.36	530.27	337.09	608.73	665.36

Key Findings:

Non-DL methods (PCMCI+, F-PCMCI) outperform DL methods in most settings, particularly on high-dimensional coupled systems.
The topology-based method TSCI outperforms pure DL approaches in the presence of confounders and in high-dimensional settings.
All methods perform poorly on coupled dynamical systems: spurious autocorrelation inference and overly dense adjacency matrix predictions under non-stationary dynamics are widespread.
Normalization (to eliminate varsortability) yields limited improvements, especially for methods relying on topological sorting such as VARLiNGAM.

Highlights & Insights¶

Unprecedented scale: 14,000+ graphs and 50M+ samples, far exceeding existing benchmarks of the same type.
Three-tier progressive design: Systematically covers varying complexity from simple chaotic systems to realistic climate models.
Extensible framework: A plug-and-play workflow that allows users to customize coupling structures, noise levels, time-lag parameters, and other settings to generate new data.
Comprehensive evaluation: Simultaneously evaluates 10 causal discovery algorithms spanning 5 major categories (Granger, constraint-based, noise-based, score-based, and topology-based).
Important insight revealed: DL methods perform worse than simple non-DL methods in high-dimensional nonlinear settings — precisely the regime they claim to handle well — pointing to clear directions for future research.

Limitations & Future Work¶

The current benchmark only handles fixed parameters (time-lag, noise level) and 3-dimensional base systems, without covering higher-dimensional or variable-parameter scenarios.
Tier 3 contains only 12 climate graphs, limiting statistical significance.
A gap remains between generated data and real-world observations; although the framework is extensible, it has not yet been connected to real climate reanalysis data.
The SHD metric does not account for graph size or edge density, making cross-tier comparisons difficult.
Recent Transformer- or diffusion-model-based causal discovery methods have not been included as baselines.

Aspect	CausalDynamics	CausalTime	CauseMe	CausalBench
Data Type	ODE/SDE + climate models	DL-generated	SAVAR + climate	Single-cell RNA
# Graphs	14,000+	Few	Few	Thousands (static graphs)
Dynamical Systems	✓ (chaotic, stochastic)	Partial	Partial	✗
Ground Truth	✓ (analytically derived)	✗ (lacks reliable validation)	Partial	✓
Extensibility	✓ (plug-and-play)	Limited	Limited	Limited

Rating¶

Novelty: ⭐⭐⭐⭐ — The first large-scale benchmark for causal discovery in dynamical systems; the three-tier design and GNR-based coupled graph generation are novel.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation of 10 SOTA methods, though the number of Tier 3 graphs is relatively small.
Writing Quality: ⭐⭐⭐⭐ — Clear structure with rigorous theoretical derivations and well-presented experiments.
Value: ⭐⭐⭐⭐ — Has the potential to become a standard benchmark in the causal discovery community and to drive methodological advances.