Efficient Ensemble Conditional Independence Test Framework for Causal Discovery¶
Conference: ICLR 2026 arXiv: 2509.21021 Code: None Area: Causal Inference Keywords: conditional independence test, causal discovery, ensemble method, stable distribution, p-value combination
TL;DR¶
This paper proposes E-CIT (Ensemble Conditional Independence Test), a framework that partitions data into subsets, performs independent tests on each subset, and aggregates the resulting p-values via a stable distribution-based combination method. E-CIT reduces the computational complexity of any base CIT to linear in sample size, while maintaining or improving test power in challenging settings such as heavy-tailed noise and real-world data.
Background & Motivation¶
Background: Constraint-based causal discovery methods (e.g., the PC algorithm) rely on numerous conditional independence tests (CITs) to determine causal graph structure. KCIT (kernel-based CIT) is among the most popular approaches, but incurs \(O(n^3)\) time complexity with respect to sample size.
Limitations of Prior Work: - The high computational cost of individual CITs is the primary bottleneck in causal discovery, not the number of tests performed. - Existing acceleration methods (RCIT, FastKCIT) are tailored specifically to KCIT and do not constitute general-purpose frameworks. - Shah & Peters (2018) proved that no single CIT is uniformly powerful across all conditional dependence structures — making a general acceleration framework more valuable than improving any single method.
Key Challenge: Large samples are necessary to ensure test power, yet the high complexity of CITs renders large-sample computation infeasible.
Goal: Design a general, plug-and-play framework applicable to any CIT method to reduce computational overhead while preserving statistical power.
Key Insight: Drawing inspiration from ensemble learning — data are partitioned into fixed-size subsets, tests are performed independently, and the resulting p-values are aggregated. The key innovation lies in the aggregation step: the closure property of stable distributions is exploited to design a p-value combination method with guaranteed consistency.
Core Idea: Divide-and-conquer + stable distribution p-value aggregation = a linear-complexity acceleration framework for arbitrary CITs.
Method¶
Overall Architecture¶
E-CIT follows a three-step pipeline (Figure 1): 1. Divide: Partition \(n\) samples uniformly into \(K\) subsets of fixed size \(n_k\), where \(K = n / n_k\). 2. Test: Apply the base CIT independently to each subset, yielding \(K\) p-values \(\{p_1, \ldots, p_K\}\). 3. Aggregate: Combine the \(K\) p-values into a final p-value using a stable distribution-based method.
With \(n_k\) fixed, the total complexity of the base CIT becomes \(K \times O(f(n_k)) = O(n)\), achieving linearization compared to the original \(O(f(n))\).
Key Designs¶
-
Stable Distribution-Based P-value Aggregation (Definition 2):
- Function: Combines the p-values from \(K\) sub-tests into a single final p-value with guaranteed statistical properties.
- Mechanism: Exploits the closure property of stable distributions — if \(X_j \sim \mathbf{S}(\alpha, \beta, \gamma, \delta)\) are i.i.d., then \(\frac{1}{K}\sum X_j \sim \mathbf{S}(\alpha, \beta, K^{1/\alpha - 1}\gamma, \delta)\). The test statistic is defined as: $\(T_e = \frac{1}{K} \sum_{k=1}^K F_S^{-1}(p_k)\)$ The final p-value is \(p_e = F_{S'}(T_e)\), where \(S' = \mathbf{S}(\alpha, \beta, K^{1/\alpha-1}\gamma, \delta)\).
- Design Motivation: The parameter \(\alpha\) controls tail heaviness; \(\alpha = 2\) recovers the Stouffer method (Gaussian), and \(\alpha = 1\) corresponds to the Cauchy combination. Tuning \(\alpha\) allows adaptation to different CITs and data characteristics.
-
Theoretical Guarantees (Theorem 1 & 2):
- Function: Establishes validity, admissibility, unbiasedness, and consistency of the ensemble test.
- Mechanism:
- Validity: Under the null hypothesis, \(p_e\) is uniformly distributed on \([0,1]\) (for exact p-values).
- Consistency (Theorem 2): Power approaches 1 as \(K \to \infty\), requiring only: ① the expected p-value of sub-tests is \(\le \alpha_e\); ② the p-value density on \([0, 1/2]\) is no less than its mirror value; ③ stable distribution parameters satisfy \(\alpha \ge 1, \beta = \delta = 0\).
- Design Motivation: The consistency conditions impose no assumptions on the data-generating process, requiring only that sub-tests be reasonably valid. This enables E-CIT to provide consistency guarantees even in complex settings where the base CIT itself lacks them.
-
Flexibility via the \(\alpha\) Parameter:
- Function: Controls the degree of flexibility in p-value aggregation.
- Mechanism: By the Neyman–Pearson lemma, the optimal combination statistic is a monotone transformation of \(-\sum \log f_1(p_k)\). Since different CITs yield different alternative p-value distributions under different dependence structures, \(\alpha\) enables adaptive tuning.
- Design Motivation: Classical methods (Fisher, Stouffer) correspond to fixed values of \(\alpha\) and lack flexibility; E-CIT provides a parsimonious one-dimensional control via \(\alpha\).
Loss & Training¶
- E-CIT is an unsupervised method requiring no training.
- Practical recommendations: \(n_k = 400\) (empirically determined), \(\alpha \in \{1.75, 2.0\}\), \(\beta = \delta = 0\), \(\gamma = 1\).
Key Experimental Results¶
Main Results¶
Data generation follows a post-nonlinear model, with \(Z\) drawn from normal or Laplace distributions, and noise from Student-t, Cauchy, or Laplace distributions.
Computational Efficiency (Figure 2, KCIT acceleration):
| Method | Time Complexity | Runtime at n=2000 | Type I Error | Power |
|---|---|---|---|---|
| KCIT (original) | \(O(n^3)\) | ~100s | ~0.05 | baseline |
| RCIT | \(O(n)\) | ~0.1s | ~0.05 | slightly below KCIT |
| FastKCIT | \(O(n \log n)\) | ~1s | ~0.05 | close to KCIT |
| E-KCIT | \(O(n)\) | ~0.1s | ~0.05 | matches or exceeds KCIT (better under heavy tails) |
Cross-Method Generality (Table 2, n=1200, Normal Z, t-noise df=2):
| Method | Orig. Power | Ensemble Power (α=1.75) |
|---|---|---|
| RCIT | 0.548 | 0.623 |
| LPCIT | 0.422 | 0.447 |
| CMIknn | 0.982 | 0.988 |
| FisherZ | 0.510 | 0.561 |
| CCIT | 0.904 (Type I=0.454!) | 0.816 (Type I=0.286↓) |
Ablation Study¶
Real Data: Flow-Cytometry (Table 3):
| Method | Orig. F1 | Ensemble F1 |
|---|---|---|
| KCIT | 0.624 | 0.695 |
| RCIT | 0.665 | 0.687 |
| LPCIT | 0.691 | 0.741 |
| CMIknn | 0.779 | 0.756 |
| FisherZ | 0.737 | 0.767 |
Key Findings¶
- Significant speedup: E-KCIT reduces KCIT's \(O(n^3)\) complexity to \(O(n)\), achieving runtime comparable to RCIT.
- Power gains, not losses: Under heavy-tailed noise (Student-t df=2, Cauchy), E-KCIT surpasses both KCIT and RCIT in power — subset-level estimation is more stable.
- Generality: Effective across 6 distinct CIT methods (KCIT, RCIT, LPCIT, CMIknn, CCIT, FisherZ).
- Real-data advantage: On the Flow-Cytometry dataset, E-CIT improves F1-score by 2–5 percentage points for most methods.
- Unexpected finding for CCIT: E-CIT substantially reduces CCIT's inflated Type I error (from 0.45+ to 0.28–0.34), at the cost of a modest reduction in power, yielding better-calibrated tests.
- Causal discovery application (Figure 3): On nonlinear additive noise causal graphs, E-KCIT outperforms both KCIT and RCIT in F1 and SHD.
Highlights & Insights¶
- A general framework, not a specific method: E-CIT functions as an accelerator rather than a new CIT — it can be plugged into any existing method.
- Extremely mild theoretical consistency conditions: No assumptions are placed on the data or model; only reasonable validity of sub-tests is required.
- Elegant application of stable distributions: The closure property of stable distributions enables exact p-value aggregation, with \(\alpha\) providing flexible control.
- Practical insight: In complex settings (heavy tails, real data), ensemble aggregation can improve power — small-sample estimates are more stable, and aggregation compensates for individual weaknesses.
Limitations & Future Work¶
- The theoretical analysis assumes sub-test p-values are i.i.d. — correlation may arise in practice (e.g., time-series data or distributional shift).
- Optimal selection of \(\alpha\) is context-dependent; only empirical recommendations (\(\{1.75, 2.0\}\)) are currently provided.
- Subset size \(n_k\) must be sufficiently large for valid sub-tests — the curse of dimensionality may persist for very high-dimensional conditioning sets \(Z\).
- Methods already exhibiting strong performance (e.g., CMIknn) benefit less, suggesting the framework is most effective for accelerating moderately powerful tests.
- Future directions include handling correlated p-values, optimizing \(\alpha\) for specific CITs, and developing adaptive subset size selection.
Related Work & Insights¶
- RCIT (Strobl et al., 2019): Accelerates KCIT via random Fourier features — applicable only to KCIT, whereas E-CIT is a general framework.
- FastKCIT (Schacht & Huang, 2025): Partitions data using GMMs — conceptually similar but designed exclusively for KCIT.
- Cauchy combination method (Liu & Xie, 2020): Combines p-values via the Cauchy distribution for whole-genome sequencing tests — E-CIT generalizes this to arbitrary stable distributions in the CIT context.
- Insight: Divide-and-conquer with aggregation is a general paradigm for large-scale statistical testing, with potential applicability to other computationally demanding testing problems.
Rating¶
- Novelty: ⭐⭐⭐⭐ The closure property of stable distributions is creatively applied to p-value aggregation for CIT, yielding a clear and practical framework; however, the divide-and-aggregate paradigm itself is not entirely novel.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Synthetic data + real data + causal discovery application; 6 CIT methods × multiple noise distributions × multiple sample sizes; comprehensive ablation.
- Writing Quality: ⭐⭐⭐⭐ Well-structured with good integration of theory and experiments; Figure 1 provides an intuitive overview; proofs are deferred to the appendix, keeping the main text accessible.
- Value: ⭐⭐⭐⭐ Highly practical — directly integrable into existing causal discovery pipelines; meaningful for large-scale causal discovery applications.