TabStruct: Measuring Structural Fidelity of Tabular Data¶
Conference: ICLR 2026 arXiv: 2509.11950 Code: https://github.com/SilenceX12138/TabStruct Area: Data Generation / Tabular Data / Causal Structure Keywords: Tabular data generation, structural fidelity, causal structure, global utility, conditional independence
TL;DR¶
This paper proposes the TabStruct evaluation framework and a global utility metric that measures the structural fidelity of tabular data generators with respect to causal structure, without requiring ground-truth causal graphs. A systematic comparison of 13 generators across 29 datasets reveals that diffusion models significantly outperform other methods in preserving global structure.
Background & Motivation¶
Background: Tabular data generation underlies tasks such as training data augmentation and missing value imputation. Existing evaluations focus on three dimensions: density estimation (distributional similarity), ML utility (downstream predictive performance), and privacy protection (distance between synthetic and real data).
Limitations of Prior Work: These three dimensions are adapted from homogeneous modalities (text/images) and do not account for the heterogeneous characteristics of tabular data. A representative failure case is SMOTE, which scores well on density estimation and ML utility yet produces data that severely violates inter-variable causal relationships—for instance, breaking conditional independencies implied by physical laws.
Key Challenge: The fundamental prior for tabular data is the structural causal model (SCM), in which variables exhibit causal dependencies. Conventional metrics evaluate only marginal distributions or predictive performance for a single target, failing to capture global causal interactions among features. The only existing benchmark addressing structural fidelity, CauTabBench, is restricted to toy SCM datasets, because quantifying structural fidelity requires ground-truth causal graphs that are virtually unavailable for real-world datasets.
Goal - How can structural fidelity be evaluated without ground-truth causal graphs? - How do existing generators perform in terms of structural fidelity? - What is the relationship between structural fidelity and conventional evaluation dimensions?
Key Insight: The authors observe that if a generator truly learns the causal structure of the data, a model trained on synthetic data should be able to predict each variable from all others with performance close to that achieved on real data. This "all-variable predictability" property is closely related to the Markov blanket concept in SCMs.
Core Idea: By rotating each variable as the prediction target and aggregating the ratio of predictive performance across all variables, global utility is defined as a proxy for the global structural fidelity of tabular data generators—without requiring a causal graph.
Method¶
Overall Architecture¶
TabStruct is a unified evaluation framework that takes a reference dataset \(\mathcal{D}_{\text{ref}}\) and a synthetic dataset \(\mathcal{D}_{\text{syn}}\) as input and outputs scores along four dimensions: density estimation, privacy protection, ML utility, and structural fidelity. Structural fidelity is the core dimension newly introduced by this work.
The evaluation pipeline operates under two scenarios: - SCM datasets: Conditional independence (CI) tests are used to directly quantify structural fidelity. - Real-world datasets without SCMs: Global utility serves as an indirect proxy for structural fidelity.
Key Designs¶
-
Conditional Independence (CI) Score — Structural Fidelity with Known Causal Graphs
- Function: On datasets with known ground-truth SCMs, structural fidelity is quantified by comparing the consistency of CI statements between real and synthetic data.
- Mechanism: All CI statements \(\mathcal{C}_{\text{global}}\) are enumerated from the CPDAG of the ground-truth SCM, including both d-separation and d-connection statements. For each CI statement, a statistical test (\(\alpha=0.01\)) is performed on the synthetic data, and the pass rate is computed as \(CI(\mathcal{C}, \mathcal{D}) = \frac{1}{|\mathcal{C}|}\sum \mathbb{1}[\hat{\mathcal{I}}_\alpha = 1]\).
- Design Motivation: Evaluation is conducted at the CPDAG level rather than the DAG level, because existing causal discovery methods become unreliable when the number of features exceeds 10. Skeleton-level evaluation is also avoided, as it discards directional information.
- Local vs. Global: Local CI considers only CI statements involving the prediction target \(y\); global CI considers all variable pairs.
-
Global Utility — A Proxy for Structural Fidelity Without Causal Graphs
- Function: Measures the degree of global structural preservation on datasets without ground-truth SCMs.
- Mechanism: Each variable \(x_j\) is rotated as the prediction target, with all remaining variables used as predictors. The per-variable utility is defined as the ratio of predictive performance relative to the reference data (balanced accuracy for classification; inverse RMSE for regression). Global utility is the average across all variables: \(\text{Global Utility}(\mathcal{D}) = \frac{1}{D+1}\sum_{j=1}^{D+1}\text{Utility}_j(\mathcal{D})\).
- Design Motivation: This design addresses two issues: (1) it avoids the target-specific bias of local utility (which only predicts \(y\)); (2) aggregating normalized performance ratios enables comparability across tasks of different types. AutoGluon with an ensemble of 9 predictors is employed to reduce single-model bias.
- Theoretical Basis: A high-fidelity generator should preserve the conditional distribution \(p(x_j | \mathcal{X} \setminus \{x_j\})\) for every variable, which is consistent with the Markov blanket concept.
-
Local Utility vs. Global Utility
- Function: The comparison highlights the limitations of local utility as equivalent to ML efficacy.
- Mechanism: Local utility focuses solely on predicting target \(y\). Experiments show that local utility is strongly correlated with local CI (\(r_s=0.78\)) but nearly uncorrelated with global CI (\(r_s=0.14\)), whereas global utility is strongly correlated with global CI (\(r_s=0.84\)).
- Design Motivation: This demonstrates that conventional ML efficacy reflects only local structure and is insufficient for comprehensive generator evaluation.
Integration of Evaluation Dimensions¶
The framework jointly considers four dimensions: density estimation (Shape/Trend), privacy protection (\(\alpha\)-precision/\(\beta\)-recall/DCR/\(\delta\)-Presence), ML utility (local utility), and structural fidelity (CI score/global utility), providing a comprehensive assessment of generators.
Key Experimental Results¶
Main Results — Structural Fidelity of 13 Generators on SCM Datasets¶
| Generator | Global CI ↑ | Global Utility ↑ | Local Utility ↑ | Shape ↑ |
|---|---|---|---|---|
| \(\mathcal{D}_\text{ref}\) | 1.00 | 0.99 | 0.99 | 1.00 |
| TabSyn | 0.70 | 0.76 | 0.76 | 0.50 |
| TabDDPM | 0.69 | 0.80 | 0.29 | 0.62 |
| TabDiff | 0.57 | 0.75 | 0.80 | 0.69 |
| SMOTE | 0.30 | 0.39 | 0.92 | 0.82 |
| CTGAN | 0.08 | 0.26 | 0.80 | 0.46 |
| GReaT | 0.16 | 0.25 | 0.27 | 0.62 |
| NRGBoost | 0.11 | 0.16 | 0.75 | 0.65 |
Key Findings: SMOTE achieves the highest local utility (0.92) but a global CI of only 0.30, demonstrating that conventional ML utility metrics can be misleading. Diffusion models (TabDDPM/TabSyn/TabDiff) consistently achieve the best global structural fidelity.
Global Utility Rankings on Real-World Datasets¶
| Generator | Global Utility ↑ | Local Utility ↑ |
|---|---|---|
| \(\mathcal{D}_\text{ref}\) | 0.99 | 0.96 |
| TabSyn | 0.73 | 0.76 |
| TabDiff | 0.73 | 0.78 |
| TabDDPM | 0.72 | 0.27 |
| ARF | 0.56 | 0.54 |
| TVAE | 0.53 | 0.70 |
| BN | 0.44 | 0.38 |
| SMOTE | 0.41 | 0.91 |
| CTGAN | 0.13 | 0.70 |
| GReaT | 0.20 | 0.23 |
On real-world datasets, the top-3 generators by global utility remain the diffusion models (TabSyn/TabDiff/TabDDPM), consistent with results on SCM datasets, validating the generalizability of global utility.
Correlation Analysis¶
| Metric Pair | Spearman \(r_s\) | p-value |
|---|---|---|
| Global Utility ↔ Global CI | 0.84 | <0.001 |
| Local Utility ↔ Local CI | 0.78 | <0.001 |
| Local Utility ↔ Global CI | 0.14 | <0.001 |
The strong correlation between global utility and global CI (0.84) is the central empirical result, validating the effectiveness of global utility as a proxy metric in the absence of ground-truth SCMs.
Highlights & Insights¶
- Filling an Evaluation Gap: This work is the first to systematically incorporate structural fidelity into the evaluation framework for tabular generators, and the proposed global utility requires no ground-truth causal graphs.
- Large-Scale Benchmark: 13 generators × 29 datasets × over 150,000 evaluations, far exceeding the coverage of existing benchmarks.
- Explaining Why Diffusion Models Excel: Diffusion models independently add noise to each feature and reconstruct all features jointly during denoising, naturally learning permutation-invariant conditional distributions that align with the structural prior of tabular data.
- Exposing the "Illusion" of SMOTE: Experiments clearly show that SMOTE performs well on conventional metrics while severely violating causal structure, exposing systematic evaluation bias.
Limitations & Future Work¶
- The strong correlation between global utility and global CI is an empirical finding without theoretical proof.
- Evaluation at the CPDAG level is coarser than at the full DAG level and may miss certain directional causal relationships.
- The framework relies on AutoGluon ensemble predictors, whose computational cost grows with the number of features (though the Tiny-default variant is efficient at 0.64s/1000 samples).
- SCM datasets use expert-validated causal graphs, but such datasets are scarce (only 6 available), limiting generalizability.
Related Work & Insights¶
- Tabular Generation Benchmarks: Synthcity (Qian et al., 2024) and SynMeter (Du & Li, 2024) cover density, privacy, and ML utility but not structural fidelity; CauTabBench (Tu et al., 2024) evaluates structural fidelity but is limited to toy SCMs.
- Tabular Generators: The comparison spans SMOTE, BN, TVAE, CTGAN, NFlow, ARF, diffusion models (TabDDPM/TabSyn/TabDiff/TabEBM), LLM-based GReaT, and tree-based NRGBoost.
- Causal Discovery: DAG learning methods become unreliable for more than 10 features (Zanga et al., 2022), motivating the CPDAG-level evaluation adopted in this work.
- Tabular Foundation Models: Hollmann et al. (2025) empirically demonstrate that SCMs are effective structural priors for tabular data.
Rating¶
| Dimension | Score | Remarks |
|---|---|---|
| Novelty | ⭐⭐⭐⭐ | First global structural fidelity metric that requires no causal graph |
| Technical Depth | ⭐⭐⭐⭐ | Clear theoretical analysis; well-motivated CI framework and global utility design |
| Experimental Thoroughness | ⭐⭐⭐⭐⭐ | 13 models × 29 datasets, 150,000+ evaluations; highly comprehensive |
| Writing Quality | ⭐⭐⭐⭐ | Well-structured; motivating examples are intuitive |
| Value | ⭐⭐⭐⭐ | Open-source framework directly applicable to evaluating new generators |
| Overall | ⭐⭐⭐⭐ | An important contribution to tabular data generation evaluation; global utility has the potential to become a standard metric |