TabReX: Tabular Referenceless eXplainable Evaluation¶

Conference: ACL 2026 arXiv: 2512.15907 Code: GitHub Area: Interpretability Keywords: Tabular evaluation metrics, referenceless evaluation, knowledge graph alignment, explainable evaluation, structured generation

TL;DR¶

This paper proposes TabReX, a graph-reasoning-based referenceless evaluation framework for tabular generation. It converts source text and generated tables into knowledge graph triples and aligns them to compute interpretable, attribute-driven scores. TabReX substantially outperforms existing methods in correlation with human judgments, and the authors also introduce TabReX-Bench, a large-scale evaluation benchmark.

Background & Motivation¶

Background: As LLMs are increasingly used to generate or transform structured outputs (e.g., converting reports into financial tables or synthesizing patient data), automatic evaluation of table quality has become a critical need. Existing evaluation metrics fall into several categories: n-gram metrics (BLEU, ROUGE), embedding-based metrics (BERTScore, BLEURT), token-level exact matching (Exact Match, PARENT), QA-based referenceless metrics (QuestEval), and recent LLM-judge metrics (TabEval, TabXEval).

Limitations of Prior Work: (1) N-gram and embedding metrics flatten tables into plain text, entirely ignoring row/column structure and unit semantics; (2) token-level methods cannot distinguish between harmless formatting adjustments and genuine factual errors; (3) QA-based metrics excessively penalize layout changes (e.g., row reordering); (4) most metrics require reference tables, limiting their generality; (5) existing benchmarks are small in scale and cover only limited perturbation types, preventing comprehensive robustness testing.

Key Challenge: Table evaluation must simultaneously account for structural fidelity and factual accuracy, while distinguishing data-preserving transformations (e.g., row reordering, unit conversion) from data-altering transformations (e.g., value tampering, row/column insertion or deletion). No existing metric performs well on both dimensions.

Goal: Design a referenceless, attribute-driven, explainable table evaluation framework that provides cell-level error traceability and a tunable sensitivity–specificity trade-off.

Key Insight: Table evaluation is reformulated as a graph alignment problem — both the source text and the generated table can be represented as knowledge graph triples [subject, predicate, object], and aligning these triples enables precise localization of matched, missing, and spurious information.

Core Idea: Text2Graph and Table2Graph are used to unify both modalities into a shared triple space. LLM-guided graph alignment identifies correspondences and discrepancies, and an attribute-driven scoring function computes interpretable evaluation scores.

Method¶

Overall Architecture¶

TabReX is a three-stage pipeline: (1) converting the source text and candidate table into knowledge graphs (Text2Graph + Table2Graph); (2) using LLM-guided graph alignment to identify correspondences between triples; and (3) computing attribute-driven structural and content scores from the alignment results. The final output includes a table-level score and cell-level error traceability.

Key Designs¶

Dual-Modality Knowledge Graph Conversion:
- Function: Unify text and tables into a comparable triple representation.
- Mechanism: For text, an LLM extracts atomic factual triples \(\mathcal{G}_S = \{(s_i, p_i, o_i)\}\) following entity-centric syntax, enforcing consistent granularity, normalized predicates, and unit-aware values. For tables, lightweight deterministic rules generate triples using column headers as predicates, row identifiers as subjects, and cell values as objects.
- Design Motivation: Unifying both modalities into the same representation space reduces evaluation to a graph alignment problem, eliminating bias introduced by modality differences. The deterministic rule-based approach on the table side (without LLM) ensures speed and consistency.
LLM-Guided Graph Alignment:
- Function: Precisely match triples from the source text and the generated table.
- Mechanism: A two-step alignment procedure — (1) Deterministic matching: triples with identical subject-predicate pairs or those equivalent after schema normalization are directly aligned; (2) LLM-assisted refinement: handles remaining cases involving paraphrasing, abbreviations, and compound attributes (e.g., "GDP growth (YoY)" ↔ "growth_rate_2021"). Each matched pair is annotated with a difference vector \(\Delta\) recording unit-aware numerical deviation, categorical mismatch, and missing/spurious tokens.
- Design Motivation: Deterministic matching handles simple cases efficiently, while the LLM addresses difficult cases requiring semantic understanding, balancing efficiency and accuracy.
Attribute-Driven Scoring:
- Function: Compute interpretable and tunable evaluation scores from alignment results.
- Mechanism: Two components are defined — TablePenalty computes the normalized proportion of row/column-level missing (MI) and extraneous (EI) entities; CellPenalty computes cell-level missing, extraneous, and partial matches (numerical deviation \(\Gamma\)). The final score is \(\mathcal{S}_{\text{TabReX}} = \text{TablePenalty} + \text{CellPenalty}\). Weight parameters \((\alpha, \beta)\) provide a tunable sensitivity–specificity trade-off: increasing \(\beta_{\text{MI}}\) biases toward sensitivity (rewarding comprehensive coverage), while increasing \(\beta_{\text{EI}}\) biases toward specificity (penalizing hallucinations).
- Design Motivation: Different domains have different error tolerance (financial applications require precision; clinical applications require recall), and the tunable weights allow the same framework to adapt to diverse requirements.

Loss & Training¶

TabReX requires no training and operates purely as an inference-time evaluation framework. LLMs are only invoked in the Text2Graph and graph alignment steps; the scoring function is entirely deterministic.

Key Experimental Results¶

Main Results¶

Correlation with human rankings (Table 2):

Metric Category	Method	Spearman ρ (↑)	Kendall τ (↑)	Tie ratio (↓)
Non-LLM (reference-based)	EM	45.88	39.38	58.40
Non-LLM (reference-based)	BERTScore	36.21	30.66	0.92
LLM (reference-based)	TabXEval	80.27	72.37	45.33
Referenceless	QuestEval	62.93	52.29	3.03
Referenceless	TabReX	74.51	64.24	13.59

Under referenceless conditions, TabReX approaches the correlation of the strongest reference-based method TabXEval, while achieving a substantially lower tie ratio (13.6% vs. 45.3%).

Ablation Study¶

Ensemble Method	Spearman ρ	Kendall τ	Note
Lex-Emb (Mean)	38.43	32.65	Lexical + embedding ensemble
LLM (Harmonic)	56.00	46.93	LLM metric ensemble
Hybrid (Harmonic)	54.03	42.71	Hybrid ensemble
TabReX	74.51	64.24	Single method

Key Findings¶

TabReX as a single metric outperforms all ensemble methods, demonstrating that the graph alignment paradigm itself is more effective than simple aggregation.
From easy to hard perturbations, TabReX maintains a stable sensitivity–specificity trade-off (small shift in the operating point), whereas EM, H-Score, and others degrade substantially.
Although TabXEval achieves the highest correlation, its tie ratio of 45.3% means nearly half of distinct variants receive identical scores — indicating insufficient discriminative precision.
TabReX-Bench (710 tables × 12 perturbations = 9,120 instances, 6 domains, 3 difficulty levels) constitutes the largest tabular evaluation benchmark to date.

Highlights & Insights¶

The use of knowledge graph triples as an intermediate representation is an elegant design choice — it reduces the modality alignment problem to a graph matching problem and naturally supports dual evaluation of structure and semantics.
The practical value of the tunable trade-off is significant — financial domains can increase \(\beta_{\text{EI}}\) to strictly penalize hallucinations, while clinical domains can increase \(\beta_{\text{MI}}\) to ensure information completeness.
Planner-driven perturbation generation ensures benchmark diversity and reproducibility — generating 12 perturbation types in a single LLM call yields more consistent results than generating them individually.

Limitations & Future Work¶

Text2Graph relies on LLMs for triple extraction, which may lack robustness for complex nested tables or non-standard formats.
Evaluation costs depend on the number of LLM calls, making latency and cost non-negligible at scale.
Only GPT-5-nano is evaluated as the backbone; performance with open-source models remains to be verified.
Cross-table reasoning (requiring facts jointly derived from multiple tables) is not yet covered.

vs. TabXEval: TabXEval is reference-based and achieves the highest correlation, but its high tie ratio leads to insufficient discriminative precision; TabReX is referenceless and provides finer-grained discrimination.
vs. QuestEval: Both are referenceless methods, but QuestEval relies on generic QA signals and excessively penalizes table-specific structural transformations (e.g., row reordering); TabReX's graph alignment naturally abstracts away the effects of formatting changes.
vs. PARENT/BLEU: These metrics are largely ineffective for structured output evaluation; TabReX represents a fundamental paradigm shift in this evaluation landscape.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The graph alignment paradigm for referenceless table evaluation is a fundamentally novel approach, and the attribute-driven scoring mechanism is elegantly designed.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ TabReX-Bench is large-scale and rigorously designed, baselines are comprehensive, and human evaluation is thorough.
Writing Quality: ⭐⭐⭐⭐ The framework is described clearly, though the density of mathematical notation requires careful reading.
Value: ⭐⭐⭐⭐⭐ The work makes an important contribution to the evaluation of structured generation, and the framework design is general and extensible.