SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables¶

Conference: ICLR 2026 arXiv: 2602.23286 Code: github.com/pshlego/SPARTA Area: Audio & Speech Keywords: Multi-hop Reasoning, Table-Text QA, Benchmark Construction, SQL, Cross-modal Reasoning

TL;DR¶

This paper presents SPARTA, an end-to-end framework for automatically constructing large-scale table-text multi-hop QA benchmarks. By leveraging a reference fact database, provenance-based refinement, and realistic structural constraints to generate high-quality nested SQL queries, SPARTA reduces the F1 of state-of-the-art models by over 30 points.

Background & Motivation¶

Three major limitations of existing benchmarks:
Limited question types and shallow reasoning: Most benchmarks require ≤2-hop reasoning and do not support advanced operations such as aggregation and grouping.
Severe annotation noise: An audit of 100 HybridQA samples reveals that 21% contain errors (redundant modality 52.4%, incomplete answers 23.8%, incorrect/unanswerable 23.8%).
Reliance on small-scale web tables: Average ~15 rows, far from the thousands of rows found in real-world databases.
The complexity of manual annotation limits benchmark scale and quality, motivating the need for automated approaches.

Method¶

Overall Architecture: Three-Stage Pipeline¶

Reference Fact Database Construction — Source tables and grounding tables are merged into a unified relational database.
Query Generation — An LLM generates nested SQL queries whose depth matches the target number of hops.
Question Verbalization — Validated SQL queries are converted into fluent natural language questions.

Reference Fact Database¶

Source table \(\mathcal{S}_T\): Retains the original relational tables (e.g., 6 publicly available Kaggle tables covering NBA salaries, awards, drafts, etc.).
Grounding table \(\mathcal{G}_T\): Decomposes unstructured text into atomic fact tuples stored in SQL-queryable relational tables.
Two grounding approaches: (1) using verified corpora such as ROTOWIRE; (2) template-based table-to-text conversion.
Shared entity attributes (e.g., PLAYER_NAME) are linked via primary–foreign key constraints to ensure join reachability.

Key Designs in Query Generation¶

Post-order Traversal + Realistic Structural Constraints¶

Nested SQL is modeled as a query graph \(G=(V,E)\), where nodes represent query blocks and edges represent nesting predicates.
Post-order traversal is adopted for construction: leaf queries are generated and validated first, then recursively encapsulated into higher-level queries.
This approach outperforms top-down or breadth-first strategies, as post-order guarantees that each intermediate block can be executively validated at construction time.

When a query returns empty results: 1. Predicates are stripped in reverse order until a non-empty result is obtained. 2. Tuples are sampled from the non-empty result. 3. A why-not provenance tool is run to identify blocking predicates. 4. The diagnostic report is fed back to the LLM to rewrite the problematic clause.

Question Verbalization¶

AST-ICL (a state-of-the-art SQL-to-text model) is used to convert SQL into fluent natural language.
Three CS graduate students perform lightweight validation; annotation efficiency is 4× that of HybridQA.

Domain-Agnostic Design¶

The framework is extensible to arbitrary domains: given source and grounding tables, and after applying table-to-text generation, the query generation pipeline remains unchanged. Extensions to the movie and medical domains are demonstrated.

Key Experimental Results¶

Benchmark Comparison¶

Benchmark	Table Scale	Question Generation	GROUP BY/HAVING	>3-Hop	Annotation Error Rate
HybridQA	4.4 cols × 15.7 rows	Manual	✗	✗	21%
OTT-QA	4.4 cols × 15.7 rows	Manual	✗	✗	21%
TAT-QA	4.0 cols × 9.4 rows	Manual	✗	✗	30%
SPARTA (NBA)	12.2 cols × 3280 rows	Automatic + lightweight validation	✓	✓	0%

Performance of SOTA Models on SPARTA¶

Model	HybridQA F1	SPARTA F1	Drop
Best existing model	>70	<40	>30↓
Best OTT-QA model	>50	<20	>30↓

Ablation Study: Query Generation Strategies¶

Method	Execution Success Rate	Query Diversity
One-Shot (no verification)	Low	Low
Post-Order (no Provenance)	Medium	Medium
Post-Order + Provenance	High	High

Key Findings¶

SOTA models (GPT-4, Claude, etc.) suffer substantial F1 drops on SPARTA, exposing fundamental weaknesses in cross-modal reasoning.
The combination of post-order traversal and provenance-based refinement significantly improves query execution rate and diversity.
Lightweight human validation requires only 1/4 of the annotation time needed for HybridQA.
Successful extension to the movie and medical domains validates the domain-agnostic design.

Highlights & Insights¶

Fundamental redesign of Table-Text QA benchmarks: The SQL-centric pipeline addresses three core issues simultaneously—scale, noise, and logical depth.
Provenance-based refinement as a key innovation: Database techniques (why-not provenance) are introduced into NLP benchmark construction.
High discriminative power: F1 drops of 30+ points for SOTA models clearly indicate fundamental deficiencies in existing cross-modal reasoning capabilities.
Reproducible and extensible: Code, data, and models are fully open-sourced to facilitate future research.

Limitations & Future Work¶

Atomic fact extraction for grounding tables relies on specific corpora (e.g., ROTOWIRE); extending to new domains requires manual template design.
Question verbalization depends on LLMs, which may introduce subtle semantic drift.
Only extractive and generative QA models are evaluated; agent-based and tool-augmented methods have not yet been tested.

Table-Text QA benchmarks: HybridQA, OTT-QA, TAT-QA, FinQA, MultiHiertt, etc.
Synthetic benchmark generation: ERBench, TDBench, etc. (mostly single-modal or shallow).
PEEL: Template-based NL–nested SQL pair generation.

Rating¶

Novelty: ⭐⭐⭐⭐ — The SQL-centric approach to automated benchmark construction is innovative.
Technical Depth: ⭐⭐⭐⭐ — Provenance-based refinement and post-order traversal constraints are elegantly designed.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive coverage across multiple domains, models, and ablations.
Practical Value: ⭐⭐⭐⭐⭐ — Directly exposes fundamental weaknesses of SOTA models, offering high community value.