CRAFT: Training-Free Cascaded Retrieval for Tabular QA¶

Conference: ACL 2026 arXiv: 2505.14984 Code: Project Page Area: Information Retrieval / Tabular Question Answering Keywords: Table Retrieval, Cascaded Retrieval, Zero-Shot, Tabular QA, Training-Free

TL;DR¶

This paper proposes CRAFT, a three-stage cascaded table retrieval framework requiring no dataset-specific training (SPLADE sparse filtering → semantic mini-table ranking → neural re-ranking). By augmenting table representations with Gemini-generated captions and descriptions, CRAFT achieves SOTA on NQ-Tables (R@1 49.84), demonstrates strong zero-shot generalization on OTT-QA, and exhibits remarkable robustness to query paraphrasing.

Background & Motivation¶

Background: Open-domain tabular question answering (TQA) requires first retrieving relevant tables from large-scale corpora, then reasoning over them to produce answers. Existing approaches include sparse retrieval (BM25), dense retrieval (DPR, DTR), and hybrid retrieval (THYME).

Limitations of Prior Work: (1) Dense retrieval models (DTR, DPR) incur high computational costs and require retraining or fine-tuning on new datasets, limiting adaptability to new domains; (2) naively linearizing tables into text loses row-column structural information; (3) complex architectures (e.g., SSDR's syntax-aware retrievers) demand elaborate modeling and expensive training.

Key Challenge: SOTA table retrieval relies on costly domain-specific fine-tuning, which renders systems inflexible when facing new domains or datasets. The question is whether pretrained models, combined with a carefully designed retrieval pipeline, can achieve competitive performance.

Goal: To construct a modular, extensible multi-stage retrieval framework that leverages off-the-shelf pretrained models to achieve competitive table retrieval and end-to-end QA performance in a zero-shot setting.

Key Insight: A three-stage cascaded design that progressively transitions from high-recall sparse retrieval to high-precision semantic re-ranking, employing progressively stronger but slower models at each stage. Gemini-generated table captions and descriptions are used to compensate for semantic deficiencies in raw table representations.

Core Idea: Applying the "progressive refinement" paradigm of cascaded retrieval to table retrieval — sparse models efficiently filter candidates → mini-table construction reduces token overhead → neural models perform precise re-ranking — achieving SOTA without any training.

Method¶

Overall Architecture¶

Preprocessing (Gemini-1.5-Flash generates query sub-questions, table captions, and descriptions; Sentence Transformer ranks table rows by semantic relevance) → Stage 1 (SPLADE sparse retrieval, filtering Top-5000 from 169K/419K tables) → Stage 2 (construct mini-tables comprising column headers + top-5 rows, semantic matching via Sentence Transformer/Jina to obtain Top-K) → Stage 3 (re-rank with OpenAI/Gemini embeddings for final results) → end-to-end LLM answer generation.

Key Designs¶

Three-Stage Cascaded Retrieval:
- Function: Progressively transitions from high recall to high precision, balancing efficiency and effectiveness.
- Mechanism: Stage 1 uses SPLADE (sparse lexical expansion) to efficiently scan the full table corpus (leveraging captions, column headers, cell values, and descriptions), filtering to 5,000 candidates. Stage 2 constructs mini-tables (column headers + top-5 rows) and applies bi-encoder semantic matching to narrow down to Top-K. Stage 3 applies the strongest embedding model (text-embedding-3-large or gemini-embedding-001) for final re-ranking.
- Design Motivation: Running semantic models over the full table corpus is computationally prohibitive; the cascaded design balances precision and efficiency at each stage.
Mini-Table Construction and Table Augmentation:
- Function: Reduces token overhead while retaining critical table information.
- Mechanism: Each table retains only its column headers and the most relevant top-5 rows (selected by semantic relevance ranking via Sentence Transformer), forming a mini-table. Gemini-1.5-Flash additionally generates descriptive captions and detailed descriptions for each table to enhance semantic matching.
- Design Motivation: Mini-tables achieve up to 33× fewer online embedding calls and 70% shorter contexts without sacrificing retrieval precision.
Dataset-Specific Model Selection:
- Function: Selects the optimal pretrained model tailored to the characteristics of each dataset.
- Mechanism: NQ-Tables (single-hop factoid queries) uses all-mpnet-base-v2 + text-embedding-3-large; OTT-QA (multi-hop reasoning, hybrid-mode text) uses Jina Embeddings v3 + gemini-embedding-001. Selection is based on each model's suitability for the specific text characteristics of the dataset.
- Design Motivation: Query and table characteristics differ across datasets; model selection should match the properties of the data.

Loss & Training¶

No training is involved. All models use pretrained weights or APIs. End-to-end QA employs Llama3-8B, Qwen2.5-7B, and Mistral-7B in zero-shot or few-shot settings for answer generation.

Key Experimental Results¶

Main Results¶

NQ-Tables Retrieval Performance

Model	Training Required	R@1	R@10	R@50
THYME (SOTA Hybrid)	Fine-tuning required	48.55	86.38	96.08
DTR+HN	Fine-tuning required	47.33	80.96	91.51
BIBERT+SPLADE	Fine-tuning required	45.62	86.72	95.62
CRAFT (Zero-Shot)	None	49.84	86.83	97.17

OTT-QA Zero-Shot Retrieval Performance

Model	R@1	R@10	R@50
THYME (Fine-tuned)	66.67	91.10	96.16
CRAFT (Zero-Shot)	55.56	89.88	96.07

Ablation Study¶

Query Robustness (Performance Change Δ Under Paraphrased Queries)

Model	Original R@10	Paraphrased Δ (avg)
DTR (M)	75.73	-8.38
DTR (S)	73.88	-11.82
DTR (M)+HN	80.96	-5.80
CRAFT	87.16	-0.04

Key Findings¶

CRAFT surpasses all fine-tuned methods on NQ-Tables in a zero-shot setting (R@1 49.84 vs. THYME 48.55), demonstrating that a carefully designed cascaded pipeline can substitute expensive fine-tuning.
On OTT-QA, CRAFT's zero-shot R@50 (96.07) approaches the fine-tuned SOTA (96.16), with a gap of only 0.09.
CRAFT is nearly immune to query paraphrasing (Δ = -0.04), whereas fine-tuned DTR degrades by 8–12 points, indicating substantially stronger generalization.
Each stage of the cascaded design contributes measurably: Stage 1→2 improves R@10 by approximately 10–21 points, and Stage 2→3 adds a further 5–8 points.
Mini-table construction reduces embedding calls by 33× without loss of retrieval precision.

Highlights & Insights¶

CRAFT outperforms fine-tuned methods through "engineering wisdom" — cascaded retrieval combined with table augmentation — suggesting that the general-purpose capabilities of pretrained models have been underestimated.
Near-perfect robustness to query paraphrasing (Δ = -0.04) is a highly practical property; fine-tuned models are notably fragile in this regard.
Mini-table construction is a simple yet effective efficiency optimization; 70% shorter contexts carry significant practical value in real-world deployment.

Limitations & Future Work¶

Reliance on commercial APIs (Gemini, OpenAI embeddings) constrains cost and reproducibility.
Model selection (different models for NQ-Tables vs. OTT-QA) introduces dataset-specific engineering choices.
Performance on non-English tables or tables with complex formatting (e.g., merged cells) has not been evaluated.
Preprocessing (caption/description generation) requires additional offline LLM calls.

vs. THYME: THYME requires fine-tuning on the target dataset and employs field-aware matching, whereas CRAFT requires no training yet achieves comparable or superior performance through cascaded retrieval.
vs. DTR: DTR is a classic dense retriever but is sensitive to query paraphrasing; CRAFT's cascaded design is inherently more robust.
vs. T-RAG: T-RAG integrates retrieval and generation end-to-end, whereas CRAFT maintains modularity for easy component replacement.

Rating¶

Novelty: ⭐⭐⭐ The combination of cascaded retrieval and table augmentation is effective but not an entirely novel concept.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across two datasets, robustness testing, stage-wise ablation, and end-to-end QA assessment.
Writing Quality: ⭐⭐⭐⭐ Method description is clear and experimental analysis is thorough.
Value: ⭐⭐⭐⭐ Demonstrates that training-free retrieval can achieve SOTA, with direct practical value for real-world deployment.