Skip to content

CRAFT: Training-Free Cascaded Retrieval for Tabular QA

Conference: ACL 2026
arXiv: 2505.14984
Code: Project Page
Area: Information Retrieval / Table Question Answering
Keywords: Table Retrieval, Cascaded Retrieval, Zero-shot, Table QA, Training-free

TL;DR

This paper proposes CRAFT, a three-stage cascaded table retrieval framework that requires no dataset-specific training (SPLADE sparse filtering → semantic mini-table ranking → neural re-ranking). By enhancing table representations with Gemini-generated titles and descriptions, it achieves SOTA on NQ-Tables (R@1 49.84), demonstrates strong zero-shot generalization on OTT-QA, and exhibits significant robustness to query rewrites.

Background & Motivation

Background: Open-domain Table Question Answering (TQA) requires retrieving relevant tables from a large-scale corpus before reasoning over them to derive answers. Existing methods include sparse retrieval (BM25), dense retrieval (DPR, DTR), and hybrid retrieval (THYME).

Limitations of Prior Work: (1) Dense retrieval models (DTR, DPR) are computationally expensive and require retraining or fine-tuning on new datasets, limiting adaptability; (2) Simple linearization of tables into text loses structural information; (3) Complex architectures (e.g., syntax-aware retrievers in SSDR) require meticulous modeling and high training costs.

Key Challenge: Performance in state-of-the-art (SOTA) table retrieval relies on expensive domain-specific fine-tuning, which makes systems inflexible for new domains or datasets. Is it possible to reach competitive performance using pre-trained models through a carefully designed retrieval pipeline?

Goal: To construct a modular, scalable multi-stage retrieval framework that utilizes off-the-shelf pre-trained models to achieve competitive table retrieval and end-to-end QA performance in a zero-shot setting.

Key Insight: A three-stage cascaded design—gradually transitioning from high-recall sparse retrieval to high-precision semantic re-ranking, with each stage using stronger but slower models. Concurrently, use Gemini to generate table titles and descriptions to compensate for the semantic insufficiency of raw table representations.

Core Idea: Apply the "progressive refinement" concept of cascaded retrieval to table retrieval: efficient filtering with sparse models → reduced token overhead via mini-table construction → precise re-ranking with neural models, achieving SOTA without any training.

Method

Overall Architecture

The core proposition of CRAFT is that a carefully orchestrated cascaded pipeline using off-the-shelf pre-trained models can match or exceed fine-tuned SOTA without dataset-specific fine-tuning. Given a natural language question, the system first performs offline preprocessing (Gemini-1.5-Flash generates query sub-questions and provides a title/description for each table; row selection is performed using Sentence Transformers). Then, it follows a "Sparse Coarse-filtering → Semantic Mid-filtering → Neural Fine-ranking" funnel, narrowing down candidates from 169k/419k tables to final Top results for end-to-end LLM answer generation. No weights are updated throughout the entire pipeline.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400, 'subGraphTitleMargin': {'top': 8, 'bottom': 16}}}}%%
flowchart TD
    Q["Natural Language Question<br/>Sub-questions generated by Gemini"]
    subgraph PRE["Mini-table Construction & Table Augmentation (Offline)"]
        direction TB
        T["Table Corpus 169k / 419k"] --> MT["Gemini adds Title + Description<br/>Sentence Transformer selects top-5 rows → mini-table"]
    end
    subgraph CAS["Three-stage Cascaded Retrieval"]
        direction TB
        S1["Stage 1 · SPLADE Sparse Filtering<br/>Full Corpus → 5000 Candidates"] --> S2["Stage 2 · Bi-encoder Semantic Mid-filtering<br/>→ Top-K"] --> S3["Stage 3 · Neural Fine-ranking<br/>→ Top Results"]
    end
    Q --> S1
    MT --> S1
    S3 --> ANS["LLM End-to-End Answer Generation"]
    MODEL["Dataset-specific Model Selection<br/>Configure pre-trained models by data features"] -.-> S2
    MODEL -.-> S3

Key Designs

1. Three-Stage Cascaded Retrieval: "Stronger yet Fewer" further down the line

Running semantic models on the entire table corpus is cost-prohibitive. CRAFT splits accuracy and efficiency across three levels. Stage 1 uses SPLADE for sparse lexical expansion, processing titles, headers, cell values, and generated descriptions to efficiently scan the corpus and filter down to 5000 candidates. Stage 2 compresses each table into a "mini-table" for bi-encoder semantic matching, narrowing it to Top-K. Stage 3 employs the strongest embedding models (text-embedding-3-large or gemini-embedding-001) for final re-ranking. Each level is stronger but slower; however, because previous stages drastically reduce the candidate pool, expensive models only run on small sets.

2. Mini-table Construction and Table Augmentation: Pruning before Semantic Enrichment

Linearizing an entire table for embedding models is expensive and dilutes key signals. CRAFT keeps only the headers and the top 5 most relevant rows (selected by Sentence Transformers based on semantic relevance to the query) to form a mini-table. Simultaneously, Gemini-1.5-Flash generates a descriptive title and a detailed summary for each table to overcome the semantic limitations of raw tables. This combination results in 33× fewer online embedding calls and 70% shorter contexts without sacrificing retrieval accuracy.

3. Dataset-specific Model Selection: Selecting vs. Tuning

While CRAFT does not train on new datasets, it acknowledges that different data has different textual features. Adaptation occurs at the "model selection" layer: NQ-Tables (single-hop fact queries) uses all-mpnet-base-v2 + text-embedding-3-large, while OTT-QA (multi-hop reasoning, hybrid text-table) uses Jina Embeddings v3 + gemini-embedding-001. This preserves the "zero-training" core while providing a calibration knob for different domains.

Loss & Training

This paper involves no training. All models use pre-trained weights or APIs. The end-to-end QA phase uses Llama3-8B, Qwen2.5-7B, or Mistral-7B to generate answers in zero-shot or few-shot settings.

Key Experimental Results

Main Results

NQ-Tables Retrieval Performance

Model Training Needs R@1 R@10 R@50
THYME (SOTA Hybrid) Fine-tuning 48.55 86.38 96.08
DTR+HN Fine-tuning 47.33 80.96 91.51
BIBERT+SPLADE Fine-tuning 45.62 86.72 95.62
CRAFT (Zero-shot) None 49.84 86.83 97.17

OTT-QA Zero-shot Retrieval Performance

Model R@1 R@10 R@50
THYME (Fine-tuned) 66.67 91.10 96.16
CRAFT (Zero-shot) 55.56 89.88 96.07

Ablation Study

Query Robustness (Performance Change Δ under Query Rewriting)

Model Original R@10 Rewritten Δ(avg)
DTR (M) 75.73 -8.38
DTR (S) 73.88 -11.82
DTR (M)+HN 80.96 -5.80
CRAFT 87.16 -0.04

Key Findings

  • CRAFT surpasses all fine-tuning methods on NQ-Tables in a zero-shot setting (R@1 49.84 vs. THYME 48.55), proving that engineering a cascaded pipeline can replace expensive fine-tuning.
  • On OTT-QA, CRAFT’s zero-shot R@50 (96.07) is nearly identical to the fine-tuned SOTA (96.16).
  • CRAFT is virtually immune to query rewriting (Δ=-0.04), whereas fine-tuned models like DTR drop by 8-12 points, indicating significantly stronger generalization.
  • Mini-table design reduces embedding calls by 33× without loss of precision.

Highlights & Insights

  • Employs "engineering wisdom" (cascaded retrieval + table augmentation) to beat fine-tuning methods, suggesting that the general capability of pre-trained models is undervalued.
  • Extreme robustness to query rewriting (Δ=-0.04) is a highly practical feature, as fine-tuned models are often fragile in this regard.
  • Mini-table construction is a simple yet effective efficiency optimization, providing 70% shorter contexts which are crucial for real-world deployment.

Limitations & Future Work

  • Reliance on commercial APIs (Gemini, OpenAI embeddings) limits cost efficiency and reproducibility.
  • Model selection (different models for NQ-Tables vs. OTT-QA) introduces dataset-specific engineering choices.
  • Performance on non-English tables or tables with complex formatting (merged cells) has not been evaluated.
  • Preprocessing (generating titles/descriptions) requires additional offline LLM calls.
  • vs. THYME: THYME requires fine-tuning on target datasets and field-aware matching; CRAFT matches performance without training via its cascaded pipeline.
  • vs. DTR: DTR is a classic dense retriever but is sensitive to query variations; CRAFT's design is inherently more robust.
  • vs. T-RAG: While T-RAG integrates retrieval and generation end-to-end, CRAFT remains modular, allowing for easy component replacement.

Rating

  • Novelty: ⭐⭐⭐ The combination of cascaded retrieval and table augmentation is effective but not entirely new.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across two datasets, robustness tests, stage ablations, and end-to-end QA.
  • Writing Quality: ⭐⭐⭐⭐ Clear methodological descriptions and detailed experimental analysis.
  • Value: ⭐⭐⭐⭐ Demonstrates that training-free retrieval can achieve SOTA, providing direct value for practical deployments.