TabRAG: Improving Tabular Document Question Answering for Retrieval Augmented Generation via Structured Representations¶

Conference: NeurIPS 2025 (AI4Tab Workshop)
arXiv: 2511.06582
Code: Available
Area: Image Segmentation
Keywords: Table QA, RAG, Structured Representation, Vision-Language Models, Document Parsing

TL;DR¶

This paper proposes TabRAG, a parsing-based RAG framework that decomposes documents into fine-grained components via layout segmentation, extracts tables into hierarchical structured representations using vision-language models, and integrates a self-generated in-context learning module to adapt to diverse table formats, achieving comprehensive improvements over existing parsing techniques on tabular document question answering.

Background & Motivation¶

Conventional RAG (Retrieval-Augmented Generation) systems perform well on plain-text documents: parse the document → pass the parsed information to a language model via in-context learning. However, when documents contain tabular data, existing RAG approaches frequently fail.

The core issue is that standard document parsing techniques (e.g., OCR, PDF parsers) discard the two-dimensional structural semantics of tables. The meaning of a cell depends on its row and column headers, yet naive text extraction "flattens" tables into one-dimensional text, resulting in:

Loss of the correspondence between cells and their row/column headers
Destruction of complex structures such as merged cells and nested headers
Inability of downstream language models to correctly interpret table content

Method¶

Overall Architecture¶

TabRAG comprises three main stages:

Layout Segmentation: Decomposes document pages into components such as text blocks, tables, and images.
Structured Table Extraction: Converts table images into hierarchical structured representations using a VLM.
Self-Generated In-Context Learning (ICL): Automatically generates in-context examples tailored to the format of the current table.

Key Designs¶

Layout Segmentation: A document layout analysis model (e.g., a LayoutLM or DETR variant) detects distinct regions within document pages: - Text regions → standard text extraction - Table regions → routed to the structured extraction pipeline - Image regions → visual description generation

Hierarchical Structured Representation: Rather than relying on simple HTML or CSV formats, TabRAG parses tables into a hierarchical structure that preserves: - Table captions and contextual information - Column header hierarchies (supporting multi-level headers) - Row header hierarchies - Cell values with their mappings to row and column headers - Span information for merged cells

This structured representation enables language models to accurately locate and interpret the semantics of individual cells.

Self-Generated In-Context Learning (Self-Generated ICL): Because table formats and styles vary considerably (e.g., financial statements vs. scientific experimental data vs. statistical tables), fixed extraction prompts may not generalize across all cases. TabRAG addresses this by:

Performing a preliminary analysis of the current table to identify its type and formatting characteristics.
Automatically generating several conversion examples from similarly formatted tables to structured representations.
Providing these examples as context to the VLM to guide accurate extraction.

Loss & Training¶

TabRAG is primarily an inference-time framework; its core modules do not require training on specific datasets: - Layout segmentation employs pretrained models. - Structured extraction is achieved via VLM in-context learning. - Self-generated ICL is fully automatic.

The end-to-end evaluation pipeline: Document → Layout Segmentation → Structured Extraction → Structured Representations stored in knowledge base → User query → Retrieval of relevant entries → LLM answer generation.

Key Experimental Results¶

Main Results¶

TabRAG is compared against existing parsing methods on multiple tabular document question answering benchmarks.

Primary Metrics: Exact Match (EM) and F1 Score:

Parsing Method	FinQA EM ↑	FinQA F1 ↑	WikiTableQA EM ↑	WikiTableQA F1 ↑	TAT-QA EM ↑	TAT-QA F1 ↑
PyMuPDF	32.5	41.2	28.8	38.5	35.2	44.8
Unstructured	38.1	47.6	34.2	44.1	40.5	50.2
LlamaParse	42.3	52.8	39.5	49.3	45.8	55.6
Docling	44.7	54.2	41.2	51.5	47.3	57.8
TabRAG	51.2	62.5	48.6	58.2	54.1	65.3

TabRAG achieves comprehensive improvements across all benchmarks, with an EM gain of 6.5 percentage points on FinQA.

Performance Across Different LLM Backbones:

Parsing Method	GPT-4 EM	GPT-4 F1	Claude-3 EM	Claude-3 F1	Llama-3 EM	Llama-3 F1
LlamaParse	42.3	52.8	40.1	50.5	35.8	45.2
Docling	44.7	54.2	42.5	52.8	37.2	47.5
TabRAG	51.2	62.5	49.5	60.8	44.3	55.1

The advantage of TabRAG remains consistent across different LLM backbones.

Ablation Study¶

Contribution of Each Component (FinQA, GPT-4):

Configuration	EM ↑	F1 ↑	ΔEM
Full TabRAG	51.2	62.5	—
w/o Self-ICL	47.5	58.1	-3.7
w/o Hierarchical Structure (flat HTML)	44.8	55.2	-6.4
w/o Layout Segmentation (full-page input)	43.2	53.5	-8.0
w/o VLM (OCR only)	38.5	48.2	-12.7

The VLM is the most critical component (EM drops by 12.7 upon removal).
Layout segmentation contributes the second largest gain (−8.0).
Hierarchical structured representation outperforms flat HTML (−6.4).
Self-ICL provides meaningful additional gains (−3.7).

Key Findings¶

Structured representation is essential: Hierarchical representations preserve table semantics more faithfully than flat HTML/CSV.
VLM is the core engine: The table comprehension capability of vision-language models underpins accurate extraction.
Adaptability of Self-ICL: Self-generated examples enable the method to adapt to diverse table formats without manually crafted prompts.
Layout segmentation provides fine granularity: Processing decomposed components outperforms whole-page processing.
LLM-agnostic: The improvements of TabRAG are consistent across different downstream LLMs.

Highlights & Insights¶

Addresses a critical gap in RAG: Tabular document question answering is a known pain point of RAG systems; TabRAG offers a systematic solution.
Training-free: The framework relies entirely on inference-time techniques (layout analysis + VLM + ICL), enabling straightforward deployment.
Modular design: Each component can be independently replaced or upgraded.
High practical value: Documents in finance, healthcare, law, and other domains frequently contain critical tabular data.

Limitations & Future Work¶

Workshop paper: The scale of evaluation may be limited.
VLM dependency: Reliance on powerful VLM models incurs relatively high inference costs.
Complex table handling: Cross-page tables and extremely large tables may remain challenging.
Latency: The multi-step processing pipeline may increase end-to-end latency.
Non-English support: The ability to handle multilingual tabular documents has not been validated.

Document Understanding: LayoutLM (Xu et al., 2020), DocFormer, etc.
Table QA: TAPAS (Herzig et al., 2020), TaBERT (Yin et al., 2020)
RAG Systems: LangChain, LlamaIndex, and various document parsing tools
VLMs for Document Understanding: Capabilities of GPT-4V and Claude in document comprehension

Rating¶

Novelty: 4/5 — The combination of hierarchical structured representation and Self-ICL is novel.
Technical Quality: 3/5 — Workshop-level, but with a systematic experimental design.
Writing Quality: 4/5 — Problem definition is clear; method description is intuitive.
Value: 5/5 — Directly addresses a practical pain point in RAG systems.
Overall: 4/5