REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark¶

Conference: ACL 2025
arXiv: 2502.12342
Code: None
Area: Information Retrieval
Keywords: Multi-Modal Retrieval, RAG, Retrieval增强生成, Query Robustness to Rephrasing, Document Retrieval

TL;DR¶

This work proposes the REAL-MM-RAG multi-modal document retrieval benchmark, defining four key attributes of real-world retrieval benchmarks (multi-modal documents, enhanced difficulty, realistic RAG queries, and accurate annotations). It introduces multi-level query rephrasing robustness evaluation and achieves SOTA retrieval performance through targeted training datasets (a rephrasing dataset and a financial table dataset).

Background & Motivation¶

Retrieval-Augmented Generation (RAG) has become an important paradigm for processing large-scale documents. Accurate document retrieval stands as the cornerstone of RAG performance—retrieving incorrect pages inevitably leads to erroneous generated answers. However, existing multi-modal retrieval benchmarks suffer from significant shortcomings.

The authors identify four key attributes for real-world document retrieval benchmarks:

Multi-modal Documents: Datasets should contain mixed content, including text, charts, and tables.

Enhanced Difficulty: Queries should go beyond simple keyword matching and require searching through numerous documents with highly similar contexts.

Realistic RAG Queries: Queries should reflect how users naturally ask questions without prior knowledge of where the answers are located, rather than referencing specific pages.

Accurate Annotations: All relevant documents must be correctly and completely annotated.

Limitations of Prior Work: - ViDoRe: ColQwen achieves ~90% NDCG@5 on it, indicating the level of difficulty is too low. Queries generated by VLMs often copy the original document text directly, making keyword matching sufficient for retrieval. - MMLongBench: Based on QA datasets, where queries assume prior knowledge of specific pages, which does not align with actual RAG scenarios. - All existing benchmarks suffer from extremely high false negative rates: ViDoRe is at 86.9% and MMLongBench is at 77.8% (meaning a large number of correct retrievals are misclassified as incorrect).

Method¶

Overall Architecture¶

REAL-MM-RAG comprises two core contributions: (1) a high-quality benchmark construction pipeline meeting the four key attributes; (2) targeted training strategies (a rephrasing training set + a financial table training set) proposed based on weaknesses identified through benchmark analysis.

Key Designs¶

Document Collection: Focuses on long documents and a large volume of pages within the same sub-domain (centering on IBM corporate data). Around 8,000 pages are distributed across four sub-domains:
- FinReport: Financial reports (2005-2023), 19 documents/2,687 pages, featuring a mix of text and tables.
- FinSlides: Quarterly financial presentations (2008-2024), 65 documents/2,280 pages, containing heavy tabular data.
- TechReport: FlashSystem technical documentation, 17 documents/1,674 pages, primarily text-based.
- TechSlides: Business and IT automation presentations, 62 documents/1,963 pages, featuring rich visual content.
Query Generation and Filtering: A two-step process to ensure compatibility with RAG scenarios.
- Generation: Uses the Pixtral-12B VLM to generate 10 query-answer pairs per page, with prompts designed to elicit RAG-specific questions.
- Filtering: Employs the Mixtral-8x22B LLM to evaluate if each query is suitable as a retrieval query, filtering out queries that contain page references (e.g., "in Figure 5") or are overly broad.
Multi-level Rephrasing: Addresses the issue where queries generated by VLMs overlap heavily with original document texts.
- Uses Mixtral-8x22B for three levels of rephrasing: Level 1 involves minor lexical substitution, Level 2 modifies vocabulary and sentence order, and Level 3 performs significant lexical rephrasing and sentence restructuring.
- Each query has four versions (original + three levels of rephrasing), all mapped to the same document page.
- After rephrasing, an LLM validates that the original semantics are preserved.
Accurate Labeling (False Negative Verification): Uses Pixtral-12B to systematically test each query against all benchmark pages to identify all pages that could potentially contain the answer. Although computationally expensive, this effectively prevents false negatives. Only queries where the unique correct page is verified are retained in the end.
Targeted Training Strategies:
- Rephrasing Training Set: Half of the queries in the ColPali training set are rephrased at random levels using LLaMA-3-70B (a different LLM from the one used for the benchmark) to force the model to learn semantics rather than keyword matching.
- Financial Table Training Set: Employs FinTabNet (complex tables from S&P 500 corporate reports) to generate 46,000 query-answer-page triplets through the same pipeline.

Loss & Training¶

ColPali-v1.2 and ColQwen2-v1.0 are fine-tuned on the rephrased dataset and/or the financial table dataset.
Training is conducted for 1 epoch, combined with the original ColPali training set.
This yields four model variants: RobCol (rephrase-trained), TabCol (table-trained), and RobTabCol (combining both).

Key Experimental Results¶

Main Results (NDCG@5, Level 3 Rephrased Queries)¶

Model	FinReport	FinSlides	TechReport	TechSlides
ColPali	34.5	27.6	62.0	75.8
ColQwen	41.8	31.1	66.9	78.1
RobTabColPali	63.2(↑28.7)	58.3(↑30.7)	70.7(↑8.7)	83.3(↑7.5)
RobTabColQwen	67.1(↑25.3)	61.6(↑30.5)	73.2(↑6.3)	85.0(↑6.9)

Impact of Query Rephrasing Levels (Average NDCG@5 Across All Benchmarks)¶

Rephrasing Level	ColPali	RobTabColPali	ColQwen	RobTabColQwen
0 (No Rephrasing)	71.3	80.8	78.9	85.1
1 (Minor)	65.3	77.8	72.5	81.7
2 (Medium)	60.3	74.9	68.2	78.6
3 (Significant)	56.6	72.7	65.3	76.4

Human Evaluation of Benchmark Quality¶

Metric	ViDoRe	MMLongBench	REAL-MM-RAG
False Negative Rate ↓	86.9%	77.8%	31.9%
Realistic RAG Query Rate ↑	43.6%	35.2%	85.0%

Key Findings¶

Vision models significantly outperform text models: Across all benchmarks, direct page embeddings based on VLMs far exceed OCR + text retrieval.
Financial table documents are extremely challenging: ColPali only achieves 27.6% on FinSlides, illustrating that table-dense documents are a major weakness of current models.
Query rephrasing leads to massive performance drops: BM25 is affected the most (dropping from 52.7% to 27.1%). Dense retrieval models are more robust but still show a significant degradation.
RobTabCol joint training is the most effective: It improves performance on financial benchmarks by 25-30 NDCG@5 points without compromising performance on non-financial benchmarks.
Rephrase-training does not hurt non-rephrased performance: RobCol maintains or improves performance even on non-rephrased queries, showing that semantic learning is a win-win.
The false negative problem in existing benchmarks is extremely severe: ViDoRe's 86.9% false negative rate indicates that most "errors" are actually correct retrievals.

Highlights & Insights¶

Systematic definition of four key attributes: This work is the first to systematically define key attributes that real-world multi-modal retrieval benchmarks should possess, establishing a reference framework for future benchmark designs.
Pioneering rephrasing evaluation: It introduces the first query rephrasing robustness evaluation for multi-modal document RAG, exposing the underlying issue of current models relying on "keyword matching" rather than "semantic understanding."
Importance of false negative verification: Through human evaluation, the paper quantitatively proves severe labeling issues in existing benchmarks, calibrating the true baseline of model performance in this domain.
Complete closed-loop from evaluation to improvement: Identifying problems through the benchmark \(\rightarrow\) developing targeted training strategies \(\rightarrow\) verifying improvement, demonstrating how a solid benchmark can effectively drive model advancement.

Limitations & Future Work¶

Queries are generated by VLMs, which might not fully capture the diverse nature of human queries.
Labeling and filtering still rely on LLMs/VLMs; although human evaluation validates their effectiveness, omissions may still exist.
The dataset is focused on a single corporate source (IBM), leaving the domain diversity somewhat limited.
Multi-page reasoning queries—queries requiring content synthesis across multiple pages to answer—are not addressed.
The evaluation is restricted to the retrieval component, without extending to the generation stage of RAG.
The training strategies rely on constructing domain-specific training data, which requires recollecting data when adapting to new domains.

Compared to ColPali/ViDoRe, REAL-MM-RAG significantly improves benchmark difficulty and realism.
The query rephrasing robustness evaluation draws inspiration from research in text retrieval, but is applied systematically to multi-modal RAG for the first time.
Insight: The "pseudo-capability" of retrieval models—high scores on simple benchmarks may mask real deficiencies in semantic understanding.
Targeted small-scale data training (46K financial table data) can drastically improve domain-specific performance, indicating that data quality is far more important than quantity.

Rating¶

Novelty: ⭐⭐⭐⭐ Defining the four attributes, multi-level rephrasing evaluation, and false negative verification are all vital contributions.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ The experimental design is exceptionally thorough, covering multiple models, multi-level rephrasing, human evaluation, and training ablation studies.
Writing Quality: ⭐⭐⭐⭐⭐ The problem definition is clear, comparison tables are complete, experimental analyses are in-depth, and the reasoning is highly rigorous.
Value: ⭐⭐⭐⭐⭐ It provides a much-needed high-quality benchmark and improvement schemes for multi-modal RAG retrieval, offering immense practical value.