MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A¶

Conference: ACL2026
arXiv: 2606.04231
Code: None (Cache repository not provided)
Area: Multimodal Retrieval-Augmented Generation / Enterprise Q&A
Keywords: MM-RAG, Enterprise Documents, Structure-aware Parsing, Multimodal Retrieval, FastRAGEval

TL;DR¶

MM-BizRAG demonstrates that enterprise multimodal RAG should not rely solely on page screenshots and visual embeddings. Instead, it should first distinguish reports from slides based on document structure, explicitly parse text, tables, and images, and assemble multimodal contexts during inference. This approach significantly outperforms vision-centric baselines on SlideVQA, FinRAGBench-V, and internal enterprise data.

Background & Motivation¶

Background: Enterprise knowledge bases commonly contain documents such as PDF, DOCX, PPTX, HTML, and TXT, which mix text, tables, images, charts, and complex layouts. A simplified route in multimodal RAG has recently become popular: treating pages as images, using visual embeddings for retrieval, and passing page images to a VLM to generate answers.

Limitations of Prior Work: The page-as-image route skips parsing but leaves the reading order, table structure, and image-text spatial relationships to be implicitly understood by pre-trained VLMs. Structured information in enterprise documents—especially financial reports, compliance materials, technical documents, and multi-page business reports—is often outside the training distribution of general-purpose models.

Key Challenge: RAG retrieval requires lightweight, stable, and indexable representations, while answer generation requires rich, structure-preserving multimodal contexts. Using the same page screenshot for both retrieval and generation may lead to inaccurate retrieval and the loss of table structures or logical reading orders.

Goal: The authors aim to build an MM-RAG system that is deployable for heterogeneous enterprise documents without model fine-tuning, explicitly handling structured artifacts and systematically comparing the impact of different ingestion representations and embedding strategies.

Key Insight: The paper shunts documents into vertical documents (reports, PDFs, filings) and horizontal documents (slide decks) according to their structure, then designs separate parsing and chunking strategies. Lightweight representations are used during retrieval for indexing, while original artifacts are reassembled during the generation stage.

Core Idea: Decouple the "representation for retrieval" from the "context for generation." During retrieval, text, descriptions, or page representations are indexed lightly. During generation, table Markdown, images, and page screenshots are restored based on placeholders and metadata to construct multimodal evidence closer to the original structure.

Method¶

The core of MM-BizRAG is document structure-aware ingestion. The system first determines whether a document has a vertical or horizontal structure. Vertical documents usually have a natural reading order suitable for layout-aware parsing, while horizontal slides treat each page as a complete semantic unit suitable for full-page images + VLM descriptions. Subsequently, the paper defines three variants by changing chunk granularity, embedding models, and artifact injection timing.

Overall Architecture¶

For vertical documents, tools like Docling extract text blocks, tables, images, and page images. Placeholders for tables/images are inserted into text representations to maintain reading order. Tables are converted to Markdown, and row-by-row descriptions are generated by an LLM; images are described by a VLM, with uninformative content like logos filtered out. For horizontal documents, page images are retained, and detailed slide-level descriptions are generated by a VLM.

During retrieval, chunks are generated based on the variant to establish text or multimodal embeddings. At inference, a query rewriter first reformulates the query based on conversation history. The system uses dense + BM25 hybrid retrieval. After RRF (Reciprocal Rank Fusion) fusion, the top 30 candidates are sent to an LLM list-wise reranker, and the top 20 chunks are selected to assemble the multimodal context for GPT-4.1 to generate the answer.

Key Designs¶

Structure-aware Document Shunting:
- Function: Selects different ingestion pipelines based on the document layout structure.
- Mechanism: Vertical documents undergo layout-aware parsing to preserve text sequence, table Markdown, image descriptions, and page images. Horizontal documents are not forcibly split into blocks; instead, each slide is treated as a holistic semantic unit represented by page images and VLM descriptions.
- Design Motivation: Reports and slides organize information differently. Reports rely on the sequence of paragraphs and tables, whereas slides rely on spatial layout. A single parsing strategy cannot adapt to both simultaneously.
Decoupling Retrieval Representation and Generation Context:
- Function: Maintains efficient retrieval while providing generation with sufficiently rich evidence.
- Mechanism: For instance, the TCTE variant uses text chunks, table descriptions, and image descriptions for text embedding. When a table or image description is retrieved, the system finds the parent text chunk of its placeholder and injects table Markdown, image artifacts, and descriptions back into their original positions.
- Design Motivation: Table Markdown and image Base64 are not suitable for direct indexing, but they are needed for generation. Assembly at inference time avoids index bloat.
FastRAGEval Single-call Evaluation:
- Function: Reduces the LLM judge cost for long-answer enterprise QA and better matches human annotations.
- Mechanism: FastRAGEval extracts atomic facts from both reference and prediction in a single LLM call and calculates precision, recall, and F1. The paper primarily uses FRE recall, replacing the two-stage claim decomposition and matching of RAGChecker.
- Design Motivation: Enterprise long answers are more concerned with critical fact recall; traditional token overlap is inapplicable. RAGChecker has high cost and latency, making a single-call judge more suitable for large-scale system evaluation.

Loss & Training¶

This paper does not train dedicated retrievers or generation models, focusing instead on the design of ingestion, retrieval, and assembly. Components used include Docling, EasyOCR, PyPdfium2, Tableformer, OpenAI text-embedding-3-large, nomic-multimodal-embed-3b, cohere-embed-v4, the GPT-4.1 model family, ColPali, and VisRAG-Ret. In the inference pipeline, dense retrieval takes 70 chunks, BM25 takes 100 chunks, and after RRF, the top 30 enter the list-wise reranker, with the final top 20 used for answer generation.

Key Experimental Results¶

Main Results¶

Pipeline	SlideVQA FRE	FinRAGBench-V FRE	Internal FRE	Key Findings
Text-Only	67.8	60.3	83.7	Strong on text-only questions, weak on tables/images
ColPali	83.6	49.3	-	Stronger than text-only for slides, weak for reports
VisRAG	78.8	46.0	-	Similarly limited by page-image-centric approach
TCTE (OAI v3-large)	87.3	80.2	88.1	Recommended production config; lower latency for vertical docs
PCMHE (Nomic)	89.9	79.6	87.6	Strongest on SlideVQA
PCMHE (Cohere)	89.06	82.4	87.8	Strongest on FinRAGBench-V
TCMIE (Cohere)	88.2	76.9	88.0	Internal data performance close to TCTE

Ablation Study¶

Analysis	Value / Phenomenon	Description
Vertical-horizontal classifier	Precision 100.00, Recall 83.28, F1 90.87	Document shunting is reliable, but recall can be improved
FRE vs RAGChecker Pearson	0.808 vs 0.748	FRE has higher correlation with human judgment
FRE vs RAGChecker Spearman	0.808 vs 0.736	Better ranking consistency
FRE vs RAGChecker Kendall	0.808 vs 0.725	Single-call metric does not sacrifice consistency
Human Labeling Consistency	Cohen's kappa 0.966	High agreement between two annotators on 200 instances
TCTE Latency	FinRAGBench-V 11.9s, Internal 11.1s	Vertical doc latency is roughly half of PCMHE Cohere

Key Findings¶

On SlideVQA, the highest FRE recall for MM-BizRAG is 89.9, an improvement of 6.3 percentage points over ColPali's 83.6, showing that explicit text/visual fusion is beneficial even when slides favor visual routes.
On FinRAGBench-V, PCMHE Cohere achieves an FRE recall of 82.4, while ColPali is only 49.3 and VisRAG 46.0; page-image-centric methods degrade significantly in report-style documents.
Internal data includes 1,908 questions, 1,048 documents, and 20,429 pages. All MM-BizRAG variants show higher FRE recall than the 83.7 of text-only.
TCTE is the recommended production configuration: its recall discrepancy from the optimum in vertical documents is usually only 1-3 percentage points, but its latency is approximately half that of PCMHE.
The faithfulness of all MM-BizRAG variants exceeds 90%, indicating that structured assembly does not trade more evidence for more ungrounded generation.

Highlights & Insights¶

The paper refutes the intuition that "parsing is unnecessary once VLMs are strong enough." Explicit modeling of tables, headers/footers, cross-page narratives, and image-text order in enterprise documents is still required.
Decoupling retrieval representation from generation context is a highly practical engineering design. Many RAG systems tie index schemas to prompt context schemas, resulting in either heavy retrieval or impoverished generation evidence.
The performance of TCTE is highly realistic: the strongest configuration is not necessarily the most suitable for production; trade-offs between latency, cost, and recall must be evaluated together.
FastRAGEval is also practical. Enterprise QA answers are often long paragraphs where token F1 or exact match are meaningless; single-call fact-level recall is closer to business evaluation.

Limitations & Future Work¶

Public slide evaluations rely mainly on SlideVQA, which is relatively simple and may not fully represent complex enterprise-grade presentations.
FinRAGBench-V only processes a subset of 213 English documents for which the authors could obtain PDFs, not covering the full 1,100+ documents or multilingual enterprise scenarios.
Public baselines only include ColPali and VisRAG, lacking comparisons with more recent or closed-source enterprise RAG systems.
Internal enterprise data cannot be released due to privacy and organizational constraints, limiting reproducibility; future releases of anonymous or synthetic versions would be valuable.
The system relies on GPT-4.1 for description generation, query rewriting, reranking, and answering; cost, rate limits, and model version changes will affect deployment performance.

vs ColPali: ColPali uses VLMs for page-level retrieval, suitable for visual matching. MM-BizRAG explicitly restores text, table, and image structures, showing clear advantages in report-style documents.
vs VisRAG: VisRAG also emphasizes page images and multimodal retrieval. MM-BizRAG argues that generation contexts require artifact-aware assembly rather than just passing page images.
vs Text-only RAG: Text-only remains strong on text-dense questions but fails on tables and images; MM-BizRAG supplements multimodal evidence without sacrificing text advantages.
Insight: Ingestion for enterprise RAG should not just ask "which embedding to use," but also "what is the document structure, which artifacts are used for indexing, and which artifacts are restored during generation."

Rating¶

Novelty: ⭐⭐⭐⭐☆ Most components use existing technology, but the systematic combination of structure shunting, placeholder alignment, and inference-time assembly is engineeringly innovative.
Experimental Thoroughness: ⭐⭐⭐⭐☆ Strong support from internal large-scale data and two public benchmarks, though public reproducibility and baseline scope are limited.
Writing Quality: ⭐⭐⭐⭐☆ The system design is clear, and variant comparisons are useful; readers must track symbols and pipelines carefully due to the complexity of the enterprise system.
Value: ⭐⭐⭐⭐⭐ Highly valuable for the implementation of enterprise-level multimodal RAG, particularly as a reminder not to abandon explicit document parsing prematurely.