CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG¶

Conference: ACL 2025
arXiv: 2506.02544
Code: https://github.com/TyangJN/CoRe-MMRAG
Area: LLM Agent / Multimodal VLM / RAG
Keywords: multimodal RAG, knowledge inconsistency, visual-textual reconciliation, KB-VQA

TL;DR¶

CoRe-MMRAG proposes an end-to-end multimodal RAG framework that addresses Parametric-Retrieval Knowledge Inconsistency (PRKI) and Visual-Textual Knowledge Inconsistency (VTKI) through a four-stage pipeline (parametric knowledge generation → joint visual-textual reranking → external knowledge generation → internal-external knowledge integration), achieving improvements of 5.6% and 9.3% on InfoSeek and Encyclopedic-VQA, respectively.

Background & Motivation¶

Background: Multimodal RAG (MMRAG) enhances MLLMs by retrieving external image-text knowledge, and is widely applied to knowledge-intensive Visual Question Answering (KB-VQA).

Limitations of Prior Work: - PRKI (Parametric-Retrieval Knowledge Inconsistency): Retrieved information may contradict the model's internal parametric knowledge, making it difficult for the model to judge which source is more reliable—noisy retrieval may override correct parametric knowledge. - VTKI (Visual-Textual Knowledge Inconsistency): The images and text of retrieved items may point to different entities, causing text-only reranking to select incorrect items. - Existing MMRAG methods (such as Wiki-LLaVA and RoRA-VLM) rely solely on textual similarity in the reranking stage, ignoring cross-modal inconsistency.

Key Challenge: MMRAG introduces external knowledge but also introduces both types of inconsistencies, and existing methods lack explicit mechanisms to reconcile them.

Goal: To design a framework that simultaneously solves both PRKI and VTKI, enabling MLLMs to reliably integrate multi-source multimodal knowledge.

Key Insight: Similar to the "internal first, external second, then integrate" philosophy of Astute RAG, but extended to multimodal scenarios and incorporating joint visual-textual evaluation.

Core Idea: First generate a reference answer using parametric knowledge, then select the most relevant retrieved items using joint visual-textual similarity, and finally compare the internal and external answers to perform reliable integration.

Method¶

Overall Architecture¶

An end-to-end four-stage pipeline: (1) Generate an internal answer \(y^{int}\) using only parametric knowledge → (2) Evaluate joint visual and textual similarity to select the most relevant retrieved item → (3) Generate an external answer \(y^{ext}\) based on the best retrieved item → (4) Compare \(y^{int}\) and \(y^{ext}\), integrating the most reliable information to generate the final answer \(y^*\).

Key Designs¶

Joint Visual-Textual Knowledge Integration (Addressing VTKI):
- Function: Simultaneously consider the relevance of both images and text to select the best retrieved item.
- Mechanism: \(I^{tv} = \arg\max_i r^\mathcal{M}(Q, \{V_i, T_i\}_{i=1}^k)\)—instead of ranking images and texts separately, they are evaluated jointly.
- Design Motivation: The individual rankings of images and text can be inconsistent (\(I^v \neq I^t\)). Joint ranking leverages cross-modal complementarity.
Parametric-Retrieval Knowledge Integration (Addressing PRKI):
- Function: Explicitly compare the internal and external answers, assess their reliability, and generate the final answer.
- Mechanism: \(y^* = \mathcal{M}(Q, y^{int}, y^{ext}, (V_{I^{tv}}, T_{I^{tv}}))\)—the model simultaneously processes the internal reference, the external reference, and the retrieved evidence.
- Design Motivation: Present the model with a "second decision" opportunity—after observing two potentially conflicting answers, it makes a final choice integrated with evidence.
Inconsistency-Aware Training Paradigm:
- Three training objectives:
  - Knowledge Source Selection: Screen samples where the model is correct using only parametric knowledge versus using only external knowledge, training the model to distinguish when to trust which source.
  - Multimodal Selection: Train the model to perform joint visual-textual reranking.
  - Unified Answer Generation: Train the entire four-stage pipeline for end-to-end generation.
- Design Motivation: Self-training (inspired by STaR) requires no extra annotations and automatically constructs training data leveraging the model's own "biases."

Key Experimental Results¶

Main Results¶

Method	InfoSeek	Encyclopedic-VQA
Qwen2-VL-7B (baseline)	-	-
Wiki-LLaVA	Low	Low
RoRA-VLM	Mid	Mid
CoRe-MMRAG	+5.6%	+9.3%

Ablation Study¶

Configuration	InfoSeek
Full CoRe-MMRAG	Highest
w/o Step 1 (w/o parametric knowledge)	Decrease of ~3%
w/o Joint Ranking (text-only ranking)	Decrease of ~2%
w/o Step 4 (no integration, direct external)	Decrease of ~4%
w/o Training Paradigm (zero-shot)	Decrease of ~5%

Key Findings¶

Parametric knowledge reference (Step 1) is crucial to the final answer quality: Removing it leads to a 3% decrease, as the model loses its "comparison anchor."
Joint visual-textual ranking outperforms single-modality ranking: Textual ranking biases are corrected by visual information.
The training paradigm yields the largest contribution (~5%): This indicates that the model needs to learn how to judge the reliability of knowledge sources.

Highlights & Insights¶

The formalization of PRKI and VTKI is a significant contribution: It deconstructs the vague "noise" issue in multimodal RAG into two distinct types of inconsistencies, providing a clear problem formulation for future research.
The four-stage "internal first, external second, then integrate" design shares similarities with Astute RAG: It validates the generalizability of this knowledge reconciliation paradigm in the multimodal domain.
The self-training paradigm constructs training data in a clever way: It automatically generates labels based on performance discrepancies with versus without retrieval, bypassing the need for manual annotation.

Limitations & Future Work¶

Tested only on KB-VQA: Broader multimodal tasks are not yet covered.
Base model limited to Qwen2-VL-7B: It remains uncertain whether the gains are consistent across larger or stronger models.
High end-to-end inference cost: The four-stage pipeline requires multiple generation passes by the model, increasing inference latency.

vs Astute RAG: Astute RAG solves PRKI in textual RAG, while CoRe-MMRAG extends to multimodal scenarios and introduces VTKI resolution.
vs Wiki-LLaVA: Wiki-LLaVA simply injects retrieved text into the prompt without handling inconsistency.
vs EchoSight: EchoSight uses visual retrieval but relies only on text reranking, whereas CoRe-MMRAG performs joint ranking.

Rating¶

Novelty: ⭐⭐⭐⭐ Clear formalization of VTKI+PRKI, with a well-designed four-stage reconciliation mechanism.
Experimental Thoroughness: ⭐⭐⭐⭐ Two benchmarks, ablation studies, and comparisons between zero-shot and fine-tuned settings.
Writing Quality: ⭐⭐⭐⭐ Clear formal definitions and intuitive architecture diagrams.
Value: ⭐⭐⭐⭐ Provides an explicit solution to the knowledge inconsistency issue in multimodal RAG.