M3KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation¶
Conference: CVPR 2026 arXiv: 2512.20136 Code: Project Page Area: Graph Learning Keywords: Multimodal Knowledge Graph, Retrieval-Augmented Generation, Audio-Visual Reasoning, Graph Pruning, Multi-hop Reasoning
TL;DR¶
This paper proposes M3KG-RAG, which constructs a multi-hop multimodal knowledge graph (M3KG) via a lightweight multi-agent pipeline and introduces the GRASP mechanism for entity grounding and selective pruning. By retaining only query-relevant and answer-useful knowledge, the approach substantially improves audio-visual reasoning capabilities of MLLMs.
Background & Motivation¶
Existing multimodal RAG approaches suffer from two key bottlenecks: (1) current MMKGs primarily cover image-text modalities with limited audio-visual coverage, and most are single-hop graphs lacking multi-hop connections that capture temporal or causal dependencies; (2) similarity-based retrieval over shared embedding spaces suffers from modality gaps, failing to filter off-topic or redundant knowledge and potentially injecting noise even when relevant context is retrieved.
The core innovations of M3KG-RAG are: constructing a multi-hop knowledge graph spanning audio and visual modalities + modality-wise retrieval to bypass modality gaps + GRASP for precise retention of answer-useful subgraphs.
Method¶
Overall Architecture¶
Raw multimodal corpus → Three-stage agent pipeline to construct M3KG (context-enhanced triple extraction → knowledge anchoring → context-aware description refinement with self-reflection loop) → Modality-wise retrieval of candidate subgraphs → GRASP grounding + pruning → Graph-augmented MLLM generation.
Key Designs¶
-
Multi-Agent M3KG Construction Pipeline:
- Function: Constructs a multi-hop, cross-modal knowledge graph from raw multimodal corpora.
- Mechanism: Rewriter enhances captions → Extractor extracts triples → Normalizer standardizes entities → Searcher queries knowledge bases for descriptions → Selector chooses context-relevant descriptions → Refiner adapts to original expressions → Inspector performs self-reflection to ensure quality.
- Design Motivation: The pipeline requires only lightweight LLMs such as Qwen3-8B, and the self-reflection loop prevents hallucinated descriptions.
-
GRASP (Grounded Retrieval And Selective Pruning):
- Function: Ensures retrieved knowledge is both query-relevant and answer-useful.
- Mechanism: Visual grounding (GroundingDINO detects entity presence in video frames → mask IoU threshold filtering) + Audio grounding (TAG model evaluates triple-query audio alignment) + Lightweight LLM binary-mask pruning of uninformative triples.
- Design Motivation: Similarity-based retrieval only captures broad semantics; GRASP provides fine-grained filtering through grounding and pruning.
-
Modality-Wise Retrieval:
- Function: Bypasses the modality gap in cross-modal embedding spaces.
- Mechanism: Video queries use InternVL2 to match visual items; audio queries use CLAP to match audio items; results are then lifted to the triple level via graph links.
- Design Motivation: Matching video queries against text-based knowledge bases in a shared embedding space frequently fails.
Loss & Training¶
No model training is involved; the approach is a pure pipeline solution. M3KG construction is performed on the training splits of evaluation benchmarks using a single H100 GPU.
Key Experimental Results¶
Main Results (Model-as-Judge Scoring)¶
| MLLM | Method | Audio QA | Video QA | AV QA |
|---|---|---|---|---|
| Qwen2.5-Omni | None | 49.00 | 42.21 | 32.42 |
| Qwen2.5-Omni | VAT-KG | 51.30 | 43.50 | 35.44 |
| Qwen2.5-Omni | M3KG-RAG | 60.77 | 44.35 | 44.67 |
Win-Rate Comparison (vs. VAT-KG)¶
| Benchmark | VAT-KG Win Rate | M3KG-RAG Win Rate |
|---|---|---|
| AudioCaps-QA | 25.6% | 74.4% |
| VCGPT | 47.6% | 52.4% |
| VALOR | 41.8% | 58.2% |
Key Findings¶
- Text KGs with simple RAG frequently degrade performance (Wikidata performs worse than no retrieval in multiple settings).
- Single-hop MMKGs (VAT-KG) yield limited improvements; the multi-hop structure is critical.
- Even GPT-4o benefits from M3KG-RAG, demonstrating that external knowledge remains valuable for large-scale models.
- Each component of GRASP (grounding + pruning) independently contributes to performance gains.
Highlights & Insights¶
- An end-to-end multimodal knowledge graph construction and retrieval framework covering audio, visual, and textual modalities.
- The two-stage "grounding → pruning" design of GRASP is both intuitively clean and effective.
- High-quality knowledge graphs can be constructed using only the lightweight Qwen3-8B, keeping computational cost manageable.
Limitations & Future Work¶
- The modality-wise retrieval threshold \(\tau\) and GRASP threshold \(\eta\) require manual tuning per dataset.
- Knowledge graph construction depends on training sets; generalization to new domains requires rebuilding the graph.
- Grounding models used in GRASP (GroundingDINO/TAG) may themselves introduce errors.
- Evaluation is limited to open-ended QA and does not cover other multimodal tasks.
Related Work & Insights¶
- vs. VAT-KG: Single-hop concept graph with simple retrieval; M3KG-RAG uses multi-hop graphs with precise GRASP filtering.
- vs. GraphRAG/LightRAG: Text-only graph RAG; M3KG-RAG extends to audio-visual multimodal settings.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of multi-hop multimodal knowledge graphs and GRASP is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ Three benchmarks, multiple MLLMs, and dual evaluation via win-rate and Model-as-Judge.
- Writing Quality: ⭐⭐⭐⭐ Architecture diagrams are clear and pipeline steps are described in detail.
- Value: ⭐⭐⭐⭐ Provides a practical knowledge graph-enhanced solution for multimodal RAG.