MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains¶

Conference: ICLR 2026 arXiv: 2603.00873 Code: https://mc-search-project.github.io Area: LLM Agent Keywords: Multimodal RAG, Agentic Search, Multi-hop Reasoning, Process-level Evaluation, Retrieval-Augmented Reasoning

TL;DR¶

This paper proposes MC-Search, the first benchmark for agentic multimodal RAG, comprising 3,333 high-quality samples (averaging 3.7 hops) across 5 reasoning topology types. The benchmark employs HAVE verification to ensure the necessity of each reasoning step, and introduces the Search-Align process-supervised fine-tuning framework, which substantially improves retrieval planning in open-source models (Qwen2.5-VL-7B F1 +13.7).

Background & Motivation¶

Background: Multimodal large language models (MLLMs) are evolving from fixed retrieve-then-generate paradigms toward more complex agentic multimodal retrieval-augmented generation (MM-RAG), requiring models to iteratively decompose queries, adaptively retrieve across modalities, and integrate multimodal evidence.

Limitations of Prior Work: Existing MM-RAG benchmarks exhibit three critical limitations: (a) most adopt simple QA formats that compress multimodal evidence into text-only pipelines (e.g., MRAG); (b) evaluation is restricted to shallow 1–2-hop retrieval without long reasoning chains (e.g., Dyn-VQA); (c) step-level annotations and explicit reasoning topologies are absent, precluding analysis of modality roles during reasoning.

Key Challenge: Real-world queries are typically ambiguous and complex, demanding multi-step, cross-modal, knowledge-intensive reasoning. Yet no suitable benchmark exists to evaluate whether MLLMs can perform long-chain, structured multimodal search reasoning.

Goal: (a) Construct the first multimodal agentic RAG benchmark supporting long reasoning chains (≥4 hops); (b) provide step-level annotations and diverse reasoning topologies; (c) design process-level evaluation metrics; (d) leverage verified reasoning chains to improve open-source models.

Key Insight: The authors build multimodal knowledge clusters from Wikipedia, define 5 representative reasoning topology structures (serial/parallel, image-initiated/text-initiated/multi-image fork, etc.), and apply HAVE filtering to ensure each reasoning step is both necessary and non-redundant.

Core Idea: Long-chain multi-hop reasoning + 5 reasoning topologies + HAVE verification + process-level metrics + Search-Align fine-tuning = comprehensive evaluation and improvement of agentic MM-RAG.

Method¶

Overall Architecture¶

MC-Search comprises two major components: (1) Benchmark Construction—a multimodal knowledge base is built from Wikipedia to generate multi-hop QA pairs covering 5 reasoning topologies, which are then filtered via HAVE and quality validation to yield 3,333 high-quality samples; (2) Evaluation and Training—a unified agentic MM-RAG pipeline and process-level metrics are designed for fair evaluation, and Search-Align is used to fine-tune open-source models on verified reasoning chains.

Key Designs¶

5 Search-Augmented Reasoning Topologies:
- Function: Defines 5 representative multi-hop reasoning graph structures. Each reasoning chain is formalized as \(\mathcal{G}(Q,A) = \{(q_t, m_t, r_t, a_t)\}_{t=1}^{T}\), where \(q_t\) is the sub-question, \(m_t\) is the retrieval modality, \(r_t\) is the evidence, and \(a_t\) is the intermediate answer.
- 5 Structures: (i) Image-Initiated Chain (image retrieval followed by text retrieval); (ii) Text-Initiated Chain (text retrieval followed by image verification); (iii) Parallel Image-Text Fork (simultaneous image and text retrieval without cross-step dependencies); (iv) Multi-Images Fork (multi-image visual comparison with textual support); (v) Text-Only Chain (pure-text baseline).
- Design Motivation: To capture serial/parallel reasoning patterns and diverse modality combinations found in real-world scenarios, enabling more comprehensive evaluation.
HAVE (Hop-wise Attribution and Verification of Evidence):
- Function: Filters hallucinated and redundant steps from reasoning chains.
- Mechanism: For each step, a contextual utility score is computed as \(\text{Util}(t) = \text{F1}(\mathcal{C}) - \text{F1}(\mathcal{C} \setminus r_t)\), measuring the drop in answer accuracy upon removing that step's evidence. A navigational role is also assessed: \(\text{Nav}(t)=1\) if the intermediate answer entity of that step appears in downstream sub-questions. Steps with utility below a threshold and \(\text{Nav}=0\) are deemed redundant.
- Design Motivation: LLM-generated long reasoning chains frequently contain fabricated steps (plausible but unsupported by evidence) or superfluous steps (non-contributing to the final answer). HAVE's dual verification (direct utility + navigational role) ensures every retained step is indispensable.
Process-Level Evaluation Metrics:
- Function: Goes beyond answer accuracy to assess reasoning process quality.
- Mechanism: (i) Hit per Step (HPS)—the proportion of gold reasoning steps successfully covered by the predicted graph; (ii) Rollout Deviation (RD)—the step-count difference between predicted and gold chains, \(\text{RD} = ||\hat{\mathcal{G}}| - |\mathcal{G}||\), reflecting over- or under-retrieval; (iii) LLM-as-a-Judge (LJ)—scoring along four dimensions: answer accuracy, reasoning coherence, entity coverage, and step alignment.
- Design Motivation: Evaluating only the final answer cannot diagnose failures in retrieval planning or modality selection.
Agentic MM-RAG Pipeline:
- Function: A unified iterative search-reasoning pipeline enabling fair evaluation across models.
- Mechanism: Each iteration consists of: (a) generating a sub-query and retrieval action (text search / image search / image-to-image search); (b) retrieving the top-1 evidence from the multimodal knowledge base; (c) generating a sub-answer and determining whether to continue searching. Modalities and evidence are logged throughout for chain-level evaluation.
- Design Motivation: Existing work employs heterogeneous pipelines, precluding fair comparison.
Search-Align Process-Supervised Fine-Tuning:
- Function: Fine-tunes open-source MLLMs using HAVE-verified reasoning chains via SFT.
- Mechanism: Reasoning graphs are converted into dialogue format (assistant generates sub-questions and reasoning; user executes retrieval and returns results). Gemini-2.5-Flash is used to generate reasoning thoughts for each step connecting adjacent hops. Supervised fine-tuning is then performed on these dialogue-style traces.
- Design Motivation: Conventional SFT supervises only the final answer, whereas Search-Align provides step-level supervision signals, teaching the model how to plan retrieval, select modalities, and integrate evidence across hops.

Loss & Training¶

Search-Align applies standard next-token prediction loss over dialogue-style reasoning traces. Training data consists of 3,333 HAVE-verified reasoning chains.

Key Experimental Results¶

Main Results (Image-Initiated Chain topology)¶

Model	F1(↑)	ΔF1(↑)	LJ(↑)	HPS(↑)	RD(↓)	Golden F1
GPT-4o-Mini	36.49	34.18	2.63	27.51	1.46	68.29
Gemini-2.5-Flash	44.10	37.38	3.01	31.46	2.91	72.39
Gemini-2.5-Pro	47.61	42.76	3.18	25.90	1.05	69.83
Claude-3.7-Sonnet	37.80	33.09	2.60	27.31	1.18	72.62
InternVL3.5-8B	39.11	29.49	2.27	22.59	1.58	-
+ Search-Align	42.27	32.65	2.53	32.49	0.94	63.86
Qwen2.5-VL-7B	26.30	8.65	1.34	16.51	4.04	-
+ Search-Align	45.70	28.05	2.23	33.59	0.70	60.95

Ablation Study (Modality Coverage Analysis)¶

Query Type	Modality	Gemini-2.5-Pro Coverage	InternVL-3.5-8B Coverage
With-image queries	Image	87.35%	63.84%
With-image queries	Text	78.61%	82.67%
Without-image queries	Image	29.50%	0.66%
Without-image queries	Text	83.55%	89.78%

Key Findings¶

Search-Align yields substantial gains: Qwen2.5-VL-7B achieves an average F1 improvement of +13.7, HPS improvement of +16.0, and RD reduction of 3.1 after fine-tuning, nearly matching Gemini-2.5-Pro.
Parallel Image-Text Fork is the hardest topology: All models achieve the lowest F1 and HPS on this topology, as it requires simultaneously covering both text and image branches.
Severe modality bias: When queries contain no explicit image cues, InternVL's image retrieval coverage drops sharply from 63.84% to 0.66%, indicating a strong default preference for text retrieval.
Performance degrades with chain length: All models exhibit sharp performance drops on 4–5-hop reasoning chains, primarily due to compounding retrieval errors and unstable planning.
Moderate over-retrieval is beneficial: Retrieving 1–2 extra steps (ΔStep=1~2) generally improves accuracy, but over-retrieval of ≥4 steps introduces noise and causes performance to collapse.
Retrieval planning is the primary bottleneck: Error analysis shows that Retrieval-Failure (84.7%), Hallucinated Entity (75.8%), and Step-Omission (74.3%) are the most frequent error types.

Highlights & Insights¶

The 5-topology design is highly systematic: Rather than arbitrarily composing multi-hop questions, the authors define a complete combinatorial space of serial/parallel × image/text modalities grounded in real MM-RAG requirements, providing a clear analytical framework for future research.
HAVE filtering is elegantly designed: Necessity is verified by measuring accuracy drop upon step removal, while navigational steps are identified by checking whether intermediate answer entities appear in downstream sub-questions. This dual criterion avoids both under-filtering and over-pruning.
Process-level metrics fill a critical gap: HPS and RD precisely localize whether a model suffers from under-retrieval or over-retrieval, making them practically useful for debugging agentic RAG systems.
The modality bias finding is thought-provoking: Near-zero image retrieval in the absence of explicit image cues indicates that models are far from capable of proactively selecting modalities based on the information needs of the query.

Limitations & Future Work¶

The knowledge base is derived from Wikipedia, limiting domain coverage (scientific and mathematical domains are not included).
Data generation relies on Gemini-2.5-Flash, introducing model-specific biases.
Evaluation covers only 6 MLLMs, excluding stronger reasoning models (e.g., GPT-5 series, Gemini-2.5-Pro with thinking).
Search-Align employs only SFT; reinforcement learning approaches such as RL or DPO remain unexplored.
The top-1 retrieval constraint may be overly strict, as practical systems typically retrieve multiple results.

vs. MMSearch: MMSearch is limited to single-hop retrieval and focuses on mixed image-text results from search engines. MC-Search targets long-chain multi-hop reasoning with an emphasis on reasoning structure and process evaluation.
vs. WebQA: WebQA contains at most 2 hops and lacks step-level annotations. MC-Search averages 3.7 hops and provides complete reasoning graph annotations.
vs. Agentic RAG systems (e.g., ReAct-style): These systems are predominantly designed for text-only scenarios. MC-Search extends agentic RAG to the multimodal setting and, for the first time, systematically evaluates modality planning capabilities.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The first long-chain multimodal agentic RAG benchmark; the combination of 5 reasoning topologies, HAVE verification, and process-level metrics is highly systematic.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers 6 MLLMs with multi-dimensional analysis (chain length, over-retrieval, modality bias, error types), though broader model coverage would strengthen conclusions.
Writing Quality: ⭐⭐⭐⭐ Well-structured with complete formalization and rich figures and tables; high information density requires careful reading in places.
Value: ⭐⭐⭐⭐⭐ Provides much-needed evaluation infrastructure and training methodology for the multimodal agentic search community; the effectiveness of Search-Align further validates the training value of the curated data.