MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification¶

Attribute	Content
Title	MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification
Conference	ACL2025
arXiv	2502.13383
Code	github.com/Aurora-slz/MM-Verify
Area	LLM Reasoning / Multimodal Math
Keywords	Multimodal Verification, Chain-of-Thought, MCTS, Reward Model, Math Reasoning

TL;DR¶

This paper proposes two models, MM-Verifier and MM-Reasoner. By synthesizing long-chain CoT verification data through simulation-based search combined with rejection sampling, and creating multimodal reasoning data via text distillation, the proposed 7B parameter models achieve an accuracy of 65.3% on MathVista, outperforming GPT-4o (63.8%) and human performance (60.3%).

Background & Motivation¶

Inspiration from Test-Time Scaling: In text-only LLMs, combining external slow-thinking and verification mechanisms has been proven to enhance multi-turn reasoning (e.g., DeepSeek-R1, s1).
Lack of Multimodal Verifiers: Self-criticism methods used in text-only domains perform poorly in multimodal models (validated by experiments in Table 6), highlighting an urgent need to develop powerful multimodal verifiers.
Lack of Long-Chain CoT Reasoning Data: While long CoT data exists in text-only domains (e.g., DeepSeek-R1), most mathematical problems in multimodal domains are not in long CoT format.
Two Core Challenges:
How to synthesize high-quality multimodal verification data to train MM-Verifier?
How to efficiently synthesize multimodal long-chain CoT reasoning data to train MM-Reasoner?

Method¶

Stage 1: Simulation-Based Search for Long-Chain CoT MM-Verifier¶

Data Source Collection¶

A total of 59,772 questions across seven categories are selected from MATH360V to serve as the source data pool: Geometry3K (33.84%), Super-CLEVR (24.17%), TabMWP (22.45%), FigureQA (18.07%), GEOS (1.48%), etc.

Simulation-Based Search Algorithm¶

Inspired by MCTS but avoiding traditional MCTS (as multimodal models struggle to generate reliable rewards):

Starting from the root node $q_i$, $k$ child nodes are simulated for each node.
For each child node, the model directly generates an answer based on the current path: $$\text{Simulation Answer} = LLM\left(\bigoplus_{i=1}^{d-1} u_i\right)$$
Repeat the simulation $l$ times, using the accuracy rate as the reward signal.
Finally, collect $n$ leaf-node solution pairs $\langle q_i, p_j^i \rangle$ as positive/negative samples.

This method generates longer CoT answers than direct sampling (validated in Figure 3).

Long-Chain CoT Verification Data Synthesis¶

GPT-4o is employed to perform "step-by-step verification" on each $\langle q_i, p_j^i \rangle$ pair, generating verification text $v_i$.
Data Cleaning: LLaMA-3.2-3B-Instruct is used to extract answers, checking two conditions:
- If the extracted answer matches the golden label and the verification conclusion is "correct" $\rightarrow$ Retain
- If the extracted answer does not match and the verification conclusion is "incorrect" $\rightarrow$ Retain
- Other cases $\rightarrow$ Discard
Qwen2-VL-7B-Instruct is fine-tuned (SFT) using the cleaned data $D_{clean}$ $\rightarrow$ MM-Verifier (Stage 1) is obtained.

Stage 2: Rejection Sampling for Further Verification Enhancement¶

Leak the long-chain CoT reasoning capabilities of the Stage 1 Verifier to generate more verification data.
Perform cleaning by comparing with ground-truth answers via string matching.
Cleaned and filtered data is used to continue training the Stage 1 model $\rightarrow$ MM-Verifier (Stage 2) is obtained.

Key Advantage: Reduces API costs (no longer requiring GPT-4o) while further enhancing verification capabilities.

Mechanism: Leverage the strong reasoning capability of text-only reasoning models (Qwen-QwQ) by bridging multimodal data through textual descriptions.

Select the MAVIS-GEOMETRY dataset, which contains geometric diagram drawings and corresponding text description instructions.
Input the geometric text description and the original question into the text-only reasoning model QwQ.
Use the output of QwQ as the training target for MM-Reasoner.
After filtering out incorrect reasoning results, utilize the remaining data for SFT on Qwen2-VL-7B-Instruct.

Data Statistics: There are 32,146 training data samples for MM-Reasoner (all sourced from MAVIS-Geo).

Key Experimental Results¶

MM-Verifier Performance: MathCheck Outcome-Judging¶

The 7B MM-Verifier surpasses all large proprietary models (including GPT-4o, Gemini, Claude) and 72B open-source models (Figure 1).

MathVista Results¶

Method	Sample 4 ALL	Sample 8 ALL	Sample 12 ALL
Qwen2-VL + Majority Voting	57.1	61.1	62.9
Qwen2-VL + Qwen2-VL-72B Judge	53.4	56.2	55.7
Qwen2-VL + MM-Verifier(S2)	59.8	62.5	64.1
MM-Reasoner + Majority Voting	59.4	62.2	64.8
MM-Reasoner + MM-Verifier(S2)	61.5	65.3	65.2

Key Findings: - MM-Verifier(S2) consistently outperforms Majority Voting and Qwen2-VL-72B Judge across all settings. - MM-Reasoner + MM-Verifier(S2) with Sample 12 reaches 65.3%, outperforming GPT-4o (63.8%) and human performance (60.3%). - When acting as a judge, Qwen2-VL-72B exhibits a performance decline when verifying the long outputs of MM-Reasoner, indicating that conventional models struggle to verify long-chain CoT.

MathVerse Results¶

MM-Verifier + MM-Reasoner achieves 25.7% (ALL), outperforming Math-LLaVA-13B (22.9%) and LLaVA-OneVision (20.7%).

Final Comprehensive Comparison¶

Model	MathVista ALL	MathVerse ALL
Human	60.3	64.9
GPT-4o	63.8	50.8
Qwen2-VL-7B	52.5	20.1
Math-LLaVA-13B	46.6	22.9
Ours (7B)	65.3	25.7

MM-Reasoner Scalability¶

As the training data scale increases from 6,952 to 32,146, the performance continues to steadily improve (Figure 4), validating the scalability of the proposed data synthesis method.

Stage 1 vs Stage 2¶

Stage 2 outperforms Stage 1 in almost all settings, verifying the effectiveness of rejection sampling in further enhancing the verifier.

Highlights & Insights¶

7B Surpasses GPT-4o and Humans: Achieving 65.3% on MathVista with only a 7B model, outperforming GPT-4o (63.8%) and human performance (60.3%), which fully demonstrates the power of the Verifier+Reasoner combination.
Ingenious Data Synthesis Strategy: Simulation search generates long CoT $\rightarrow$ GPT-4o verification $\rightarrow$ rejection sampling self-enhancement, forming a progressive data-quality improvement pipeline.
Text-Multimodal Bridging: The capability of the text-only reasoning model is distilled into the multimodal model through textual descriptions from MAVIS, avoiding expensive multimodal tree search.
Revealing the Limitations of LLMs in Verifying Long Outputs: Qwen2-VL-72B's performance in verifying MM-Reasoner declines as the number of samples increases, indicating that traditional models lack the capability to verify long-chain CoT.
Two-Stage Verifier Design: Stage 1 utilizes an external API to guarantee quality, whereas Stage 2 bootstraps via its own capabilities, progressively lowering costs.

Limitations & Future Work¶

Resource Constraints: Failed to scale MM-Verifier and MM-Reasoner to the 72B parameter size.
Limited Scalability Testing: Data volume scalability checks were only carried out up to <100K samples.
Narrow Reasoning Domain: Primarily focuses on mathematical reasoning; effectiveness has not been verified on other multimodal reasoning tasks (such as scientific reasoning or commonsense reasoning).
Dependency on External Models: Stage 1 requires GPT-4o to generate verification texts, leading to a relatively high data synthesis cost.
Single Data Source for MM-Reasoner: Employs only MAVIS-Geometry data, offering limited domain coverage.

Multimodal Math Models: UniMath, G-LLaVA, MAVIS, EAGLE, etc.
Reward Models:
- ORM (Outcome Reward Model): Qwen2.5-Math-RM-72B, which only evaluates the final outcome.
- PRM (Process Reward Model): Math-Shepherd, EurusPRM, Qwen2.5-Math-PRM, which evaluate the reasoning process step-by-step.
LLM-as-a-Judge: Various approaches utilizing LLMs as evaluators.
Test-Time Scaling: Scaling compute during inference, such as DeepSeek-R1, s1, and LIMO.

Rating ⭐⭐⭐⭐⭐¶

Strengths: Impressive results (7B outperforming GPT-4o and humans); the data synthesis pipeline is carefully designed and validated with thorough ablation studies at each step; introduces a new paradigm for verifiers in the multimodal domain; open-source code.

Weaknesses: The approach relies on external APIs (GPT-4o) to some extent; the training sources for MM-Reasoner are relatively narrow (only geometry problems); generalization to non-mathematical tasks remains unproven.