Enhancing Outcome Reward-Based RL Training of MLLMs with Self-Consistency Sampling¶

Conference: NeurIPS 2025 arXiv: 2511.10648 Code: GitHub Area: Multimodal VLM Keywords: Reinforcement Learning, Self-Consistency Sampling, Reasoning Faithfulness, Multimodal Reasoning, Outcome Reward

TL;DR¶

To address the problem of "unfaithful reasoning trajectories induced by outcome-reward RL training in multimodal multiple-choice tasks," this paper proposes Self-Consistency Sampling (SCS), which obtains consistency rewards via truncation-resampling and visual perturbation to penalize spurious reasoning. When combined with RLOO, SCS achieves an average improvement of 7.7 percentage points across six benchmarks.

Background & Motivation¶

Background: Outcome-reward RL methods (e.g., GRPO, RLOO, REINFORCE++) are the dominant paradigm for enhancing reasoning capabilities in MLLMs.
Limitations of Prior Work: In multiple-choice questions (the primary format of multimodal reasoning benchmarks), a critical yet overlooked issue exists: unfaithful trajectories gaming the reward — the model guesses the correct option following an erroneous reasoning chain, yet receives the same full reward as a genuinely correct reasoning trace.
Key Challenge: The paper reveals the severity of this issue through exploratory experiments:

Insufficient gains from multiple-choice format: On Geometry3K, multiple-choice training yields only a 5.6% improvement, 6.4 points lower than the 12.0% gained with open-ended QA.

Truncation-continuation divergence: Continuing from truncated reasoning trajectories at different positions frequently produces different final answer choices from the same prefix, indicating that unfaithful reasoning trajectories are pervasive.

Qualitative analysis: The model frequently generates incorrect reasoning yet coincidentally selects the correct answer.

Goal: Process Reward Models (PRMs) can mitigate this issue but are computationally expensive. SCS aims to identify and down-weight unreliable reasoning trajectories through self-consistency checks, without introducing any additional reward model.

Method¶

Overall Architecture¶

SCS models the reasoning process as a tree structure based on three key assumptions:

Uniqueness of correct trajectories: Exactly one correct reasoning trajectory exists in the reasoning tree.
Leaf-node/option alignment: Every trajectory ultimately points to one answer option.
Relationship between correct/incorrect trajectories and options: A correct trajectory necessarily leads to the correct option; an incorrect trajectory may either guess the correct option or select a wrong one.

Under these assumptions, when using only accuracy reward \(r = r_{\text{acc}}\), unfaithful reasoning still has a non-zero probability of receiving full reward:

\[P(\tau^- \mid y^+) = \frac{P(y^+, \tau^- \mid x)}{P(y^+, \tau^- \mid x) + P(y^+, \tau^+ \mid x)}\]

Key Designs¶

Consistency Reward¶

SCS introduces a consistency reward \(r_{\text{con}}\) to penalize inconsistent reasoning patterns. Given a reasoning trajectory \(\tau\), the model resamples \(N\) times to obtain an answer set \(\mathcal{A}\), and the consistency reward is defined as:

\[r_{\text{con}} = \frac{1}{N}(N - |\mathcal{A}|) \cdot c\]

where \(c\) is a scaling coefficient. Intuitively, if the reasoning is correct, repeated sampling should consistently point to the same answer (\(|\mathcal{A}|=1\), maximum reward); if the reasoning is incorrect, answers will diverge (\(|\mathcal{A}|\) increases, reward decreases).

Consistency reward for a correct trajectory: \(r_{\text{con}}^+ = \frac{1}{N}(N-1) \cdot c\) (maximum). Consistency reward for an incorrect trajectory: \(r_{\text{con}}^- = \frac{1}{N}(N-|\mathcal{A}^-|) \cdot c\), where \(r_{\text{con}}^+ > r_{\text{con}}^-\) since \(\mathbb{E}(|\mathcal{A}^-|) > 1\).

Total reward: \(r = r_{\text{for}} + r_{\text{acc}} + r_{\text{con}}\) (format reward + accuracy reward + consistency reward).

Truncation-Resampling (TR)¶

The initial reasoning trajectory \(\tau\) is truncated at ratio \(k\) to yield an incomplete trajectory \(\tau^<\), which serves as a prefix for \(m\) resampling continuations, each generating a new answer \(a_t\). All answers are collected to form \(\mathcal{A}\).

Core Idea: If the reasoning process is faithful, continuations from the same prefix should yield consistent answers; if the reasoning is spurious, continuations will diverge.

Visual Perturbation (VP)¶

Slight Gaussian noise is added to the input image at each resampling step:

\[\tilde{\mathbf{x}}_i = \mathbf{x} + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma_i^2), \quad \sigma_i \sim \mathcal{U}(\sigma_{\min}, \sigma_{\max})\]

The perturbation magnitude \(\sigma_i\) is independently sampled from a uniform distribution at each step, exposing the model to diverse visual variants. This compels the policy to reason based on perturbed visual evidence; correct reasoning should remain robust to small perturbations.

Loss & Training¶

SCS is compatible with multiple RL algorithms: RLOO, GRPO, REINFORCE++, and REINFORCE++-baseline. Training configuration:

Base model: Qwen2.5-VL-7B-Instruct
Training data: M³CoT (7.8k) + Geometry3K (2.1k) + ScienceQA (6.2k, filtered) — only image-containing multiple-choice questions retained
Batch size: 128, with 16 trajectories sampled per prompt
Learning rate: 1e-6, temperature: 1.0
Truncation ratio \(k\)=0.8 (RLOO/REINFORCE++) or 0.4 (GRPO); resampling count \(m\)=4 (RLOO) or 8 (GRPO)
8 × A800 GPUs, approximately 24 hours of training

Key Experimental Results¶

Main Results¶

Table 1: Effect of SCS with Different RL Algorithms (Qwen2.5-VL-7B-Instruct)

Method	SCS	Overall	M3CoT	MMMU	SciQA	WeMath	MathVerse	MathVision
Baseline (pretrained)	-	54.9	65.5	45.7	73.7	62.5	57.7	24.1
SFT	-	58.6	78.7	52.6	51.0	90.7	49.4	29.3
RLOO	✕	57.8	67.6	51.5	53.9	86.4	56.8	30.4
RLOO	✓	65.5 (+7.7)	75.7	59.1	68.8	88.1	67.1	34.0
GRPO	✕	63.6	72.6	57.2	66.6	88.3	64.2	32.8
GRPO	✓	64.5 (+0.9)	73.9	58.0	66.4	88.7	67.0	33.1
REINFORCE++	✕	60.9	66.8	54.9	64.8	84.3	60.9	33.4
REINFORCE++	✓	62.9 (+2.0)	65.7	54.6	76.1	85.4	61.6	34.0

Table 2: Cross-Model Generalization (RLOO + SCS)

Model	w/o SCS	w/ SCS	Gain
Qwen2.5-VL-7B-Instruct	57.8	65.5	+7.7
Qwen2.5-VL-3B-Instruct	54.7	57.9	+3.2
InternVL3-8B	61.7	63.3	+1.6

Ablation Study¶

Component Effectiveness (based on RLOO):

TR	VP	Overall	Gain
✕	✕	57.8	-
✓	✕	63.0	+5.2
✕	✓	62.8	+5.0
✓	✓	65.5	+7.7

Each component contributes approximately 5 points individually; their combination recovers the full 7.7-point gain, demonstrating complementarity.

Hyperparameter Sensitivity:

Truncation ratio \(k\): performance increases from 0.1 to 0.8, peaks at 0.8, then declines. Too small a ratio provides insufficient consistency signal; too large a ratio leaves inadequate exploration space.
Resampling count \(m\): performance increases from 2 to 4, peaks at \(m=4\), then declines. Excessive resampling introduces randomness and increases computational overhead.
Performance fluctuation within the tested hyperparameter ranges remains within 4 points, indicating that the method is relatively robust.

Quantitative Analysis of Reasoning Reliability: 100 correctly answered samples were randomly drawn from each benchmark and manually examined for reasoning faithfulness. After SCS training, the proportion of unfaithful reasoning decreased by approximately 15% (human evaluation: 25.0 → 21.2; o3-mini evaluation: 22.0 → 19.0).

Key Findings¶

RLOO under vanilla RL even underperforms SFT (57.8 vs. 58.6), but with SCS jumps to 65.5 — SCS yields the largest gain for RLOO (+7.7).
The advantage of vanilla RL over SFT is marginal (approximately 2–3 points), indicating that standard outcome rewards are inefficient for multiple-choice tasks.
The consistency reward in SCS effectively functions as a "poor man's process reward," requiring no additional reward model.
The two components contribute nearly symmetrically (TR +5.2 vs. VP +5.0), yet exhibit synergistic effects when combined.

Highlights & Insights¶

The paper precisely identifies a widely overlooked problem: unfaithful reasoning trajectories under the multiple-choice format cause degeneration of outcome-reward RL.
SCS is an elegant and lightweight design: it requires no additional reward model, does not alter the RL algorithm's structure, and only appends a consistency reward term.
The theoretical derivation is rigorous: the mathematical formulation from tree-structure modeling to consistency reward follows a logically sound progression.
The truncation-resampling mechanism is ingenious in that it exploits the "causal coherence" of reasoning trajectories — a faithful reasoning prefix should consistently lead to the correct conclusion.

Limitations & Future Work¶

Validation is limited to the multiple-choice format; effectiveness in open-ended QA settings remains unknown.
Training data scale is relatively small (~16k samples); performance at larger scales awaits verification.
The mathematical assumption underlying the consistency reward (uniqueness of the correct trajectory) may not hold for open-ended reasoning.
Visual perturbation employs simple Gaussian noise; more sophisticated data augmentation strategies (geometric transformations, occlusion) may yield further improvements.
SCS produces a smaller gain for GRPO (+0.9), possibly because GRPO's group-relative advantage estimation already partially suppresses unfaithful trajectories.

Self-Consistency (Wang et al., 2022) improves reasoning consistency by taking the majority vote over multiple samples; SCS integrates the consistency idea into RL training rewards.
Process Reward Models (PRMs) provide step-level feedback but are computationally expensive; SCS serves as a low-cost alternative.
RLOO benefits the most, likely because its leave-one-out baseline already exhibits low variance, and SCS's consistency signal further stabilizes training.
The truncation-resampling mechanism is generalizable to other scenarios requiring verification of reasoning faithfulness.

Rating¶

Novelty: ⭐⭐⭐⭐ — The problem is precisely identified, and the consistency reward design demonstrates strong originality.
Technical Depth: ⭐⭐⭐⭐ — The theoretical derivation is complete, and the two-component design is well-motivated.
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers 4 RL algorithms, 3 models, 6 benchmarks, and multiple ablation groups.
Value: ⭐⭐⭐⭐ — Plug-and-play, negligible computational overhead, well-suited for multiple-choice reasoning training.