Enhancing Outcome Reward-Based RL Training of MLLMs with Self-Consistency Sampling¶
Conference: NeurIPS 2025 arXiv: 2511.10648 Code: GitHub Area: Multimodal VLM Keywords: Reinforcement Learning, Self-Consistency Sampling, Reasoning Faithfulness, Multimodal Reasoning, Outcome Reward
TL;DR¶
To address the problem of "unfaithful reasoning trajectories induced by outcome-reward RL training in multimodal multiple-choice tasks," this paper proposes Self-Consistency Sampling (SCS), which obtains consistency rewards via truncation-resampling and visual perturbation to penalize spurious reasoning. When combined with RLOO, SCS achieves an average improvement of 7.7 percentage points across six benchmarks.
Background & Motivation¶
- Background: Outcome-reward RL methods (e.g., GRPO, RLOO, REINFORCE++) are the dominant paradigm for enhancing reasoning capabilities in MLLMs.
- Limitations of Prior Work: In multiple-choice questions (the primary format of multimodal reasoning benchmarks), a critical yet overlooked issue exists: unfaithful trajectories gaming the reward — the model guesses the correct option following an erroneous reasoning chain, yet receives the same full reward as a genuinely correct reasoning trace.
- Key Challenge: The paper reveals the severity of this issue through exploratory experiments:
Insufficient gains from multiple-choice format: On Geometry3K, multiple-choice training yields only a 5.6% improvement, 6.4 points lower than the 12.0% gained with open-ended QA.
Truncation-continuation divergence: Continuing from truncated reasoning trajectories at different positions frequently produces different final answer choices from the same prefix, indicating that unfaithful reasoning trajectories are pervasive.
Qualitative analysis: The model frequently generates incorrect reasoning yet coincidentally selects the correct answer.
- Goal: Process Reward Models (PRMs) can mitigate this issue but are computationally expensive. SCS aims to identify and down-weight unreliable reasoning trajectories through self-consistency checks, without introducing any additional reward model.
Method¶
Overall Architecture¶
SCS models the reasoning process as a tree structure based on three key assumptions:
- Uniqueness of correct trajectories: Exactly one correct reasoning trajectory exists in the reasoning tree.
- Leaf-node/option alignment: Every trajectory ultimately points to one answer option.
- Relationship between correct/incorrect trajectories and options: A correct trajectory necessarily leads to the correct option; an incorrect trajectory may either guess the correct option or select a wrong one.
Under these assumptions, when using only accuracy reward \(r = r_{\text{acc}}\), unfaithful reasoning still has a non-zero probability of receiving full reward:
Key Designs¶
Consistency Reward¶
SCS introduces a consistency reward \(r_{\text{con}}\) to penalize inconsistent reasoning patterns. Given a reasoning trajectory \(\tau\), the model resamples \(N\) times to obtain an answer set \(\mathcal{A}\), and the consistency reward is defined as:
where \(c\) is a scaling coefficient. Intuitively, if the reasoning is correct, repeated sampling should consistently point to the same answer (\(|\mathcal{A}|=1\), maximum reward); if the reasoning is incorrect, answers will diverge (\(|\mathcal{A}|\) increases, reward decreases).
Consistency reward for a correct trajectory: \(r_{\text{con}}^+ = \frac{1}{N}(N-1) \cdot c\) (maximum). Consistency reward for an incorrect trajectory: \(r_{\text{con}}^- = \frac{1}{N}(N-|\mathcal{A}^-|) \cdot c\), where \(r_{\text{con}}^+ > r_{\text{con}}^-\) since \(\mathbb{E}(|\mathcal{A}^-|) > 1\).
Total reward: \(r = r_{\text{for}} + r_{\text{acc}} + r_{\text{con}}\) (format reward + accuracy reward + consistency reward).
Truncation-Resampling (TR)¶
The initial reasoning trajectory \(\tau\) is truncated at ratio \(k\) to yield an incomplete trajectory \(\tau^<\), which serves as a prefix for \(m\) resampling continuations, each generating a new answer \(a_t\). All answers are collected to form \(\mathcal{A}\).
Core Idea: If the reasoning process is faithful, continuations from the same prefix should yield consistent answers; if the reasoning is spurious, continuations will diverge.
Visual Perturbation (VP)¶
Slight Gaussian noise is added to the input image at each resampling step:
The perturbation magnitude \(\sigma_i\) is independently sampled from a uniform distribution at each step, exposing the model to diverse visual variants. This compels the policy to reason based on perturbed visual evidence; correct reasoning should remain robust to small perturbations.
Loss & Training¶
SCS is compatible with multiple RL algorithms: RLOO, GRPO, REINFORCE++, and REINFORCE++-baseline. Training configuration:
- Base model: Qwen2.5-VL-7B-Instruct
- Training data: M³CoT (7.8k) + Geometry3K (2.1k) + ScienceQA (6.2k, filtered) — only image-containing multiple-choice questions retained
- Batch size: 128, with 16 trajectories sampled per prompt
- Learning rate: 1e-6, temperature: 1.0
- Truncation ratio \(k\)=0.8 (RLOO/REINFORCE++) or 0.4 (GRPO); resampling count \(m\)=4 (RLOO) or 8 (GRPO)
- 8 × A800 GPUs, approximately 24 hours of training
Key Experimental Results¶
Main Results¶
Table 1: Effect of SCS with Different RL Algorithms (Qwen2.5-VL-7B-Instruct)
| Method | SCS | Overall | M3CoT | MMMU | SciQA | WeMath | MathVerse | MathVision |
|---|---|---|---|---|---|---|---|---|
| Baseline (pretrained) | - | 54.9 | 65.5 | 45.7 | 73.7 | 62.5 | 57.7 | 24.1 |
| SFT | - | 58.6 | 78.7 | 52.6 | 51.0 | 90.7 | 49.4 | 29.3 |
| RLOO | ✕ | 57.8 | 67.6 | 51.5 | 53.9 | 86.4 | 56.8 | 30.4 |
| RLOO | ✓ | 65.5 (+7.7) | 75.7 | 59.1 | 68.8 | 88.1 | 67.1 | 34.0 |
| GRPO | ✕ | 63.6 | 72.6 | 57.2 | 66.6 | 88.3 | 64.2 | 32.8 |
| GRPO | ✓ | 64.5 (+0.9) | 73.9 | 58.0 | 66.4 | 88.7 | 67.0 | 33.1 |
| REINFORCE++ | ✕ | 60.9 | 66.8 | 54.9 | 64.8 | 84.3 | 60.9 | 33.4 |
| REINFORCE++ | ✓ | 62.9 (+2.0) | 65.7 | 54.6 | 76.1 | 85.4 | 61.6 | 34.0 |
Table 2: Cross-Model Generalization (RLOO + SCS)
| Model | w/o SCS | w/ SCS | Gain |
|---|---|---|---|
| Qwen2.5-VL-7B-Instruct | 57.8 | 65.5 | +7.7 |
| Qwen2.5-VL-3B-Instruct | 54.7 | 57.9 | +3.2 |
| InternVL3-8B | 61.7 | 63.3 | +1.6 |
Ablation Study¶
Component Effectiveness (based on RLOO):
| TR | VP | Overall | Gain |
|---|---|---|---|
| ✕ | ✕ | 57.8 | - |
| ✓ | ✕ | 63.0 | +5.2 |
| ✕ | ✓ | 62.8 | +5.0 |
| ✓ | ✓ | 65.5 | +7.7 |
Each component contributes approximately 5 points individually; their combination recovers the full 7.7-point gain, demonstrating complementarity.
Hyperparameter Sensitivity:
- Truncation ratio \(k\): performance increases from 0.1 to 0.8, peaks at 0.8, then declines. Too small a ratio provides insufficient consistency signal; too large a ratio leaves inadequate exploration space.
- Resampling count \(m\): performance increases from 2 to 4, peaks at \(m=4\), then declines. Excessive resampling introduces randomness and increases computational overhead.
- Performance fluctuation within the tested hyperparameter ranges remains within 4 points, indicating that the method is relatively robust.
Quantitative Analysis of Reasoning Reliability: 100 correctly answered samples were randomly drawn from each benchmark and manually examined for reasoning faithfulness. After SCS training, the proportion of unfaithful reasoning decreased by approximately 15% (human evaluation: 25.0 → 21.2; o3-mini evaluation: 22.0 → 19.0).
Key Findings¶
- RLOO under vanilla RL even underperforms SFT (57.8 vs. 58.6), but with SCS jumps to 65.5 — SCS yields the largest gain for RLOO (+7.7).
- The advantage of vanilla RL over SFT is marginal (approximately 2–3 points), indicating that standard outcome rewards are inefficient for multiple-choice tasks.
- The consistency reward in SCS effectively functions as a "poor man's process reward," requiring no additional reward model.
- The two components contribute nearly symmetrically (TR +5.2 vs. VP +5.0), yet exhibit synergistic effects when combined.
Highlights & Insights¶
- The paper precisely identifies a widely overlooked problem: unfaithful reasoning trajectories under the multiple-choice format cause degeneration of outcome-reward RL.
- SCS is an elegant and lightweight design: it requires no additional reward model, does not alter the RL algorithm's structure, and only appends a consistency reward term.
- The theoretical derivation is rigorous: the mathematical formulation from tree-structure modeling to consistency reward follows a logically sound progression.
- The truncation-resampling mechanism is ingenious in that it exploits the "causal coherence" of reasoning trajectories — a faithful reasoning prefix should consistently lead to the correct conclusion.
Limitations & Future Work¶
- Validation is limited to the multiple-choice format; effectiveness in open-ended QA settings remains unknown.
- Training data scale is relatively small (~16k samples); performance at larger scales awaits verification.
- The mathematical assumption underlying the consistency reward (uniqueness of the correct trajectory) may not hold for open-ended reasoning.
- Visual perturbation employs simple Gaussian noise; more sophisticated data augmentation strategies (geometric transformations, occlusion) may yield further improvements.
- SCS produces a smaller gain for GRPO (+0.9), possibly because GRPO's group-relative advantage estimation already partially suppresses unfaithful trajectories.
Related Work & Insights¶
- Self-Consistency (Wang et al., 2022) improves reasoning consistency by taking the majority vote over multiple samples; SCS integrates the consistency idea into RL training rewards.
- Process Reward Models (PRMs) provide step-level feedback but are computationally expensive; SCS serves as a low-cost alternative.
- RLOO benefits the most, likely because its leave-one-out baseline already exhibits low variance, and SCS's consistency signal further stabilizes training.
- The truncation-resampling mechanism is generalizable to other scenarios requiring verification of reasoning faithfulness.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The problem is precisely identified, and the consistency reward design demonstrates strong originality.
- Technical Depth: ⭐⭐⭐⭐ — The theoretical derivation is complete, and the two-component design is well-motivated.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Covers 4 RL algorithms, 3 models, 6 benchmarks, and multiple ablation groups.
- Value: ⭐⭐⭐⭐ — Plug-and-play, negligible computational overhead, well-suited for multiple-choice reasoning training.