Sherlock: Self-Correcting Reasoning in Vision-Language Models¶

Conference: NeurIPS 2025 arXiv: 2505.22651 Code: https://dripnowhy.github.io/Sherlock/ Area: Multimodal VLM / Self-Correction / Visual Reasoning Keywords: self-correction, preference learning, trajectory-level, VLM reasoning, self-improvement

TL;DR¶

The first systematic study of self-correction capabilities in reasoning VLMs: existing reasoning VLMs are found to be nearly incapable of self-correction (<10% exhibit an aha moment). The paper proposes Sherlock, a three-stage training framework (SFT cold-start → offline trajectory-level preference learning → online self-iterative improvement) that surpasses LLaVA-CoT/Mulberry/LlamaV-o1 (which use 100K–260K annotations) using only 20K labeled samples.

Background & Motivation¶

Reasoning VLMs (e.g., LLaVA-CoT, VL-Rethinker) are capable of long-chain reasoning but suffer from three critical issues: (1) high sensitivity to reasoning errors—a single mistake propagates through the chain and leads to incorrect final answers; (2) reliance on massive labeled datasets (100K–260K samples); and (3) poor generalization beyond training domains. The authors' key insight is that self-correction can simultaneously address all three issues—correcting a partially correct reasoning chain is easier than generating from scratch, correction pairs naturally form preference data, and dependence on external annotations is reduced.

Core Problem¶

Can existing reasoning VLMs self-correct? If not, how can they be trained to do so? And can self-correction ability in turn improve direct reasoning performance?

Method¶

Overall Architecture¶

Three-stage training pipeline: 1. SFT Cold-Start: Jointly trains reasoning and correction capabilities on 10K labeled samples. 2. Offline Preference Training: Applies trajectory-level self-correction objectives for preference learning. 3. Online Iterative Self-Improvement: Iteratively improves the model using self-generated preference data without external annotations.

Key Designs¶

Empirical Analysis of Self-Correction in Existing VLMs (Section 3):
- Step-wise: After perturbing one reasoning step in LLaVA-CoT/VL-Rethinker, fewer than 10% of continuations exhibit an aha moment (a reflection signal), and even when such moments appear, only ~50% lead to a correct final answer.
- Response-wise: Neither self-correction prompts nor external critics (Critic-V, Qwen2.5-VL) effectively improve reasoning; performance may even degrade.
- Conclusion: Current reasoning VLMs fundamentally lack self-correction capability.
Trajectory-level Self-Correction Objective: Rather than revising an entire response, only the suffix beginning from the erroneous step is modified. Given a preference pair \((Y_w, Y_l)\), the sequence is split at a random truncation point \(i\), and preference learning is applied only to the suffix at positions \(\geq i\). This avoids unnecessary updates to an already-correct prefix and provides a cleaner learning signal.
Visual Perturbation for Preference Data Construction: Low-quality responses are generated by injecting visual noise—after a random truncation point, Gaussian noise is added to the input image before the model continues generation, producing rejected responses with a controlled quality gap. No external verifier is required.
Dynamic \(\beta\) for DPO: The \(\beta\) parameter is adaptively adjusted based on truncation position \(i\) and noise intensity \(\varepsilon\): \(\beta(i,n,\varepsilon) = \frac{1}{4}\left(0.5 + \left(\frac{i}{n}\right)^{0.5} + \frac{\varepsilon}{2}\right)\). A large quality gap (early truncation + strong noise) yields a larger \(\beta\) (conservative update); a small gap yields a smaller \(\beta\) (stronger learning signal).
Online Self-Improvement (Stage 3): Self-correction capability is used to bootstrap further improvement. For each input, three rounds of self-correction produce \(Y_2/Y_3/Y_4\); if the final answers are consistent, \(Y_4\) serves as the preferred response. The rejected response is constructed from \(Y_1\) with visual perturbation. Each round requires only 5K unlabeled questions.

Loss & Training¶

Base model: Llama3.2-Vision-11B-Instruct
Only 20K labeled samples are required (two random 10K subsets \(D_A\), \(D_B\) sampled from LLaVA-CoT's 100K dataset)
Online stage: 5K unlabeled samples per round, 2 iterative rounds
Training cost: 128 GPU·h, less than LLaVA-CoT (160 h) and LlamaV-o1 (288 h)

Key Experimental Results¶

Method	Labeled Data	Direct Reasoning Avg	After Self-Correction Avg
LLaVA-CoT	100K	63.2	63.0 (↓0.2)
Mulberry	260K	63.9	63.8 (↓0.1)
LlamaV-o1	175K	63.4	48.2 (↓15.2!)
Sherlock Iter2	20K	64.1	65.4 (+1.3)

Key observation: Sherlock is the only model whose performance improves after self-correction. The catastrophic collapse of LlamaV-o1 after correction is attributed to a conflict between its multi-round reasoning format and the correction prompt.

Inference-time scaling: Sherlock combined with the MM-Verify verifier improves MathVista from 52.0 to 55.9 (+3.9), requiring only 8.7 GPU·h compared to 40.2 h for Majority Vote.

Ablation Study¶

Self-correction and direct reasoning are not orthogonal—training for self-correction also improves direct reasoning (Finding 1).
Trajectory-level objective >> full-response-level objective—applying correction at the full-response level during online iteration leads to degradation (Finding 2).
Dynamic \(\beta\) stabilizes training and continuously improves both capabilities (Finding 3).
With 20K samples, Sherlock SFT already outperforms LLaVA-CoT trained on the same data volume by +0.8, confirming that self-correction is a free lunch.
Each round of self-correction yields consistent gains: 64.1 → 64.5 → 65.2 → 65.4 (3 rounds).

Highlights & Insights¶

The empirical finding that existing reasoning VLMs cannot self-correct—quantifying the aha moment rate at below 10% is highly compelling.
Trajectory-level correction rather than full-response correction—modifying only the erroneous suffix while preserving the correct prefix is a principled and fine-grained design choice.
Visual perturbation for preference data construction—cleverly exploiting image noise to create a controlled quality gap without any external verifier.
The closed loop of self-correction → self-iteration: once the model learns to correct, it can generate its own preference data for continued improvement—an elegant self-bootstrapping paradigm.
20K data outperforming 260K—extremely high data efficiency, achieved by fully exploiting each sample.

Limitations & Future Work¶

Validation is limited to Llama3.2V-11B; larger models (e.g., 72B) and other VLM families remain untested.
The self-correction gain is modest (+1.3%), with limited room for improvement on cases already near-correct.
Visual perturbation relies solely on Gaussian noise; semantic-level perturbations (e.g., occluding key regions) may be more effective.
The online stage uses self-consistency for filtering, but consistency does not guarantee correctness.
Integration with RL-based methods (e.g., GRPO) is unexplored—Sherlock follows a pure SFT + DPO paradigm.

vs. LLaVA-CoT: LLaVA-CoT relies on SFT alone without self-correction training and exhibits slight performance degradation after correction; Sherlock surpasses it with 1/5 of the data while enabling effective self-correction.
vs. Mulberry: Mulberry generates 260K CoT samples via MCTS with step-wise reflection SFT, yet self-correction still fails; Sherlock's trajectory-level preference learning proves more effective.
vs. NoisyRollout (2504.13055): NoisyRollout applies visual perturbation to enhance RL exploration, while Sherlock uses it to construct preference data—the same strategy applied to different contexts.
vs. VL-Rethinker: RL-based but also fails at self-correction (<10% aha), demonstrating that RL alone is insufficient.

Relevance to My Research¶

Complementary to NoisyRollout: NoisyRollout adds visual perturbation during the RL stage for exploration, while Sherlock uses it during preference learning for data construction—the two approaches are composable.
The finding that "self-correction is a free lunch" is highly instructive—any reasoning VLM should incorporate self-correction training.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The first systematic study of self-correction in reasoning VLMs; trajectory-level correction combined with visual perturbation–based preference data is a novel contribution.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 8 benchmarks, detailed ablations, integration with MM-Verify, and meticulous case studies.
Writing Quality: ⭐⭐⭐⭐⭐ — The logical flow from the analysis in Section 3 to the method design in Section 4 is seamless, with four clearly articulated takeaways.
Value: ⭐⭐⭐⭐⭐ — Self-correction is a critical capability for VLMs; the data-efficient training paradigm with 20K samples is highly valuable for resource-constrained settings.