Skip to content

Sherlock: Self-Correcting Reasoning in Vision-Language Models

Conference: NeurIPS 2025 arXiv: 2505.22651 Code: https://dripnowhy.github.io/Sherlock/ Area: Multimodal VLM / Self-Correction / Visual Reasoning Keywords: self-correction, preference learning, trajectory-level, VLM reasoning, self-improvement

TL;DR

The first systematic study of self-correction capabilities in reasoning VLMs: existing reasoning VLMs are found to be nearly incapable of self-correction (<10% exhibit an aha moment). The paper proposes Sherlock, a three-stage training framework (SFT cold-start → offline trajectory-level preference learning → online self-iterative improvement) that surpasses LLaVA-CoT/Mulberry/LlamaV-o1 (which use 100K–260K annotations) using only 20K labeled samples.

Background & Motivation

Reasoning VLMs (e.g., LLaVA-CoT, VL-Rethinker) are capable of long-chain reasoning but suffer from three critical issues: (1) high sensitivity to reasoning errors—a single mistake propagates through the chain and leads to incorrect final answers; (2) reliance on massive labeled datasets (100K–260K samples); and (3) poor generalization beyond training domains. The authors' key insight is that self-correction can simultaneously address all three issues—correcting a partially correct reasoning chain is easier than generating from scratch, correction pairs naturally form preference data, and dependence on external annotations is reduced.

Core Problem

Can existing reasoning VLMs self-correct? If not, how can they be trained to do so? And can self-correction ability in turn improve direct reasoning performance?

Method

Overall Architecture

Three-stage training pipeline: 1. SFT Cold-Start: Jointly trains reasoning and correction capabilities on 10K labeled samples. 2. Offline Preference Training: Applies trajectory-level self-correction objectives for preference learning. 3. Online Iterative Self-Improvement: Iteratively improves the model using self-generated preference data without external annotations.

Key Designs

  1. Empirical Analysis of Self-Correction in Existing VLMs (Section 3):

    • Step-wise: After perturbing one reasoning step in LLaVA-CoT/VL-Rethinker, fewer than 10% of continuations exhibit an aha moment (a reflection signal), and even when such moments appear, only ~50% lead to a correct final answer.
    • Response-wise: Neither self-correction prompts nor external critics (Critic-V, Qwen2.5-VL) effectively improve reasoning; performance may even degrade.
    • Conclusion: Current reasoning VLMs fundamentally lack self-correction capability.
  2. Trajectory-level Self-Correction Objective: Rather than revising an entire response, only the suffix beginning from the erroneous step is modified. Given a preference pair \((Y_w, Y_l)\), the sequence is split at a random truncation point \(i\), and preference learning is applied only to the suffix at positions \(\geq i\). This avoids unnecessary updates to an already-correct prefix and provides a cleaner learning signal.

  3. Visual Perturbation for Preference Data Construction: Low-quality responses are generated by injecting visual noise—after a random truncation point, Gaussian noise is added to the input image before the model continues generation, producing rejected responses with a controlled quality gap. No external verifier is required.

  4. Dynamic \(\beta\) for DPO: The \(\beta\) parameter is adaptively adjusted based on truncation position \(i\) and noise intensity \(\varepsilon\): \(\beta(i,n,\varepsilon) = \frac{1}{4}\left(0.5 + \left(\frac{i}{n}\right)^{0.5} + \frac{\varepsilon}{2}\right)\). A large quality gap (early truncation + strong noise) yields a larger \(\beta\) (conservative update); a small gap yields a smaller \(\beta\) (stronger learning signal).

  5. Online Self-Improvement (Stage 3): Self-correction capability is used to bootstrap further improvement. For each input, three rounds of self-correction produce \(Y_2/Y_3/Y_4\); if the final answers are consistent, \(Y_4\) serves as the preferred response. The rejected response is constructed from \(Y_1\) with visual perturbation. Each round requires only 5K unlabeled questions.

Loss & Training

  • Base model: Llama3.2-Vision-11B-Instruct
  • Only 20K labeled samples are required (two random 10K subsets \(D_A\), \(D_B\) sampled from LLaVA-CoT's 100K dataset)
  • Online stage: 5K unlabeled samples per round, 2 iterative rounds
  • Training cost: 128 GPU·h, less than LLaVA-CoT (160 h) and LlamaV-o1 (288 h)

Key Experimental Results

Method Labeled Data Direct Reasoning Avg After Self-Correction Avg
LLaVA-CoT 100K 63.2 63.0 (↓0.2)
Mulberry 260K 63.9 63.8 (↓0.1)
LlamaV-o1 175K 63.4 48.2 (↓15.2!)
Sherlock Iter2 20K 64.1 65.4 (+1.3)

Key observation: Sherlock is the only model whose performance improves after self-correction. The catastrophic collapse of LlamaV-o1 after correction is attributed to a conflict between its multi-round reasoning format and the correction prompt.

Inference-time scaling: Sherlock combined with the MM-Verify verifier improves MathVista from 52.0 to 55.9 (+3.9), requiring only 8.7 GPU·h compared to 40.2 h for Majority Vote.

Ablation Study

  • Self-correction and direct reasoning are not orthogonal—training for self-correction also improves direct reasoning (Finding 1).
  • Trajectory-level objective >> full-response-level objective—applying correction at the full-response level during online iteration leads to degradation (Finding 2).
  • Dynamic \(\beta\) stabilizes training and continuously improves both capabilities (Finding 3).
  • With 20K samples, Sherlock SFT already outperforms LLaVA-CoT trained on the same data volume by +0.8, confirming that self-correction is a free lunch.
  • Each round of self-correction yields consistent gains: 64.1 → 64.5 → 65.2 → 65.4 (3 rounds).

Highlights & Insights

  • The empirical finding that existing reasoning VLMs cannot self-correct—quantifying the aha moment rate at below 10% is highly compelling.
  • Trajectory-level correction rather than full-response correction—modifying only the erroneous suffix while preserving the correct prefix is a principled and fine-grained design choice.
  • Visual perturbation for preference data construction—cleverly exploiting image noise to create a controlled quality gap without any external verifier.
  • The closed loop of self-correction → self-iteration: once the model learns to correct, it can generate its own preference data for continued improvement—an elegant self-bootstrapping paradigm.
  • 20K data outperforming 260K—extremely high data efficiency, achieved by fully exploiting each sample.

Limitations & Future Work

  • Validation is limited to Llama3.2V-11B; larger models (e.g., 72B) and other VLM families remain untested.
  • The self-correction gain is modest (+1.3%), with limited room for improvement on cases already near-correct.
  • Visual perturbation relies solely on Gaussian noise; semantic-level perturbations (e.g., occluding key regions) may be more effective.
  • The online stage uses self-consistency for filtering, but consistency does not guarantee correctness.
  • Integration with RL-based methods (e.g., GRPO) is unexplored—Sherlock follows a pure SFT + DPO paradigm.
  • vs. LLaVA-CoT: LLaVA-CoT relies on SFT alone without self-correction training and exhibits slight performance degradation after correction; Sherlock surpasses it with 1/5 of the data while enabling effective self-correction.
  • vs. Mulberry: Mulberry generates 260K CoT samples via MCTS with step-wise reflection SFT, yet self-correction still fails; Sherlock's trajectory-level preference learning proves more effective.
  • vs. NoisyRollout (2504.13055): NoisyRollout applies visual perturbation to enhance RL exploration, while Sherlock uses it to construct preference data—the same strategy applied to different contexts.
  • vs. VL-Rethinker: RL-based but also fails at self-correction (<10% aha), demonstrating that RL alone is insufficient.

Relevance to My Research

  • Complementary to NoisyRollout: NoisyRollout adds visual perturbation during the RL stage for exploration, while Sherlock uses it during preference learning for data construction—the two approaches are composable.
  • The finding that "self-correction is a free lunch" is highly instructive—any reasoning VLM should incorporate self-correction training.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — The first systematic study of self-correction in reasoning VLMs; trajectory-level correction combined with visual perturbation–based preference data is a novel contribution.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 8 benchmarks, detailed ablations, integration with MM-Verify, and meticulous case studies.
  • Writing Quality: ⭐⭐⭐⭐⭐ — The logical flow from the analysis in Section 3 to the method design in Section 4 is seamless, with four clearly articulated takeaways.
  • Value: ⭐⭐⭐⭐⭐ — Self-correction is a critical capability for VLMs; the data-efficient training paradigm with 20K samples is highly valuable for resource-constrained settings.