Skip to content

Latent Chain-of-Thought for Visual Reasoning

Conference: NeurIPS 2025 arXiv: 2510.23925 Code: heliossun/LaCoT Area: Multimodal VLM / Visual Reasoning Keywords: visual reasoning, chain-of-thought, amortized variational inference, GFlowNets, inference-time scaling, LVLM

TL;DR

This paper reformulates visual CoT reasoning as a posterior inference problem and proposes LaCoT, a training framework based on amortized variational inference (AVI) comprising reference-guided GFlowNet fine-tuning (RGFN), token-level reward approximation, and Bayesian inference scaling (BiN). On Qwen2.5-VL 3B/7B, LaCoT outperforms GRPO by 10.6% and achieves open-source state-of-the-art across seven visual reasoning benchmarks.

Background & Motivation

Background: Visual chain-of-thought (CoT) reasoning is central to improving the interpretability and reliability of large vision-language models (LVLMs), yet existing training methods exhibit pronounced generalization bottlenecks.

Limitations of Prior Work: - SFT: Supervised fine-tuning relies on teacher-forcing log-likelihood, which merely imitates reference reasoning chains without exploration capacity. - PPO/GRPO: KL penalties force the policy to remain close to the SFT baseline, constraining the discovery of novel reasoning paths; reward hacking is also common—models attain high scores without genuinely solving the problem. - Deterministic sampling: Existing methods treat reasoning as a deterministic generation process, failing to capture the diversity and uncertainty of reasoning trajectories. - Inference-time scaling cost: Methods such as Best-of-N and Beam Search require additional reward model evaluations, incurring high computational cost and reliance on biased critic models. - Token-level reward computation: Multimodal reasoning chains typically span ~1k tokens, making exact per-token reward computation prohibitively expensive.

Method

LaCoT models visual reasoning as posterior inference in a latent variable model: given a question–answer pair \((X, Y)\), the objective is to sample latent reasoning chains \(Z\) from the posterior \(P(Z \mid X, Y)\). The overall framework comprises three core contributions.

1. Token-Level Marginal Reward Approximation (ISubTB)

  • The true reward \(R(z_{1:t^\top}) = \log P(X z_{1:t} Y)\) is computed at every \(\lambda = 8\) tokens along the full reasoning chain.
  • Intermediate steps are approximated via linear interpolation: \(\tilde{R}(z_{1:t+i}^\top) = R(z_{1:t}^\top) + \frac{i}{\lambda} \cdot [R(z_{1:t+\lambda}^\top) - R(z_{1:t}^\top)]\)
  • Theoretical analysis shows that when \(\lambda\) is sufficiently small, the interpolation error approaches zero and the flow consistency condition holds.
  • The exact reward in the SubTB loss is replaced by this approximation to yield the ISubTB objective.

2. Reference-Guided GFlowNet Fine-Tuning (RGFN)

  • \(m\) candidate reasoning chains \(\{Z_1, \ldots, Z_m\}\) are explored from the policy model \(q_\theta(Z \mid X)\).
  • A reference reasoning chain \(Z_\text{ref}\) (generated by GPT-4o or DeepSeek-R1) serves as an anchor.
  • An indicator function \(\mathbb{I}(Z_i)\) filters out candidates whose reward falls below \(\delta_s \cdot R(Z_\text{ref})\), preventing catastrophic forgetting.
  • The annealing coefficient \(\delta_s\) is progressively tightened over training steps: broader exploration is tolerated in the first 50 steps, after which the threshold is gradually raised.
  • Gradients are back-propagated only through samples that outperform the reference, eliminating the need for manual KL penalty tuning or gradient clipping.

3. Bayesian Inference Scaling (BiN)

  • \(N\) latent reasoning chains \(Z_i \sim q_\theta(Z \mid X)\) are sampled, each yielding a corresponding answer \(Y_i \sim \pi_\Phi(Y \mid X Z_i)\).
  • The length-normalized joint likelihood is computed: \(P(Y_i \mid X) \approx \frac{1}{N} \sum \frac{\pi_\Phi(Z_i Y_i \mid X)}{|Z_i Y_i|}\)
  • The answer with the highest marginal likelihood is selected as the final output.
  • No external reward model is required; statistically robust answer selection is achieved via Bayesian sampling.

Training Details

  • Backbone: Qwen2.5-VL-3B/7B; SFT is first applied to obtain \(\pi_\Phi\) as the reward model and initialization.
  • The policy model is trained with LoRA (\(r=64\), \(\alpha=128\)) using only 3k visual reasoning samples.
  • A new role token "Analyzer" is introduced to enable the model to optionally provide intermediate reasoning steps.

Key Experimental Results

Model MathVista MathVision MathVerse (V-only) MMMU (val) MMMU-Pro MMVet MME
GPT-4o 60.0 30.4 40.6 70.7 51.9 69.1 2329
Qwen2.5-VL-7B 63.7 25.4 38.2 50.0 34.6 70.5 2333
R1-Onevision (GRPO) 64.1 23.9 37.8 47.9 28.2 71.1 1111
LaCoT-Qwen-7B 68.4 24.9 43.3 54.9 35.3 74.2 2372
LaCoT-Qwen-3B 63.2 20.7 40.0 48.8 28.9 69.6 2208
Inference Scaling Method MathVerse MathVista MMMU MMVet
3B w/ BoN 21.2 57.1 44.7 67.1
3B w/ BiN 40.0 63.2 48.8 69.6
7B w/ BoN 26.5 62.2 47.3 71.2
7B w/ BiN 39.7 68.4 54.9 74.2

Highlights & Insights

  • Substantial theoretical contribution: Modeling CoT reasoning as variational inference and realizing amortized posterior sampling via the GFlowNet SubTB objective provides a rigorous theoretical foundation.
  • 3B surpasses 7B: LaCoT-3B achieves a 14-point jump on MathVerse, surpassing all 7B models (including LLaVA-CoT-11B), demonstrating that sampling diversity matters more than model scale.
  • BiN greatly outperforms BoN: BiN exceeds BoN by approximately 15–18 percentage points on MathVerse without requiring any external reward model.
  • Elegant reference-guided exploration: RGFN replaces KL penalties with annealed filtering, simultaneously ensuring exploration and preventing catastrophic forgetting while reducing gradient variance.
  • Training efficiency: Significant gains are achieved with only 3k samples and LoRA fine-tuning.

Limitations & Future Work

  • Validation is limited to models with \(\leq 7\)B parameters due to resource constraints; the effect on larger models remains unknown.
  • Gains on MathVision (real mathematical competition problems with handwritten or low-resolution figures) are modest, indicating that OCR and visual grounding remain bottlenecks.
  • As an on-policy method, exploring long-sequence complex latent variables remains challenging in terms of memory and time cost.
  • Hallucination is not addressed—increasing sampling \(N\) provides mitigation but not a fundamental solution.
  • The reward function depends on the quality of the SFT model's likelihood estimates; poor SFT training propagates errors downstream.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — Introducing GFlowNets into probabilistic modeling of LVLM reasoning is a highly novel perspective supported by complete theoretical derivations.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation across seven benchmarks with ablations covering major components, though validation on larger models is absent.
  • Writing Quality: ⭐⭐⭐⭐ — Mathematical derivations are clear and figures are intuitive, though notation density is occasionally high.
  • Value: ⭐⭐⭐⭐⭐ — Introduces a new paradigm for visual reasoning; the BiN inference scaling method is directly applicable to arbitrary reasoning LVLMs.