Latent Chain-of-Thought for Visual Reasoning¶
Conference: NeurIPS 2025 arXiv: 2510.23925 Code: heliossun/LaCoT Area: Multimodal VLM / Visual Reasoning Keywords: visual reasoning, chain-of-thought, amortized variational inference, GFlowNets, inference-time scaling, LVLM
TL;DR¶
This paper reformulates visual CoT reasoning as a posterior inference problem and proposes LaCoT, a training framework based on amortized variational inference (AVI) comprising reference-guided GFlowNet fine-tuning (RGFN), token-level reward approximation, and Bayesian inference scaling (BiN). On Qwen2.5-VL 3B/7B, LaCoT outperforms GRPO by 10.6% and achieves open-source state-of-the-art across seven visual reasoning benchmarks.
Background & Motivation¶
Background: Visual chain-of-thought (CoT) reasoning is central to improving the interpretability and reliability of large vision-language models (LVLMs), yet existing training methods exhibit pronounced generalization bottlenecks.
Limitations of Prior Work: - SFT: Supervised fine-tuning relies on teacher-forcing log-likelihood, which merely imitates reference reasoning chains without exploration capacity. - PPO/GRPO: KL penalties force the policy to remain close to the SFT baseline, constraining the discovery of novel reasoning paths; reward hacking is also common—models attain high scores without genuinely solving the problem. - Deterministic sampling: Existing methods treat reasoning as a deterministic generation process, failing to capture the diversity and uncertainty of reasoning trajectories. - Inference-time scaling cost: Methods such as Best-of-N and Beam Search require additional reward model evaluations, incurring high computational cost and reliance on biased critic models. - Token-level reward computation: Multimodal reasoning chains typically span ~1k tokens, making exact per-token reward computation prohibitively expensive.
Method¶
LaCoT models visual reasoning as posterior inference in a latent variable model: given a question–answer pair \((X, Y)\), the objective is to sample latent reasoning chains \(Z\) from the posterior \(P(Z \mid X, Y)\). The overall framework comprises three core contributions.
1. Token-Level Marginal Reward Approximation (ISubTB)¶
- The true reward \(R(z_{1:t^\top}) = \log P(X z_{1:t} Y)\) is computed at every \(\lambda = 8\) tokens along the full reasoning chain.
- Intermediate steps are approximated via linear interpolation: \(\tilde{R}(z_{1:t+i}^\top) = R(z_{1:t}^\top) + \frac{i}{\lambda} \cdot [R(z_{1:t+\lambda}^\top) - R(z_{1:t}^\top)]\)
- Theoretical analysis shows that when \(\lambda\) is sufficiently small, the interpolation error approaches zero and the flow consistency condition holds.
- The exact reward in the SubTB loss is replaced by this approximation to yield the ISubTB objective.
2. Reference-Guided GFlowNet Fine-Tuning (RGFN)¶
- \(m\) candidate reasoning chains \(\{Z_1, \ldots, Z_m\}\) are explored from the policy model \(q_\theta(Z \mid X)\).
- A reference reasoning chain \(Z_\text{ref}\) (generated by GPT-4o or DeepSeek-R1) serves as an anchor.
- An indicator function \(\mathbb{I}(Z_i)\) filters out candidates whose reward falls below \(\delta_s \cdot R(Z_\text{ref})\), preventing catastrophic forgetting.
- The annealing coefficient \(\delta_s\) is progressively tightened over training steps: broader exploration is tolerated in the first 50 steps, after which the threshold is gradually raised.
- Gradients are back-propagated only through samples that outperform the reference, eliminating the need for manual KL penalty tuning or gradient clipping.
3. Bayesian Inference Scaling (BiN)¶
- \(N\) latent reasoning chains \(Z_i \sim q_\theta(Z \mid X)\) are sampled, each yielding a corresponding answer \(Y_i \sim \pi_\Phi(Y \mid X Z_i)\).
- The length-normalized joint likelihood is computed: \(P(Y_i \mid X) \approx \frac{1}{N} \sum \frac{\pi_\Phi(Z_i Y_i \mid X)}{|Z_i Y_i|}\)
- The answer with the highest marginal likelihood is selected as the final output.
- No external reward model is required; statistically robust answer selection is achieved via Bayesian sampling.
Training Details¶
- Backbone: Qwen2.5-VL-3B/7B; SFT is first applied to obtain \(\pi_\Phi\) as the reward model and initialization.
- The policy model is trained with LoRA (\(r=64\), \(\alpha=128\)) using only 3k visual reasoning samples.
- A new role token "Analyzer" is introduced to enable the model to optionally provide intermediate reasoning steps.
Key Experimental Results¶
| Model | MathVista | MathVision | MathVerse (V-only) | MMMU (val) | MMMU-Pro | MMVet | MME |
|---|---|---|---|---|---|---|---|
| GPT-4o | 60.0 | 30.4 | 40.6 | 70.7 | 51.9 | 69.1 | 2329 |
| Qwen2.5-VL-7B | 63.7 | 25.4 | 38.2 | 50.0 | 34.6 | 70.5 | 2333 |
| R1-Onevision (GRPO) | 64.1 | 23.9 | 37.8 | 47.9 | 28.2 | 71.1 | 1111 |
| LaCoT-Qwen-7B | 68.4 | 24.9 | 43.3 | 54.9 | 35.3 | 74.2 | 2372 |
| LaCoT-Qwen-3B | 63.2 | 20.7 | 40.0 | 48.8 | 28.9 | 69.6 | 2208 |
| Inference Scaling Method | MathVerse | MathVista | MMMU | MMVet |
|---|---|---|---|---|
| 3B w/ BoN | 21.2 | 57.1 | 44.7 | 67.1 |
| 3B w/ BiN | 40.0 | 63.2 | 48.8 | 69.6 |
| 7B w/ BoN | 26.5 | 62.2 | 47.3 | 71.2 |
| 7B w/ BiN | 39.7 | 68.4 | 54.9 | 74.2 |
Highlights & Insights¶
- Substantial theoretical contribution: Modeling CoT reasoning as variational inference and realizing amortized posterior sampling via the GFlowNet SubTB objective provides a rigorous theoretical foundation.
- 3B surpasses 7B: LaCoT-3B achieves a 14-point jump on MathVerse, surpassing all 7B models (including LLaVA-CoT-11B), demonstrating that sampling diversity matters more than model scale.
- BiN greatly outperforms BoN: BiN exceeds BoN by approximately 15–18 percentage points on MathVerse without requiring any external reward model.
- Elegant reference-guided exploration: RGFN replaces KL penalties with annealed filtering, simultaneously ensuring exploration and preventing catastrophic forgetting while reducing gradient variance.
- Training efficiency: Significant gains are achieved with only 3k samples and LoRA fine-tuning.
Limitations & Future Work¶
- Validation is limited to models with \(\leq 7\)B parameters due to resource constraints; the effect on larger models remains unknown.
- Gains on MathVision (real mathematical competition problems with handwritten or low-resolution figures) are modest, indicating that OCR and visual grounding remain bottlenecks.
- As an on-policy method, exploring long-sequence complex latent variables remains challenging in terms of memory and time cost.
- Hallucination is not addressed—increasing sampling \(N\) provides mitigation but not a fundamental solution.
- The reward function depends on the quality of the SFT model's likelihood estimates; poor SFT training propagates errors downstream.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Introducing GFlowNets into probabilistic modeling of LVLM reasoning is a highly novel perspective supported by complete theoretical derivations.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation across seven benchmarks with ablations covering major components, though validation on larger models is absent.
- Writing Quality: ⭐⭐⭐⭐ — Mathematical derivations are clear and figures are intuitive, though notation density is occasionally high.
- Value: ⭐⭐⭐⭐⭐ — Introduces a new paradigm for visual reasoning; the BiN inference scaling method is directly applicable to arbitrary reasoning LVLMs.