NeurIPS 2025 LLM Reasoning visual reasoning chain-of-thought amortized variational inference GFlowNets inference-time scaling LVLM

Latent Chain-of-Thought for Visual Reasoning¶

Conference: NeurIPS 2025 arXiv: 2510.23925 Code: heliossun/LaCoT Area: Multimodal VLM / Visual Reasoning Keywords: visual reasoning, chain-of-thought, amortized variational inference, GFlowNets, inference-time scaling, LVLM

TL;DR¶

This paper reformulates visual CoT reasoning as a posterior inference problem and proposes LaCoT, a training framework based on amortized variational inference (AVI) comprising reference-guided GFlowNet fine-tuning (RGFN), token-level reward approximation, and Bayesian inference scaling (BiN). On Qwen2.5-VL 3B/7B, LaCoT outperforms GRPO by 10.6% and achieves open-source state-of-the-art across seven visual reasoning benchmarks.

Background & Motivation¶

Background: Visual chain-of-thought (CoT) reasoning is central to improving the interpretability and reliability of large vision-language models (LVLMs), yet existing training methods exhibit pronounced generalization bottlenecks.

Limitations of Prior Work: - SFT: Supervised fine-tuning relies on teacher-forcing log-likelihood, which merely imitates reference reasoning chains without exploration capacity. - PPO/GRPO: KL penalties force the policy to remain close to the SFT baseline, constraining the discovery of novel reasoning paths; reward hacking is also common—models attain high scores without genuinely solving the problem. - Deterministic sampling: Existing methods treat reasoning as a deterministic generation process, failing to capture the diversity and uncertainty of reasoning trajectories. - Inference-time scaling cost: Methods such as Best-of-N and Beam Search require additional reward model evaluations, incurring high computational cost and reliance on biased critic models. - Token-level reward computation: Multimodal reasoning chains typically span ~1k tokens, making exact per-token reward computation prohibitively expensive.

Method¶

LaCoT models visual reasoning as posterior inference in a latent variable model: given a question–answer pair \((X, Y)\), the objective is to sample latent reasoning chains \(Z\) from the posterior \(P(Z \mid X, Y)\). The overall framework comprises three core contributions.

1. Token-Level Marginal Reward Approximation (ISubTB)¶

The true reward \(R(z_{1:t^\top}) = \log P(X z_{1:t} Y)\) is computed at every \(\lambda = 8\) tokens along the full reasoning chain.
Intermediate steps are approximated via linear interpolation: \(\tilde{R}(z_{1:t+i}^\top) = R(z_{1:t}^\top) + \frac{i}{\lambda} \cdot [R(z_{1:t+\lambda}^\top) - R(z_{1:t}^\top)]\)
Theoretical analysis shows that when \(\lambda\) is sufficiently small, the interpolation error approaches zero and the flow consistency condition holds.
The exact reward in the SubTB loss is replaced by this approximation to yield the ISubTB objective.

2. Reference-Guided GFlowNet Fine-Tuning (RGFN)¶

\(m\) candidate reasoning chains \(\{Z_1, \ldots, Z_m\}\) are explored from the policy model \(q_\theta(Z \mid X)\).
A reference reasoning chain \(Z_\text{ref}\) (generated by GPT-4o or DeepSeek-R1) serves as an anchor.
An indicator function \(\mathbb{I}(Z_i)\) filters out candidates whose reward falls below \(\delta_s \cdot R(Z_\text{ref})\), preventing catastrophic forgetting.
The annealing coefficient \(\delta_s\) is progressively tightened over training steps: broader exploration is tolerated in the first 50 steps, after which the threshold is gradually raised.
Gradients are back-propagated only through samples that outperform the reference, eliminating the need for manual KL penalty tuning or gradient clipping.

3. Bayesian Inference Scaling (BiN)¶

\(N\) latent reasoning chains \(Z_i \sim q_\theta(Z \mid X)\) are sampled, each yielding a corresponding answer \(Y_i \sim \pi_\Phi(Y \mid X Z_i)\).
The length-normalized joint likelihood is computed: \(P(Y_i \mid X) \approx \frac{1}{N} \sum \frac{\pi_\Phi(Z_i Y_i \mid X)}{|Z_i Y_i|}\)
The answer with the highest marginal likelihood is selected as the final output.
No external reward model is required; statistically robust answer selection is achieved via Bayesian sampling.

Training Details¶

Backbone: Qwen2.5-VL-3B/7B; SFT is first applied to obtain \(\pi_\Phi\) as the reward model and initialization.
The policy model is trained with LoRA (\(r=64\), \(\alpha=128\)) using only 3k visual reasoning samples.
A new role token "Analyzer" is introduced to enable the model to optionally provide intermediate reasoning steps.

Key Experimental Results¶

Model	MathVista	MathVision	MathVerse (V-only)	MMMU (val)	MMMU-Pro	MMVet	MME
GPT-4o	60.0	30.4	40.6	70.7	51.9	69.1	2329
Qwen2.5-VL-7B	63.7	25.4	38.2	50.0	34.6	70.5	2333
R1-Onevision (GRPO)	64.1	23.9	37.8	47.9	28.2	71.1	1111
LaCoT-Qwen-7B	68.4	24.9	43.3	54.9	35.3	74.2	2372
LaCoT-Qwen-3B	63.2	20.7	40.0	48.8	28.9	69.6	2208

Inference Scaling Method	MathVerse	MathVista	MMMU	MMVet
3B w/ BoN	21.2	57.1	44.7	67.1
3B w/ BiN	40.0	63.2	48.8	69.6
7B w/ BoN	26.5	62.2	47.3	71.2
7B w/ BiN	39.7	68.4	54.9	74.2

Highlights & Insights¶

Substantial theoretical contribution: Modeling CoT reasoning as variational inference and realizing amortized posterior sampling via the GFlowNet SubTB objective provides a rigorous theoretical foundation.
3B surpasses 7B: LaCoT-3B achieves a 14-point jump on MathVerse, surpassing all 7B models (including LLaVA-CoT-11B), demonstrating that sampling diversity matters more than model scale.
BiN greatly outperforms BoN: BiN exceeds BoN by approximately 15–18 percentage points on MathVerse without requiring any external reward model.
Elegant reference-guided exploration: RGFN replaces KL penalties with annealed filtering, simultaneously ensuring exploration and preventing catastrophic forgetting while reducing gradient variance.
Training efficiency: Significant gains are achieved with only 3k samples and LoRA fine-tuning.

Limitations & Future Work¶

Validation is limited to models with \(\leq 7\)B parameters due to resource constraints; the effect on larger models remains unknown.
Gains on MathVision (real mathematical competition problems with handwritten or low-resolution figures) are modest, indicating that OCR and visual grounding remain bottlenecks.
As an on-policy method, exploring long-sequence complex latent variables remains challenging in terms of memory and time cost.
Hallucination is not addressed—increasing sampling \(N\) provides mitigation but not a fundamental solution.
The reward function depends on the quality of the SFT model's likelihood estimates; poor SFT training propagates errors downstream.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Introducing GFlowNets into probabilistic modeling of LVLM reasoning is a highly novel perspective supported by complete theoretical derivations.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation across seven benchmarks with ablations covering major components, though validation on larger models is absent.
Writing Quality: ⭐⭐⭐⭐ — Mathematical derivations are clear and figures are intuitive, though notation density is occasionally high.
Value: ⭐⭐⭐⭐⭐ — Introduces a new paradigm for visual reasoning; the BiN inference scaling method is directly applicable to arbitrary reasoning LVLMs.