🧠 VLM Reasoning¶
🧪 ICML2025 · 5 paper notes
📌 Same area in other venues: 📷 CVPR2026 (150) · 🔬 ICLR2026 (112) · 💬 ACL2026 (32) · 🧪 ICML2026 (31) · 🤖 AAAI2026 (10) · 🧠 NeurIPS2025 (30)
🔥 Top topics: Reasoning ×5 · Multimodal/VLM ×3
- Diffusion-VLA: Generalizable and Interpretable Robot Foundation Model via Self-Generated Reasoning
-
DiVLA (Diffusion-VLA) is proposed to unify the reasoning capabilities of autoregressive VLMs and the action generation capabilities of diffusion models into an end-to-end framework. By directly embedding self-generated language reasoning into policy learning via a Reasoning Injection Module, DiVLA achieves generalization to unseen objects, interpretable action decision-making, and high-speed inference (82Hz for the 2B model).
- Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner
-
Proposes a scalable Bayesian Theory-of-Mind (ToM) planner that decomposes multi-step reasoning into step-by-step Bayesian updates. By leveraging a weak-to-strong control mechanism, it transfers specialized ToM capabilities from smaller models to large language models (up to 405B), outperforming the Prev. SOTA by 4.6% on multimodal ToM benchmarks.
- Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger
-
This paper proposes the RCTS framework, which constructs a reasoning-context-rich knowledge base via a self-consistency evaluation mechanism and re-ranks retrieved exemplars using Monte Carlo Tree Search with Heuristic Rewards (MCTS-HR). This enables LVLMs to significantly outperform raw ICL and Vanilla-RAG methods across multiple VQA datasets (by an average of +3-4%).
- Reasoning Limitations of Multimodal Large Language Models. A Case Study of Bongard Problems
-
This paper systematically evaluates the abstract visual reasoning capabilities of 4 closed-source and 4 open-source MLLMs on three datasets: the classic synthetic Bongard Problems (BPs), Bongard HOI, and Bongard-OpenWorld. Seven problem-solving strategies and a new dataset, Bongard-RWR (which represents synthetic BP concepts using real-world images), are proposed, revealing that the extremely poor performance of MLLMs on synthetic BPs is not due to domain shift but rather an inherent limitation in abstract reasoning.
- Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas
-
This work investigates the causes of spatial reasoning failures in VLMs from a mechanistic interpretability perspective, finding that image tokens obtain only ~10% of attention despite making up 90% of the input, and that the geometric distribution of attention is the key factor. The authors propose AdaptVis, a training-free decoding method that adaptively adjusts image attention temperature based on runtime confidence, achieving up to a 50% absolute improvement on the WhatsUp dataset.