ICML2025 VLM Reasoning AI paper notes paper summaries Reasoning Multimodal/VLM Diffusion Models Robotics LLM

🧠 VLM Reasoning¶

🧪 ICML2025 · 5 paper notes

📌 Same area in other venues: 📷 CVPR2026 (150) · 🔬 ICLR2026 (112) · 💬 ACL2026 (32) · 🧪 ICML2026 (31) · 🤖 AAAI2026 (10) · 🧠 NeurIPS2025 (30)

🔥 Top topics: Reasoning ×5 · Multimodal/VLM ×3

Diffusion-VLA: Generalizable and Interpretable Robot Foundation Model via Self-Generated Reasoning: DiVLA (Diffusion-VLA) is proposed to unify the reasoning capabilities of autoregressive VLMs and the action generation capabilities of diffusion models into an end-to-end framework. By directly embedding self-generated language reasoning into policy learning via a Reasoning Injection Module, DiVLA achieves generalization to unseen objects, interpretable action decision-making, and high-speed inference (82Hz for the 2B model).
Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner: Proposes a scalable Bayesian Theory-of-Mind (ToM) planner that decomposes multi-step reasoning into step-by-step Bayesian updates. By leveraging a weak-to-strong control mechanism, it transfers specialized ToM capabilities from smaller models to large language models (up to 405B), outperforming the Prev. SOTA by 4.6% on multimodal ToM benchmarks.
Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger: This paper proposes the RCTS framework, which constructs a reasoning-context-rich knowledge base via a self-consistency evaluation mechanism and re-ranks retrieved exemplars using Monte Carlo Tree Search with Heuristic Rewards (MCTS-HR). This enables LVLMs to significantly outperform raw ICL and Vanilla-RAG methods across multiple VQA datasets (by an average of +3-4%).
Reasoning Limitations of Multimodal Large Language Models. A Case Study of Bongard Problems: This paper systematically evaluates the abstract visual reasoning capabilities of 4 closed-source and 4 open-source MLLMs on three datasets: the classic synthetic Bongard Problems (BPs), Bongard HOI, and Bongard-OpenWorld. Seven problem-solving strategies and a new dataset, Bongard-RWR (which represents synthetic BP concepts using real-world images), are proposed, revealing that the extremely poor performance of MLLMs on synthetic BPs is not due to domain shift but rather an inherent limitation in abstract reasoning.
Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas: This work investigates the causes of spatial reasoning failures in VLMs from a mechanistic interpretability perspective, finding that image tokens obtain only ~10% of attention despite making up 90% of the input, and that the geometric distribution of attention is the key factor. The authors propose AdaptVis, a training-free decoding method that adaptively adjusts image attention temperature based on runtime confidence, achieving up to a 50% absolute improvement on the WhatsUp dataset.