🧠 VLM Reasoning¶

📷 CVPR2025 · 13 paper notes

📌 Same area in other venues: 📷 CVPR2026 (150) · 🔬 ICLR2026 (112) · 💬 ACL2026 (32) · 🧪 ICML2026 (31) · 🤖 AAAI2026 (10) · 🧠 NeurIPS2025 (30)

🔥 Top topics: Reasoning ×11 · Multimodal/VLM ×10 · LLM ×4

Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation: This paper proposes the CRYSTAL benchmark (6372 instances) to evaluate MLLMs at the intermediate reasoning step level using Match F1 and Ordered Match F1. It reveals widespread cherry-picking behaviors and disordered reasoning processes, and introduces a CPR-Curriculum training strategy to improve reasoning quality.
Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Models: This paper proposes Coarse Correspondences, a lightweight, training-free visual prompting method. By overlaying coarse-grained instance correspondence markers obtained from object tracking onto image frames, it significantly enhances the spatial-temporal reasoning capabilities of MLLMs, achieving improvements of +20.5% on ScanQA, +9.7% on OpenEQA, +6.0% on EgoSchema, and +11% on R2R navigation.
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning: This paper proposes the Critic-V framework, which decouples the VLM reasoning process into a Reasoner and a Critic. By utilizing a DPO-trained Critic model to provide natural language feedback for iteratively optimizing the reasoning path, this approach outperforms GPT-4V on 5 out of 8 benchmarks, showing particularly significant improvements on mathematical reasoning tasks (MathVista +11.8%).
Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents: Proposes two large-scale document retrieval benchmarks, DocHaystack and InfoHaystack (1000+ documents per question), and V-RAG, a vision-centric retrieval-augmented generation framework, which improves Recall@1 by 9%-11% over the best baseline.
ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models: This paper proposes Espire, a simulation-based diagnostic benchmark for embodied spatial reasoning. It decomposes VLM evaluation into localization and execution phases, systematically assessing the capabilities of VLMs across multiple spatial reasoning dimensions and granularities through a fully generative paradigm.
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models: Insight-V proposes a visual reasoning enhancement scheme consisting of a data generation pipeline and a multi-agent reasoning system: it constructs high-quality long-chain reasoning data through progressive generation and multi-granular evaluation, designs a Reasoning Agent and a Summary Agent to collaboratively solve problems, and incorporates iterative DPO to further improve reasoning quality, achieving an average improvement of 7% across seven visual reasoning benchmarks.
MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning: MM-CondChain is the first MLLM benchmark for visually grounded deep compositional reasoning. By using a Verifiable Programmatic Intermediate Representation (VPIR), it automatically constructs multi-layer conditional chains and chain-style hard negatives. The strongest model achieves only a 53.33 Path F1, revealing that deep compositional reasoning remains a fundamental challenge.
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts: This paper proposes the MV-MATH benchmark, consisting of 2,009 high-quality multi-image math problems (sourced from real K-12 scenarios) to systematically evaluate the capability of 25 multimodal large models in multi-image math reasoning scenarios. It is found that all models perform well below human levels (the best, Claude, only achieves 33.9%), revealing that multi-image math reasoning remains a significant challenge for MLLMs.
Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence: Proposes the VAEX-Bench benchmark to systematically evaluate the "abstract spatiotemporal reasoning" capability of MLLMs for the first time. Unlike extractive tasks that pull information from single frames, abstract reasoning requires integrating observations across rooms and time to infer global spatial layouts, perform cross-scene counting, etc. The study reveals that all SOTA models (including GPT-5.2 and Gemini-3 Pro) perform significantly worse than humans on abstract reasoning.
SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model: This work proposes the Sequential 3D Affordance Reasoning task and constructs a benchmark of 180K instruction-point cloud pairs. By introducing a <SEG> token and a multi-granular language-point integration module into a 3D MLLM, the model reasons and segments sequential affordance regions from complex human instructions.
Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA: By systematically replacing the image encoder (CLIP/SigLIP/SigLIP2/AIMv2) and introducing 2D-RoPE position embedding within the LLaVA framework, this study reveals that the spatial reasoning capability of VLMs is primarily determined by the encoder's training objective, and relying solely on 2D positional structures to improve spatial understanding is insufficient.
Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World: This paper introduces Dyn-Bench, the first large-scale benchmark designed to systematically evaluate the capability of Multimodal Large Language Models (MLLMs) to perceive, track, and reason about dynamics in a physical 4D world. Composing of 1K videos, 7K VQA pairs, and 3K dynamic object localization pairs, it reveals that existing models fail to perform well simultaneously in both spatiotemporal reasoning and dynamic localization. Furthermore, two structured enhancement methods, Mask-Guided Fusion and ST-TCM, are proposed to significantly improve performance.
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces: This paper proposes VSI-Bench, a video-based visual spatial intelligence benchmark (5000+ QA pairs), to systematically evaluate the spatial reasoning capabilities of MLLMs. The study finds that spatial reasoning is the primary bottleneck, and traditional language reasoning techniques (such as CoT) fail to improve performance, whereas explicitly generating cognitive maps can enhance spatial distance reasoning.