🧠 VLM Reasoning¶
🔬 ICLR2026 · 23 paper notes
📌 Same area in other venues: 📷 CVPR2026 (144) · 🧪 ICML2026 (20) · 💬 ACL2026 (31) · 🤖 AAAI2026 (10) · 🧠 NeurIPS2025 (30) · 📹 ICCV2025 (13)
🔥 Top topics: Reasoning ×20 · Multimodal/VLM ×11 · Agents ×2 · Robotics ×2
- DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage
-
This paper proposes DIVA-GRPO, which addresses reward sparsity and advantage vanishing in GRPO training by dynamically assessing question difficulty, adaptively generating semantically consistent variants of varying difficulty, and incorporating difficulty-weighted local-global advantage estimation. The method achieves state-of-the-art multimodal reasoning performance at the 7B model scale.
- Empowering Small VLMs to Think with Dynamic Memorization and Exploration
-
This paper proposes DyME (Dynamic Memorize-Explore), which progressively and dynamically alternates between an SFT memorization mode and a GRPO exploration mode, enabling—for the first time—reasoning capabilities in small-scale vision-language models (SVLMs, <1B parameters) on domain-specific tasks.
- Evaluating VLMs' Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences
-
This paper systematically evaluates VLMs' spatial reasoning capabilities over robot motion trajectories, proposing four image-querying methods that enable VLMs to select optimal motion paths based on user natural language descriptions. Results show that Qwen2.5-VL achieves 71.4% zero-shot accuracy, with smaller models achieving significant gains after fine-tuning.
- FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models
-
This paper introduces FRIEDA, a benchmark that systematically evaluates large vision-language models (LVLMs) on multi-step, cross-map cartographic reasoning. The strongest model, Gemini-2.5-Pro, achieves only 38.20% accuracy, far below the human baseline of 84.87%.
- GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models
-
This paper introduces GTR-Bench, a novel benchmark for geo-temporal reasoning of moving targets in large-scale camera networks. Evaluation reveals that the strongest model, Gemini-2.5-Pro (34.9%), falls far short of human performance (78.61%), exposing three critical deficiencies in current VLMs: imbalanced utilization of spatial-temporal context, weak temporal prediction capability, and insufficient map-video alignment.
- Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification
-
This paper identifies a pervasive "agreement bias" in multimodal large language models (MLLMs) when used as agent behavior verifiers—whereby models systematically over-approve agent actions—and proposes Self-Grounded Verification (SGV), a two-step generation framework (first extracting behavioral priors, then performing conditioned verification) to mitigate this bias. SGV achieves up to 25 pp improvement in failure detection rate and 14 pp improvement in accuracy across web navigation, desktop manipulation, and robotic manipulation tasks.
- MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning
-
This paper introduces MMR-Life, a benchmark comprising 2,646 five-choice multi-image questions based on 19,108 real-life images, covering 7 reasoning types and 21 tasks. It is the first systematic evaluation of MLLMs on multi-image reasoning in real-life scenarios. The strongest model, GPT-5, achieves only 58.69% accuracy—14 percentage points below human performance. Key findings include the failure of reasoning enhancement methods on large models and the weaker generalization of RL compared to BoN.
- OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
-
Grounded in cognitive psychology, this work introduces OmniSpatial—the first comprehensive spatial reasoning benchmark—systematically covering 4 dimensions (dynamic reasoning, complex spatial logic, spatial interaction, and perspective transformation) across 50 subcategories with 8.4K manually annotated QA pairs. The strongest reasoning model, o3, achieves only 56.33% while humans reach 92.63%, revealing that complex spatial reasoning remains a fundamental bottleneck for VLMs.
- Reasoning-Driven Multimodal LLM for Domain Generalization
-
This paper proposes RD-MLDG — the first framework to incorporate MLLM reasoning chains into domain generalization. It constructs the DomainBed-Reasoning dataset, systematically analyzes two core challenges of reasoning supervision (optimization gap + reasoning pattern mismatch), and addresses them jointly via MTCT (Multi-Task Cross-Training) and SARR (Self-Aligned Reasoning Regularization), achieving an average accuracy of 86.89% across four standard DG benchmarks — substantially surpassing GPT-4o (83.46%) and all CLIP/ViT-based methods.
- Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks
-
This paper introduces the Ref-Adv benchmark, constructed via a pipeline of hard distractor pairing + LLM-assisted minimally sufficient expression generation + three-annotator unanimous verification. The benchmark eliminates "grounding shortcuts" present in classical REC datasets. Across 13 contemporary MLLMs — including GPT-4o, Gemini 2.5, and Qwen2.5-VL-72B — accuracy drops dramatically from 90%+ on RefCOCO(+/g) to 50–68% on Ref-Adv, systematically exposing severe deficiencies in complex visual reasoning and precise grounding.
- Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes
-
This paper proposes MV-RoboBench, the first benchmark integrating multi-view spatial reasoning with robotic manipulation tasks, systematically evaluating 40+ VLMs (open-source, closed-source, and reasoning-enhanced). The best-performing model, GPT-5, achieves only 56.4% accuracy, far below the human baseline of 91.0%. The study further reveals a positive correlation between spatial and robotic reasoning, and that performance on single-view benchmarks does not reliably transfer to multi-view settings.
- Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation
-
Inspired by the draft-then-verify paradigm of Speculative Decoding, this paper proposes Speculative Verdict (SV), which employs multiple lightweight VLMs to generate diverse reasoning paths as drafts, while a large model serves as the verdict to synthesize, verify, and correct them. Without any training, SV surpasses GPT-4o by 11.9% on information-intensive VQA and recovers correct answers in 47–53% of minority-correct cases.
- SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
-
This paper proposes SophiaVL-R1, which introduces a holistic-level thinking process reward into rule-based RL training of MLLMs. A Thinking Reward Model (TRM) is trained to evaluate reasoning quality along five dimensions (including logical soundness and redundancy). Trust-GRPO is proposed to compute a reliability weight \(\gamma\) from the contrast of thinking rewards between correct and incorrect answer groups, mitigating reward hacking. A time-based annealing strategy \(e^{-\text{steps}/T}\) gradually reduces the thinking reward contribution so that the model relies more on accurate rule-based rewards in later training. The resulting 7B model comprehensively outperforms LLaVA-OneVision-72B on multiple benchmarks, including MathVista (71.3%) and MMMU (61.3%).
- Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models
-
This paper proposes Spatial-DISE, a unified spatial reasoning benchmark grounded in a cognitive-science-based 2×2 taxonomy (Intrinsic/Extrinsic × Static/Dynamic). The benchmark comprises 559 evaluation VQA pairs and 12K+ training instances. Evaluation across 32 state-of-the-art VLMs reveals a substantial gap between model performance and human-level capability, particularly on dynamic spatial reasoning tasks such as mental rotation and folding.
- Spatial CAPTCHA: Generatively Benchmarking Spatial Reasoning for Human-Machine Differentiation
-
This paper proposes Spatial CAPTCHA, a novel human verification framework grounded in 3D spatial reasoning. It exploits fundamental capability gaps between humans and multimodal large language models (MLLMs) across geometric reasoning, perspective-taking, occlusion handling, and mental rotation tasks to distinguish humans from machines. The best-performing MLLM achieves only 31.0% Pass@1 accuracy, far below human performance.
- Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA
-
Through controlled experiments within the LLaVA framework, this paper systematically investigates the effects of image encoder training objectives and 2D positional encoding on the spatial reasoning capabilities of VLMs. The study finds that encoder choice dominates spatial performance, AIMv2 yields the most consistent results, while improvements from 2D-RoPE are unstable—indicating that spatial reasoning failures are rooted in core design choices of current VLM pipelines.
- SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
-
This paper introduces SpatiaLab, a real-world spatial reasoning benchmark comprising 1,400 visual QA pairs spanning 30 subcategories across 6 major spatial task categories. Supporting both MCQ and open-ended evaluation formats, SpatiaLab reveals a substantial gap between the strongest current VLMs (InternVL3.5-72B: 54.93% MCQ) and humans (87.57%), with the gap widening further under open-ended settings.
- SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs
-
This paper introduces SpinBench, a cognitively grounded diagnostic benchmark that systematically evaluates spatial reasoning in 37 VLMs through 7 progressively structured task categories—ranging from object identity recognition to perspective taking—revealing systemic deficiencies including egocentric bias and weak rotation understanding.
- ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding
-
ThinkOmni is a training-free framework that leverages a text-only large reasoning model (LRM) to guide an omni-modal LLM (OLLM) during decoding via Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals. The method achieves 70.2% on MathVista and 75.5% on MMAU, matching or surpassing RFT-based approaches.
- Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs
-
This paper proposes VC-STaR (Visual Contrastive Self-Taught Reasoner), motivated by the observation that VLMs perceive visual content more accurately when comparing two similar images. A contrastive self-improvement framework is designed: contrastive VQA pairs are constructed to elicit more faithful visual analysis from the model, and an LLM integrates this contrastive analysis into reasoning chains, yielding the high-quality visual reasoning dataset VisCoR-55K. Fine-tuning on this dataset achieves +5.7% on MMVP and +3.2% on Hallusion.
- VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL
-
VidGuard-R1 is the first video authenticity detector that fine-tunes an MLLM with GRPO (Group Relative Policy Optimization). By constructing a 140K shortcut-free real/fake video dataset and designing two specialized reward mechanisms—temporal artifact reward and diffusion-step quality reward—it achieves 86.17% accuracy on its in-house dataset and 95%+ zero-shot SOTA performance on GenVidBench and GenVideo benchmarks, while generating interpretable chain-of-thought reasoning.
- VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?
-
This paper introduces VLM-SubtleBench, a benchmark for evaluating vision-language models on subtle difference comparative reasoning, covering 10 difference types and 6 image domains (natural, gaming, industrial, aerial, medical, and synthetic). It reveals a performance gap of over 30% between VLMs and humans on spatial, temporal, and viewpoint reasoning tasks.
- VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use
-
This paper proposes VTool-R1, the first framework that trains VLMs via reinforcement fine-tuning to generate interleaved textual and visual intermediate reasoning steps, enabling models to "think with images."