🧠 VLM Reasoning¶

🧪 ICML2026 · 20 paper notes

📌 Same area in other venues: 📷 CVPR2026 (144) · 💬 ACL2026 (31) · 🔬 ICLR2026 (23) · 🤖 AAAI2026 (10) · 🧠 NeurIPS2025 (30) · 📹 ICCV2025 (13)

🔥 Top topics: Reasoning ×17 · Multimodal/VLM ×13 · LLM ×2

3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models: 3ViewSense argues that the bottleneck in VLM spatial reasoning is not insufficient visual features or weak language reasoning, but the absence of a stable 3D intermediate representation. Consequently, it requires the model to first induce front, left, and top views from a single image, and then reason based on these orthographic views, significantly outperforming same-scale VLMs in occlusion counting and view-consistent spatial reasoning.
Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models: This paper transforms VLM spatial reasoning from a passive process of "viewing all perspectives before answering" into an agentic workflow of "active framing based on the question, updating cognitive maps, and verifying reasoning with executable spatial assertions." By fine-tuning Qwen2.5-VL-3B with dense rewards, it achieves 80.5% overall accuracy on MindCube-Tiny, notably improving the Rotation subset to 85.0%.
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning: This paper explicitly splits VLM output into <recognition> perception blocks and <think> reasoning blocks. It introduces a "blindfolded" text reasoning agent (which has no access to images and only sees the perception text written by the VLM) to determine a perception reward \(R_P\) based on its ability to answer correctly. Combined with Structured Verbal Verification (SVV) for output reward \(R_O\), the proposed MoCA uses \(R_P\) as a gate for modal-level credit assignment. This allows a 7B model to improve across 9 perception/reasoning/rich-modality benchmarks simultaneously, surpassing GPT-4o on multiple metrics.
Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners: To address the "understanding–generation gap" (capable of understanding but failing to generate) in unified multimodal models for anything-to-image (X2I) tasks, this paper proposes the Self-Adaptive Interleaved Reasoner. Using a hierarchical data synthesis pipeline, 50,000 samples are diverted into three modes: direct generation, self-reflection, and multi-step planning. By employing SFT and GRPO training with step-wise reasoning rewards and intra-group complexity penalties, Emu3.5 outperforms closed-source models such as GPT-4o and Gemini 2.5 Flash on KRIS-Bench and OmniContext.
Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding: The authors decompose the KL loss of multimodal online distillation into two sub-objectives, "language prior" and "visual grounding," based on the Bayes chain rule. They find that the gradients of these two are nearly orthogonal and that standard distillation merely takes a passive bisector. Consequently, they propose Visual Gradient Steering (VGS) to actively bias the update direction toward the visual subspace, achieving average improvements of +2.37%/+1.56% across seven multimodal reasoning benchmarks for Qwen3-VL 8B→2B/4B distillation.
Efficient Reasoning with Hidden Thinking: Heima distills each stage (summary / caption / reasoning) of a multimodal LLM's lengthy Chain-of-Thought (CoT) into a single special thinking token. This allows the model to "think" in latent space, reducing the token count from the 100-200 range to 13-16 while maintaining zero-shot accuracy more stably than LLaVA-CoT. A companion LLM "interpreter" is trained to reconstruct the textual reasoning chain from the hidden states of these thinking tokens, thereby empirically validating the information-theoretic upper bound of compression loss.
Find, Fix, Reason: Context Repair for Video Reasoning: This paper addresses the dilemma in video reasoning where on-policy RL stagnates at capability ceilings while off-policy distillation suffers from entropy collapse. It introduces a frozen, tool-augmented large teacher model that inserts minimal "evidence patches" (key-frame intervals, error types) when student rollouts fail. The student re-answers the same question under these refined conditions, and the repaired trajectories are incorporated into GRPO optimization via a chosen-rollout mechanism.
From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models: This paper identifies that current VLM post-training overemphasizes "long-chain reasoning" while neglecting perception bottlenecks. It explicitly decouples post-training into three independent stages: "Visual Perception → Textual Reasoning → Visual Reasoning," and utilizes RLVR (instead of caption SFT) to specifically refine perception. This approach enables Qwen3-VL-8B to achieve relative improvements of approximately +5.9% and +1.2% on visual math and perception benchmarks, respectively, while shortening reasoning traces by 20.8%.
iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning: Addressing the counter-intuitive phenomenon where "explicit visual grounding hinders CoT reasoning," the authors propose iVGR—a dual-stream GRPO training framework. It allows textual CoT and grounded CoT (with boxes) to rollout simultaneously, using a consistency reward to "internalize" the visual localization capabilities of high-quality grounded trajectories into pure textual CoT. This enables the model to reap the benefits of grounded reasoning during inference without explicitly outputting coordinates.
Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training: VISTA transforms the self-improvement training of multimodal large models into a two-stage pipeline: "supplementing samples for hard problems via prefix resampling, and filtering pseudo-positives via Vision-aware Attention Score (VAS)." It achieves an average improvement of +13.66% in multimodal reasoning for mathematics and medicine on Qwen2.5-VL-3B.
Learning GUI Grounding with Spatial Reasoning from Visual Feedback: Gui-Cursor reformulates GUI grounding from "single-step coordinate prediction" into an interactive search of "moving the cursor to find the target." By utilizing a dense reward function with trajectory penalties and GRPO training, the VLM learns to align numerical coordinates with screen positions via visual feedback from rendered cursors. Using only 8K samples, it improves GPT-4o-level performance on ScreenSpot-Pro from GTA1's 50.1% to 58.1%.
LIMSSR: LLM-Driven Sequence-to-Score Reasoning under Training-Time Incomplete Multimodal Observations: The authors reformulate multimodal Action Quality Assessment (AQA) with "missing modalities during training" as an "LLM-based conditional sequence-to-score reasoning" problem. By using prompts and special tokens, the LLM completes missing semantics without full data supervision. Combined with mask-aware dual-path fusion to suppress hallucinations, the method outperforms SOTAs that rely on complete training data across three AQA datasets.
CSMR (Look on Demand): A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning: Inspired by Baddeley's working memory theory, CSMR treats "when to introduce visual evidence into reasoning" as a dynamic decision. The LLM maintains reasoning states and invokes an independent perception module (VLM) to fetch visual evidence on-demand until sufficient; this addresses flaws in two existing paradigms (information loss from static textualization in pre-reasoning and language prior contamination in unified VL spaces), achieving zero-shot superiority over baselines across multiple multimodal reasoning benchmarks.
R\(^3\)L: Reasoning 3D Layouts from Relative Spatial Relations: R³L attributes two types of systematic errors (semantic drift and metric drift) in MLLM multi-hop "relative spatial relation" reasoning to "repeated frame transformations." By implementing Invariant Spatial Decomposition (shortening relation chains), Consistent Spatial Imagination (an imagine-and-revise loop for conflict elimination), and Support Spatial Optimization (global-to-local pose reparameterization), it enables GPT-5 to generate open-vocabulary 3D scenes across 9 categories with collision and out-of-bounds rates near zero, significantly outperforming LayoutVLM/Holodeck/LayoutGPT in semantic metrics.
ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning: This paper systematically reveals structural failures in the widely used VSI-Bench due to 3D label drift and frame sampling inconsistency. The authors re-label 381 scenes and 5365 objects, design frame-budget adaptive QA and "dummy video" (removing query object frames) stress tests to construct ReVSI, a high-fidelity spatial intelligence benchmark. Evaluations show that open-source VLMs drop by up to 40% on ReVSI and maintain high hallucination rates on dummy videos, exposing a systematic overestimation of existing spatial reasoning capabilities.
Spectral-Progressive Thought Flow for Lightweight Multimodal Reasoning: SpecFlow shifts multimodal spatial reasoning from "pixel-level thinking" to "spectral-level thinking"—utilizing Block Discrete Cosine Transform (BDCT) + Flow Matching + Progressive Frequency Activation to maintain visual intermediate thoughts in a fixed-size spectral workspace. Combined with Classifier-Free Guidance (CFG) for text-guided visual evolution, it reduces KV cache by 1.6–2.1× while maintaining spatial reasoning accuracy.
The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design: This paper formalizes "VLM inability to see details" as a Sequential Bayesian Optimal Experimental Design (S-BOED) problem. It proposes the training-free FOVEA module based on a computable proxy objective of "coverage \(\times\) resolution," consistently outperforming Direct and ReAct-style baselines on high-resolution and remote sensing benchmarks.
Thinking in Structures: Evaluating Spatial Intelligence in Constraint-Governed Spaces: Authors construct SSI-Bench, a benchmark consisting of 1,000 ranking-style VQA items focusing on "constrained structured spaces" (real 3D structures like roofs, bridges, towers), requiring VLMs to provide a complete permutation of 3-4 candidate components according to geometric or topological criteria. Evaluation of 31 VLMs reveals that the strongest closed-source model, Gemini-3-Flash, achieves only 33.6%, and the best open-source model, GLM-4.6V, reaches 22.2%, compared to a human performance of 91.6%. This highlights a lack of consistent spatial reasoning capabilities in current VLMs when facing real-world 3D scenes jointly constrained by geometry, connectivity, and physical feasibility.
Vision-aligned Latent Reasoning for Multi-modal Large Language Model: This paper proposes VaLR: inserting several "latent tokens" before each step of MLLM CoT reasoning and performing representation alignment (REPA) on these tokens using patch features from visual encoders like DINOv3, SigLIP, or π³. This mechanism continuously "feeds back" visual information to the model during long-chain reasoning, increasing the accuracy of Qwen2.5-VL on VSI-Bench from 33.0% to 52.9%, and for the first time enabling MLLMs to exhibit "longer reasoning, higher accuracy" test-time scaling behavior.
What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity (GLANCE): GLANCE introduces a "think-see alignment" self-supervised head to VLM agent reinforcement learning. It maps the "next-state prediction" produced in the LLM's CoT through a lightweight projector to the actual next-frame representation encoded by an EMA target vision encoder. The gap between prediction and reality serves simultaneously as an intrinsic curiosity reward, a training signal for the vision encoder, and an alignment loss to "ground" the internalized world model. Combined with a curriculum exploration mechanism that periodically resets the projector to combat curiosity drain, GLANCE consistently outperforms existing exploitation-only VLM-RL methods across five agentic tasks.