ICLR2026 VLM Reasoning AI paper notes paper summaries Reasoning Multimodal/VLM Reinforcement Learning LLM Layout & Composition Agents

🧠 VLM Reasoning¶

🔬 ICLR2026 · 112 paper notes

📌 Same area in other venues: 📷 CVPR2026 (150) · 💬 ACL2026 (32) · 🧪 ICML2026 (31) · 🤖 AAAI2026 (10) · 🧠 NeurIPS2025 (30) · 📹 ICCV2025 (15)

🔥 Top topics: Reasoning ×93 · Multimodal/VLM ×54 · Reinforcement Learning ×7 · LLM ×6 · Layout & Composition ×5

AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning: AdaReasoner teaches Multimodal Large Language Models (MLLMs) to dynamically orchestrate a set of visual tools during multi-turn visual reasoning. Through a two-stage training process of "Tool Cold Start + Multi-turn Tool GRPO," it enables a 7B small model to autonomously select, discard, and adjust tool usage frequency. It achieves an average performance gain of +38.7%, reaching a near-perfect score of 97.6% on VSP, surpassing GPT-5 and Claude Sonnet 4.
Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks: Agent-X is a large-scale benchmark for "vision-centric agents," covering 6 types of scenarios with 828 real-world multimodal tasks (image/multi-image/video/instructional text). It features a fine-grained "step-level + reasoning chain + outcome" three-mode evaluation system. Results indicate that even the strongest models from GPT, Gemini, and Qwen series achieve full-link success rates below 50%, exposing significant flaws in current LMMs regarding multi-step visual reasoning and tool invocation.
Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models: AGILE redefines "jigsaw puzzle solving" as an interactive process where the model generates code and observes feedback. Combined with infinitely scalable procedurally synthesized data, cold-start SFT, and GRPO reinforcement learning, it improves Qwen2.5-VL-7B accuracy on 2×2 puzzles from 9.5% to 82.8% and achieves an average gain of 3.1% across 9 general vision benchmarks.
ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping: ARES utilizes "window entropy" as an exploration trigger and controls exploration depth through a difficulty-aware hierarchical entropy reward. This allows multimodal reasoning models to think less on simple problems and more on difficult ones, simultaneously improving accuracy and reasoning efficiency across mathematical, logical, and multimodal benchmarks.
AutoGPS: Automated Geometry Problem Solving via Multimodal Formalization and Deductive Reasoning: AutoGPS utilizes a neuro-symbolic synergistic framework consisting of a "Multimodal Problem Formalizer (MPF) + Deductive Symbolic Reasoner (DSR)." It first translates plane geometry problems into formal language and then performs rigorous deduction via hypergraph expansion. This process yields both a correct answer and a traceable step-by-step solution, achieving SOTA on Geometry3K / PGPS9K and improving human-evaluated logical accuracy from ~71% to 99%.
Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks: This paper identifies that existing medical VLM benchmarks focus only on classification accuracy, creating an "evaluation illusion." It proposes a "Breadth-Depth" two-axis evaluation framework and builds Neural-MedBench, a deep reasoning benchmark for neurology (120 multimodal cases, 200 reasoning tasks). Empirical results show that top models like GPT-5, Claude-4, and MedGemma fail collectively in deep reasoning, with failures primarily stemming from reasoning rather than perception.
Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs: Inspired by the Wechsler Intelligence Scale for Children, "General Intelligence" is decomposed into five measurable abilities: Execution, Perceptual Reasoning, Learning, Memory, and Planning. KidGym is constructed with 12 2D grid interaction tasks, three difficulty levels, and a customizable dynamic benchmark. It systematically reveals significant shortcomings of current top MMLMs in non-semantic abstract vision, quantity perception, and composite ability tasks.
CircuitSense: A Hierarchical MLLM Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process: CircuitSense establishes the first MLLM benchmark organized by engineering abstraction levels, emphasizing the derivation of symbolic equations from circuit schematics. Using 8,006 problems (human-curated + synthetically generated), it systematically evaluates 8 MLLMs, revealing a fundamental gap where closed-source models exceed 85% in perception tasks but plummet below 19% in symbolic derivation.
CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs: CompoDistill finds that existing knowledge distillation (KD) for Multimodal Large Language Models (MLLMs) only acquires "visual recognition" but fails in "visual perception," rooted in the misalignment of visual attention distributions between teacher and student. It introduces a VAT module to align student visual attention to the teacher and a TAF module to reuse the teacher's adapter. With a three-stage training strategy, it elevates a 2B student's Compositional Reasoning (CR) average from 61.5 to 66.7, approaching a 4B teacher without degrading VQA performance.
Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning: This paper employs an evaluation framework based on propositional logic and "six interaction modes" that split facts across modalities. It systematically demonstrates that the true bottleneck of Multimodal Large Language Model (MLLM) reasoning lies in "integration" rather than "perception." Through attention probes and causal interventions, two root causes are identified: the task-composition bottleneck (identification and reasoning cannot be jointly performed in a single forward pass) and the fusion bottleneck (modality fusion in early layers introduces bias). The authors also provide two lightweight remedies: "two-step prompting" and "early-layer attention warming."
Composition-Grounded Data Synthesis for Visual Reasoning: COGS decomposes a small set of seed questions into atomic "perception + reasoning" factors, then recombines these factors with new images to generate large-scale synthetic QA pairs containing sub-questions/intermediate answers. Using factor-level process rewards for reinforcement learning, it enables MLLMs to acquire transferable complex reasoning capabilities in "image-rich but annotation-scarce" artificial image domains like charts and webpages.
CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation: CoPRS enables multimodal large models to perform chain-of-thought reasoning before outputting a "focus token," which is converted into a dense, differentiable heatmap serving as a positional prior. A lightweight decoder سپس refines this prior into a segmentation mask, achieving SOTA performance on RefCOCO series and ReasonSeg with interpretably aligned reasoning and segmentation.
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning: DeepEyes enables Visual Language Models (VLM) to internalize "zooming into images" as an inherent action within the reasoning chain. Without relying on SFT cold-start or external tools, end-to-end reinforcement learning allows the model to learn to actively crop and zoom into key regions during reasoning. This elevates a 7B model from 71.2% to 90.1% on the V* high-resolution benchmark.
DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage: Ours proposes DIVA-GRPO, which addresses reward sparsity and advantage vanishing in GRPO training by dynamically evaluating question difficulty, adaptively generating semantically consistent variants of different difficulty levels, and combining difficulty-weighted local-global advantage estimation. It achieves SOTA multimodal reasoning performance on 7B-scale models.
Efficient Multimodal Spatial Reasoning via Dynamic and Asymmetric Routing: This paper proposes DARE, which utilizes "differentiable dynamic routing across layers and hops" for asymmetric preservation of vision and text tokens. On multimodal spatial reasoning tasks, it reduces FLOPs by an average of 40.37% and KV-cache by 46.07%, while accuracy in most tasks actually improves.
Empowering Small VLMs to Think with Dynamic Memorization and Exploration: The paper proposes DyME (Dynamic Memorize-Explore), which enables small-scale vision-language models (<1B parameters) to achieve reasoning capabilities on specific tasks for the first time by dynamically switching between SFT memorization and GRPO exploration modes.
Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory: Classic Item Response Theory (IRT) is extended into "modality-decomposed" versions (M2IRT / M3IRT), where model ability and item difficulty are decomposed into "image-only / text-only / cross-modal integration" components. This enables the identification of tasks requiring genuine cross-modal reasoning, the elimination of shortcut items solvable via single modalities, and the restoration of model rankings using only 1%–10% of the original benchmark size.
ExpVid: A Benchmark for Experiment Video Understanding & Reasoning: ExpVid is the first benchmark to systematically evaluate the capability of Multimodal Large Language Models (MLLMs) in understanding real-world wet-lab experiment videos. Using a three-level task hierarchy of "Fine-grained Perception \(\rightarrow\) Procedure Understanding \(\rightarrow\) Scientific Reasoning," it reveals that current models excel at coarse-grained recognition but suffer significant performance drops in detail identification, state tracking, and "inferring scientific conclusions from operations."
FlowGen: Synthesizing Diverse Flowcharts to Enhance and Benchmark MLLM Reasoning: This paper proposes FlowGen, a controllable flowchart synthesizer that utilizes seven structural parameters and four rendering backends to generate diagrams on-demand. It synthesizes massive training data to significantly enhance the flowchart parsing capabilities of open-source MLLMs (approaching closed-source models) and generates a rigorous benchmark where even GPT-4o fails to achieve a 25% F1 score.
Fostering Video Reasoning via Next-Event Prediction: This paper proposes Next-Event Prediction (NEP), a learning task that splits video into "past" and "future" segments. It requires MLLMs to predict textual descriptions of future events based solely on past frames, leveraging the video's inherent future content as a self-supervised signal to elicit temporal reasoning capabilities. The work also introduces the V1-33K training set and the FutureBench evaluation benchmark.
FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting: FrameThinker enables visual language models to "think while watching long videos" like a detective—initially performing a sparse scan, then "zooming in" to key segments for multi-turn frame selection based on reasoning needs. Using SFT for action syntax and RL for decision strategy, it achieves a new SOTA of 76.1% on LongVideo-Reason with an average of 20.6 frames (compared to 512 frames used by competitors).
FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models: The FRIEDA benchmark is proposed to systematically evaluate the multi-step, cross-map cartographic reasoning capabilities of large vision-language models. Results show the strongest model, Gemini-2.5-Pro, achieves only 38.20% accuracy, significantly lower than the human performance of 84.87%.
Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning: The game code is "distilled" into verifiable VQA data with step-by-step analysis (GameQA: 30 games / 158 tasks / 140,000 questions). By performing GRPO reinforcement learning solely on game data, multiple VLMs achieve consistent performance improvements across seven completely out-of-domain (OOD) general vision reasoning benchmarks.
Generative Universal Verifier as Multimodal Meta-Reasoner: This paper elevates the task of "checking whether visual outcomes satisfy task requirements" to a fundamental capability of multimodal reasoning systems. The authors construct ViVerBench to evaluate existing VLM shortcomings in visual verification, train OmniVerifier-7B as a generative universal verifier, and employ OmniVerifier-TTS to convert verification feedback into multi-turn image editing during test-time, thereby improving the quality of complex text-to-image and reasoning-based generation.
GIR-Bench: Versatile Benchmark for Generating Images with Reasoning: GIR-Bench systematically quantifies the understanding-generation gap in unified multimodal models—where models "can reason but cannot draw"—using three complementary subsets (UGC, T2I, Edit) and a task-specific, programmable verification pipeline, effectively bypassing the biases of MLLM-as-a-Judge.
GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models: Ours proposes GTR-Bench, a new benchmark for geo-temporal reasoning of moving targets in large-scale camera networks. Evaluation reveals that the strongest model, Gemini-2.5-Pro (34.9%), lags significantly behind human performance (78.61%), uncovering three major flaws in current VLMs: imbalanced spatio-temporal context utilization, weak temporal prediction ability, and insufficient map-video alignment.
InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models: InternSpatial constructs a large-scale open dataset and diagnostic evaluation set for VLM spatial reasoning. By utilizing a unified data engine to organize single-view, multi-view, diverse scenarios, and various visual/textual instruction formats into over 12 million QA pairs, the model achieves significant improvements in spatial reasoning benchmarks while maintaining general multimodal capabilities.
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs: IV-Bench is the first benchmark for "Image-Grounded Video Perception and Reasoning"—using an externally sourced reference image as visual context to query video content. With 966 videos, 2,560 image-text queries, and 13 task categories, it reveals that the strongest MLLMs achieve only 28.9% accuracy (compared to 88.8% for humans), exposing the fragility of current models' ability to understand videos via visual anchors.
JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation: JointAVBench is the first "audio-visual strongly correlated" joint reasoning benchmark for Omni-LLMs, covering 5 cognitive dimensions, 4 types of audio signals, and 3 scene spans for a total of 15 tasks. It utilizes a semi-automated pipeline to synthesize 2,853 multiple-choice questions from movies that require audio-visual synergy to solve; even the strongest model achieves only 65.3% accuracy.
JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA: JUDO utilizes "juxtaposed normal-defect images" for fine-grained segmentation reasoning, internalizes industrial domain knowledge into model parameters via SFT, and unifies visual grounding with domain semantics using multi-reward GRPO. Using a 7B model, it outperforms GPT-4o and Qwen2.5-VL on the MMAD benchmark.
Latent Visual Reasoning: LVR enables Multimodal Large Language Models (MLLMs) to move beyond "thinking" solely in text space. Instead, it uses the LLM's last hidden states to autoregressively reconstruct question-related visual semantics directly in the visual embedding space ("think before speaking"). Combined with a modified GRPO reinforcement learning, this approach significantly outperforms the "Think about/with Images" paradigms on perception-intensive VQA tasks.
LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models: Lens constructs a unified distribution benchmark consisting of three levels and eight tasks ("Perception-Understanding-Reasoning") using the same set of 3.4K contemporary social media images paired with 60K+ human-annotated questions. It specifically quantifies the synergistic effect of low-level perception on high-level reasoning and proposes SMEC, a self-driven multi-expert collaboration framework without external tools, to improve complex reasoning performance.
Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification: This paper identifies a severe "agreement bias" in Multimodal Large Language Models (MLLMs) when serving as agent behavior verifiers—a systematic over-approval of agent actions. It proposes Self-Grounded Verification (SGV), a two-step generation method (extracting behavioral priors before conditional verification) to mitigate this bias. SGV improves failure detection rates by up to 25pp and accuracy by up to 14pp across web navigation, desktop operations, and robotic manipulation tasks.
LLMs as Rules Oracles: Exploring Real-World Multimodal Reasoning in Tabletop Strategy Game Environments: This paper introduces LudoBench—a multimodal game comprehension benchmark that pairs "real tabletop game photos + complete rulebooks + situated questions." It finds that leading vision-language models fail significantly on the most basic task of "understanding a new tabletop game" for novice players (Perception 63%, Rule Integration 36%, Short-term Optimization only 8%), exposing fundamental defects in cross-modal rule grounding and lightweight forward simulation capabilities.
Math Blind: Failures in Diagram Understanding Undermine Reasoning in MLLMs: This paper introduces the diagnostic benchmark MATHEMETRIC to decouple "perception" from "reasoning," revealing that current MLLMs exhibit extremely poor foundational perception (shape/counting/relationships/grounding) on mathematical diagrams—specifically, fine-grained grounding is near zero, leading to "blindly trusting text" (Math Blind). Furthermore, after training on the graph-structured geometric perception dataset GEOMETRIC, grounding tasks improve by \(+79\%\). This perception gain transfers to reasoning tasks without additional CoT data, resulting in a \(+3\text{--}4\%\) improvement across four public benchmarks.
MathNet: A Global Multimodal Benchmark for Mathematical Reasoning and Retrieval: MathNet constructs the largest Olympiad-level math problem database to date (30K+ problems, 47 countries, 17 languages, spanning 40 years of official exams). It introduces "math-aware retrieval" as an independent task and provides benchmarks for problem-solving, retrieval, and retrieval-augmented generation (RAG), revealing that frontier models remain severely limited in geometry, discrete mathematics, and identifying mathematical equivalence.
Medical Thinking with Multiple Images: This paper introduces MedThinkVQA—the first expert-annotated multi-image medical diagnostic reasoning benchmark, averaging 6.62 images per case. Through a three-step "Think-with-Images" supervision and beyond-accuracy step-level evaluation, it reveals that the true bottleneck for current top-tier multimodal large models is not the length of the reasoning chain, but the ability to "extract-align-compose" visual evidence across multiple views.
MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning: MedVR trains medical VLMs as agents capable of "zooming in" to examine images. It utilizes Entropy-guided Visual Relocation (EVR) to identify moments for re-examining images and Consensus-guided Credit Assignment (CCA) to automatically generate pseudo-labels for visual grounding from multiple successful trajectories. Without requiring any manual annotation for intermediate steps, it achieves SOTA performance on 6 medical VQA benchmarks.
MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse: MetaSpatial models 3D indoor scene layout generation as an RL policy learning problem. It proposes the 3D-SPO algorithm, which injects physics-aware advantage modulation into coordinate tokens based on GRPO and stacks discounted returns from multi-round refinement trajectories during training. This enables the VLM to directly generate physically plausible and format-stable (x,y,z) layouts without any ground-truth annotations or post-processing.
MIMIC-Bench: Exploring the User-Like Thinking and Mimicking Capabilities of Multimodal Large Language Models: This paper crawls 150K+ user videos from real social platforms to construct MIMIC-Data, selects 4,000 high-interaction videos for MIMIC-Bench, and shifts MLLM evaluation from "what happens in the video" to "how humans think and comment." It also trains MIMIC-Chat, which can generate realistic human-like comments.
MindCube: Spatial Mental Modeling from Limited Views: The MindCube benchmark (21,154 questions / 3,268 images) is proposed to systematically expose the deficiency of VLMs in "reconstructing unseen spaces from limited views," where they perform near random guessing. A "map-then-reason" (SFT + RL) scheme is introduced, where the model first draws a cognitive map and then reasons upon it, improving the accuracy of Qwen2.5-VL-3B from 37.8% to 61.3%.
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search: Mini-o3 employs a triplet of "hard-sample dataset + diverse cold-start trajectories + over-turn masking" to enable a VLM trained on only 6 interaction turns to naturally extend to dozens of trial-and-error exploration turns during inference, reproducing OpenAI o3-style deep visual search capabilities and achieving new SOTA results.
Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning: The paper proposes the MoVT paradigm and AdaVaR framework, unifying "text-based reasoning" and "visually-grounded reasoning" into a single LVLM. By employing an improved AdaGRPO algorithm, the model learns to adaptively select the appropriate reasoning mode based on the problem context, leading to simultaneous improvements across tasks such as mathematics, visual search, hallucination reduction, and spatial reasoning.
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization: This paper introduces MM-HELIX, an integrated "Evaluation-Data-Training" platform. It features a benchmark of 42 multimodal puzzles requiring iterative trial-and-error generated programmatically, a SERG pipeline that synthesizes 100k high-quality reflective CoT samples, and a single-stage AHPO algorithm that dynamically fuses offline expert supervision with online RL exploration. This approach boosts Qwen2.5-VL-7B by +18.6% on MM-HELIX and +5.7% on general math/logic tasks.
MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning: This paper proposes the MMR-Life benchmark (2,646 5-way multiple-choice questions based on 19,108 real images, covering 7 reasoning types and 21 tasks) to systematically evaluate the multi-image reasoning capabilities of MLLMs in real-life scenarios. The study finds that the strongest model, GPT-5, achieves only 58.69% accuracy—14% behind human performance—and reveals key insights such as the failure of reasoning enhancement methods on large models and the observation that RL generalization is weaker than BoN (Best-of-N).
MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos: MMR-V is an evaluation benchmark for "deep video reasoning" that emphasizes long-range multi-frame evidence mining and implicit reasoning of "what's left unsaid." It reveals that even the strongest Gemini-2.5-pro achieves only 64.3% accuracy, while CoT and test-time scaling are largely ineffective in this domain.
MMReD: A Cross-Modal Benchmark for Dense Context Reasoning: MMReD constructs a "room-character" randomly-evolving visual sequence environment, upgrading long-context reasoning from "needle-in-a-haystack retrieval" to "dense reasoning that requires uniform attention to the entire context." It reveals that nearly 30 LLMs/LVLMs, ranging from GPT-4o to reasoning-specialized models, systematically collapse as the sequence length grows, a limitation that SFT/GRPO fine-tuning fails to mitigate.
More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models: The paper reveals the "double-edged sword" nature of multimodal reasoning—longer reasoning improves logic but weakens perceptual grounding due to "visual forgetting." It proposes VAPO (Visual-Anchored Policy Optimization), which inserts visual anchors and utilizes perceptual rewards to pull reasoning back to visual evidence, achieving a new SOTA with VAPO-Thinker-7B.
NePTune: A Neuro-Pythonic Framework for Tunable Compositional Reasoning on Vision-Language: NePTune enables LLMs to translate natural language questions into "hybrid Python programs"—combining imperative control flow with soft logic operators—executed by scoring atomic concepts with a VLM under uncertainty to achieve training-free yet fine-tunable compositional visual reasoning.
No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers: VALOR is proposed: a completely annotation-free training framework for visual reasoning. It scales programmatic reasoning via RL with LLM verifiers and enhances visual grounding via hard-negative mining with VLM verifiers. A small Qwen3-8B combined with visual expert tools outperforms both open-source and closed-source large models in spatial reasoning.
Not Search, But Scan: Benchmarking MLLMs on Scan-Oriented Academic Paper Reasoning: ScholScan proposes a new "scan-oriented" paradigm for academic paper reasoning—moving away from pre-defined retrieval targets and instead tasking models to read an entire paper like a reviewer to actively discover internal scientific inconsistencies. Based on 715 real-world papers, 9 error categories, and 1,800 tasks, this multimodal benchmark evaluated 15 models under 24 input configurations. The findings reveal that even the strongest MLLMs score below 60 across all error categories, and RAG provides almost no assistance, exposing systematic shortcomings in the existing "search-oriented" paradigm.
OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning: The authors constructed OCR-Reasoning—the first benchmark to systematically evaluate the "text-rich image reasoning" capabilities of multimodal large language models (MLLMs). It includes 1,069 human-annotated samples covering 6 core reasoning capabilities across 18 practical tasks, providing both final answers and step-by-step reasoning processes. Results show that even the strongest MLLMs do not exceed 50% accuracy, revealing that this direction remains far from resolved.
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models: OmniSpatial is the first comprehensive spatial reasoning benchmark built on cognitive psychology. It systematically covers 4 dimensions and 50 subcategories (Dynamic Reasoning, Complex Spatial Logic, Spatial Interaction, and Perspective Transformation) with 8.4K human-annotated QA pairs. Results show that the o3 model achieves only 56.33% compared to a human score of 92.63%→ revealing that complex spatial reasoning remains a core bottleneck for VLMs.
Perception-Aware Policy Optimization for Multimodal Reasoning: Identifying that 67% of errors in multimodal RLVR stem from the neglected bottleneck of "inaccurate visual perception," this paper proposes PAPO. It introduces an implicit perception KL loss between "original vs. masked images" (plus double entropy regularization) into the GRPO/DAPO optimization objective. Without additional annotations, reward models, or teacher models, PAPO achieves an overall improvement of 4.4%–17.5% across 8 multimodal reasoning benchmarks and a 30.5% reduction in perception errors.
Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward: Addressing the limitation where existing Reinforcement Learning from Verifiable Rewards (RLVR) only rewards final answer correctness and fails to improve visual perception, this paper proposes Perception-R1. By extracting atomic "visual annotations" from high-quality CoT trajectories as references, it employs a judge LLM to determine if the model's response faithfully describes these visual facts. Significant performance gains are achieved on 8 multimodal benchmarks using only 1,442 training samples, significantly outperforming Vision-R1 trained on 200k samples.
Play to Generalize: Learning to Reason Through Game Play: By applying Reinforcement Learning (RL) to let a 7B multimodal large model (MLLM) play arcade games like Snake and 3D rotation recognition—without ever touching math problems, formulas, or diagrams—the model outperforms similarly sized models trained specifically on mathematical data in multimodal reasoning benchmarks like MathVista and MMMU, while preserving general vision capabilities.
ProxyThinker: Test-Time Guidance Through Small Visual Reasoners: ProxyThinker proposes a completely training-free test-time method: by adding the token-level logit difference between a small "RFT Expert" and an equivalent-sized "Base Amateur" to the output logits of a large base model (weighted by coefficient \(\alpha\)), a 32B/72B model can "inherit" the slow-thinking behaviors (e.g., self-verification, self-correction) from RL-tuned small models without any parameter updates. This approach approaches or even exceeds the performance of full-scale RFT models of the same size on mathematical and multimodal reasoning benchmarks, achieving a 38× speedup through asynchronous parallel implementation in vLLM.
Pursuing Minimal Sufficiency in Spatial Reasoning: Addressing the dual bottlenecks where VLMs "perceive inaccurately" and are "distracted by redundant information" in 3D spatial reasoning, this paper proposes MSSR: a zero-shot dual-agent framework. A Perception Agent actively queries the 3D scene via visual programming, while a Reasoning Agent iteratively prunes and completes information as needed to construct a "Minimal Sufficient Set (MSS)" before answering. It achieves +19.2 and +16.8 percentage point improvements over the GPT-4o backbone on MMSI-Bench and ViewSpatial-Bench, respectively.
PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts: PuzzleWorld collects 667 "puzzlehunt" style multimodal puzzles without explicit problem definitions, annotating each with final answers, stepwise reasoning trajectories, and cognitive skill labels. Results show that current state-of-the-art models achieve final answer accuracies of only 1–18%, far behind puzzle enthusiasts. Through stepwise scoring and fine-tuning experiments, the study reveals three major model shortcomings: "myopic reasoning, over-reliance on language, and lack of visual sketching capabilities."
Read the Room: Video Social Reasoning with Mental-Physical Causal Chains: This paper introduces the R3-Bench benchmark and the R3-FDT large-scale training set to systematically evaluate the video social reasoning capabilities of LVLMs through a "Mental-Physical Causal Chain" structure. The study reveals a significant gap between current state-of-the-art models and human performance and demonstrates that fine-tuning on R3-FDT significantly improves social reasoning across multiple benchmarks.
Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning: RAPID repositions the role of Multi-modal Large Language Models (MLLMs) as "perceptors"—responsible only for translating images into text (query-related captions + tentative solutions), which is then handed over to any external text-only LLM for reasoning. A reinforcement learning algorithm named VPO is used to optimize these text outputs based on the "final correctness of the external LLM," allowing a single trained MLLM to be used plug-and-play with increasingly powerful LLMs to achieve continuous performance gains without expensive vision-language re-alignment.
Reasoning-Driven Multimodal LLM for Domain Generalization: Ours proposes RD-MLDG: the first framework to introduce Multimodal Large Language Model (MLLM) reasoning chains into Domain Generalization (DG). By constructing the DomainBed-Reasoning dataset, the study systematically analyzes two major challenges in reasoning supervision (optimization difficulty and reasoning mode mismatch). These are addressed through the synergy of Multi-Task Cross Training (MTCT) and Self-Aligned Reasoning Regularization (SARR). On four standard DG benchmarks, it achieves an average accuracy of 86.89%, significantly outperforming GPT-4o (83.46%) and all CLIP/ViT-based methods.
Reasoning in Space via Grounding in the World: This paper proposes GS-Reasoner, which utilizes a "dual-path pooling" mechanism to align geometric features with image patch-level semantic and positional features, constructing a unified semantic-geometric hybrid 3D representation. This allows a 3D LLM to perform autoregressive 3D visual grounding without relying on any external detectors or decoders for the first time. By using grounding results as intermediate Chain-of-Thought (CoT) steps to enhance spatial reasoning, the model achieves SOTA performance on benchmarks like VSI-Bench.
Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks: The authors propose the Ref-Adv benchmark, constructed through a pipeline of Hard Distractor Pairing + LLM-assisted Minimally Sufficient Expression Generation + Three-annotator Consistency Verification. This benchmark eliminates "grounding shortcuts" in modern REC. On Ref-Adv, the accuracy of 13 contemporary MLLMs (including GPT-4o, Gemini 2.5, Qwen2.5-VL-72B, etc.) drops significantly from 90%+ on RefCOCO(+/g) to 50-68%, systematically exposing severe deficiencies in complex visual reasoning and authentic grounding capabilities.
ReVisual-R1: Advancing Multimodal Reasoning from Optimized Cold Start to Staged Reinforcement Learning: This paper systematically deconstructs the training pipeline of Multimodal Large Language Models (MLLMs). It discovers that a three-stage curriculum—consisting of "high-difficulty text-only cold start + multimodal RL + text RL"—is the key to activating complex reasoning. Furthermore, it proposes the PAD sampling mechanism to address "gradient stagnation" in multimodal GRPO. ReVisual-R1-7B achieves open-source SOTA across nine reasoning benchmarks, even surpassing GPT-4o.
ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis: To address the bottleneck of high-quality training data for complex video reasoning, this paper develops a multi-stage "agentic data synthesis" pipeline to create the ReWatch dataset (hierarchical captions + high-difficulty QA + re-watching CoT). By applying SFT followed by RLVR with an "Observation & Reasoning (O&R)" reward, Qwen2.5-VL-7B is trained into ReWatch-R1, achieving SOTA performance among models of similar size across five challenging video reasoning benchmarks.
Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning: Rex-Thinker reformulates grounded object referring from "direct coordinate generation" into a process where an open-vocabulary detector provides candidate boxes, followed by a multimodal large model (MLLM) performing box-by-box reasoning via a Planning-Action-Summarization framework with rejection capabilities. This approach simultaneously improves grounding accuracy, explainability, and rejection performance for null-target expressions on HumanRef.
RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy: RIG integrates textual reasoning, low-level action prediction, and future frame generation into a single autoregressive Transformer. Through progressively constructed Minecraft trajectory data, the policy is enabled to "think, imagine the outcome, and then refine actions," simultaneously improving control, generation, and reasoning performance with significantly less environmental interaction data.
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation: ROVER proposes a reciprocal cross-modal reasoning benchmark for unified multimodal models, utilizing 1,312 tasks and 1,876 images to simultaneously examine whether "linguistic reasoning can constrain image generation" and whether "visual intermediate results can assist verbal reasoning." The study finds that current models show gains in concrete physical visual reasoning but still significantly fail in the visualization of abstract symbols.
SCoT: Teaching 3D-LLMs to Think Spatially with Million-scale CoT Annotations: SCoT constructs a 1.1 million-scale 3D scene Chain-of-Thought dataset, categorizing tasks into three levels: perception, analysis, and planning. By constraining the reasoning chain with scene evidence markers (<SI>), it makes 3D-LLMs more interpretable and faithful in complex spatial analysis and planning, while also cautioning that CoT should not be overused for simple perception tasks.
Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes: The authors introduce MV-RoboBench, the first benchmark integrating multi-view spatial reasoning with robotic manipulation execution. It contains 1.7K human-annotated QA pairs and reveals a massive gap between the strongest current VLMs (GPT-5 at only 56.4%) and humans (91.0%).
SketchThinker-R1: Towards Efficient Sketch-Style Reasoning in Large Multimodal Models: SketchThinker-R1 introduces a three-stage pipeline—compressing long reasoning into sketches, training a SketchJudge reward model, and applying GRPO reinforcement learning—enabling Large Multimodal Models (LMMs) to significantly reduce intermediate reasoning tokens in visual question answering and logic/math/physics tasks while maintaining or improving accuracy.
Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation: Inspired by the "draft-then-verify" paradigm of Speculative Decoding, this paper proposes Speculative Verdict (SV). It utilizes multiple lightweight VLMs to generate diverse reasoning paths as drafts, while a large model serves as the verdict to synthesize, verify, and correct errors. SV outperforms GPT-4o by 11.9% on information-intensive VQA without training and can rectify 47-53% of minority-correct cases.
SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward: SophiaVL-R1 is proposed to introduce a holistic-level thinking process reward when training MLLM reasoning with rule-based RL. Specifically, a Thinking Reward Model is trained to evaluate reasoning quality across five dimensions (e.g., logical consistency and redundancy). Trust-GRPO is then introduced to calculate a confidence weight \(\gamma\) based on the comparison of thinking rewards between correct and incorrect answer groups to mitigate reward hacking. Additionally, an annealing strategy \(e^{-\text{steps}/T}\) gradually reduces the thinking reward to ensure greater reliance on accurate rule-based rewards in later stages. The 7B model comprehensively outperforms LLaVA-OneVision-72B across multiple benchmarks, including MathVista (71.3%) and MMMU (61.3%).
SpaCE-Eval: A Benchmark for Real-World Multi-Modal Reasoning: SpaCE-Eval constructs a real-world physical spatial multimodal reasoning VQA benchmark consisting of newly hand-drawn human diagrams. It systematically examines MLLMs using three task categories: spatial reasoning, commonsense knowledge, and environmental interaction. The results demonstrate that current state-of-the-art models remain far below human performance in both overall accuracy and spatial reasoning.
Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models: This paper proposes Spatial-DISE, a unified spatial reasoning benchmark based on a 2×2 cognitive science taxonomy (Intrinsic/Extrinsic × Static/Dynamic). It includes 559 evaluation VQA pairs and 12K+ training data. Evaluations across 32 SOTA VLMs reveal a significant gap between models and humans in dynamic spatial reasoning, particularly in mental rotation and folding.
Spatial CAPTCHA: Generatively Benchmarking Spatial Reasoning for Human-Machine Differentiation: Spatial CAPTCHA is proposed, a novel human-verification framework based on 3D spatial reasoning. It leverages fundamental capability differences between humans and multimodal large language models (MLLMs) in tasks such as geometric reasoning, perspective transformation, occlusion handling, and mental rotation. The best-performing MLLM achieved only a 31.0% Pass@1 accuracy, significantly lower than human performance.
Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes: Addressing ego-centric multi-view scenarios (e.g., autonomous driving/robotics) where cameras simultaneously cover front, rear, left, and right views, this paper establishes the first outdoor 3D spatial reasoning benchmark, Ego3D-Bench (8.6K QA). It proposes Ego3D-VLM, a training-free, plug-and-play framework that localizes queried objects in 3D global coordinates to generate a compact "textual cognitive map." Feeding this map into any VLM improves MCQA accuracy by an average of 12% and reduces absolute distance RMSE by an average of 56%.
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?: SpatiaLab is introduced as a real-world spatial reasoning benchmark containing 1,400 vision-QA pairs across 6 major categories and 30 subcategories. Supporting both MCQ and open-ended evaluations, it reveals a significant spatial reasoning gap between the strongest current VLM (InternVL3.5-72B at 54.93% MCQ) and humans (87.57%), with the disparity widening in open-ended settings.
SpatialLadder: Building Spatial Reasoning Capabilities for Vision-Language Models via Progressive Training: This paper proposes SpatialLadder, which first constructs a 26k spatial dataset covering localization, single-image, multi-view, and video using ScanNet reconstruction. It then employs a three-stage progressive training strategy: "Perception-Localization → Spatial Understanding → Reinforced Reasoning." This approach trains a 3B Qwen2.5-VL to reach spatial reasoning SOTA, achieving a 23.4% overall improvement over the base model and surpassing GPT-4o by 20.8%.
SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs: SpinBench is proposed as a diagnostic benchmark grounded in cognitive science. It systematically evaluates the spatial understanding of 37 VLMs through 7 progressive task categories (ranging from object recognition to perspective taking), revealing systematic flaws such as egocentric bias and weak rotation comprehension.
SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports: SportR is the first large-scale multimodal benchmark for "sports rule reasoning" across multiple sports. It comprises 4,789 images and 2,052 videos covering 50 types of fouls and 12 types of tactics across 5 ball games. The dataset includes 6,841 purely human-written Chain-of-Thought (CoT) trajectories and precise bounding box annotations. MLLMs are evaluated through a progressive QA hierarchy—ranging from foul identification to penalty prediction and evidence localization. Results indicate that even GPT-5 achieves low scores, with visual grounding \(IoU\) generally \(<7\%\).
STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning: STVG-R1 reformulates the difficult frame-by-frame coordinate regression in spatial-temporal video grounding into an instance identification problem—"viewing numbered videos and answering target IDs + time segments." By training the VLM with GRPO and task-specific rewards, the model significantly improves spatial consistency and cross-task generalization on multiple benchmarks like HCSTVG, ST-Align, and MeViS.
Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking: To address the issue where Unified Vision-Language Models (UVLMs) treat "understanding" and "generation" as two parallel skills that do not interact during problem-solving, this paper proposes AD-Loop. This method allows models to interleave "textual thinking (Analyzing)" and "latent visual thinking (Drafting)" during the reasoning process. Through a two-stage training of SFT + Adaptive RL, the model learns to switch between these two capabilities as needed, achieving a +2.3% average improvement in understanding and a GenEval total score of 86%.
Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models: This paper points out that the "near-random guessing" performance of multimodal models on compositional reasoning benchmarks is largely an illusion created by artificially depressed evaluation metrics. It proposes a more faithful GroupMatch metric along with SimpleMatch to translate results back to standard metrics. Furthermore, it introduces Test-Time Matching (TTM), an iterative self-training algorithm without external supervision, which allows SigLIP-B16 to outperform GPT-4.1 on MMVP-VLM and enables GPT-4.1 to exceed estimated human performance on Winoground for the first time.
ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning: ThinkMorph proposes the principle that "text and images should be complementary rather than isomorphic reasoning modalities." By fine-tuning a unified multimodal model (Bagel-7B) on approximately 24K carefully constructed interleaved reasoning trajectories, the model learns an interleaved reasoning process of "Textual Hypothesis → Visual Manipulation → Textual Verification." It achieves an average performance gain of 34.7% over the base model on vision-intensive tasks and exhibits high-order intelligence such as emergent visual manipulations unseen during training, autonomous switching of reasoning modes, and superior test-time scaling.
ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding: The ThinkOmni training-free framework is proposed, which utilizes Large Reasoning Models (LRM) to guide Omni-modal LLMs (OLLM) during decoding. By employing Stepwise Contrastive Scaling to adaptively balance perception and reasoning signals, it achieves 70.2% on MathVista and 75.5% on MMAU, matching or surpassing reinforcement fine-tuning (RFT) methods.
Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs: The authors propose VC-STaR (Visual Contrastive Self-Taught Reasoner). Based on the observation that "VLMs see more accurately when contrasting two similar images," they design a contrastive self-improving framework: by constructing contrastive VQA pairs, the model generates more faithful visual analysis during comparison. An LLM then integrates this contrastive analysis into the reasoning path to produce the high-quality visual reasoning dataset VisCoR-55K. After fine-tuning, performance improves by 5.7% on MMVP and 3.2% on Hallusion.
Thyme: Think Beyond Images: Thyme enables Multimodal Large Language Models (MLLMs) to autonomously generate and execute code during reasoning for image operations (e.g., cropping, rotation, contrast enhancement) and mathematical calculations. This capability is activated through a two-stage training process: a "500k SFT cold start" followed by "GRPO-ATS reinforcement learning," consistently outperforming Qwen2.5-VL baselines across nearly 20 benchmarks, particularly in high-resolution perception and complex reasoning.
TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning: TimeSearch-R reformulates "temporal search" in long videos as a multi-turn reasoning process where text reasoning and video retrieval are interleaved. It utilizes GRPO with "Completeness Self-Verification" (GRPO-CSV) for reinforcement learning, enabling the model to autonomously learn which frames to inspect and when search is sufficient. It consistently outperforms hand-crafted search workflows and pure text-based reasoning models across temporal search, long video understanding, and complex video reasoning benchmarks.
Transductive Visual Programming: Evolving Tool Libraries from Experience for Spatial Reasoning: TVP enables visual programming agents to solve problems using basic tools and store high-quality programs in an "Experience Library." It then clusters and abstracts reusable high-level tools from these successfully executed programs into a "Tool Library," forming a "Program → Tool → Better Program" closed loop. It outperforms GPT-4o by 22% and previous visual programming systems by 11% on Omni3D-Bench, with the abstracted tools demonstrating zero-shot transferability to unseen spatial reasoning benchmarks.
Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations: The authors propose the STARE benchmark, which systematically evaluates Multimodal Large Language Models (MLLMs) using approximately 4,000 spatial problems requiring "multi-step visual simulation" (2D/3D transformations, cube folding, tangrams, viewpoint, and temporal reasoning). The study finds that while models perform near human levels on simple 2D transformations, their performance drops to near-random on tasks requiring step-by-step "mental imagery" such as folding or tangrams. Furthermore, models fail to consistently utilize intermediate visual steps—revealing a fundamental gap in current MLLMs regarding non-verbal, serialized visual simulation capabilities.
Unleashing Perception-Time Scaling to Multimodal Reasoning Models: Addressing the phenomenon where "inference-time scaling makes Large Vision-Language Models (LVLMs) think longer but not see more accurately," this paper proposes Perception-Time Scaling (PTS). By rewriting perception as a token-dense, decomposable explicit process (symbolic distance + segment-wise accumulation) and employing SFT cold-start followed by GRPO reinforcement, the authors improve high-precision accuracy on their self-built perception benchmark, DisTANCE, from 8.0% to 64.7%, with the ability to generalize to out-of-distribution geometry and real-world multimodal tasks.
Unlocking the Essence of Beauty: Advanced Aesthetic Reasoning with Relative-Absolute Policy Optimization: This paper proposes the Aes-R1 framework, which utilizes an automated data pipeline, AesCoT, to distill aesthetic reasoning corpora across five dimensions for cold-start SFT. It then employs RAPO, a reinforcement learning algorithm that simultaneously optimizes "absolute score regression + relative ranking," allowing the MLLM to improve average PLCC/SRCC in image aesthetic assessment by 47.9%/34.8% relative to the backbone using only 15K training samples, surpassing SOTAs of the same scale.
VGR: Visual Grounded Reasoning: VGR enables Multimodal Large Language Models (MLLMs) to "replay visual memory" during the thinking process—autonomously framing key regions during reasoning and retrieving high-resolution visual tokens to continue thinking. Coupled with a set of VGR-SFT data containing grounding signals, it significantly outperforms baselines on fine-grained image understanding tasks such as ChartQA, AI2D, and MMStar, while using only 30% of LLaVA-NeXT's visual tokens.
Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction–Reasoning Synergy: Vid-LLM utilizes only monocular video as input. Through a Cross-Task Adapter that mutually enhances "reconstruction" and "reasoning," it injects geometric priors directly reconstructed from the video into the LLM. It achieves performance levels close to models using explicit 3D point clouds across 3D Question Answering (QA), dense captioning, and visual grounding, without requiring any external point cloud, depth, or pose inputs.
VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning: VideoAnchor is a training-free test-time plugin that identifies "visual anchors" stable across frames from video or multi-view image tokens using Sparse Subspace Clustering (SSC). These anchors are converted into Q/K/V attention scaling factors to mitigate the over-reliance of VLMs on textual priors, providing consistent improvements for multiple MLLMs on spatial tasks like VSI-Bench, All-Angles-Bench, SPAR-Bench, and Video-MME.
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Video: VideoMathQA constructs a mathematical reasoning benchmark for real instructional videos, using 420 video QAs, 2,945 expert step annotations, and a multi-layer evaluation protocol to test whether models can perform long-range, multi-step, and diagnostic reasoning across video, subtitles, speech, and mathematical knowledge.
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?: VideoReasonBench constructs a vision-centric complex reasoning benchmark centered on "visible operations + partially visible latent states." The study demonstrates that most current MLLMs remain weak in fine-grained video perception and multi-step state reasoning, while longer test-time thinking significantly benefits such tasks.
VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning: VideoZoomer reformulates long video reasoning as a multi-turn tool-calling task of "glance then zoom." A 7B MLLM autonomously decides when and where to invoke the <video_zoom> tool to capture high-frame-rate clips. Using a two-stage training process of "cold-start SFT + GRPO reinforcement learning," it outperforms open-source models on multiple long video benchmarks with a smaller frame budget and even matches closed-source systems on specific tasks.
VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL: VidGuard-R1 is the first video authenticity detector to utilize Group Relative Policy Optimization (GRPO) reinforcement learning for fine-tuning MLLMs. By constructing a shortcut-free dataset of 140,000 real/fake video pairs and designing two specialized reward mechanisms—temporal artifact rewards and diffusion step quality rewards—it achieves 86.17% accuracy on its internal dataset and reaches 95%+ SOTA zero-shot detection performance on the GenVidBench and GenVideo benchmarks, while generating explainable Chain-of-Thought reasoning.
Vision-SR1: Self-Rewarding Vision-Language Model via Reasoning Decomposition and Multi-Reward Policy Optimization: Vision-SR1 decomposes VLM reasoning into two stages: "visual perception" and "linguistic reasoning." It requires the model to first generate a self-consistent visual description that allows answering the question even if the original image is removed. The same VLM then provides a visual reward by re-answering based solely on this description. Through decoupled multi-reward policy optimization, these two signals are back-propagated separately, mitigating visual hallucinations and suppressing "language shortcut" behaviors (guessing based on linguistic priors without looking at the image) without requiring external visual supervision or additional GPUs.
VisionReasoner: Unified Reasoning-Integrated Visual Perception via Reinforcement Learning: VisionReasoner unifies ten categories of visual perception tasks—including detection, segmentation, and counting—into a "multi-object cognition" problem. By employing a unified reward mechanism and GRPO reinforcement learning, a single Qwen2.5-VL model is trained to generate structured reasoning before outputting results. This approach achieves relative improvements of 29.1%, 22.1%, and 13.2% over baselines on COCO detection, ReasonSeg, and CountBench, respectively.
VisualPRM400K: An Effective Dataset for Training Multimodal Process Reward Models: The authors constructed VisualPRM400K, the first multimodal process supervision dataset with approximately 400,000 samples, using a Monte Carlo automatic annotation pipeline. An 8B multimodal process reward model (PRM), named VisualPRM, was trained as a "judge" for Best-of-N evaluation. This model consistently improves the reasoning capabilities of various MLLM families and scales (e.g., a +5.9 point gain for a 78B model across seven reasoning benchmarks). Additionally, a manually annotated process evaluation benchmark, VisualProcessBench, was released.
VisuLogic: A Benchmark for Evaluating Visual Reasoning Capabilities of Multimodal Large Models: VisuLogic constructs a 1,000-question, human-verified, pure visual logic reasoning benchmark across six categories. It deliberately blocks "language shortcuts" where images are converted to text for reasoning. Results show most top Multimodal Large Models (MLLMs) achieve an accuracy under 30% (barely above the 25% random baseline and far below the human 51.4%). A supplementary training set and a reinforcement learning baseline are also provided.
VisuRiddles: Fine-Grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning: Using a real-world riddle benchmark (VisuRiddles) and a synthesizer with structured perceptual descriptions, this paper systematically proves that the root cause of Multimodal Large Language Models (MLLMs) failing in Abstract Visual Reasoning (AVR) is lack of fine-grained perception rather than weak reasoning ability. Based on this, it proposes the "SFT for perception, then GRPO for reasoning" two-stage training paradigm (PAVR), enabling a 7B model to outperform commercial models like GPT-5 and Gemini-2.5-Pro in AVR tasks.
VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?: Proposes VLM-SubtleBench, a benchmark evaluating the subtle comparative reasoning capabilities of Visual Language Models, covering 10 difference types and 6 image domains (Natural, Gaming, Industrial, Aerial, Medical, Synthetic), revealing a performance gap of over 30% between VLMs and humans in spatial, temporal, and viewpoint reasoning.
VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use: The study proposes VTool-R1, the first framework that trains VLMs via Reinforcement Learning Fine-tuning (RFT) to generate interleaved textual and visual intermediate reasoning steps, enabling models to "think with images."
We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning: We-Math 2.0 integrates a five-level "Mathematical Knowledge System" (491 knowledge points, 1,819 principles) with a model-centric three-dimensional difficulty data space (MathBook-Standard/Pro) and a two-stage reinforcement learning framework (cold-start SFT + progressive alignment RL). Using only ~9.8K training samples, it improves Qwen2.5-VL-7B by an average of 6.1 points across four major visual math benchmarks.
What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging: Addressing the "affirmative bias" in Vision-Language Models (VLMs) for Described Object Detection—where models fail to distinguish "person with a hat" from "person without a hat"—this paper constructs a negation-dense COVAND dataset via a CoT+VQA pipeline. It introduces NegToMe, a module that merges "not + attribute" into a single semantic unit and amplifies negation signals at the tokenization level, combined with deep cross-attention LoRA. Modifying <0.1% of parameters, the method achieves up to a +10.8 mAP improvement in NMS-AP on OVDEval.
Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum: Wiki-R1 addresses the issues of "high retrieval noise, sparse rewards, and RL failing to learn reasoning" in knowledge-based VQA. It generates a data curriculum from easy to difficult via controllable retrieval difficulty and utilizes observation propagation to select samples with the strongest training signals. This allows Qwen2.5-VL to achieve new SOTA results for retrieval-augmented KB-VQA on Encyclopedic VQA and InfoSeek.
Zebra-CoT: A Dataset for Interleaved Vision-Language Reasoning: Constructed ZEBRA-COT, the first large-scale diverse interleaved text-image reasoning dataset (182K reasoning trajectories across 18 domains). Scaffolding experiments demonstrate that visual CoT has a potential improvement of up to +43% for frontier models, and fine-tuning enables Anole-7B and Bagel-7B to acquire endogenous visual reasoning capabilities.