CVPR2026 VLM Reasoning AI paper notes paper summaries Reasoning Multimodal/VLM LLM Reinforcement Learning Adversarial Robustness Question Answering

🧠 VLM Reasoning¶

📷 CVPR2026 · 150 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (112) · 💬 ACL2026 (32) · 🧪 ICML2026 (31) · 🤖 AAAI2026 (10) · 🧠 NeurIPS2025 (30) · 📹 ICCV2025 (15)

🔥 Top topics: Reasoning ×115 · Multimodal/VLM ×64 · LLM ×10 · Reinforcement Learning ×6 · Adversarial Robustness ×4

A Causal Marriage between VLM and IRM from Understanding to Reasoning: Starting from token-level causal representations, this paper proves that a "vocabulary-constrained InfoNCE" is formally equivalent to the invariance principle of IRM. Based on this, it proposes CLIP-IRM, a mid-training paradigm that enhances OOD understanding without architectural changes, and transfers the OOD guarantees of IRM to multimodal reasoning by using its invariant alignment score as a process-level reward for GRPO.
A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning: Ours proposes A4VL, a training-free multi-agent perception-action alliance framework. Through event-driven video chunking, clue-guided keyframe selection, and a multi-round agent negotiation-pruning mechanism, it consistently outperforms 28 baseline methods across five VideoQA benchmarks with significantly lower inference latency.
Act2See: Emergent Active Visual Perception for Video Reasoning: Act2See enables video VLMs through supervised fine-tuning to autonomously decide when to insert a video frame during the textual CoT reasoning process—either by retrieving a real evidence frame from the original video or conditionally "imagining" a counterfactual frame—thereby refreshing or surpassing closed-source models of similar or even larger sizes on 5 video reasoning benchmarks including VideoEspresso and ViTIB.
Adversarial Style Optimization: Enhancing VLM Jailbreaks by GRPO-based Stylistic Triggers Optimization: The authors identify a "stylistic inconsistency" vulnerability in VLMs—they can understand content in almost any artistic style, yet their safety alignment is easily bypassed by specific visual style triggers. Based on this, they propose ASO, which fine-tunes an image editing model using GRPO to overlay optimal styles onto existing adversarial images, consistently improving the Attack Success Rate (ASR) across four SOTA VLMs.
Agentic Video Summarization via Self-Reflecting Multimodal Understanding: Reinterprets video summarization from a "one-time importance score regression" into a "predict-verify-reflect" closed-loop workflow composed of three MLLM agents: Summarizer, Verifier, and Reflector. This allows the model to self-correct and retrieve missed keyframes, outperforming previous SOTA on SumMe and TVSum in Kendall's \(\tau\) and Spearman's \(\rho\).
All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models: The authors observe that VLMs trained with GRPO, while achieving deeper reasoning in single trials, suffer from "diversity collapse" early in training—degenerating into a single dominant strategy. They propose MUPO (Multi-group Policy Optimization), which clusters sampled responses into multiple groups based on reasoning patterns, estimates local advantages within groups, and applies inter-group diversity rewards. This allows the model to maintain multiple problem-solving strategies while preserving depth, achieving an average improvement of 2-7% in acc@1/acc@4 across nine reasoning benchmarks.
ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning: ANTS allows Multimodal Large Language Models (MLLM) to "understand" cached suspected OOD images at test-time. It generates "descriptive negative sentences" to characterize far-OOD and "visually similar negative labels" to characterize near-OOD. These two negative textual spaces are dynamically fused via an adaptive weight. On the ImageNet benchmark, ANTS achieves a zero-shot, training-free 3.1% reduction in FPR95, setting a new SOTA.
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning: ARM-Thinker transforms the multimodal reward model from a "one-pass scoring" system into an agent that actively invokes tools (crop-and-zoom, document retrieval, instruction verification) to seek evidence. Using a two-stage GRPO training strategy—encouraging tool usage followed by refining accuracy—the 7B model achieves average gains of +16.2%, +9.6%, and +4.2% across reward modeling, think-with-images, and general reasoning benchmarks, respectively, matching or even surpassing GPT-4o on reward and tool-use benchmarks.
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs: Addressing the "counting deficiency" in multimodal large language models (MLLMs), this work introduces CG-AV-Counting—the first interpretable counting benchmark for long videos across audio-visual modalities with fine-grained "counting clue" annotations. Simultaneously, it proposes AV-Reasoner, which leverages GRPO and curriculum learning to transfer counting capabilities from related tasks such as localization and QA. While achieving SOTA on several audio-visual reasoning benchmarks, the paper honestly identifies that explicit reasoning in the language space offers little help out-of-distribution.
AXG-Reasoner: Error Detection and Explanation in Long Task Videos with Vision-Language Models: To address the challenge of "detecting and explaining user operation errors in long task videos," this paper utilizes a frozen VLM combined with an automatically constructed "Action Execution Graph (AXG)" and temporal action segmentation. By decomposing each action segment into fine-grained sub-actions and querying the VLM only on keyframes of these sub-actions, the model focuses on sparse spatial-temporal error clues. It achieves SOTA performance in error explanation and detection on EgoPER and CaptainCook4D, significantly surpassing VLM baselines.
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning: GASP moves away from fine-tuning VLMs on 3D VQA data. Instead, it injects a lightweight "correspondence head" into every transformer layer of the LLM, using ground-truth point correspondence and depth from real video scenes for deep supervision. This improves the internal "cross-view matching" capability from <5% to over 70%, achieving 18~29% gains on spatial reasoning benchmarks like All-Angles and VSI-Bench with zero 3D VQA training.
Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT: This paper demonstrates that the Multiple-Choice Question Answering (MCQA) format leaks option signals that models can exploit, leading to inflated evaluations and RFT learning "option-guessing" shortcuts. It proposes the ReVeL framework to automatically rewrite MCQA into "OpenQA that remains rule-verifiable" based on answer types. After fine-tuning with GRPO on 20k rewritten samples, OpenQA accuracy improved by approximately 6 percentage points without a drop in MCQA scores, while revealing that MCQA scores are inflated by up to 20 percentage points compared to OpenQA.
Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs: This paper discovers that RL (GRPO) fine-tuning forces lightweight (3B) video MLLMs to take "perceptual shortcuts" instead of genuine reasoning. By first training a "bias model" specialized in shortcuts and then using a repulsive objective (CDPO) with a reversed KL divergence sign to push the main model away from the bias model, it achieves a 14.2% improvement on CLEVRER over GRPO using only 1% of the data.
Boosting Reasoning in Large Multimodal Models via Activation Replay: The authors use logit lens to discover that RLVR post-training "excessively" perturbs low-entropy input activations of large multimodal models (LMMs). They propose Activation Replay, a training-free test-time method that optimizes a set of learnable visual tokens to pull the low-entropy activations of the RLVR model back to the base model's distribution, achieving consistent gains across mathematics, o3-style visual agents, and video reasoning.
BOP-Ask: Object-Interaction Reasoning for Vision-Language Models: This paper automatically transforms the 6D object pose benchmark BOP into BOP-Ask, a large-scale object interaction reasoning dataset containing 150K images, 33.8M Q&A pairs, and covering six skill categories (pose/grasp/trajectory/rearrangement/spatial/depth). Fine-tuning open-source VLMs on this dataset significantly outperforms GPT-5 and Gemini on the in-house test set, generalizes to out-of-domain spatial reasoning benchmarks, and enables a real Franka robot to complete 10/15 pick-and-place tasks.
Breaking the Regional Perception Bottleneck of Multimodal Large Language Models via External Reasoning Framework: This paper identifies that the true bottleneck for Multimodal Large Language Models (MLLMs) in pixel-level grounding lies not in "seeing the region" but in the "translating the region into coordinates" (semantics refinement) stage. It proposes R-Ground, an external reasoning framework based on Multimodal Monte Carlo Tree Search (MCTS), which directs computational power specifically to this stage, enabling a 7B model to outperform a 72B model on the RefCOCO series.
Can a Second-View Image Be a Language? Geometric and Semantic Cross-Modal Reasoning for X-ray Prohibited Item Detection: This paper proposes a paradigm that treats the "second-view image (side-view) as a language modality." It introduces the first dual-view multi-modal security benchmark, DualXrayBench, and the GSXray dataset featuring <top>/<side>/<conclusion> Chain-of-Thought (CoT) supervision. The resulting GSR model improves overall accuracy from 53.5 to 65.4 across eight cross-view reasoning tasks, nearly doubling the mIoU.
CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal Reasoning: CARE is a "failure-centric" RLVR post-training framework for multimodal reasoning. It uses the best rollout in a group as an anchor, selects a small set of "near-miss" hard negatives for z-score normalization within a subgroup (only suppressing negatives), and performs structured reflection resampling on representative failures. By transforming "near-miss errors" into supervision signals, it achieves a macro average score 4.62 points higher than GRPO across six verifiable visual reasoning benchmarks using Qwen2.5-VL-7B.
CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering: CaST-Bench introduces a new task—"Causal Chain-Grounded Spatio-Temporal Video Question Answering"—where models must not only provide correct answers but also ground them in a causal evidence chain consisting of time segments and bounding boxes. Through a human-machine collaborative pipeline, a high-quality dataset of 1,015 videos and 2,066 questions was constructed. Evaluation metrics were designed to assess both answer accuracy and evidence grounding. Testing on 15 mainstream VLMs showed performance significantly lower than humans (best 50.34% vs. human 91.89%).
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning: This paper proposes Chain-of-Frames (CoF), enabling video LLMs to directly reference keyframes using identifiers like "Frame-k" within single-stage reasoning, effectively embedding temporal grounding into the CoT text itself. Using a low-cost data pipeline to generate 164,000 training samples with frame citations to finetune InternVL, the method achieves an average performance gain of 3.8%–5.1% across five video understanding benchmarks. It further demonstrates that purely synthetic data can lead to significant improvements.
Chain-of-Thought Guided Multi-Modal Object Re-Identification: CoT-ReID enables multimodal large models to "reason while looking" at RGB/NIR/TIR trimodal objects. It decomposes reasoning chain text into three levels—early, late, and decision-making—to guide visual feature learning, setting new SOTA benchmarks (e.g., MSVR310 mAP 71.7%) across four multispectral ReID datasets.
Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts: Targeting "High Information Density (HID) charts" with dense subplots and numerous legend annotations, Chart-FR1 explicitly anchors reasoning steps to OCR text and local bounding boxes using a <focus> tag (Focus-CoT). By employing Focus-GRPO with "Information Efficiency Reward + Adaptive KL Penalty" for reinforcement learning, it improves Qwen2.5-VL-7B by an average of 6.1% across five chart benchmarks, surpassing GPT-4o.
ChartR: Evaluating Reasoning Accuracy and Robustness in Chart Question Answering: ChartR decomposes each chart question into 4–10 dependent sub-questions and provides four visual perturbation variants for each image. Using eight metrics to simultaneously evaluate "step-by-step reasoning accuracy" and "robustness under perturbation," the study reveals that across 12 MLLMs, full-chain accuracy is generally below 10%, numerical value extraction is the primary bottleneck, and models rely heavily on text annotations rather than genuine visual understanding.
CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning: CLiViS decomposes egocentric video question answering into a training-free loop where the "LLM acts as a planner and the VLM acts as a perceptual executor." Together, they maintain a dynamic cognitive map (navigation graph + relationship graph) that evolves during reasoning. This bridges fine-grained perception and high-level reasoning through structured scene representations, achieving SOTA results on OpenEQA, EgoTempo, and EgoSchema benchmarks.
CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning: CodeDance is proposed to utilize executable code as a universal solver for visual reasoning. The MLLM generates code to define, combine, and execute multiple tools, rendering intermediate visual results (bboxes, lines, and charts) to support a verifiable reasoning chain. Through RL training with a tool-call reward that balances exploration and efficiency, emergent behaviors—such as unseen tool combinations and cross-task transfer—are observed. The 7B model outperforms GPT-4o on benchmarks for counting, visual search, and chart QA.
CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization: This paper discovers that visual agents capable of "thinking with images" often answer correctly but use tools unfaithfully (e.g., cropping the wrong area but guessing the right answer). It proposes CodeV, which represents visual tools as executable Python code and utilizes Tool-Aware Policy Optimization (TAPO) on top of GRPO. TAPO introduces a process-level dense reward that only evaluates tool outputs without inspecting the chain-of-thought. Consequently, CodeV maintains or improves accuracy across 10 benchmarks while increasing the faithful tool-use rate to 1.3–2× that of baselines.
CogniVerse: Revolutionizing Multi-Modal Retrieval-Augmented Generation with Cognitive Reflection and Geometric Reasoning: CogniVerse introduces a "brain-like reflection-retrieval-synthesis" three-step process into Multimodal RAG: first, a cognitive reflection module determines if external knowledge is needed and filters relevant content; second, image-text data and knowledge graphs are aligned in hyperbolic space with spectral-based subgraph pruning; finally, an optimal transport loss is used to generate answers that balance local accuracy and global coherence. It outperforms MuRAG/MMCoQA/GraphRAG across three MMQA datasets in accuracy, coherence, and retrieval precision while reducing latency.
Compositional Transformation Reasoning for Composed Video Retrieval: Addressing the Composed Video Retrieval task ("given a reference video + modification text, retrieve a target video"), this paper proposes MoRe, a zero-shot framework. It employs multi-objective Pareto ranking to recall a small set of high-quality candidates, then utilizes an MLLM to decompose videos into "Entity-Action-Scene" dimensions for pairwise reasoning to determine which candidate best matches the modification intent. This achieves R@1 gains of +5.8 and +10.8 on EgoCVR and WebVid-CoVR, respectively.
Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence: Conan enables a 7B video multimodal large model to work like a detective: first classifying frames into evidence/context/distractor, then reasoning while deciding whether "evidence is sufficient to answer or more frames need to be retrieved." Developed via the self-constructed Conan-91k dataset, a three-stage cold start, and AIR RLVR with joint rewards, it achieves a 10.5% average improvement over the Qwen2.5-VL-7B base across six multi-step reasoning benchmarks, outperforming GPT-4o on most leaderboards.
Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR: This paper proposes Consensus Entropy (CE), a training-free and model-agnostic metric that judges output reliability in an unsupervised manner by measuring whether OCR results from multiple VLMs converge. Based on this, the CE-OCR framework is built (consensus entropy-weighted ensemble + entropy threshold routing to a stronger model), improving quality verification F1 by 42.1% compared to VLM-as-Judge and increasing OCR accuracy by 8.2% on datasets like OCRBench, while routing only 7.3% of samples.
COT-FM: Cluster-wise Optimal Transport Flow Matching: Ours proposes COT-FM, a plug-and-play enhancement framework for Flow Matching. By clustering target samples, inverting the pre-trained model to obtain cluster-wise source distributions, and approximating optimal transport within each cluster, it significantly straightens the transport paths. This simultaneously accelerates sampling and improves generation quality without modifying the model architecture.
CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning: A graph-based automatic data generation pipeline is proposed to construct the CRIT dataset and benchmark. It is designed to train and evaluate the cross-modal multi-hop reasoning capabilities of VLMs on interleaved image-text content. Models trained with this data achieve significant improvements on multiple benchmarks, including SPIQA.
Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection: The Phase-wise Decomposition and Alignment (PDA) framework is proposed, utilizing the Chain-of-Thought (CoT) reasoning capabilities of LLMs to decompose action labels into "start-middle-end" phase descriptions. Through text-guided foreground filtering and adaptive phase alignment, it achieves fine-grained action pattern transfer. On THUMOS14 OV-TAD, it reaches an Avg mAP of 46.9 (surpassing the Prev. SOTA Ti-FAD's 41.2).
Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning: Targeting professional fields lacking high-quality annotations such as chemistry, earth sciences, and multimodal mathematics, DoGe decouples the RL self-evolution of VLMs into "Cognitive Process Decoupling" (forcing the Thinker to analyze context first without seeing the question) and "Data Decoupling" (iterative curriculum synthesis of Knowledge Pools and Seed Problem Pools). By using a two-stage RL cycle to avoid reward hacking and entropy collapse caused by synthetic data, the 3B/7B models achieve average improvements of 5.7% / 2.3% across 7 benchmarks.
Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models: This paper discovers that Multimodal Large Language Models (MLLMs) experience spatial dispersion of visual attention in "Chain-of-Thought" (CoT) reasoning modes, drifting away from task-relevant regions (the longer they think, the more they miss). Consequently, the authors propose the training-free VRGA framework: it uses an "Entropy-Focus" criterion to automatically identify attention heads that actually process visual information, locates task-relevant regions, and re-weights these regions during the generation phase. This restores visual grounding and reduces off-topic responses without retraining, improving VQA scores by 1–6 points across different model scales.
DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models: DeepScan is a training-free framework that mimics the human visual reasoning process of "capturing local cues first and then aggregating evidence bottom-up." By wrapping the LVLM in a three-stage pipeline consisting of Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning, it achieves 90.6% accuracy on the V* benchmark using Qwen2.5-VL-7B (+16.3% relative to the base model). It can be seamlessly transferred to different architectures and parameter scales without any fine-tuning.
DPAD: Discriminative Perception via Anchored Description for Reasoning Segmentation: Aiming at the issue where geometric rewards in RL+GRPO training for Reasoning Segmentation (RS) fail to constrain whether the reasoning chain focuses on unique attributes of the target, the DPAD method is proposed. It generates a reasoning chain + geometric localization + anchored description. By introducing a CLIP-based Discriminative Perception Reward to compare similarity differences between the description and ROI/AOI, it forces descriptions to be more discriminative, thereby indirectly constraining the reasoning chain to focus on the target. On ReasonSeg, cIoU improves by 3.09%, and reasoning chain length is reduced by 42%.
dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models: For unified generative-understanding diffusion multi-modal large language models (dMLLMs), this work utilizes the model's own image-text understanding capabilities as a "judge" (Self-Verified Feedback) to score candidate images. Combined with a coarse-to-fine Hierarchical Trajectory Search, it reduces the complexity of traditional linear search from \(O(NT)\) to near-linear \(O(N+T)\). This significantly improves the generation quality of three dMLLMs on GenEval while being 5–6 times faster than linear search.
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding: DocSeeker is proposed to achieve structured reasoning and evidence grounding in long document understanding through an ALR (Analyze-Locate-Reason) visual reasoning paradigm and a two-stage training process (SFT + EviGRPO). It matures robustly from training on short documents to generalizing to ultra-long documents.
Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs: When connecting MLLMs to visual tools like depth, optical flow, or matching, the bottleneck is not the number of tool calls or model size, but "how the tool output is fed." This paper proposes Perception Program (P2), which rewrites raw dense pixel-level tool outputs into compact, structured, language-native symbolic summaries. It can be inserted into any MLLM without training or architectural changes. It achieves an average improvement of 19.66% across six BLINK perception tasks, with multi-view reasoning in GPT-5 Mini soaring from 41.35% to 86.47%.
Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small VLMs: This study systematically investigates the impact of LLM scaling on multimodal capabilities, finding that visual tasks—rather than LLM-dependent tasks—are most affected, and that perceptual degradation is as severe as reasoning degradation. The proposed Extract+Think method (Visual Extraction Tuning + Step-by-step Reasoning) enables a minimal model with 0.6B perception and 1.7B reasoning to outperform PrismCaptioner and LLaVA-OneVision-0.5B, which are up to 12 times larger.
Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design: The paper argues that the common assumption that "the GRPO training paradigm for linguistic reasoning can be directly transferred to visual perception tasks" is invalid. Addressing two neglected characteristics of perception tasks—the need for a wider output space and finer, more stable rewards—the authors propose the plug-and-play Dr. Seg. It uses <look> tags to encourage breadth of exploration and Distribution-Ranked Reward to map multiple continuous metrics to empirical quantiles. Without altering the model architecture, it achieves SOTA on 5/6 benchmarks across segmentation, detection, and counting.
EduDiag: A Benchmark for Educational Diagnostic Reasoning with Error Tracing and Correction on Large Multimodal Models: EduDiag constructs the first benchmark to evaluate the "educational diagnostic reasoning" capability of Large Multimodal Models (LMMs). Given a problem, an image, a reference solution process, and a wrong answer, the model is required to reverse reconstruct the erroneous reasoning chain leading to that wrong answer and generate corrective feedback. Covering 8,345 annotations across common sense, science, and mathematics domains, evaluations of 24 mainstream LMMs show that even GPT-5 performs poorly, with error tracing identified as the core bottleneck.
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs: EgoMind is proposed as a CoT framework that requires no geometric priors. Through two core components—Role-Play Captioning (RPC) and Progressive Spatial Analysis (PSA)—it achieves competitive multi-frame spatial reasoning capabilities using only 5K SFT and 20K RL samples.
EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy: EgoProx is the first benchmark to evaluate whether Multimodal Large Language Models (MLLMs) can perform "body-object" 3D proximity reasoning from a first-person perspective. It organizes tasks into four categories based on the human cognitive hierarchy: Intention, Exploration, Exploitation, and Chain-of-Actions. Utilizing an agent-based data engine with Gemini-2.5-Pro as the controller to orchestrate various 3D tools, it automatically generates 2,405 high-quality QAs. Results show that even GPT-5 and Gemini-2.5-Pro perform far below human levels, yet minimal instruction tuning significantly unlocks "dormant" spatial knowledge within the models' pre-training.
Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching: This paper proposes using "wide-baseline matching" (WBM) as a touchstone for probing and training spatial reasoning in MLLMs. It introduces ReasonMatch-Bench, stratified by viewpoint difference and matching granularity (where the strongest baseline achieves only 37.2 F1 compared to human 84.0). Utilizing an automated data pipeline that extracts verifiable correspondences from video-3D corpora and DCRL (Verifiable Reward RL with Dual-level Dynamic Curriculum), the authors improved Qwen3-VL-8B from 27.5 to 70.5 F1 on this benchmark. The model also successfully transfers to multiple spatial intelligence benchmarks without compromising general vision capabilities.
EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models: Ours proposes EMO-R3, which guides MLLMs to perform step-by-step emotional reasoning through Structured Emotional Thought (SET) and designs Reflective Emotional Reward (RER) to allow the model to re-evaluate the vision-text consistency and emotional coherence of its reasoning, significantly enhancing the interpretability and accuracy of multimodal emotion understanding.
EmoThinker: Advancing Visual-Acoustic Emotion Analysis via Structural Token Selection and Chain-of-Thought Reasoning: EmoThinker transforms visual-acoustic emotion analysis from "implicit fusion" to "explicit step-by-step reasoning": the visual end uses structural token selection to separate facial focal regions from text-conditioned backgrounds, while the audio end utilizes text-guided attention to refine paralinguistic features. Combined with the first CoET dataset featuring step-by-step reasoning chains for LoRA post-training, it achieves new SOTA on five benchmarks such as DFEW (with a 10.5% Gain in zero-shot WAR on DFEW).
Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory: This paper introduces the MM-SafetyBench++ benchmark and the EchoSafe framework, which accumulates safety insights through a self-reflective memory bank maintained at inference time. This allows MLLMs to distinguish between scenarios with similar appearances but different safety intents based on context, improving contextual safety without requiring additional training.
Fast Reasoning Segmentation for Images and Videos: FastReasonSeg completely decouples "visual perception" from "reasoning"—first compressing scenes into structured digital twin JSONs using SAM-2, depth estimation, and detection; then enabling a small LLM to perform multi-step reasoning over this JSON to retrieve target masks. By employing a "Teacher-generated reasoning chain → Student SFT + RL two-stage distillation" pipeline, a 0.6B model outperforms competitors 20× its size across four image/video reasoning benchmarks while achieving 7.79 FPS with only 2.1GB VRAM usage.
From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training: Addressing severe annotation noise in Multimodal Large Models during Reinforcement Learning with Verifiable Rewards (RLVR), this paper proposes a two-stage token-level entropy scheduling method. In the early training stage, entropy is maximized to encourage exploration, resist overfitting to incorrect labels, and maintain intra-group diversity for GRPO; in the later stage, entropy is minimized to exploit and solidify knowledge into confident predictions. Ours is more robust than single-direction entropy methods across GUI grounding, fine-grained classification, and open-vocabulary detection under various noise ratios.
From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs: The authors collected pedestrian-perspective outdoor videos using stereo cameras + LiDAR + IMU/GPS to construct OSI-Bench, the first three-layer (Relational/Metric/Kinematics) outdoor spatial intelligence benchmark with precise metric ground truth (8736 QAs). Through three diagnostic experiments—blinding tests, abnormal scenes, and geometric information ablation—they demonstrate that current MLLM "spatial intelligence" on indoor benchmarks is primarily supported by language priors, which fails in the open world, particularly in dynamic reasoning.
Fuel Gauge: Estimating Chain-of-Thought Length Ahead of Time in Large Multimodal Models: The authors discover an internal "fuel" signal in reasoning large multimodal models that depletes during Chain-of-Thought (CoT) reasoning. By extracting this signal with a tiny 82k-parameter network and performing linear extrapolation to the "zero-fuel" step, the total CoT length can be predicted before or at the start of inference. This enables predictive KV cache allocation (reducing allocation frequency by up to 13×) and CoT length regulation (linear control of accuracy).
G\(^2\)VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning: G2VLM utilizes a "Mixture-of-Transformer-Experts (MoT)" architecture to integrate a feedforward 3D reconstruction expert and a semantic understanding expert within the same VLM. Relying on shared self-attention for mutual reinforcement, this 2B model can directly predict depth, point clouds, and camera poses like VGGT, while outperforming GPT-4o on spatial reasoning tasks (scoring 18.5 points higher on SPAR-Bench).
Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning: This work proposes GAR-SSL, a training-free sound source localization (SSL) framework that remodels the task as a three-stage "Generate-Analyze-Refine" metacognitive reasoning process. By directly leveraging the intrinsic reasoning capabilities of Multimodal Large Language Models (MLLMs) for audio-visual localization, it achieves performance comparable to or superior to trained methods on single-source and multi-source benchmarks.
Geoint-R1: Formalizing Multimodal Geometric Reasoning with Dynamic Auxiliary Constructions: Geoint-R1 formalizes "drawing auxiliary lines + formal proof" into a verifiable multimodal geometric reasoning task. By using Lean4 to encode dynamic auxiliary constructions into formal language and employing a Verification Reward Model (modulated by auxiliary line accuracy) to drive curriculum reinforcement learning, a 7B model achieves an average performance exceeding GPT-4o / Gemini-1.5-pro on the self-built Geoint benchmark.
GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models: GGBench introduces a "geometric generative reasoning" benchmark for Unified Multimodal Models (UMMs): 1,411 geometric construction problems, each strictly aligned with "natural language steps + executable GeoGebra code + rendered images." Combined with a four-stage evaluation protocol, experiments reveal that "end-to-end image generation" UMMs significantly lag behind "code-then-render" LLMs, highlighting a gap where models "can solve problems but cannot construct diagrams."
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning: This paper proposes the G2F-RAG paradigm, which renders retrieved structured knowledge into a single "reasoning frame" appended to the end of a video. This facilitates unified reasoning within the visual space of Large Multimodal Models (LMMs), avoiding attention dilution and cognitive load caused by text appending, achieving consistent training-free improvements across 8 video benchmarks.
Grounded Chain-of-Thought for Multimodal Large Language Models: This paper proposes the "Grounded Chain-of-Thought (GCoT)" task and the MM-GCoT benchmark. It requires Multimodal Large Language Models (MLLMs) to provide step-by-step reasoning with coordinate-based grounding before answering. By introducing the "Answer-Grounding Consistency" metric to quantify visual hallucinations, the study reveals that 12 state-of-the-art MLLMs commonly "answer correctly but look at the wrong place," and hallucinations are independent of model scale.
GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking: GThinker addresses the "visual inertia" problem in MLLMs—where "textual logic is flawless but misled by incorrect initial visual judgments"—by proposing a free-form Cue-Rethinking reasoning paradigm anchored by visual cues with self-triggered rethinking. Through a two-stage training process involving an "annotation pipeline + judge-guided selective cold-start + incentive RL," this capability is injected into Qwen2.5-VL-7B, achieving 81.5% on M3CoT and surpassing o4-mini.
HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models: The authors construct HandVQA, a large-scale diagnostic benchmark containing 1.6M+ multiple-choice questions automatically generated from 3D hand joint annotations regarding joint angles, distances, and relative positions. The benchmark systematically exposes severe deficiencies in current VLMs' fine-grained hand spatial reasoning and demonstrates that models fine-tuned on HandVQA achieve zero-shot transfer to downstream tasks such as gesture recognition (+10.33%) and hand-object interaction recognition (+2.63%).
Hear you are: Teaching LLMs Spatial Reasoning with Vision and Spatial Sound: This paper proposes the "Audio-Visual Spatial Reasoning" task and synthesizes a million-scale QA dataset, Hear You Are QA, featuring binaural audio and \(360^{\circ}\) panoramas using SoundSpaces 2.0. It trains "Hear You Are LLM," a multimodal large model connecting a binaural spatial audio encoder and a panoramic vision encoder to Qwen2-7B. In scenarios solvable only through spatial cues—such as "semantic mismatch between sound and visual objects" or "multiple identical objects distinguishable only by orientation"—it significantly outperforms baselines using only monaural audio.
Hierarchical Process Reward Models are Symbolic Vision Learners: This work redefines "geometric diagram understanding" as a symbolic auto-encoding problem—the encoder parses the diagram into a logical form of points/lines/shapes/relations (latents are symbolic graphs rather than pixel vectors), and an executable rendering engine redraws the logical form back into the original image. A Hierarchical Process Reward (SymHPR) + Stabilized GRPO is used to supervise this non-differentiable pipeline, enabling a 7B model to achieve a 98.2% reduction in reconstruction MSE and gains of +13% / +3% on perception and reasoning benchmarks, respectively.
HoneyBee: Data Recipes for Vision-Language Reasoners: This work systematically investigates construction principles for vision-language (VL) reasoning datasets—covering context source strategies, data interventions (image description auxiliary signals + text-only reasoning), and multi-dimensional data scaling. Based on these findings, the authors construct the HoneyBee CoT reasoning dataset with 2.5 million samples. The trained 3B VLM outperforms SOTA by 7.8% on MathVerse, while a proposed test-time scaling strategy reduces decoding costs by 73%.
Improving Vision-language Models with Perception-centric Process Reward Models: Addressing the limitation in VLM reinforcement learning where result-only rewards fail to locate specific errors, this paper introduces Perceval, a perception-centric Process Reward Model. Perceval verifies vision-language consistency step-by-step and identifies hallucinated tokens. These signals are used both during training (via token-level advantage redistribution in GRPO) and inference (via truncate-regeneration). The method achieves consistent improvements across multiple visual reasoning benchmarks and demonstrates that improved perception generalizes to stronger overall reasoning capabilities.
Incentivizing Versatile Video Reasoning in MLLMs via Data-Efficient Reinforcement Learning: This paper proposes VideoReasoner: by using only 3K cold-start data and 5K reinforcement learning data (8K in total) directly on a Base MLLM (Qwen2-VL-7B-Base), it trains three video reasoning capabilities—"event reasoning / keyframe reasoning / direct answering." During the inference phase, these are combined into a pipeline that "first locates key events and keyframes, then performs dense sampling for back-filling to generate answers." It significantly outperforms the Base model across 7 video benchmarks and matches or even surpasses Qwen2.5-VL-7B-Instruct trained on large-scale data.
InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity: InfiniBench is a fully automated, parameterizable 3D scene benchmark "generator." It translates natural language scene descriptions into physically plausible, photorealistic videos with controllable complexity. This allows for the theoretical generation of infinite VLM spatial reasoning evaluation tasks across composition, relation, and observation complexities, specifically exposing model failure modes under diverse spatial conditions.
IPR-1: Interactive Physical Reasoner: IPR enables an 8B VLM to learn physics and causality across 1000+ heterogeneous games through a closed-loop "world model imagination rollout scoring \(\rightarrow\) reinforced VLM strategy" paradigm. It utilizes a physics-centric latent action code, PhysCode, to align "semantic intent" with "visual dynamics" into a shared action space for both prediction and reasoning, achieving an overall competitiveness (average rank) that surpasses GPT-5.
Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models: SymPL identifies that VLMs struggle with "allocentric" spatial reasoning (reasoning from the perspective of an object in the scene). It proposes a training-free approach to extract 3D information and rewrite such problems into a "symbolic layout problem" (e.g., "which colored dot falls in the yellow region") using four factors: Projection, Abstraction, Bipartition, and Localization. This converts difficult perspective transformations into simple "color region localization" tasks where VLMs naturally excel, leading to significant performance gains in both allocentric and egocentric tasks.
Latent Implicit Visual Reasoning: LIVR appends a set of learnable latent tokens to Large Multimodal Models (LMMs) and employs a "visual bottleneck" attention mask to force the answers to be generated through these tokens. This allows the model to learn task-beneficial visual abstractions without any intermediate-step supervision, consistently outperforming direct supervised fine-tuning (SFT) on 9 vision-intensive tasks and achieving state-of-the-art (SOTA) performance in multi-task and cross-dataset generalization.
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models: Addressing the common failure of VLMs in "dynamic spatial reasoning" (understanding how objects move/change relative relationships in 3D space over time), this paper proposes DSR Suite: an automated pipeline using visual foundation models to generate multiple-choice QAs with geometric cues from in-the-wild videos, constructing the training set DSR-Train and a human-refined evaluation benchmark DSR-Bench. Furthermore, it designs a lightweight Geometric Selection Module (GSM) (dual Q-Former) to inject "question-relevant" 3D priors into Qwen2.5-VL-7B, substantially outperforming all competitors on DSR-Bench with 58.9% (vs. 38.4% for the runner-up) while preserving general video understanding capabilities.
Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos: This paper proposes SynRL: a method to teach VLMs "temporal primitives" (direction, speed, state tracking, etc.) using synthetic videos (geometric movements/state changes) generated entirely via code. The core finding is that basic temporal skills learned from abstract synthetic videos can be directly transferred to real-world videos. Using only ~7.7K synthetic CoT samples, the model achieves comprehensive improvements across 15 benchmarks, even outperforming Video-R1 which uses 165K real samples (approximately 21× data efficiency).
Let VLMs Grade Their Own Thoughts: A Self-Quantification Approach to Reasoning-Aware Reward Modeling: Video-RAISE proposes to let video VLMs score their own reasoning chains using "intrinsic confidence" (answer token probabilities) during generation. This transforms the sparse 0/1 text-matching rewards in GRPO into continuous, fine-grained learning signals. By designing two reward mechanisms, SCRE for strict logic tasks and IGSR for open-ended tasks, the method achieves SOTA performance on six video understanding benchmarks and achieves approximately 90% reasoning chain consistency.
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling: LongVT enables multimodal large language models to process long videos by emulating the human strategy of "global skimming followed by zooming into suspicious clips." It encapsulates the model's inherent temporal grounding capability into a native crop_video tool, which is interleaved within the reasoning chain to iteratively "re-examine" and correct errors. Supported by the self-constructed VideoSIAH data suite and a three-stage training pipeline, it achieves new open-source SOTA results across four long-video benchmarks.
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens: This paper proposes the Mirage framework, enabling VLMs to treat their own hidden states as "latent visual tokens" and directly append them into text sequences during decoding. This allows interleaved multimodal reasoning without generating any actual pixel-level images. Combined with a two-stage fine-tuning approach of "first visual grounding, then textual relaxation" and reinforcement learning (RL), Mirage consistently outperforms pure-text decoding and explicit image-generation baselines across multiple benchmarks such as spatial planning, jigsaw puzzles, and spatial relations.
Mimic Human Cognition, Master Multi-Image Reasoning: A Meta-Action Framework for Enhanced Visual Understanding: To address the significant performance drop of Multimodal Large Language Models (MLLMs) in multi-image reasoning, this paper mimics human cognition by decomposing multi-image reasoning into five structured "meta-actions": Global / Focus / Hint / Think / Answer (the CINEMA framework). It utilizes "Retrieval-Based Tree Sampling" to generate two high-quality trajectories for cold-start and implements a two-stage reinforcement learning (RL) process—Diversity-Preserving Strategy (DPS) followed by Annealing DAPO—to prevent entropy collapse. The 7B model surpasses GPT-4o on multi-image benchmarks like MUIR and MV-Math, while also achieving gains in video and single-image tasks.
MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents: MindPower introduces a robot-centric Theory-of-Mind (ToM) reasoning framework that organizes the process of Perception → Belief → Desire → Intention → Decision → Action into a six-layer reasoning hierarchy. By optimizing reasoning consistency with Mind-Reward (based on GRPO), the model exceeds GPT-4o by 12.77% in decision-making and 12.49% in action generation.
MINERVA-Cultural: A Benchmark for Cultural and Multilingual Long Video Reasoning: The MINERVA-Cultural benchmark is introduced, featuring 2,400 human-annotated video reasoning questions across 18 languages/regions. Through evidence graphs and an iterative error isolation strategy, it reveals severe deficiencies in the cultural visual perception of current SOTA Video-LLMs (the strongest model, Gemini-2.5-Pro, achieves only 45.07% vs. 95.22% for humans).
MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation: Constructed MMTIT-Bench, a multilingual and multi-scenario text-image machine translation benchmark covering 14 non-English and non-Chinese languages. Proposed the CPR-Trans data paradigm (Cognition → Perception → Translation Reasoning), which significantly improves end-to-end translation quality on 3B and 7B models, with the 7B model achieving performance competitive with a 235B model.
MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models: Expert selection in MoE is modeled as a sequential decision-making problem. The routing policy is optimized via GRPO reinforcement learning with modality-aware routing guidance. This approach consistently outperforms deterministic top-K routing and its variants on image and video understanding tasks for VLMs.
Monet: Reasoning in Latent Visual Space Beyond Image and Language: Monet enables Multimodal Large Language Models (MLLMs) to perform visual reasoning within a continuous latent visual space by generating sequences of latent embeddings as "intermediate visual thoughts," rather than relying on image cropping or external tools. Through a three-stage distilled SFT and a specialized Reinforcement Learning (RL) method called VLPO—which incorporates latent embeddings into the policy gradient—the 7B model achieves consistent gains in both real-world perception/reasoning and out-of-distribution (OOD) abstract visual reasoning.
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning: OASIS redefines streaming video reasoning as a "temporal routing" problem. It employs an online-maintained hierarchical event forest as long-term memory, coupled with a two-stage strategy of "coarse reasoning on short context, followed by refined retrieval based on semantic intent when uncertain." Without altering the MLLM or requiring training, it significantly enhances long-range accuracy and compositional reasoning for multiple streaming MLLM backbones while maintaining constant token costs.
OneThinker: All-in-one Reasoning Model for Image and Video: OneThinker utilizes an 8B model to unify 10 basic visual tasks across image and video (QA, captioning, spatio-temporal grounding, tracking, and segmentation) into a "think-then-structured-output" reasoning paradigm. It introduces EMA-GRPO to resolve optimization imbalances caused by significant differences in reward magnitudes and densities across multiple tasks, outperforming specialized models of comparable size across 31 benchmarks.
OpenMMReasoner: Pushing the Frontiers in Multimodal Reasoning with an Open and General Recipe: OpenMMReasoner provide a fully transparent and reproducible two-stage recipe for training open-source multimodal large models into strong reasoning models: starting with an SFT cold start using 874k high-quality distilled data, followed by RL (GSPO) refinement with 74k data. Based on Qwen2.5-VL-7B, it achieves an average improvement of 11.6% across nine multimodal reasoning benchmarks.
OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection: This work transforms Open-Vocabulary Object Detection (OVOD) from a "one-time static matching of text and regions" into an LLM-free proactive visual reasoning process. It employs an eight-state weak Markov Decision Process (w-MDP) to characterize visual state transitions, uses UCB Bandit to sample reasoning trajectories in uncertain regions, and jointly trains a lightweight Reward-Policy Model (RM) using Markov transition statistics. This creates a self-evolving closed loop that consistently improves rare class detection on COCO/LVIS with minimal inference overhead.
PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning: Addressing the issue where directly migrating "confidence growth process rewards" from the language domain to vision-language reasoning fails because sparse visual perception steps are overwhelmed by the statistics of dense textual reasoning steps (mixture-induced signal degradation). PDCR uses a model-internal Visual Dependence Score combined with an Otsu threshold to cluster steps into "perception" and "reasoning" in an unsupervised manner. Advantages are then calculated via independent min-max normalization within each cluster, providing sparse visual steps with correctly scaled reward signals. This approach consistently outperforms GRPO/DAPO/PACR across 7 V-L reasoning benchmarks.
Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning: Addressing the defect in RLVR training for Vision-Language Models that "only verifies textual answers while allowing upstream visual perception errors to go unchecked," PEARL utilizes a "perception checklist" derived from the original problem to add a set of verifiable perception sub-questions to each reasoning task. It employs the perception reward as both a direct supervision signal and a "fidelity gate" to release reasoning updates, achieving an average improvement of approximately +9.7% over the baseline across 6 multimodal reasoning benchmarks including MathVerse.
PhysInOne: Visual Physics Learning and Reasoning in One Suite: PhysInOne is a large-scale synthetic dataset containing \(153,810\) dynamic 3D scenes and 2 million annotated videos. It covers 71 fundamental physical phenomena across mechanics, optics, fluid dynamics, and magnetism, establishing a new benchmark for physics-aware world models.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs: POINTS-Long equips a pre-trained Multimodal Large Language Model (MLLM) with a "Standby Mode": using a small set of learnable tokens, the entire visual sequence is distilled to 1/40–1/10 of its length. This maintains 97.7%–99.7% of the original accuracy for long video understanding while preserving the high-fidelity "Focus Mode" of the original model. By leveraging a detachable KV Cache, it supports ultra-long streaming videos, achieving up to a 6.2× increase in end-to-end decoding throughput.
PointThinker: Point-Incentivized Parallel Thinking for Multimodal Large Language Model: PointThinker enables Multimodal Large Language Models (MLLM) to explicitly list multiple "key points" in an image during inference and develop independent reasoning paths around each point, thereby amplifying the diversity of parallel thinking. It employs a point-level dense reward RL method, GPPO, which assigns different rewards to "useful points" and "ineffective points" within the same thinking chain. This method improves Qwen2.5-VL-7B by +4~6 points on difficult benchmarks such as HallusionBench.
Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees: The authors propose Proof-of-Perception (PoP), which models multimodal reasoning as an executable directed acyclic graph (DAG). Each perception/logic node outputs set-values with conformal prediction certificates to provide step-by-step reliability guarantees. A lightweight controller adaptively allocates computing power within a budget based on these certificates, outperforming CoT, ReAct, and PoT baselines on document, chart, and multi-image QA benchmarks.
Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation: Addressing the issue in Vision-Language Navigation (VLN) where agents "do not know which step of the instruction they have reached," Progress-Think moves away from predicting numerical completion. Instead, it enables the model to reason the "completed instruction text" from historical observations. Using an annotation-free three-stage framework (Self-supervised Progress Pre-training → Progress-Guided Policy Pre-training → Progress-Policy Joint RL Fine-tuning), it couples progress reasoning with action policies, achieving SOTA on R2R-CE / RxR-CE using only monocular RGB.
Prototypical Action Reasoning Facilitated by Vision-Language Alignment for Egocentric Action Anticipation: PAR-VLA utilizes Vision-Language Models (VLM) to learn verbs and nouns as "disentangled visual prototypes" which serve as stable semantic anchors. It transforms open, unconstrained future action anticipation into conditional prediction guided by these semantic concepts. By refining verb-noun dependencies through a dual-stream symbiotic decoder, it achieves New SOTA on three datasets, including EPIC-KITCHENS-100.
QUANTIPHY: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models: QUANTIPHY is the first quantitative benchmark evaluating the physical reasoning abilities of VLMs. Given a video and a single physical prior of an object (size / velocity / acceleration in real-world units), the model is required to infer the numerical values of the target object's kinematic quantities. Using 3.3K+ video-text instances and numerical ground truths, it reveals a gap where current VLMs are "linguistically plausible but numerically systematically incorrect"—they rely more on pre-trained world knowledge rather than faithfully using the given visual and textual inputs.
R-4B: Incentivizing General-Purpose Auto-Thinking in MLLMs via Bi-Mode Annealing and Reinforce Learning: R-4B teaches a 4B Multimodal Large Language Model (MLLM) to "think only when necessary." By first using bi-mode annealing to train a single backbone to master both "reasoning" and "direct-answering" modes, and then applying Bi-mode Policy Optimization (BPO)—which forces simultaneous sampling of thinking and non-thinking response pairs for joint optimization—it achieves SOTA performance across 25 benchmarks using simple rule-based mathematical rewards. It matches or exceeds larger models on reasoning tasks while significantly reducing redundant inference tokens.
R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning: R-C2 treats the modality gap in multimodal large models—where the same content yields different answers under image versus text inputs—as an unannotated reward signal. The model derives a question from a candidate answer, switches modalities to reconstruct the answer, and receives a reward if the reconstruction is successful. This dense cycle-consistent signal is used for GRPO reinforcement learning, achieving up to a 7.6-point gain across six multimodal reasoning benchmarks.
R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space: R4 attaches a continuously growing "4D Spatio-Temporal Knowledge Base" (semantics + 3D space + time) to frozen Vision-Language Models. During reasoning, it decomposes natural language queries into three keys—semantic, spatial, and temporal—to retrieve evidence from this memory and iteratively inject it into the VLM. Without training any parameters, R4 enables VLMs to recall objects seen minutes ago, reason about occluded or disappeared entities, and coordinate across multiple agents, significantly outperforming strong baselines like GPT-5 and o3 on embodied QA and navigation benchmarks.
Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR: The authors observed that the output entropy of OCR models on "formatted text" like formulas and tables is an order of magnitude higher than on plain text. Consequently, they propose Format Decoupled RL (FD-RL): utilizing entropy to rank and filter format-intensive hard samples, and then applying GRPO training with a suite of separate reward functions for text, formulas, and tables. The method achieves a competitive score of 90.41 on OmniDocBench among end-to-end models.
ReaGEN: Adaptive Generation of Structured Chains-of-Thought for Efficient Multimodal Reasoning: ReaGEN does not fine-tune the base Vision-Language Model (VLM). Instead, it employs a lightweight generator with only 18M parameters to adaptively "output" a structured chain-of-thought (determining which reasoning stages to use and in what order) based on the attention flow of each problem. This achieves accuracy close to deep search with a single inference pass—yielding a maximum improvement of +26 accuracy points over VReST on Qwen3-VL-4B, while reducing average token usage by approximately 53% (reaching up to 79% on certain benchmarks).
ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation: ReAlign first trains a multimodal large language model (MLLM), AIGI-R1, that can "reason" using GRPO. It then uses the generated reasoning text as a "bridge" to distill the reasoning text space into a lightweight CLIP detector via contrastive learning. This allows the small model to inherit both the cross-domain generalization and semantic error sensitivity of the large model, while requiring only the image encoder during inference. It achieves SOTA results on AIGCDetectBenchmark / AIGI-Holmes / UltraSynth-10k (mAcc 96.14% / 99.44% / 97.09%).
ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps: Ours proposes the ReasonMap benchmark, which utilizes high-resolution transit maps from 30 cities to construct 1,008 QA pairs. Through a two-level evaluation framework (correctness + quality), the fine-grained visual reasoning capabilities of 16 MLLMs are systematically evaluated. The study reveals that among open-source models, base models outperform reasoning models, whereas the opposite is true for closed-source models.
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning: Ours proposes the RLER dual-paradigm framework. In the training phase, GRPO is employed with three novel rewards (Frame-sensitive, Think-transparency, Anti-repetition) to teach the model to generate structured evidence. In the inference phase, a training-free orchestrator performs weighted election and self-check among multiple candidates based on evidence consistency. This approach comprehensively outperforms open-source and RL-based LMMs on 8 video benchmarks with an average gain of 6.3%, requiring only approximately 3.1 candidates.
Reinforcing Structured Chain-of-Thought for Video Understanding: Proposes SDRL (Summary-Driven Reinforcement Learning), a single-stage RL framework without SFT. By utilizing structured CoT (Summarize→Think→Answer) and two self-supervised mechanisms (CVK and DVR), it enhances video temporal reasoning and achieves SOTA results on 7 VideoQA benchmarks.
Reinforcing Video Object Segmentation to Think before it Segments: Veason-R1 reformulates "Video Reasoning Segmentation (VRS)" as a two-step sequential decision-making process: "select a keyframe first, then locate the target within that frame." It trains a single policy using Chain-of-Thought (CoT) SFT for cold-starting and GRPO reinforcement learning (with three types of verifiable rewards: temporal, spatial, and consistency). Using only the ReVOS dataset, it achieves SOTA results on ReVOS, ReasonVOS, and MeViS while significantly improving robustness against hallucinations.
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding: REVISOR upgrades "textual reflection" to "visual reflection"—enabling multimodal large models to propose a specific video segment for re-watching after an initial reasoning pass, call tools to densely resample that segment, and conduct a second-stage reasoning with the new visual evidence; combined with DADR (Dual Attribution Decoupled Reward) to ensure correct segment selection, it achieves an average improvement of ~2% for Qwen2.5-VL-7B across VideoMME, LongVideoBench, MLVU, and LVBench.
RMIR: A Benchmark Dataset for Reasoning-Intensive Multimodal Image Retrieval: RMIR introduces a multimodal image retrieval benchmark requiring 1-2 steps of logical reasoning to find the target image (1,634 test queries across functional, temporal, and causal reasoning), accompanied by a fully automated and scalable data generation pipeline. Evaluations indicate that even the strongest models achieve only 46.53% R@20, with generative embeddings utilizing explicit reasoning significantly outperforming discriminative encoders.
Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework: This paper proposes the Self-Critical Inference (SCI) framework, which addresses both language bias and language sensitivity in LVLMs through logit aggregation of multi-round textual and visual counterfactual reasoning. It also introduces DRBench, a dynamic robustness benchmark for model-specific evaluation. Increasing counterfactual reasoning rounds consistently improves robustness, opening a new direction for test-time scaling.
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection: ForeSight equips VLMs with a set of low-level visual tools (Canny / Zoom / Color) and a mask-based visual reflection mechanism. Using GRPO reinforcement learning, a 7B model autonomously decides "when to invoke tools and whether to overturn draft answers" during reasoning. On the self-built Odd-One-Out saliency localization benchmark CG-SalBench, it improves IoU from 32.56% to 62.24%, approaching the performance of 72B models.
See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning: BiPS shifts the "where to look" visual cues from inference-time tools or latent tokens to the training phase. By employing a pair of KL constraints (pulling toward "evidence-only" charts and pushing away from "evidence-ablated" charts) within the GRPO framework, it shapes the perceptual strategy of the VLM. Training on only 13K chart samples, Qwen2.5-VL-7B achieves a 7.3% average improvement across eight benchmarks (rising to 8.2% with 39K math data) with zero additional inference overhead.
See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles: State-aware Reasoning (StaR) is proposed to improve GUI toggle control accuracy by over 30% without compromising general agent performance. This is achieved by teaching multimodal agents a three-step reasoning chain: "perceive current state → analyze target state → decide whether to act."
Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness: Proposes an efficient plug-and-play module to enhance the recognition and reasoning capabilities of VLMs for rare objects by learning multi-modal class embeddings: a cross-attention adapter refines visual tokens on the vision side, and object detection prompts are injected on the text side, achieving a significant improvement from 72.8 to 75.4 on CODA-LM without fine-tuning the VLM.
Seeing What Matters: A Training-Free Self-Guided Framework for Multimodal Detail Perception and Reasoning: Imitating the human visual process of "scan-locate-focus", SLoFo requires no training and no additional modules. It integrates the MLLM's internal gradient-weighted attention (semantic branch) and PCA reconstruction error (structural branch) into a semantic-structural importance map, crops key sub-images to feed back into the model, and utilizes stage-by-stage token pruning to suppress irrelevant visual noise. It improves TextVQA by 4.79% and DocVQA by 12.01% on LLaVA-v1.5-7B.
SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation: SegCompass uses Sparse Autoencoders (SAEs) to project MLLM Chain-of-Thought (CoT) and visual tokens into a shared high-dimensional sparse concept space. Through codebook aggregation and slot mapping, it generates observable multi-slot heatmaps to guide segmentation. This transforms the "reasoning \(\rightarrow\) segmentation" path from a black box or post-hoc assembly into a verifiable "white-box" alignment, achieving or exceeding SOTA on 5 benchmarks.
Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning: Addressing the issues where uniform sampling dilutes key evidence and existing frame selection lacks purity rewards, this paper proposes EARL (Evidence-Aware Reinforcement Learning). It enables Video LLMs to actively select keyframes during reasoning, performs local resampling around these frames to recover fine-grained temporal details, and utilizes an IoU-based multi-component reward to enforce "selecting less but better." The 7B model achieves 59.8%, 69.0%, and 64.9% on LongVideoBench, MVBench, and VideoMME respectively, setting a new SOTA for open-source Video LLMs.
Self-Consistency for LLM-Based Motion Trajectory Generation and Verification: Extends the self-consistency paradigm of LLMs from natural language reasoning to the visual domain—defining shape families of motion trajectories via a hierarchy of Lie transformation groups. By clustering multiple trajectories sampled from LLMs under transformation-invariant distance metrics, it achieves unsupervised improvements in trajectory generation (+4-6%) and verification (+11.8% precision) without training.
Self-Critical Distillation Network for Video-based Commonsense Captioning: SCD-Net addresses two major problems caused by the "video → content description → commonsense" reasoning chain: the lack of visual grounding and the isolation of different commonsense categories. It employs self-critical reinforcement learning to strengthen visual reasoning and a joint reasoning distillation framework (cascaded teacher decoder + student + language adaptive wrapper distillation) to establish inter-class correlations. On the V2C dataset, it outperforms LLM-based methods without relying on Large Language Models.
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL: This paper proposes DIRL (Double Interactive Reinforcement Learning). The approach utilizes a mixture of data from a "single-tool expert IRL teacher + frontier model full-tool teacher" for initial SFT, followed by a second round of interactive RL refinement using the full toolset. This process trains a 3B Qwen2.5-VL into SpaceTools, a spatial reasoning agent capable of autonomously scheduling over ten vision/robotics tools. It achieves SOTA across benchmarks like RoboSpatial, BLINK, and BOP-ASK, and successfully controls a real 7-DOF robotic arm as a tool for pick-and-place tasks (86% success rate).
SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models: The SpatiaLQA benchmark is proposed (9,605 QA pairs, 241 real indoor scenes) to systematically evaluate 41 VLMs on spatial logical reasoning. A Recursive Scene Graph-Aided Reasoning (RSGAR) method is designed to enhance the spatial logical reasoning capabilities of VLMs.
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning: The SpatialStack framework is proposed to inject multi-layer geometric features from a multi-view geometry encoder (VGGT) layer-by-layer into different layers of an LLM decoder (rather than fusing only the last layer). Through hierarchical alignment—shallow layers for fine-grained spatial perception and deep layers for high-level semantic reasoning—it achieves open-source SOTA on multiple 3D spatial reasoning benchmarks.
Stable and Efficient Single-Rollout RL for Multimodal Reasoning: Addressing the dilemma in multimodal RLVR where GRPO with multiple rollouts is computationally expensive while single-rollout methods suffer from entropy collapse, this paper proposes MSSR. By replacing group normalization with a Beta conjugate baseline and introducing an "entropy-based advantage shaping" mechanism, the framework maintains stable training with only one rollout per sample. MSSR matches GRPO performance in half the training steps and exceeds it by over 2 points on average across five benchmarks.
StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering: StaR-KVQA utilizes a single open-source MLLM to autonomously generate "dual-path symbolic relational paths + path-anchored natural language explanations" as structured reasoning traces. It replaces answer-only supervised fine-tuning (SFT) with structure-aware self-distillation (supervising "reasoning trace + answer"). Without any external retrieval, it improves OK-VQA accuracy by +11.3% over the strongest baseline while providing auditable intermediate reasoning.
STAR-R1: Multi-View Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs: STAR-R1 utilizes a two-stage training approach—"Process-supervised SFT cold start + Reference-aware RL"—on Qwen2.5-VL-7B. This allows the model to mimic human behavior by first anchoring key references and then performing cross-view alignment for scene reconstruction, significantly outperforming open-source and several closed-source models on multi-view spatial understanding benchmarks such as TVR, MMSI-Bench, MindCube-Tiny, and SPAR-Bench.
Streaming Video Crime Anticipation with Spatio-Temporal Causal Reasoning: To address the issue where "existing surveillance systems only provide post-event/mid-event alerts and cannot anticipate crimes before they occur," this paper makes two contributions: constructing the STCRC dataset with spatio-temporal causal annotations (73K samples, 5 progressive causal reasoning tasks) and designing a streaming co-processor STCH that converts implicit entity dynamics into explicit causal hypergraphs for VLMs. This achieves a 70.7% relative improvement in crime classification, a 10.1% improvement in detection, and a 3.7% reduction in time prediction error.
TableMix: Enhancing Multimodal Table Reasoning in MLLMs from a Data-Centric Perspective: Addressing the paradoxical phenomenon where Multimodal Large Language Models (MLLMs) underperform text-only models in table reasoning, TableMix adopts a data-centric approach: it simultaneously mixes three types of data—multimodal table reasoning, text-only mathematical reasoning, and simple table perception—within each training batch. This restores the reasoning capability weakened by alignment pre-training while preserving visual perception. Combined with a Difficulty-aware Reward Shaping (DRS) mechanism, TableMix outperforms multimodal baselines and matches or exceeds the strongest text-only method, Table-R1, across seven table benchmarks.
TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation: TerraScope enables remote sensing VLMs to generate segmentation masks at each reasoning step and reinject visual features of masked regions into the reasoning chain ("thinking with pixels"). It features a 1-million-sample pixel-masked CoT dataset named Terra-CoT and the first benchmark evaluating both "answer + mask quality," TerraScope-Bench. It significantly outperforms existing VLMs on fine-grained geospatial tasks such as land cover estimation, area ranking, and change detection.
Think360: Evaluating the Width-centric Reasoning Capability of MLLMs Beyond Depth: This paper introduces Think360, a multimodal benchmark focusing on "reasoning width"—specifically a model's capability in multi-path searching, multi-constraint pruning, and trial-and-error backtracking. It contains 1200+ high-quality samples and utilizes a fine-grained Tree-of-Thought evaluation protocol, revealing significant weaknesses in the width-dimension reasoning of current MLLMs.
Think Visually, Reason Textually: Vision-Language Synergy in Abstract Reasoning: Addressing ARC-AGI abstract reasoning, the authors identify a complementarity where "vision excels at rule induction, while text excels at precise execution." They propose training-free VLSR (using images for rule induction and text for rule application) and MSSC (using vision to verify text answers for cross-modal self-correction). These methods achieve an average improvement of up to 4.33% over text-only baselines on GPT-4o / Gemini-2.5-Pro / o4-mini / Qwen3-VL.
Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views: 3DThinker enables VLMs to directly generate a sequence of "3D latent tokens" within the reasoning chain and align them with the geometric features of the 3D foundation model VGGT. Without requiring any 3D priors as input or relying on dense annotations, it performs spatial reasoning by "imagining 3D scenes" from limited 2D views. It consistently outperforms strong baselines across 8 spatial understanding benchmarks, with the largest model even surpassing o3.
Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs: FiNDR utilizes a reasoning-augmented Large Multimodal Model (LMM) to directly "think of" fine-grained class names for unlabeled images. By employing CLIP for visual filtering and modality coupling to construct a classifier, it pushes vocabulary-free recognition to SOTA on 5 fine-grained datasets (avg. cACC +9.5%), even surpassing the zero-shot upper bound that uses "ground truth class names."
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models: This paper provides the first quantitative analysis of the CoT reasoning process in diffusion multimodal LLMs (dMLLM), identifying two key issues: "early answer generation" and "weak visual dependence." It proposes two training-free methods, Position-Step Penalty (PSP) and Visual Reasoning Guidance (VRG), achieving up to a 7.5% accuracy improvement with 3x acceleration.
Thinking in 360°: Humanoid Visual Search in the Wild: The paper elevates "visual search" from cropping and zooming in static 2D images to an embodied task where humanoid agents actively turn their heads to find objects/paths in a 360° panorama (HVS). It uses panoramic images as a hardware-free, lightweight simulator to close the "perception-action" loop, proposes a matching in-the-wild benchmark H*Bench, and utilizes a two-stage post-training pipeline with SFT+GRPO to boost the object search success rate of a 3B open-source model from 14.83% to 47.38% and path search from 6.44% to 24.94%.
Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World: This paper introduces Dyn-Bench—a large-scale benchmark for dynamic understanding of the physical 4D world (comprising 1k videos, 7k VQA pairs, and 3k dynamic grounding pairs). It systematically evaluates the spatio-temporal reasoning capabilities of general-purpose, spatial-aware, and region-level MLLMs, revealing that existing models fail to maintain consistency between reasoning and grounding. The authors propose two structured integration methods, Mask-Guided Fusion and ST-TCM, to significantly enhance dynamic perception.
Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding: SpecTemp offloads the time-consuming frame magnification process in the "thinking-with-frames" paradigm to a lightweight 3B draft MLLM. This draft model performs dense sampling and selects sparse keyframes, allowing the 7B target MLLM to focus solely on temporal reasoning and verification. Through an iterative speculative-verification loop, it maintains or improves accuracy across 8 video benchmarks while reducing inference latency by approximately 20%.
Thinking with Programming Vision: Towards a Unified View for Thinking with Images: This paper proposes CodeVision, which enables MLLMs to directly "write code" as a unified tool interface to manipulate images (rotation, flipping, cropping, enhancement, etc.). It employs a two-stage training process of "SFT cold-start + dense process reward RL" to empower the model with robust multi-turn, multi-tool reasoning capabilities on images contaminated by orientation perturbations. CodeVision achieves an average Gain of over ten points on a self-constructed orientation transformation benchmark compared to base models and nearly doubles the score of the second-best model on the multi-tool benchmark MVToolBench.
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning: VITAL equips Multimodal Large Language Models (MLLMs) with a "video clipping" tool, allowing them to densely resample suspicious time intervals into new frames during the reasoning chain to form a "multimodal chain-of-thought." Combined with difficulty-aware DGRPO reinforcement learning to stabilize multi-task training, it achieves 7B-level SOTA performance in long video QA and temporal grounding.
Towards Sparse Video Understanding and Reasoning: ReViSe reforms video question answering as "question-driven multi-turn sparse frame selection"—selecting only a few frames per turn, compressing verified evidence into a structured "summary-as-state" across turns, and stopping early once confident. It serves as a plug-and-play wrapper for any VLM and supports reinforced fine-tuning using the label-free reward EAGER, achieving higher accuracy on multiple VQA benchmarks with only a few frames.
Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization: ADPO employs a reinforcement learning objective to enable the same VLM to both generate answers and provide self-verification scores. By using a "preference verification reward" to address class imbalance and "advantage decoupled optimization" to prevent reward hacking, this single-model best-of-N selection outperforms traditional dual-model "generator + verifier" setups across mathematics, visual grounding, and mobile agent tasks, while reducing inference latency by up to \(53.5\%\).
UniT: Unified Multimodal Chain-of-Thought Test-time Scaling: UniT transfers "test-time scaling" from language models to unified multimodal models. By using a multi-model agent pipeline to synthesize "generate→reflect→refine" multi-round Chain-of-Thought (CoT) data, it finetunes a single unified model (Bagel) to iteratively generate, verify, and correct images during inference. Controlled by "budget forcing" over the number of generation rounds, UniT achieves significant improvements in compositional generation, multi-round editing, and visual reasoning.
VAST: Video Ability-Stratified Taxonomy for Data-Efficient Video Reasoning: VAST advocates for organizing video reasoning training data by "underlying reasoning abilities" rather than "task formats." It proposes a three-tier cognitive taxonomy (Perception/Reasoning/Cognition) with the accompanying VAST-15K/VAST-Bench. Utilizing the Video-VAST reinforcement learning framework, which adds only a consistency reward without modifying the architecture, it achieves 66.3% on MVBench, surpassing Video-R1's 62.7% while saving approximately 72% of GPU hours and 96% of training samples.
VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction: VGent decomposes visual grounding into "high-level reasoning" and "low-level box prediction". It utilizes a frozen Multimodal Large Language Model (MLLM) as an encoder responsible solely for reasoning, employs off-the-shelf detectors to generate candidate boxes, and uses a decoder to cross-attend to the encoder's hidden states to "select" target boxes. This avoids the slowness and hallucinations associated with autoregressive word-by-word decoding, achieving a massive +20.6% F1 gain on multi-target benchmarks with constant inference latency.
VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding: VideoARM proposes an agentic reasoning paradigm based on Hierarchical Multimodal Memory (HM3). Through an adaptive cycle of "observe-think-act-memorize" and a coarse-to-fine tool-use strategy, it surpasses SOTA on long-video understanding benchmarks while reducing token consumption to 1/34 of DVD.
VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice: Ours proposes VideoAuto-R1, a video understanding framework for "on-demand reasoning": it adopts an "answer once, think once, answer twice" (answer→think→answer) paradigm during training, and during inference, it decides whether to trigger CoT reasoning based on the confidence of the first answer. It maintains SOTA accuracy while compressing the average response length from 149 to 44 tokens (approx. 3.3x compression).
ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking: ViRC introduces the Reason Chunking mechanism, structuring multimodal mathematical CoT into a sequence of "Critical Reasoning Units (CRUs)." This simulates the process of human experts repeatedly examining images to prove intermediate propositions step-by-step. Supported by the CRUX dataset and a progressive training strategy (Instructional SFT → Practice SFT → Strategic RL), ViRC-7B achieves an average improvement of 18.8% across mathematical benchmarks.
VisionLeaf: Entropy-Guided Leaf-First Reasoning for Efficient and Accurate Think-with-Image: VisionLeaf treats the multi-turn tool calling in think-with-image as a reasoning tree. Instead of a single-chain rollout from the root as in standard GRPO, it performs "leaf-first" splitting at nodes with the highest entropy. This approach improves Qwen2.5-VL-7B performance on VStar and HR-Bench by approximately 4.2% while nearly halving the number of tool calls, without modifying the model or training data.
VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs: VisRes is a visual reasoning benchmark constructed using pure images in a four-choice format, expanding tasks across three difficulty levels: "perception completion → single-attribute rules → multi-attribute composition." The study reveals that once linguistic prompts are removed, even frontier VLMs such as GPT-5 and Gemini-2.5 perform near random levels under subtle perturbations, exposing that their "reasoning" largely stems from language priors rather than true visual understanding.
Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models: VACoT enables unified understanding-generation models to achieve high-fidelity multi-reference image generation by first generating a structured "Adaptive Visual Planning" checklist of elements to preserve, followed by "Iterative Visual Correction" through self-reflection. By injecting this "look-and-check" capability into BAGEL via two-stage SFT + flow-GRPO training, it improves the average score on OmniContext from 5.55 to 8.26, outperforming GPT-4o on specific sub-tasks.
VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation: VOLD utilizes a text-only teacher LLM (Qwen3-8B) to train the reasoning capabilities of a vision-language student model (Qwen2.5-VL-3B). It first performs distribution alignment via SFT using teacher-generated reasoning trajectories, then integrates GRPO reinforcement learning with "on-policy distillation" (reverse KL) for joint optimization on the same rollouts. Without using any vision-language reasoning data during the entire process, VOLD outperforms methods trained directly on multimodal data across four visual reasoning benchmarks: MMMU-Pro, MathVision, and LogicVista.
VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues: This paper introduces the VRR-QA benchmark, containing 1K meticulously annotated video question-answer pairs specifically designed to test model capabilities in reasoning about implicit visual relations (e.g., off-screen events, cross-frame causality, and spatial inference). It reveals that current state-of-the-art VideoQA models (including GPT-O3) exhibit significant deficiencies in implicit reasoning—the best model achieves only 64% accuracy, far below the human performance of 83%.
When to Think and When to Look: Uncertainty-Guided Lookback: This paper provides the first systematic analysis of the impact of test-time thinking on visual reasoning in LVLMs. It discovers that "thinking more is often inferior to looking more"—lengthy reasoning chains frequently overlook the image, leading to "long-wrong" trajectories. Based on this, the authors propose an uncertainty-guided lookback decoding strategy. By injecting visual lookback prompts when the reasoning chain drifts, the method improves performance on 6 benchmarks including MMMU by 2-6 points without modifying the model.
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought: MIRA is a multimodal benchmark specifically designed for problems that "require drawing intermediate diagrams before reasoning": 546 problems across Euclidean Geometry, Physics Reasoning, Abstract Spatial & Logic Puzzles, and Causal Transformation are provided with human-annotated intermediate visual cues. A three-level diagnostic protocol ("Direct / Text-CoT / Visual-CoT") is used to isolate the contribution of visual information. Results show that even GPT-5, Gemini 2.5 Pro, and o3 achieve less than 20% accuracy under direct input, while the average relative improvement reaches 33.7% when provided with human intermediate diagrams, proving that "drawing to think" is a core missing capability in current MLLMs.