🧠 VLM Reasoning¶

💬 ACL2026 · 31 paper notes

📌 Same area in other venues: 📷 CVPR2026 (144) · 🧪 ICML2026 (20) · 🔬 ICLR2026 (23) · 🤖 AAAI2026 (10) · 🧠 NeurIPS2025 (30) · 📹 ICCV2025 (13)

🔥 Top topics: Reasoning ×29 · Multimodal/VLM ×19 · LLM ×3 · Agents ×2

A Survey of Multimodal Mathematical Reasoning: From Perception, Alignment to Reasoning: This survey proposes two complementary perspectives: the Perception–Alignment–Reasoning (PAR) process framework and the Answer–Process–Executable (APE) evaluation framework. It systematically organizes three major task families—geometry, chart/table, and visual word problems—mapping existing methods and benchmarks onto these coordinates, making it the first process-centric multimodal mathematical reasoning survey.
Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization: The GPRO framework is proposed to address overthinking in LVLMs by using a meta-reasoning controller to dynamically route computation into three paths (Fast/Perception Re-check/Reasoning Reflection) at each token generation step, simultaneously improving accuracy and efficiency.
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation: The authors propose AnchorSeg, which reformulates reasoning segmentation as a structured conditional generation process based on language-grounded query banks. By explicitly decoupling spatial localization and semantic reasoning via anchor queries and incorporating a Token-Mask cycle consistency training objective, AnchorSeg achieves state-of-the-art (SOTA) performance on ReasonSeg (67.7% gIoU, 68.1% cIoU).
ArrowGEV: Grounding Events in Video via Learning the Arrow of Time: Proposes ArrowGEV, a reinforcement learning framework inspired by the physics concept "Arrow of Time," which models temporal directionality in videos by distinguishing between time-sensitive and time-insensitive events, enhancing the grounding accuracy and temporal understanding of VLMs.
Can MLLMs Reason Beyond Language? VisReason: A Comprehensive Benchmark for Vision-Centric Reasoning: VisReason constructs a multimodal benchmark comprising 1,505 daily visual reasoning questions, specifically designed to test whether models can reason directly based on visual evidence. Results demonstrate that even the strongest models achieve an average accuracy of only \(47.5\%\), significantly lower than the human performance of \(71.4\%\). Furthermore, Chain-of-Thought (CoT) and larger reasoning budgets provide only limited improvements.
ChemVLR: Prioritizing Reasoning in Perception for Chemical Vision-Language Understanding: Ours proposes ChemVLR, the first reasoning-based VLM in the chemistry domain. It constructs a 760K reasoning dataset through a cross-modal reverse engineering strategy and employs a three-stage training process (CPT-SFT-RL), significantly outperforming proprietary models and domain-expert VLMs in molecular recognition and reaction prediction tasks.
Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning: SPUR is the first benchmark for the "Perception → Understanding → Reasoning" three-stage evaluation of biomedical experimental images (multi-panel staining/Western blot/statistical charts). Containing 4264 expert-verified MCQs, it reveals that current MLLMs struggle, with only Gemini 3 Pro Preview barely exceeding 60%, and quantitative reasoning accuracy generally 12.76%–31.41% lower than qualitative reasoning.
Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision: The authors construct EgoPoint-Bench (11.7k QA / 5 dimensions / 3-level semantic reference), the first hybrid real-physical simulation benchmark for first-person "finger pointing" QA. It confirms that current SOTA MLLMs generally rely on "visual proximity/saliency" pseudo-correlations rather than truly parsing fingertip rays. By performing LoRA fine-tuning on simulated data, they achieve an average improvement of up to +25 points and robust sim-to-real generalization.
DRIFT: Transferring Reasoning Priors for Efficient MLLM Fine-Tuning: DRIFT treats the "parameter difference between a text reasoning expert and a multimodal model" as a directional prior, applying lightweight bias to gradients (without altering weights) during multimodal SFT backpropagation. Using ~4K multimodal CoT data and approximately 2 hours of training, it consistently enables Qwen2.5-VL-7B to outperform parameter merging baselines and heavy SFT/RL methods on benchmarks such as MathVista, MathVerse, and WeMath.
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection: This paper formally defines the multimodal error detection task and constructs the ErrorRadar benchmark—comprising 2,500 K-12 multimodal math problems from real-world student responses—to evaluate MLLM performance in error step identification (STEP) and error category classification (CATE). The results show that the strongest model, GPT-4o, still lags behind human experts by approximately 10-15%.
Faithful-First Reasoning, Planning, and Acting for Multimodal LLMs: This paper proposes the Faithful-First RPA framework, which evaluates perceptual faithfulness (whether claimed objects truly exist in the image) at each reasoning step via the FaithEvi pipeline. The FaithAct mechanism enforces evidence-based planning and action during the reasoning generation process, improving perceptual faithfulness by up to 24% without compromising task accuracy.
Forest Before Trees: Latent Superposition for Efficient Visual Reasoning: This paper proposes Laser, which performs visual reasoning in latent space via Dynamic Window Alignment Learning (DWAL). It enables the model to maintain a "probabilistic superposition" of future semantics during reasoning rather than precise token-by-token prediction, achieving a "global-to-local" cognitive hierarchy. Laser achieves SOTA among latent reasoning methods on 6 benchmarks with only 6 reasoning tokens (a 97%+ reduction), outperforming Monet by an average of 5.03%.
GeoArena: Evaluating Open-World Geographic Reasoning in Large Vision-Language Models: This paper introduces GeoArena, a "dynamic, label-free, and process-oriented" evaluation platform for open-world geographic reasoning in LVLMs. It reformulates geographic localization under in-the-wild images as a pairwise reasoning alignment task, ranking 17 frontier LVLMs via human preferences and Bradley-Terry scores, achieving an expert-crowdsourcing agreement rate of 78%.
GeoRC: A Benchmark for Geolocation Reasoning Chains: GeoRC is proposed as the first geolocation reasoning chain benchmark authored by GeoGuessr champion-level experts (800 reasoning chains, 500 scenes). It evaluates the capability of VLMs to generate auditable reasoning chains. The study finds that while closed-source VLMs match human localization accuracy, their reasoning chain quality lags significantly behind, whereas open-source VLMs perform nearly identically to a pure hallucination baseline.
HierVA: Hierarchical Visual Agent — Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning: HierVA utilizes a "manager–worker" two-layer multimodal agent to manage both image and text contexts during chart reasoning through a disciplined "acquisition–limitation–distillation" process. It achieves training-free performance surpassing strong baselines like CoT and "thinking with images" on complex chart reasoning benchmarks such as CharXiv.
iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models: iReasoner enables LMMs to perform self-questioning and answering on unlabeled images, extending final answer consistency to consistency rewards for intermediate CoT steps, resulting in an improvement of up to +2.13 points in multimodal reasoning on Qwen2.5-VL-7B.
MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models: This paper proposes MMErroR, a multimodal erroneous reasoning benchmark containing 1,997 samples. Each sample embeds a single reasoning error across six major domains and four error categories. It requires VLMs not only to detect the presence of errors in the reasoning chain but also to classify the error type (Visual Perception Error / Knowledge Application Error / Question Comprehension Error / Reasoning Error). Evaluation of 12 representative VLMs shows that even the strongest model, Gemini-3-Pro-Preview, only achieves 66.65% accuracy.
OMHBench: Benchmarking Balanced and Grounded Omni-Modal Multi-Hop Reasoning: OMHBench constructs a 6,144-task omni-modal three-hop reasoning benchmark covering text, image, and speech contexts. Through entity-attribute chains and six balanced reasoning paths, it exposes systematic weaknesses in current MLLMs regarding speech grounding, path robustness, and cross-modal grounding.
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Models: This paper introduces OMIBench—the first large-scale benchmark for Olympiad-level multi-image reasoning. It covers over 1000 competition problems in Biology, Chemistry, Mathematics, and Physics. The study finds that even the strongest LVLM (Gemini-3-Pro) achieves only approximately 50% accuracy, representing a decline of over 25% compared to single-image benchmarks.
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning: This is a position paper advocating that Multimodal Large Language Models (MLLMs) can significantly advance interdisciplinary scientific reasoning. It proposes a four-stage research roadmap (Broad Knowledge Recognition → Analogical Reasoning Generalization → Insightful Reasoning → Creative Hypothesis Generation) and systematically reviews the current status, five major challenges, and eight future directions of MLLMs in mathematics, physics, chemistry, and biology.
PROGRESSLM: Towards Progress Reasoning in Vision-Language Models: This paper defines the ability to "judge task completion steps from a single-frame observation" as progress reasoning for VLMs. It constructs Progress-Bench and ProgressLM-45K, demonstrating that explicit learning of "episodic retrieval + mental simulation" is more stable than simple inference prompting.
SciMDR: Advancing Scientific Multimodal Document Reasoning: SciMDR proposes a synthesize-and-reground data construction framework. It first synthesizes faithful QA pairs and reasoning chains based on atomic claims, and then re-embeds them into full scientific papers for training. This enables a 7B VLM to approach the performance of the GPT-5 series in scientific multimodal document reasoning.
ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction: ShredBench constructs an evaluation benchmark that requires multimodal large language models to restore content from "shredded" documents. The results demonstrate that while current MLLMs are proficient in conventional OCR, they generally lack the ability to integrate visual fragments, reading order, and semantic context for reasoning.
Structured and Abstractive Reasoning on Multi-modal Relational Knowledge Images: This paper proposes the STAR data engine and a two-stage training framework for multi-modal relational knowledge (MMRK) images. Using STAR-64K synthetic data, CoT annotations, and knowledge-aware KGRPO, it significantly improves the capability of MLLMs in understanding and reasoning over abstract structured knowledge images.
TableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity: TableVista constructs a multimodal table benchmark consisting of 3,000 high-quality table reasoning questions expanded into 30,000 visual samples. After systematically evaluating 29 foundation models, it was found that models are relatively stable to style changes but significantly degrade under complex structures, cross-table reasoning, visual fragmentation, and vision-only input.
TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos: This paper proposes TemporalVLM, which extracts local fine-grained temporal features through a time-aware segment encoder (overlapping sliding Video Q-Former + fusion module) and aggregates global long-range dependencies using a BiLSTM. This marks the first introduction of LSTM into Video LLMs, outperforming previous methods across four tasks: dense video captioning, temporal localization, highlight detection, and action segmentation.
Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry: This paper introduces the PlantInquiryVQA benchmark and the Chain-of-Inquiry (CoI) framework, comprising 24,950 plant images and 138,068 QA pairs. It simulates the adaptive diagnostic questioning strategies of botanists to evaluate the multi-step visual reasoning capabilities of 18 MLLMs in plant pathology diagnosis. The study reveals that structured questioning significantly enhances diagnostic accuracy and reduces hallucinations, although even the strongest model achieved a clinical utility score of only 0.188.
TRACE: Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning: This paper proposes TRACE (Textual Representation of Allocentric Context from Egocentric Video), a prompting method that guides Multimodal Large Language Models (MLLMs) to generate structured textual allocentric 3D environmental representations—including meta-context, camera trajectories, and entity registries—from egocentric videos. These serve as intermediate reasoning steps to enhance spatial question-answering capabilities, consistently outperforming existing prompting strategies on VSI-Bench and OST-Bench.
VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning: VL-Calibration decomposes the verbalized confidence of LVLMs into visual confidence and reasoning confidence. By utilizing image perturbation KL divergence, token entropy, and token-level advantage reweighting for training, the model simultaneously reduces ECE and improves accuracy across 13 visual reasoning benchmarks.
What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning: This paper proposes the UILoop (UI-in-the-Loop) paradigm, which restructures GUI reasoning from the traditional "Screen → Action" into a cyclic "Screen → UI Element → Action" process. Through UI-element-driven reinforcement fine-tuning, the model explicitly learns to locate, understand, and utilize key UI elements, achieving SOTA performance in GUI reasoning tasks.
When Slower Isn't Truer: Inverse Scaling Law of Truthfulness in Multimodal Reasoning: This paper identifies an "inverse scaling law" in multimodal reasoning—reasoning (slow-thinking) models are more prone to generating untruthful outputs than chat (fast-thinking) models when faced with misleading visual inputs. It constructs the TruthfulVQA benchmark (5,000+ samples, 50 annotators, three-tier tiered prompting) and the TruthfulJudge evaluation model (88.4% accuracy) to systematically diagnose this phenomenon.