🧠 VLM Reasoning¶

💬 ACL2025 · 18 paper notes

📌 Same area in other venues: 📷 CVPR2026 (150) · 🔬 ICLR2026 (112) · 💬 ACL2026 (32) · 🧪 ICML2026 (31) · 🤖 AAAI2026 (10) · 🧠 NeurIPS2025 (30)

🔥 Top topics: Reasoning ×18 · Multimodal/VLM ×12

AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness: This work proposes AdamMeme—an adaptive evaluation framework based on multi-agent collaboration, which probes the reasoning capabilities and specific weaknesses of Multimodal Large Language Models (mLLMs) in harmful content understanding by iteratively generating more challenging meme samples.
Answering Complex Geographic Questions by Adaptive Reasoning with Visual Context and External Commonsense Knowledge: This paper proposes an adaptive reasoning framework for complex geographic questions. It combines visual context (such as maps and satellite images) with external commonsense knowledge bases for multi-step reasoning, dynamically selecting reasoning paths based on question complexity, and significantly outperforms direct end-to-end answering methods on geographic VQA tasks.
Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning: This paper constructs a systematic evaluation benchmark to assess large vision-language models (LVLMs) on basic visual graph structure understanding and reasoning, finding that existing models perform poorly on such tasks, and proposes targeted improvement methods.
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs: This paper proposes a method to transfer the reasoning capabilities of LLMs to VLMs. Through improved chart representation pre-training, construction of large-scale synthetic reasoning datasets, and multi-task fine-tuning, the 5B-parameter PaLI-3 model outperforms models 10 times its size on ChartQA.
FCMR: Robust Evaluation of Financial Cross-Modal Multi-Hop Reasoning: This work constructs FCMR, a cross-modal multi-hop reasoning benchmark in the financial domain, comprising three modalities: text, tables, and charts. It is categorized into three difficulty levels: Easy, Medium, and Hard. The strongest model, Claude 3.5 Sonnet, achieves only 30.4% accuracy on the Hard level, revealing critical bottlenecks of MLLMs in the information retrieval phase.
FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation: This work constructs FinMME, an evaluation benchmark containing over 11,000 high-quality financial multimodal samples across 18 financial domains and 10 chart types. It proposes the FinScore evaluation framework, which integrates hallucination penalties with domain normalization. Experimental results show that even GPT-4o scores only 15.34 (with an average accuracy of 46.56%), revealing significant deficiencies of MLLMs in the financial domain.
Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?: This work systematically evaluates 13 open-source small LVLMs (\(\le 9\text{B}\) parameters) serving as judges for chart comprehension and reasoning tasks. It finds that some open-source models (e.g., LLaVA-Critic-7B) can achieve evaluation capabilities close to GPT-4 (about 80% agreement rate), though issues like positional bias and length bias remain prevalent.
LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating: This paper proposes the LongDocURL benchmark, which covers 20 subtasks across three primary task categories: understanding, numerical reasoning, and cross-element locating. It contains 2,325 high-quality QA pairs spanning over 33,000 pages of documents. A systematic evaluation of 26 model configurations exposes key performance gaps of current LVLMs in long document understanding.
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale: Proposes a scalable, low-cost method to construct MAmmoTH-VL-Instruct, a multimodal instruction tuning database of 12 million instances rich in Chain-of-Thought (CoT) reasoning, using only open-source models. The resulting model, MAmmoTH-VL-8B, achieves state-of-the-art (SOTA) performance on multimodal reasoning benchmarks (e.g., MathVerse +8.1%, MMMU-Pro +7%, MuirBench +13.3%).
MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning: This paper proposes leveraging code as a supervisory signal for cross-modal alignment to construct the ImgCode-8.6M dataset consisting of 8.6 million image-code pairs, and the MM-MathInstruct-3M dataset containing 3 million multimodal mathematical instruction-tuning samples. The trained MathCoder-VL achieves State-of-the-Art (SOTA) performance in multimodal mathematical reasoning among open-source models, outperforming GPT-4o and Claude 3.5 Sonnet on geometry problems.
MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration: This paper proposes the MMBoundary framework, which inserts natural language confidence statements at each step of the reasoning chain (rather than offering confidence only after the final answer). It combines textual and cross-modal self-reward signals to estimate confidence, and utilizes a two-stage training paradigm of SFT and RL to achieve step-level confidence calibration, reducing the calibration error by an average of 7.5% and improving task accuracy by 8.3%.
Progressive Multimodal Reasoning via Active Retrieval: This paper proposes the AR-MCTS framework, which combines Active Retrieval with Monte Carlo Tree Search (MCTS) to dynamically retrieve key knowledge at each step of multi-step multimodal reasoning, replacing traditional beam search sampling. It automatically generates step-by-step reasoning annotations to progressively align the Process Reward Model (PRM), significantly improving the reasoning performance of multiple MLLMs on MathVista, We-Math, and GAOKAO-MM.
SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data: This study identifies a severe shortage of spatial relation data in existing VLM datasets (where the top 17% of relations cover over 90% of samples). To address this, the authors propose leveraging LLMs to automatically extract a synthetic spatial reasoning dataset of 455k samples (3.4 million QA pairs) from ultra-detailed image description datasets such as DOCCI, Localized Narratives, and PixMo-Cap. The fine-tuned SpaRE model achieves up to a 49% performance boost on the What's Up benchmark without compromising general vision-language capabilities.
The Role of Visual Modality in Multimodal Mathematical Reasoning: Challenges and Insights: This study systematically reveals that existing multimodal mathematical reasoning models utilize visual information to an extremely limited extent—shuffling or removing training images has negligible impact on model performance—and proposes the HC-M3D benchmark to genuinely test visual dependency, showing that mainstream models fail to identify subtle variations in images.
Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning: This paper identifies a severe visual forgetting phenomenon in MLLMs during long CoT reasoning—removing the image halfway through reasoning only causes a ~2% drop in accuracy, indicating that models rely excessively on self-generated text while ignoring visual evidence. To address this, the authors propose a Take-along Visual Conditioning (TVC) strategy. It injects an image review mechanism via Dynamic Visual Reaffirmation (DVR) during the training phase, and compresses and re-injects visual tokens via Periodic Visual Calibration (PVC) during the inference phase. TVC outperforms the previous SOTA by 3.4 points on average (43.4 vs 40.0) across five mathematical reasoning benchmarks.
VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search: This paper proposes the VisuoThink framework, which dynamically integrates visual aids during reasoning and explores multiple reasoning paths through vision-text interleaved reasoning and predictive rollout tree search. Without fine-tuning, VisuoThink achieves SOTA performance on geometric and spatial reasoning tasks (reaching up to 48.5% Accuracy@1 on Geomverse-109, a 21.8% improvement over the best baseline).
VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism: VReST applies Monte Carlo Tree Search (MCTS) to Large Vision-Language Models (LVLMs) for multimodal mathematical reasoning: each tree node represents a reasoning step, each path represents a complete reasoning chain, and a multimodal self-reward mechanism that does not introduce any additional models is used to score each step. This systematically explores the reasoning space without training, achieving SOTA on three multimodal mathematical reasoning benchmarks and validating that the test-time scaling law also holds for multimodal tasks.
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?: This paper proposes the We-Math benchmark, containing 6.5K visual mathematical problems and 67 hierarchical knowledge concepts. By decomposing composite problems into sub-problems, it introduces a four-dimensional evaluation metric (Insufficient Knowledge (IK), Insufficient Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM)), systematically evaluating the mathematical reasoning process of LMMs from the perspective of knowledge mastery for the first time, rather than focusing solely on the final results.