ICCV2025 VLM Reasoning AI paper notes paper summaries Reasoning Multimodal/VLM Agents LLM Question Answering Personalized Generation

🧠 VLM Reasoning¶

📹 ICCV2025 · 15 paper notes

📌 Same area in other venues: 📷 CVPR2026 (150) · 🔬 ICLR2026 (112) · 💬 ACL2026 (32) · 🧪 ICML2026 (31) · 🤖 AAAI2026 (10) · 🧠 NeurIPS2025 (30)

🔥 Top topics: Reasoning ×13 · Multimodal/VLM ×7

Boosting MLLM Reasoning with Text-Debiased Hint-GRPO: This paper identifies two critical issues in applying GRPO to MLLM reasoning — low data utilization (invalid gradients when all sampled outputs for a hard question are incorrect) and text bias (the model ignores visual input and relies solely on textual reasoning) — and proposes two corresponding solutions: Hint-GRPO (adaptively providing reasoning hints) and text-debiasing calibration (enhancing image conditioning at test time). The approach achieves significant reasoning improvements across 11 datasets on 3 base MLLMs.
ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning: This paper proposes PointCoT, which integrates reflective visual grounding (bounding boxes) into the chain-of-thought for chart reasoning, enabling MLLMs to interactively verify each reasoning step against the chart's visual content. It also constructs the ChartPoint-SFT-62k dataset containing 19.2K high-quality samples, achieving a +5.04% improvement on ChartBench.
DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning: This paper proposes the DWIM framework, which employs a discrepancy-aware workflow generation strategy to curate high-quality training data and an instruct-masking fine-tuning strategy to clone only effective actions, endowing LLMs with tool-aware capability for compositional visual reasoning and achieving state-of-the-art results on multiple VR benchmarks.
FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging: This paper proposes FinMMR, a bilingual (Chinese–English) multimodal financial numerical reasoning benchmark containing 4,300 questions and 8,700 images spanning 14 financial sub-domains, requiring models to perform multi-step precise numerical computation. Evaluation of 15 state-of-the-art MLLMs shows that the best model achieves only 53% accuracy on the Hard subset, exposing fundamental bottlenecks in current MLLMs for professional-domain multimodal reasoning.
From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning: This paper proposes the MIR benchmark, comprising 22,257 multi-image interleaved reasoning QA pairs with five-stage reasoning steps, and introduces a progressive curriculum learning strategy that trains MLLMs from easy to hard samples to improve multi-image interleaved reasoning capability.
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step: LLaVA-CoT proposes a method enabling vision-language models to perform autonomous multi-stage structured reasoning. By constructing the LLaVA-CoT-100k structured reasoning annotation dataset, the model is trained to sequentially execute four stages—Summary, Caption, Reasoning, and Conclusion—and a Stage-Wise Retracing Search (SWIRES) is proposed for test-time scaling, allowing an 11B model to surpass Gemini-1.5-pro and GPT-4o-mini.
MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning: This paper introduces MMAT-1M, the first million-scale multimodal agent tuning dataset, constructed via a four-stage data engine (Foundation → Rationale → Reflection → Integration). It endows MLLMs with CoT reasoning, tool invocation, and self-reflection capabilities, achieving an average improvement of 2.7% on InternVL2.5-8B and 8.8% on RAG tasks.
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation: This paper proposes the Abstract Perspective Change (APC) framework, which leverages visual foundation models to construct an abstract scene representation and perform perspective transformations, enabling VLMs to reason spatially from arbitrary viewpoints. APC substantially outperforms existing VLMs and fine-tuned models on both synthetic and real-image benchmarks.
Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models: This paper proposes Physics Context Builders (PCBs), a modular framework that fine-tunes small specialized VLMs on simulation data to generate detailed physical scene descriptions, which serve as physical context to augment the physical reasoning capabilities of large foundation VLMs (e.g., GPT-4o), without modifying the large model itself.
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization: This paper proposes StepGRPO, an online reinforcement learning framework that introduces two rule-based step-wise reasoning rewards — StepRAR (Step-wise Reasoning Accuracy Reward) and StepRVR (Step-wise Reasoning Validity Reward) — without requiring a process reward model. The framework addresses the sparse reward problem in RL-based MLLM training, enabling models to autonomously explore and improve their reasoning capabilities.
ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering: This paper proposes ReasonVQA, a dataset constructed through a low-cost and scalable framework that automatically integrates structured encyclopedic knowledge (Wikidata) with images, generating 1/2/3-hop multi-hop reasoning questions. The benchmark comprises 598K images and 4.2M questions, posing significant challenges to existing VQA models.
Taming the Untamed: Graph-Based Knowledge Retrieval and Reasoning for MLLMs to Conquer the Unknown: Using Monster Hunter: World as a testbed, this paper constructs a multimodal knowledge graph (MH-MMKG) containing text, images, video, and complex entity relations, designs 238 complex queries along with a multi-agent knowledge retrieval method, and reveals the inadequacy of current MLLMs in domain-specific knowledge retrieval and reasoning tasks.
ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools: This paper proposes ToolVQA — a large-scale multimodal tool-augmented VQA dataset containing 23K samples. It is automatically constructed via the ToolEngine pipeline, which combines image-guided DFS with LCS-based example matching, to generate multi-step reasoning data in realistic scenarios. LLaVA-7B fine-tuned on this dataset surpasses GPT-3.5-Turbo on 5 OOD benchmarks.
Training-Free Personalization via Retrieval and Reasoning on Fingerprints: This work proposes R2P, the first training-free VLM personalization method. It leverages the inherent world knowledge of VLMs to extract concept "fingerprint" attributes, achieving personal concept recognition through a retrieval-reasoning paradigm and cross-modal attribute verification, without requiring any fine-tuning or large-scale pre-training.
Understanding Museum Exhibits using Vision-Language Reasoning: Constructs Museum-65, a large-scale museum exhibit dataset containing 65 million images and 200 million QA pairs. By fine-tuning BLIP and LLaVA on this dataset, the study demonstrates that domain-specific large-scale datasets significantly outperform zero-shot state-of-the-art (SOTA) VLMs, with the fine-tuned LLaVA achieving 57% and 70% accuracy on exhibit title and origin identification, respectively (compared to 22% and 33% for GPT-4o).