🧩 Multimodal VLM¶

🔬 ICLR2026 · 93 paper notes

📌 Same area in other venues: 💬 ACL2026 (52) · 📷 CVPR2026 (287) · 🤖 AAAI2026 (92) · 🧠 NeurIPS2025 (151) · 📹 ICCV2025 (142)

🔥 Top topics: Multimodal/VLM ×45 · Reasoning ×22 · LLM ×5 · Agents ×4 · Robotics ×4

A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models: This paper proposes the A-TPT framework, which promotes angular diversity by maximizing the minimum pairwise angular distance among normalized text features on the unit hypersphere. It addresses the miscalibration caused by overconfident predictions in test-time prompt tuning (TPT) of VLMs, achieving superior performance over existing TPT calibration methods on both natural distribution shifts and medical datasets.
BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning: This paper proposes BEAT, the first visual backdoor attack framework targeting VLM-driven embodied agents. It employs environmental objects (e.g., knives) as triggers and adopts a two-stage training pipeline (SFT + Contrastive Trigger Learning) to achieve precise backdoor activation. BEAT attains an attack success rate of up to 80% while preserving normal task performance, exposing critical security vulnerabilities in VLM-based embodied agents.
BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models: This paper proposes BioCAP, which trains a biological multimodal foundation model by using an MLLM to generate Wikipedia-knowledge-guided synthetic descriptive captions (rather than relying solely on species labels). BioCAP achieves an average improvement of 8.8% over BioCLIP across 10 species classification benchmarks and a 21.3% gain on text-image retrieval tasks.
Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems: Bongard-RWR+ is a benchmark comprising 5,400 Bongard problems, constructed via a VLM-based pipeline (Pixtral-12B + Flux.1-dev) that automatically generates photorealistic images to represent abstract concepts. Systematic evaluation reveals that state-of-the-art VLMs struggle to discriminate fine-grained visual concepts such as contour, rotation, and angle, with accuracy dropping as low as 19%.
Bootstrapping MLLM for Weakly-Supervised Class-Agnostic Object Counting (WS-COC): This paper proposes WS-COC, the first MLLM-based weakly supervised class-agnostic object counting framework. Through three strategies — divide-and-discern dialogue tuning (progressively narrowing the counting range), comparative ranking optimization (learning relative counting relationships across images), and global-local counting enhancement — WS-COC achieves performance comparable to or surpassing fully supervised methods using only image-level count annotations.
Breaking the Limits of Open-Weight CLIP: An Optimization Framework for Self-supervised Fine-tuning of CLIP: This paper proposes TuneCLIP, a self-supervised fine-tuning (SSFT) framework that improves existing open-weight CLIP models through a two-stage design — first recovering optimizer statistics (OSR) to eliminate cold-start bias, then applying a hinged global contrastive loss (HGCL) with a margin to mitigate over-penalization of false negatives — achieving consistent general-purpose performance gains without any labels, with improvements of up to +2.5% on ImageNet and variants and +1.2% on the DataComp benchmark.
Can Vision-Language Models Answer Face to Face Questions in the Real-World?: This paper introduces QIVD (Qualcomm Interactive Video Dataset), a face-to-face real-time QA benchmark comprising 2,900 videos with audio and timestamp annotations. It reveals that existing VLMs fall far short of human performance in real-time situated understanding (best model 60% vs. humans 87%), with primary bottlenecks in referential disambiguation, response timing judgment, and situated commonsense. Fine-tuning on this data can substantially close the gap.
Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts: To address the Straggler Effect in MoE inference—where the most heavily loaded expert determines overall latency due to uneven token distribution—this paper proposes Capacity-Aware Token Drop (discarding low-scoring tokens from overloaded experts) and Expanded Drop (re-routing overflow tokens to lightly loaded local experts). The approach achieves a 1.85× speedup on Mixtral-8×7B with a 0.2% performance improvement.
CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing: CityLens is introduced as the largest urban socioeconomic sensing benchmark to date (17 cities, 6 domains, 11 prediction tasks), evaluating 17 LVLMs across three paradigms—direct metric prediction, normalized metric estimation, and feature-based regression—for inferring socioeconomic indicators from satellite and street-view imagery. Results show that general-purpose LVLMs still fall short of domain-specialized contrastive learning methods on most tasks.
Closing the Modality Gap Aligns Group-Wise Semantics: This paper demonstrates that the modality gap in CLIP is inconsequential for instance-level tasks (retrieval) yet severely harms group-level tasks (clustering). It proposes a novel objective comprising an Align True Pairs loss and a Centroid Uniformity loss that reduces the gap to nearly zero in both bimodal and trimodal settings, substantially improving clustering V-Measure by +10–17 points while preserving retrieval performance.
Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping: This paper proposes AttWarp, a plug-and-play test-time image warping method that leverages the MLLM's own cross-modal attention maps to perform rectilinear grid resampling — expanding high-attention regions and compressing low-attention regions — achieving consistent accuracy improvements, enhanced compositional reasoning, and reduced hallucinations across 5 benchmarks and 4 MLLMs.
Contamination Detection for VLMs using Multi-Modal Semantic Perturbation: This paper proposes a multi-modal semantic perturbation framework for detecting data contamination in VLMs. It uses an LLM to generate dense captions and Flux ControlNet to alter answer-relevant semantic elements while preserving image composition. Contaminated models suffer sharp performance drops on perturbed samples due to memorization of original image-text pairs, whereas clean models are unaffected thanks to genuine reasoning ability. The paper also provides the first systematic validation that most existing LLM-based contamination detection methods are unreliable in VLM settings.
Context Tokens are Anchors: Understanding the Repetition Curse in dMLLMs from an Information Flow Perspective: This work investigates the underlying mechanism behind the "Repetition Curse" in diffusion multimodal large language models (dMLLMs) when cache-based acceleration is applied, through an information flow perspective. It reveals that context tokens act as anchors that aggregate semantic information, and that caching disrupts this information flow pattern. The proposed CoTA method reduces repetition rates by up to 92%.
Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach: This paper proposes the Emotion Statement Judgment (ESJ) task and the INSETS automatic annotation pipeline, reformulating visual emotion evaluation from "open-ended classification" to "statement veracity judgment." The authors construct the MVEI benchmark (3,086 samples, 424 emotion labels, four cognitive dimensions) and systematically evaluate 19 MLLMs, finding that even GPT-4o lags behind humans (91.6%) by 13.3% in accuracy.
Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification: This paper proposes EUQ (Evidential Uncertainty Quantification), which leverages Dempster-Shafer evidence theory to decompose the epistemic uncertainty of LVLMs into conflict CF (internal contradictions) and ignorance IG (lack of information). EUQ requires no training and only a single forward pass to detect four types of misbehaviors—hallucination, jailbreak, adversarial attacks, and OOD failures—achieving an average AUROC improvement of 10.4%/7.5% over the best baseline.
Directional Embedding Smoothing for Robust Vision Language Models: This paper extends RESTA (Randomized Embedding Smoothing and Token Aggregation) from LLMs to VLMs, demonstrating that directional embedding noise significantly outperforms isotropic noise in the safety-utility tradeoff, serving as a lightweight inference-time defense layer against multimodal jailbreak attacks.
DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage: This paper proposes DIVA-GRPO, which addresses reward sparsity and advantage vanishing in GRPO training by dynamically assessing question difficulty, adaptively generating semantically consistent variants of varying difficulty, and incorporating difficulty-weighted local-global advantage estimation. The method achieves state-of-the-art multimodal reasoning performance at the 7B model scale.
Do Vision-Language Models Respect Contextual Integrity in Location Disclosure?: This paper introduces the VLM-GEOPRIVACY benchmark grounded in Nissenbaum's Contextual Integrity (CI) theory. Through seven progressively structured context-aware questions and a three-tier location disclosure granularity (refusal / city-level / precise location), it systematically evaluates whether 14 mainstream VLMs can determine appropriate location disclosure levels based on social-norm cues present in images. Results show that all models exhibit severe over-disclosure bias (Over-Disclosure rates of 46–52%), and malicious prompting can push the Abstention Violation rate to 100%.
Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models: This paper proposes Dynamic Multimodal Activation Steering (DMAS), a training-free method that constructs a semantics-based truthfulness steering vector database and a visual perception steering vector, dynamically selecting the most relevant steering vectors at inference time to intervene on critical attention heads. DMAS significantly mitigates hallucinations in LVLMs, achieving a gain of 94.66 points on MME and a 20.2% reduction in hallucination rate on CHAIR.
EgoHandICL: Egocentric 3D Hand Reconstruction with In-Context Learning: This work introduces the in-context learning (ICL) paradigm to 3D hand reconstruction for the first time. Through VLM-guided template retrieval, a multimodal ICL tokenizer, and an MAE-driven reconstruction pipeline, EgoHandICL significantly outperforms state-of-the-art methods on the ARCTIC and EgoExo4D benchmarks.
Empowering Small VLMs to Think with Dynamic Memorization and Exploration: This paper proposes DyME (Dynamic Memorize-Explore), which progressively and dynamically alternates between an SFT memorization mode and a GRPO exploration mode, enabling—for the first time—reasoning capabilities in small-scale vision-language models (SVLMs, <1B parameters) on domain-specific tasks.
Enhanced Continual Learning of Vision-Language Models with Model Fusion: This paper proposes the Continual Decoupling-Unifying (ConDU) framework, which is the first to introduce model fusion into VLM continual learning. By maintaining a unified model and performing iterative decoupling-unifying operations guided by task triggers, ConDU surpasses the state of the art by an average of 2% on the MTIL benchmark while simultaneously enhancing zero-shot capability.
Enhancing Multi-Image Understanding through Delimiter Token Scaling: By scaling the hidden states of image delimiter tokens in vision-language models, this work enhances inter-image information isolation and achieves performance gains on multi-image understanding benchmarks (Mantis/MuirBench/MIRB/QBench2) and multi-document/multi-table understanding benchmarks (TQABench/MultiNews/WCEP-10) without introducing any additional training or inference cost.
Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models: This paper proposes a training-free two-stage VLM framework that records corrected reasoning trajectories in an Error Notebook and applies RAG-based test-time adaptation. On specification-driven part retrieval in 3D CAD assemblies, GPT-4o accuracy improves from 41.7% to 65.1% (+23.4%), with a further +4.5% gain from a grammar-constrained validator.
Evaluating VLMs' Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences: This paper systematically evaluates VLMs' spatial reasoning capabilities over robot motion trajectories, proposing four image-querying methods that enable VLMs to select optimal motion paths based on user natural language descriptions. Results show that Qwen2.5-VL achieves 71.4% zero-shot accuracy, with smaller models achieving significant gains after fine-tuning.
FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models: This paper introduces FRIEDA, a benchmark that systematically evaluates large vision-language models (LVLMs) on multi-step, cross-map cartographic reasoning. The strongest model, Gemini-2.5-Pro, achieves only 38.20% accuracy, far below the human baseline of 84.87%.
GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-Guided Latent Diffusion Model?: This paper proposes GLYPH-SR, a vision-language-guided diffusion framework that simultaneously optimizes image quality and text readability via a dual-branch Text-SR fusion ControlNet and a ping-pong scheduler, achieving a 15.18-point improvement in OCR F1 on SVT ×8.
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs: This paper proposes GAR (Grasp Any Region), which employs RoI-aligned feature replay to extract high-fidelity local features while preserving global context, enabling precise single-region captioning, multi-region interaction modeling, and compositional reasoning. The 1B model surpasses InternVL3-78B.
Grounding-IQA: Grounding Multimodal Language Models for Image Quality Assessment: This paper integrates spatial grounding (referring + grounding) with image quality assessment (IQA), constructs the GIQA-160K dataset to fine-tune a multimodal LLM that generates quality descriptions with bounding boxes and spatial VQA, achieving significantly superior fine-grained quality perception over general-purpose MLLMs.
GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models: This paper introduces GTR-Bench, a novel benchmark for geo-temporal reasoning of moving targets in large-scale camera networks. Evaluation reveals that the strongest model, Gemini-2.5-Pro (34.9%), falls far short of human performance (78.61%), exposing three critical deficiencies in current VLMs: imbalanced utilization of spatial-temporal context, weak temporal prediction capability, and insufficient map-video alignment.
HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit: This paper proposes the HiDrop framework, which conducts a systematic layer-wise behavioral analysis of MLLMs (shallow layers = propagators, middle layers = fusion hubs, deep layers = language reasoners) and designs a three-stage strategy: Late Injection (skipping shallow layers) + Concave Pyramid Pruning (aggressive pruning in middle layers) + Early Exit (discarding tokens in deep layers). The framework compresses approximately 90% of visual tokens with negligible performance degradation and achieves a 1.72× training speedup.
How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images: This work presents the first systematic diagnosis revealing that the root cause of poor zero-shot medical VQA performance in medical MLLMs is insufficient visual grounding—model attention systematically deviates from clinically relevant regions. Building on this finding, the authors propose VGRefine, a training-free inference-time attention correction method that achieves state-of-the-art results across 110K+ samples on 6 benchmarks spanning 8 imaging modalities.
ICYM2I: The Illusion of Multimodal Informativeness under Missingness: This paper identifies a largely overlooked problem in multimodal learning: distribution shift induced by modality missingness leads to severely biased modality value estimation. The proposed ICYM2I framework applies dual inverse probability weighting (IPW) to correct bias in both training and evaluation, achieving unbiased estimates of modality predictive utility and information-theoretic value under the MAR assumption.
Index-Preserving Lightweight Token Pruning for Efficient Document Understanding: A binary patch classifier with only 203K parameters is inserted before the VLM visual encoder to remove background tokens from document images. A \(3 \times 3\) max-pooling operation is then applied to recover fragmented text regions while preserving original spatial indices, achieving 40–60% FLOPs reduction on Qwen2.5-VL with accuracy degradation of no more than ~5 percentage points.
IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning: This paper reveals the implicit visual coordinate (IVC) system established by RoPE positional encoding within LVLMs, and proposes a training-free, prompt-aware vision token pruning strategy that preserves IVC tokens and semantic foreground tokens while pruning approximately 50% of visual tokens with ≥99% of original performance retained.
K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge: This paper proposes the K-Sort Eval framework, which leverages posterior correction and dynamic matching strategies to enable VLMs to reliably and efficiently replace human annotators in preference evaluation of visual generation models, typically converging to results consistent with human Arena rankings in fewer than 90 model runs.
KeepLoRA: Continual Learning with Residual Gradient Adaptation: By analyzing the SVD decomposition of pretrained model weights, this paper identifies that general knowledge is encoded in the principal subspace while domain-specific knowledge resides in the residual subspace. KeepLoRA is proposed to constrain LoRA updates for new tasks within the residual subspace, while using gradient information for initialization to preserve plasticity, achieving an optimal balance among forward stability, backward stability, and plasticity in continual learning.
Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification: This paper identifies a pervasive "agreement bias" in multimodal large language models (MLLMs) when used as agent behavior verifiers—whereby models systematically over-approve agent actions—and proposes Self-Grounded Verification (SGV), a two-step generation framework (first extracting behavioral priors, then performing conditioned verification) to mitigate this bias. SGV achieves up to 25 pp improvement in failure detection rate and 14 pp improvement in accuracy across web navigation, desktop manipulation, and robotic manipulation tasks.
LiveWeb-IE: A Benchmark For Online Web Information Extraction: This paper introduces LiveWeb-IE, the first benchmark for online web information extraction (WIE), covering multi-type data extraction including text, images, and hyperlinks. It further proposes the Visual Grounding Scraper (VGS) framework, which simulates human cognitive processes—visual scanning to locate regions → precise element localization → XPath generation—to achieve robust information extraction on dynamic webpages.
LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models: This paper proposes LLaVA-FA, an efficient compression method for large multimodal models (LMMs) that performs joint low-rank and quantization weight approximation in the frequency domain. By exploiting the decorrelation property and conjugate symmetry of the Fourier transform, the method achieves more compact and accurate weight representations. It further introduces PolarQuant (polar coordinate quantization) and ODC (Optional Diagonal Calibration), surpassing existing efficient multimodal models on multiple benchmarks with minimal active parameters and computational cost.
Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation: This paper proposes AIR (Adaptive vIsual Reinforcement), a framework that reduces hallucinations in MLLMs at inference time without any training, via prototype-distance-based token reduction combined with optimal-transport-guided selective patch reinforcement (LLaVA-1.5-7B CHAIR_S: 22→18.4, POPE accuracy +5.3%), while preserving general multimodal capabilities.
Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering: This paper proposes MAPD (Meta-Adaptive Prompt Distillation), a MAML-based prompt distillation framework that leverages an attention mapper to distill soft prompts from task-relevant image features, enabling LMMs to adapt to novel visual question answering tasks at test time with only a few gradient steps. MAPD outperforms ICL by 21.2%.
Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models: This paper identifies modality-specific and attention-head-specific semantic redundancy in the KV Cache of LVLMs, demonstrating that importance-only selection fails to preserve semantic coverage. The proposed MixKV adaptively mixes importance and diversity scores per attention head for KV Cache compression, achieving an average improvement of 5.1% under extreme compression ratios.
MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning: This paper introduces MMR-Life, a benchmark comprising 2,646 five-choice multi-image questions based on 19,108 real-life images, covering 7 reasoning types and 21 tasks. It is the first systematic evaluation of MLLMs on multi-image reasoning in real-life scenarios. The strongest model, GPT-5, achieves only 58.69% accuracy—14 percentage points below human performance. Key findings include the failure of reasoning enhancement methods on large models and the weaker generalization of RL compared to BoN.
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs: This paper proposes MMTok, a multimodal visual token selection framework formulated as a Maximum Coverage Problem. By jointly leveraging text-visual and visual-visual coverage signals, MMTok selects the most informative subset of visual tokens in a training-free manner, significantly outperforming unimodal baselines and even surpassing methods that require fine-tuning.
Modal Aphasia: Can Unified Multimodal Models Describe Images From Memory?: This paper identifies and systematically defines the phenomenon of Modal Aphasia — unified multimodal models can generate visual concepts (e.g., movie poster images) from memory with near-perfect fidelity, yet exhibit error rates more than 7× higher when verbally describing the same concepts, with severe hallucinations occurring almost exclusively in the text modality. Through real-world experiments with frontier models (ChatGPT-5) and controlled synthetic experiments with open-source models (Janus-Pro, Harmon), the paper confirms that modal aphasia is a systemic deficiency of current unified architectures rather than a training artifact, and demonstrates its potential threat to AI safety frameworks.
Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional: A large-scale empirical study reveals severe unimodal dependency issues across 23 VQA benchmarks — many benchmarks designed to eliminate text bias have instead introduced image bias, with models exploiting unimodal shortcuts rather than performing genuine cross-modal reasoning.
Multimodal Classification via Total Correlation Maximization: This paper analyzes modality competition in multimodal classification from an information-theoretic perspective and proposes TCMax, a loss function that maximizes the Total Correlation (TC) between multimodal features and labels. TCMax simultaneously addresses joint learning, unimodal learning, and cross-modal alignment without additional hyperparameters, surpassing state-of-the-art methods on multiple audio-visual and image-text classification benchmarks.
Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs: This work is the first to extend automatic prompt optimization (APO) from the pure text space to the multimodal space, proposing the MPO framework. It achieves an average accuracy of 65.1% across 10 datasets spanning image, video, and molecular modalities—surpassing the strongest text-based APO baseline ProTeGi (60.0%)—via two key components: alignment-preserving joint exploration (unified semantic gradients synchronously drive text and non-text prompt updates, diversified by Generation/Edit/Mix operators) and prior-inherited Bayesian UCB candidate selection (warm-starting child prompt Beta priors from parent prompt performance).
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models: Grounded in cognitive psychology, this work introduces OmniSpatial—the first comprehensive spatial reasoning benchmark—systematically covering 4 dimensions (dynamic reasoning, complex spatial logic, spatial interaction, and perspective transformation) across 50 subcategories with 8.4K manually annotated QA pairs. The strongest reasoning model, o3, achieves only 56.33% while humans reach 92.63%, revealing that complex spatial reasoning remains a fundamental bottleneck for VLMs.
On the Generalization Capacities of MLLMs for Spatial Intelligence: This paper identifies a fundamental flaw in RGB-only spatial reasoning MLLMs—the focal-length–depth ambiguity arising from the neglect of camera intrinsics—and proposes the Camera-Aware MLLM (CA-MLLM) framework. Through dense camera ray embedding, camera-aware data augmentation, and geometric prior distillation, it improves F1 from 39.1% to 52.1% on cross-camera generalization benchmarks for spatial localization.
Post-hoc Probabilistic Vision-Language Models: A training-free post-hoc uncertainty estimation method is proposed that applies Laplace approximation to the last few layers of VLMs such as CLIP and SigLIP, and analytically derives uncertainty over cosine similarity. The method achieves substantial improvements over baselines in both uncertainty quantification and active learning.
PPE: Positional Preservation Embedding for Token Compression in Multimodal Large Language Models: This paper proposes PPE (Positional Preservation Embedding), which exploits the dimensional independence of rotations in RoPE to encode multiple original position IDs from merged tokens into distinct dimension segments, enabling a single compressed token to carry multiple spatial/temporal positional cues. PPE is a zero-parameter, plug-and-play operator that achieves an average performance drop of only 3.6% on image tasks at 55% compression, and maintains comparable performance at 90% compression via cascaded compression.
PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies: This work introduces PRISMM-Bench, the first benchmark grounded in genuine reviewer-annotated multimodal inconsistencies in scientific papers. Mining 18,009 ICLR open reviews yields 384 cross-modal inconsistencies, evaluated across three tasks—identification, remediation, and paired matching—with a JSON-structured debiasing scheme for answer representation. Among 21 state-of-the-art LMMs, the best achieves only 53.9%, systematically exposing severe deficiencies in cross-modal reasoning over scientific documents.
Procedural Mistake Detection via Action Effect Modeling: This paper proposes a dual-branch multimodal supervision framework for action effect modeling, combining a visual branch (object state and spatial relation features) with a text branch (GPT-4o-generated scene graphs). Learnable effect tokens distill external supervision signals, achieving state-of-the-art mistake detection on egocentric procedural videos.
Reasoning-Driven Multimodal LLM for Domain Generalization: This paper proposes RD-MLDG — the first framework to incorporate MLLM reasoning chains into domain generalization. It constructs the DomainBed-Reasoning dataset, systematically analyzes two core challenges of reasoning supervision (optimization gap + reasoning pattern mismatch), and addresses them jointly via MTCT (Multi-Task Cross-Training) and SARR (Self-Aligned Reasoning Regularization), achieving an average accuracy of 86.89% across four standard DG benchmarks — substantially surpassing GPT-4o (83.46%) and all CLIP/ViT-based methods.
Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks: This paper introduces the Ref-Adv benchmark, constructed via a pipeline of hard distractor pairing + LLM-assisted minimally sufficient expression generation + three-annotator unanimous verification. The benchmark eliminates "grounding shortcuts" present in classical REC datasets. Across 13 contemporary MLLMs — including GPT-4o, Gemini 2.5, and Qwen2.5-VL-72B — accuracy drops dramatically from 90%+ on RefCOCO(+/g) to 50–68% on Ref-Adv, systematically exposing severe deficiencies in complex visual reasoning and precise grounding.
Revisit Visual Prompt Tuning: The Expressiveness of Prompt Experts: This paper reveals the limitations of VPT from a Mixture-of-Experts (MoE) perspective — prompt experts are input-agnostic constant functions with limited expressiveness — and proposes VAPT, which employs token-wise projectors and a shared feature projector to make prompt experts input-adaptive. VAPT achieves superior performance with fewer parameters and is supported by theoretical guarantees on optimal sample efficiency.
Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes: This paper proposes MV-RoboBench, the first benchmark integrating multi-view spatial reasoning with robotic manipulation tasks, systematically evaluating 40+ VLMs (open-source, closed-source, and reasoning-enhanced). The best-performing model, GPT-5, achieves only 56.4% accuracy, far below the human baseline of 91.0%. The study further reveals a positive correlation between spatial and robotic reasoning, and that performance on single-view benchmarks does not reliably transfer to multi-view settings.
Self-Aug: Query and Entropy Adaptive Decoding for Large Vision-Language Models: This paper proposes Self-Aug, a training-free decoding strategy that employs Self-Augmentation Selection (SAS) Prompting to enable LVLMs to leverage their own parametric knowledge for dynamically selecting query-semantically-aligned visual augmentations. It further introduces the Sparsity Adaptive Truncation (SAT) algorithm, which exploits the full entropy of the output distribution to dynamically regulate candidate token set size. Self-Aug consistently outperforms existing contrastive decoding methods across 5 LVLMs and 7 benchmarks.
Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking: This paper proposes EvoQuality, a self-supervised iterative framework that generates pseudo-ranking labels via pairwise majority voting and employs GRPO for self-iterative optimization, enabling VLMs to autonomously improve their image quality perception without any human annotations. The framework achieves a 31.8% PLCC improvement in zero-shot settings and surpasses supervised SOTA on 5 out of 7 IQA benchmarks.
Shuffle-R1: Efficient RL Framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle: Shuffle-R1 is proposed as an RL training framework that addresses two key efficiency bottlenecks—Advantage Collapsing and Rollout Silencing—through Pairwise Trajectory Sampling (selecting high-contrast trajectory pairs) and Advantage-based Batch Shuffle (redistributing training batches by advantage values). The framework achieves a 22% improvement over the baseline on Geo3K and surpasses GPT-4o on MathVerse.
Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation: Inspired by the draft-then-verify paradigm of Speculative Decoding, this paper proposes Speculative Verdict (SV), which employs multiple lightweight VLMs to generate diverse reasoning paths as drafts, while a large model serves as the verdict to synthesize, verify, and correct them. Without any training, SV surpasses GPT-4o by 11.9% on information-intensive VQA and recovers correct answers in 47–53% of minority-correct cases.
SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward: This paper proposes SophiaVL-R1, which introduces a holistic-level thinking process reward into rule-based RL training of MLLMs. A Thinking Reward Model (TRM) is trained to evaluate reasoning quality along five dimensions (including logical soundness and redundancy). Trust-GRPO is proposed to compute a reliability weight \(\gamma\) from the contrast of thinking rewards between correct and incorrect answer groups, mitigating reward hacking. A time-based annealing strategy \(e^{-\text{steps}/T}\) gradually reduces the thinking reward contribution so that the model relies more on accurate rule-based rewards in later training. The resulting 7B model comprehensively outperforms LLaVA-OneVision-72B on multiple benchmarks, including MathVista (71.3%) and MMMU (61.3%).
Sparsity Forcing: Reinforcing Token Sparsity of MLLMs: This paper proposes Sparsity Forcing — a GRPO-based RL post-training framework that treats a sparse-attention MLLM as the policy model and the original MLLM as the reference model. Through multi-budget rollouts exploring different token retention thresholds \(p\), and using a joint reward combining efficiency (token reduction rate) and performance (answer correctness) for within-group contrastive optimization, the method improves the token reduction rate of Qwen2/2.5-VL from 20% to 75% with minimal accuracy loss, achieving 3× memory reduction and 3.3× decoding speedup.
Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models: This paper proposes Spatial-DISE, a unified spatial reasoning benchmark grounded in a cognitive-science-based 2×2 taxonomy (Intrinsic/Extrinsic × Static/Dynamic). The benchmark comprises 559 evaluation VQA pairs and 12K+ training instances. Evaluation across 32 state-of-the-art VLMs reveals a substantial gap between model performance and human-level capability, particularly on dynamic spatial reasoning tasks such as mental rotation and folding.
Spatial CAPTCHA: Generatively Benchmarking Spatial Reasoning for Human-Machine Differentiation: This paper proposes Spatial CAPTCHA, a novel human verification framework grounded in 3D spatial reasoning. It exploits fundamental capability gaps between humans and multimodal large language models (MLLMs) across geometric reasoning, perspective-taking, occlusion handling, and mental rotation tasks to distinguish humans from machines. The best-performing MLLM achieves only 31.0% Pass@1 accuracy, far below human performance.
Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA: Through controlled experiments within the LLaVA framework, this paper systematically investigates the effects of image encoder training objectives and 2D positional encoding on the spatial reasoning capabilities of VLMs. The study finds that encoder choice dominates spatial performance, AIMv2 yields the most consistent results, while improvements from 2D-RoPE are unstable—indicating that spatial reasoning failures are rooted in core design choices of current VLM pipelines.
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?: This paper introduces SpatiaLab, a real-world spatial reasoning benchmark comprising 1,400 visual QA pairs spanning 30 subcategories across 6 major spatial task categories. Supporting both MCQ and open-ended evaluation formats, SpatiaLab reveals a substantial gap between the strongest current VLMs (InternVL3.5-72B: 54.93% MCQ) and humans (87.57%), with the gap widening further under open-ended settings.
SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery: SpectralGCD represents images as CLIP cross-modal image-text similarity vectors (i.e., mixtures of semantic concepts), employs spectral filtering to automatically select task-relevant concepts, and applies forward-backward knowledge distillation to preserve semantic quality. The method achieves a new multimodal GCD state of the art across six benchmarks at a training cost comparable to unimodal approaches.
SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery: This paper proposes SpectralGCD, which represents images as semantic mixtures over a CLIP concept dictionary (i.e., cross-modal similarity vectors), employs spectral filtering to automatically select task-relevant concepts, and incorporates forward-reverse knowledge distillation to preserve semantic quality. The method achieves multimodal state-of-the-art across six benchmarks at a computational cost comparable to unimodal approaches.
SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs: This paper introduces SpinBench, a cognitively grounded diagnostic benchmark that systematically evaluates spatial reasoning in 37 VLMs through 7 progressively structured task categories—ranging from object identity recognition to perspective taking—revealing systemic deficiencies including egocentric bias and weak rotation understanding.
Steering and Rectifying Latent Representation Manifolds in Frozen Multi-Modal LLMs for Video Anomaly Detection: This paper proposes SteerVAD, a framework that identifies "latent anomaly expert" (LAE) attention heads within a completely frozen multimodal large language model (MLLM) and dynamically steers their representation manifolds via a hierarchical meta-controller, achieving tuning-free video anomaly detection SOTA with only 1% of training data.
TableDART: Dynamic Adaptive Multi-Modal Routing for Table Understanding: This paper proposes TableDART, which employs a lightweight MLP gating network with only 2.59M parameters to dynamically select the optimal processing path (Text-only / Image-only / Fusion) for each query-table pair. By reusing frozen unimodal expert models and introducing an LLM Agent for cross-modal fusion, TableDART achieves an average improvement of 4.02% over the strongest MLLM baseline HIPPO across 7 table understanding benchmarks, while reducing inference latency by 24.5%.
ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding: ThinkOmni is a training-free framework that leverages a text-only large reasoning model (LRM) to guide an omni-modal LLM (OLLM) during decoding via Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals. The method achieves 70.2% on MathVista and 75.5% on MMAU, matching or surpassing RFT-based approaches.
Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs: This paper proposes VC-STaR (Visual Contrastive Self-Taught Reasoner), motivated by the observation that VLMs perceive visual content more accurately when comparing two similar images. A contrastive self-improvement framework is designed: contrastive VQA pairs are constructed to elicit more faithful visual analysis from the model, and an LLM integrates this contrastive analysis into reasoning chains, yielding the high-quality visual reasoning dataset VisCoR-55K. Fine-tuning on this dataset achieves +5.7% on MMVP and +3.2% on Hallusion.
U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning: This work systematically ablates the design space of MLLM embedding learning, revealing key factors such as bidirectional attention + mean pooling outperforming the mainstream last-token approach, and learnable temperature being severely underestimated. Based on these findings, the authors construct U-MARVEL, a three-stage framework (progressive transition → filtered hard negatives → reranker distillation), achieving 63.2% Avg on M-BEIR with a single model, substantially surpassing existing SOTA, while also leading on zero-shot CIR and T2V transfer.
Unified Vision-Language Modeling via Concept Space Alignment: This paper proposes v-Sonar, which post-hoc aligns a visual encoder to the SONAR text embedding space, enabling the Large Concept Model (LCM) trained in the SONAR space to handle visual inputs in a zero-shot manner. Through instruction fine-tuning, v-Sonar is extended into v-LCM, which surpasses existing VLMs in 61 out of 62 languages.
UniHM: Unified Dexterous Hand Manipulation with Vision Language Model: This paper proposes UniHM, the first unified language-conditioned dexterous hand manipulation framework. It maps heterogeneous robotic hands into a shared discrete space via a morphology-agnostic VQ codebook, leverages a VLM for instruction-driven manipulation sequence generation, and ensures physical feasibility through physics-guided dynamic refinement.
VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL: VidGuard-R1 is the first video authenticity detector that fine-tunes an MLLM with GRPO (Group Relative Policy Optimization). By constructing a 140K shortcut-free real/fake video dataset and designing two specialized reward mechanisms—temporal artifact reward and diffusion-step quality reward—it achieves 86.17% accuracy on its in-house dataset and 95%+ zero-shot SOTA performance on GenVidBench and GenVideo benchmarks, while generating interpretable chain-of-thought reasoning.
VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs: This paper introduces VisioMath, a benchmark comprising 1,800 K-12 mathematics problems in which all answer choices consist of highly visually similar figures. It reveals a core weakness of LMMs in multi-image–text alignment, and explores three alignment strategies that achieve up to +12.6% accuracy improvement.
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models: This paper proposes Vision-R1, which constructs 200K high-quality multimodal CoT data via Modality Bridging for cold-start initialization, followed by Progressive Thinking Suppression Training (PTST) combined with GRPO reinforcement learning. At the 7B parameter scale, Vision-R1 achieves multimodal mathematical reasoning performance approaching OpenAI O1.
Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play: This paper proposes Vision-Zero, the first annotation-free gamified self-play framework for VLMs. By casting visual reasoning as a "Who is the Spy?"-style game and combining it with the Iterative-SPO training algorithm, Vision-Zero achieves scalable self-improvement and surpasses SOTA methods trained on human-annotated data across reasoning, chart understanding, and vision-centric tasks.
VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations: This paper introduces VisJudge-Bench, the first comprehensive benchmark for aesthetics and quality assessment of data visualizations (3,090 samples, 32 chart types), and trains the VisJudge model, which reduces MAE by 23.9% compared to GPT-5 and improves agreement with human experts by 60.5%.
Visual Prompt-Agnostic Evolution: This paper proposes Prompt-Agnostic Evolution (PAE), which accelerates VPT convergence (average 1.41×) and improves accuracy by 1–3% across 25 datasets through frequency-aware task initialization (MPA) and a Koopman-Lyapunov dynamical system (KLD) for cross-layer prompt coupling. PAE is plug-and-play for various VPT variants and introduces no inference overhead.
Visual Symbolic Mechanisms: Emergent Symbol Processing in Vision Language Models: This paper discovers that VLMs internally develop a three-stage symbolic processing mechanism (ID retrieval → ID selection → feature retrieval) that uses content-agnostic spatial position indices (position IDs) to solve the visual binding problem, and demonstrates that binding errors can be directly traced to failures in these mechanisms.
VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?: This paper introduces VLM-SubtleBench, a benchmark for evaluating vision-language models on subtle difference comparative reasoning, covering 10 difference types and 6 image domains (natural, gaming, industrial, aerial, medical, and synthetic). It reveals a performance gap of over 30% between VLMs and humans on spatial, temporal, and viewpoint reasoning tasks.
VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use: This paper proposes VTool-R1, the first framework that trains VLMs via reinforcement fine-tuning to generate interleaved textual and visual intermediate reasoning steps, enabling models to "think with images."
WebDS: An End-to-End Benchmark for Web-based Data Science: This paper introduces WebDS, the first end-to-end web-based data science benchmark comprising 870 tasks across 29 websites and 10 domains. The strongest evaluated agent (BrowserUse + GPT-4o) completes only 15% of tasks, while humans achieve 90%, revealing a substantial performance gap in realistic data science workflows.
Why Keep Your Doubts to Yourself? Trading Visual Uncertainties in Multi-Agent Bandit Systems: This paper proposes Agora, a framework that recasts multi-agent VLM coordination as a decentralized uncertainty trading market. Cognitive uncertainty is minted into quantifiable, three-dimensional tradable assets (perceptual / semantic / inferential), and efficient equilibrium allocation is achieved through a profit-driven trading protocol and a market-aware Thompson Sampling Broker. Agora consistently outperforms heuristic baselines across five multimodal benchmarks (e.g., +8.5% accuracy on MMMU with more than 3× cost reduction).
Why Keep Your Doubts to Yourself? Trading Visual Uncertainties in Multi-Agent Bandit Systems: This paper proposes Agora, a framework that recasts multi-agent VLM coordination as a decentralized uncertainty trading market. By decomposing epistemic uncertainty into tradable assets along three dimensions (perceptual / semantic / reasoning) and employing a profitability-driven trading protocol together with Thompson Sampling brokers, Agora achieves cost-aware optimal allocation, yielding up to +8.5% accuracy improvement with over 3× cost reduction across five multimodal benchmarks.
Why Reinforcement Fine-Tuning Preserves Prior Knowledge Better: A Data Perspective: Through a systematic study of how SFT and RFT affect prior knowledge using a jigsaw puzzle task, this paper reveals that the key to RFT avoiding catastrophic forgetting lies in data distribution rather than algorithmic differences — data sampled by RFT naturally aligns with the base model's probability landscape, causing less interference.
Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition: This paper proposes DA-HOI, a zero-shot HOI detection framework that fully decouples object detection from interaction recognition. It replaces conventional CLIP-based features with MLLM VQA capabilities for interaction recognition. The core contributions are deterministic generation (achieving 31.50 mAP training-free), spatial-aware pooling (incorporating spatial priors and cross-attention), and one-pass deterministic matching (reducing \(M\) forward passes to one). DA-HOI comprehensively surpasses the state of the art across all four zero-shot settings on HICO-DET and supports plug-and-play detector substitution after training.