Skip to content

🧩 Multimodal VLM

📹 ICCV2025 · 142 paper notes

A Quality-Guided Mixture of Score-Fusion Experts Framework for Human Recognition

This paper proposes QME (Quality-guided Mixture of score-fusion Experts), a framework that dynamically integrates similarity scores from multiple biometric modalities—including face recognition, gait recognition, and person re-identification—via learnable score fusion strategies and a quality-based MoE routing mechanism, achieving state-of-the-art performance on multiple whole-body recognition benchmarks.

A Quality-Guided Mixture of Score-Fusion Experts Framework for Human Recognition

This paper proposes a Quality-guided Mixture of score-fusion Experts (QME) framework that employs a quality-guided MoE strategy to perform learnable fusion of similarity scores from heterogeneous biometric modalities (face, gait, body). Combined with a pseudo-quality loss and a score triplet loss, QME achieves state-of-the-art performance on multiple whole-body biometric recognition benchmarks.

Acknowledging Focus Ambiguity in Visual Questions

This work is the first to formally define and systematically investigate focus ambiguity in visual question answering — the phenomenon arising when a linguistic expression in a question may plausibly refer to multiple regions in an image, a type of ambiguity entirely overlooked by existing VQA systems. The authors construct the VQ-FocusAmbiguity dataset (5,500 samples with 12,880 instance segmentation annotations) and demonstrate that modern models perform poorly at both recognizing and localizing focus ambiguity.

Adaptive Prompt Learning via Gaussian Outlier Synthesis for Out-of-Distribution Detection

This paper proposes the APLGOS framework, which initializes learnable in-distribution (ID) prompts using ChatGPT-standardized Q&A pairs, synthesizes virtual OOD prompts and images by sampling from the low-likelihood regions of class-conditional Gaussian distributions, and aligns text-image embeddings via contrastive learning to achieve more compact ID/OOD decision boundaries.

Adaptive Prompt Learning via Gaussian Outlier Synthesis for Out-of-distribution Detection

This paper proposes APLGOS, a framework that leverages prompt learning in vision-language models to synthesize virtual OOD prompts and images by sampling from low-probability regions of class-conditional Gaussian distributions, thereby enforcing more compact decision boundaries between in-distribution (ID) and out-of-distribution (OOD) categories. The method achieves state-of-the-art performance on four mainstream benchmarks.

Advancing Textual Prompt Learning with Anchored Attributes

This paper proposes ATPrompt, which embeds general-purpose attribute tokens (e.g., color, shape) into textual prompts, extending the learning space of soft prompts from a one-dimensional class level to a multi-dimensional attribute level. ATPrompt serves as a plug-and-play module that integrates seamlessly into existing textual prompt learning methods, consistently improving baseline performance across 11 datasets.

AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?

This paper proposes AdvDreamer, a framework that generates physically reproducible adversarial 3D transformation (Adv-3DT) samples from single images via zero-shot monocular pose manipulation, a naturalness reward model, and an inverse semantic probability loss. The framework reveals that current VLMs—including GPT-4o—suffer performance drops of 50–80% under 3D variations, and establishes MM3DTBench, the first VQA benchmark for evaluating VLM robustness to 3D variations.

AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models

This paper proposes AIGI-Holmes, which adapts MLLMs into a "Holmes"-style detector capable of both accurately identifying AI-generated images and providing human-verifiable explanations. This is achieved by constructing the Holmes-Set dataset with explanatory annotations and a carefully designed three-stage training pipeline (visual expert pre-training → SFT → DPO). At inference time, a collaborative decoding strategy further enhances generalization.

AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models

This paper proposes AIGI-Holmes, which achieves explainable and generalizable AI-generated image detection through the construction of Holmes-Set — an annotated dataset with interpretive labels — a three-stage training pipeline (visual expert pre-training → SFT → DPO), and a collaborative decoding strategy. The method attains state-of-the-art detection accuracy on three benchmarks while providing human-verifiable explanations.

AirCache: Activating Inter-Modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference

This paper proposes AirCache, a KV Cache compression method for LVLMs that evaluates visual token importance via an Elite Observation Window, combined with adaptive layer-wise budget allocation based on the intensity and skewness of importance score distributions. At only 10% visual KV Cache retention, performance degradation remains within 1%, while decoding latency is reduced by 29%–66%.

AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference

This paper proposes AirCache, which achieves model performance retention with only 10% of the visual KV cache—reducing decoding latency by 29%–66%—through an elite observation window (leveraging text self-attention to select critical text tokens for evaluating visual token importance) and adaptive inter-layer budget allocation (based on the intensity and skewness of importance score distributions).

Analyzing Finetuning Representation Shift for Multimodal LLMs Steering

A training-free framework that reveals representation shifts in multimodal large language models (MLLMs) during finetuning through concept-level analysis, and leverages shift vectors for lightweight model behavior steering (debiasing, safety control).

Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

This paper presents the first systematic study of visual correspondence matching deficiencies in multimodal large language models (MLLMs). The authors construct the MMVM benchmark (1,510 samples) and a 220K matching dataset, and propose CoLVA, which leverages object-level contrastive learning and a fine-grained visual expert to substantially improve cross-image instance matching in MLLMs.

Attention to the Burstiness in Visual Prompt Tuning!

This paper reveals the "burstiness" and non-Gaussian distribution of self-attention module data in Visual Prompt Tuning, and proposes learning "bursty prompts" via data whitening and a bilinear model. The approach substantially outperforms VPT and its variants across multiple benchmarks, e.g., improving accuracy on CUB-200 from 42.15% to 77.86%.

AutoComPose: Automatic Generation of Pose Transition Descriptions for Composed Pose Retrieval Using Multimodal LLMs

This paper proposes AutoComPose, the first framework leveraging multimodal large language models (MLLMs) to automatically generate human pose transition descriptions. Through body-part-level description generation, diversification augmentation, and a cyclic consistency loss, AutoComPose achieves superior composed pose retrieval performance while eliminating the need for costly manual annotation.

BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning

Inspired by the efficient learning capabilities of human infants, this paper proposes the BabyVLM framework, which includes a synthetic training dataset (converting general-purpose data into child-directed formats) and multiple developmentally aligned evaluation benchmarks. The framework enables data-efficient pretraining of compact VLMs, achieving performance that surpasses models trained solely on SAYCam or generic data.

Background Invariance Testing According to Semantic Proximity

This paper proposes a background invariance testing method based on semantic proximity. It constructs a keyword ontology via association analysis to systematically sample background scenes, achieving an optimal balance between test diversity (recall) and consistency with human judgment (precision). The work further demonstrates that visualization-based testing frameworks are more informative than global statistical metrics.

BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models

By analyzing the semantic refinement of visual embeddings in the shallow layers of LLMs, this paper proposes BASIC, a method that leverages intrinsically refined visual embeddings from within the LLM as supervision signals to directly guide the visual projector in generating better initial visual embeddings along two dimensions: directional alignment and semantic distribution.

Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval

This paper identifies the candidate prior bias problem in MLLM-based retrieval systems — where candidate likelihood estimation tends to favor candidates with high prior probability rather than those that are semantically most relevant — and proposes BLiM (Bidirectional Likelihood Estimation) and CPN (Candidate Prior Normalization) to address this issue, achieving an average R@1 gain of 6.4 across four text-video retrieval benchmarks.

Boosting MLLM Reasoning with Text-Debiased Hint-GRPO

This paper identifies two critical issues in applying GRPO to MLLM reasoning — low data utilization (invalid gradients when all sampled outputs for a hard question are incorrect) and text bias (the model ignores visual input and relies solely on textual reasoning) — and proposes two corresponding solutions: Hint-GRPO (adaptively providing reasoning hints) and text-debiasing calibration (enhancing image conditioning at test time). The approach achieves significant reasoning improvements across 11 datasets on 3 base MLLMs.

CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers

This paper proposes CAD-Assistant, the first tool-augmented vision-language model framework for generic CAD tasks. By integrating a CAD-specific toolset (sketch parameterizer, rendering module, constraint checker, etc.) and the FreeCAD Python API, it surpasses supervised task-specific methods in a zero-shot setting.

Calibrating MLLM-as-a-Judge via Multimodal Bayesian Prompt Ensembles

This paper proposes Multimodal Mixture-of-Bayesian Prompt Ensembles (MMB), which learns image-cluster-conditioned prompt weights to substantially improve calibration and judgment accuracy of MLLMs used as evaluators, addressing the failure of standard prompt ensemble methods in multimodal settings.

CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models

This work is the first to introduce multimodal large language models (MLLMs) into category-agnostic pose estimation (CAPE), enabling keypoint localization for arbitrary categories using only a query image and textual descriptions—without requiring traditional support images or annotations—surpassing the 5-shot state-of-the-art on the MP-100 benchmark.

CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning

CaptionSmiths is a framework that enables slider-style flexible control over three caption attributes — length, descriptiveness, and lexical uniqueness — via continuous scalar interpolation rather than discrete clustering. Trained jointly on multiple datasets, it achieves more precise attribute control and higher lexical alignment quality than baselines.

CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

This paper introduces CAPTURe, a benchmark that evaluates spatial reasoning and world model construction in VLMs by requiring amodal counting of regularly arranged objects under occlusion. Results show that even the strongest model, GPT-4o, achieves a 14.75% counting error under occlusion, while humans perform nearly perfectly.

Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning

This paper proposes the Causal CLIP Adapter (CCA), which applies ICA to causally disentangle CLIP visual features, and enhances cross-modal alignment via unidirectional text classifier fine-tuning and bidirectional cross-attention, achieving state-of-the-art few-shot classification performance across 11 benchmark datasets.

ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning

This paper proposes PointCoT, which integrates reflective visual grounding (bounding boxes) into the chain-of-thought for chart reasoning, enabling MLLMs to interactively verify each reasoning step against the chart's visual content. It also constructs the ChartPoint-SFT-62k dataset containing 19.2K high-quality samples, achieving a +5.04% improvement on ChartBench.

Chimera: Improving Generalist Model with Domain-Specific Experts

This paper proposes Chimera, a scalable and low-cost multimodal pipeline that integrates domain-specific expert knowledge (tables, charts, math, documents) into a generalist multimodal large model via a lightweight routing module for dynamic expert selection, a progressive training strategy, and a Generalist-Specialist Collaboration Masking (GSCM) mechanism. Chimera achieves 64.9% on MathVista (SOTA) and matches or surpasses specialist models on multiple visual structure extraction tasks.

CLIPSym: Delving into Symmetry Detection with CLIP

This paper proposes CLIPSym, the first method to leverage the multimodal understanding capability of pretrained CLIP for reflection and rotation symmetry detection. It introduces a Semantics-Aware Prompt Grouping (SAPG) strategy to integrate textual semantic cues and a decoder with theoretical rotation equivariance guarantees, achieving state-of-the-art results on three benchmarks: DENDI, SDRW, and LDRS.

CoA-VLA: Improving Vision-Language-Action Models via Visual-Textual Chain-of-Affordance

This paper proposes CoA-VLA, which organizes four categories of robotic affordances (object, grasp, spatial, and motion) into a chain-of-thought reasoning process, and injects them into a diffusion policy network via a visual-textual co-injection module, significantly improving the accuracy and generalization of VLA models in multi-task manipulation.

CoA-VLA: Improving Vision-Language-Action Models via Visual-Textual Chain-of-Affordance

This paper proposes the Chain-of-Affordance (CoA-VLA) framework, which injects four categories of robot affordances (object, grasp, spatial, and movement) into the policy network of a VLA model in both textual and visual modalities. The approach achieves an 85.54% success rate on a real-robot multi-task benchmark spanning 7 tasks, outperforming OpenVLA by 30.65%, and demonstrates generalization to unseen object poses and obstacles.

CompCap: Improving Multimodal Large Language Models with Composite Captions

This paper proposes CompCap, an automated framework for synthesizing six categories of composite images (collages, image-text mixtures, charts, tables, code, and diagrams) along with high-quality captions. The resulting CompCap-118K dataset, when incorporated into the SFT stage, significantly improves MLLM comprehension of composite images.

Controlling Multimodal LLMs via Reward-guided Decoding

This paper proposes Multimodal Reward-Guided Decoding (MRGD), which constructs two reward models to independently control object precision and recall, enabling fine-grained controllability over MLLM outputs at inference time while substantially reducing object hallucinations.

Controlling Multimodal LLMs via Reward-guided Decoding

This paper proposes MRGD (Multimodal Reward-Guided Decoding), which trains a PaliGemma-based object hallucination reward model and an OWLv2-based object recall reward model. During MLLM inference, MRGD performs sentence-level beam search by scoring candidates with a linearly weighted combination of the two rewards. On CHAIR, it reduces LLaVA-1.5's CHAIRi from 15.05 to 4.53 (a 70% reduction) while enabling dynamic and controllable precision–recall trade-offs.

CVPT: Cross Visual Prompt Tuning

To address the computational redundancy and attention disruption caused by prompt tokens participating in self-attention in Visual Prompt Tuning (VPT), this paper proposes CVPT, which decouples the interaction between prompt and image tokens via cross-attention and leverages a weight-sharing mechanism to initialize the cross-attention module. CVPT significantly outperforms VPT across 25 datasets and achieves performance comparable to mainstream adapter-based methods.

DADM: Dual Alignment of Domain and Modality for Face Anti-Spoofing

This paper proposes the DADM framework, which simultaneously addresses intra-domain modality misalignment and inter-domain modality misalignment in multimodal face anti-spoofing via a Mutual Information Mask (MIM) module and a dual domain-modality alignment optimization strategy, achieving state-of-the-art performance across four evaluation protocols.

DASH: Detection and Assessment of Systematic Hallucinations of VLMs

This paper proposes DASH, a fully automated pipeline that systematically discovers false-positive object hallucination clusters in VLMs via two complementary strategies: LLM-based text query generation (DASH-LLM) and diffusion model optimization-based image query generation (DASH-OPT). Applied to ReLAION-5B, DASH uncovers 19k+ clusters and 950k+ images, and constructs the more challenging DASH-B benchmark.

DisenQ: Disentangling Q-Former for Activity-Biometrics

This paper proposes DisenQ (Disentangling Q-Former), which leverages structured language guidance to disentangle video features into three independent spaces—biometric, motion, and non-biometric—achieving state-of-the-art activity-aware person recognition without requiring additional visual modalities.

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

This paper proposes Dita (Diffusion Transformer Policy), which, unlike prior methods that denoise on compressed embeddings using shallow networks, adopts in-context conditioning to directly condition denoising on raw visual tokens. A causal Transformer processes the full token sequence of language, images, timesteps, and noisy actions. With 334M parameters, Dita achieves state-of-the-art or competitive performance on SimplerEnv zero-shot, LIBERO, CALVIN, and other benchmarks.

DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding

This paper proposes DocThinker, the first framework to apply GRPO (Group Relative Policy Optimization) reinforcement learning to document understanding. By training MLLMs with a four-objective rule-based reward (format, answer accuracy, RoI IoU, and question rephrasing quality), DocThinker enables models to autonomously generate interpretable reasoning processes. Using only 4K training samples, it improves Qwen2.5-VL-7B on DocVQA from 0.355 to 0.579 (RL vs. SFT: 0.579 vs. 0.355) and achieves 82.4% precision on visual grounding tasks.

DOGR: Towards Versatile Visual Document Grounding and Referring

This paper proposes DOGR-Engine, a data engine for document grounding and referring, constructs DOGR-Bench — the first comprehensive benchmark evaluating document grounding and referring capabilities across 7 task types × 3 document types — and develops DOGR, the first document understanding MLLM that integrates precise text localization with interactive grounding and referring capabilities.

DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning

This paper proposes the DWIM framework, which employs a discrepancy-aware workflow generation strategy to curate high-quality training data and an instruct-masking fine-tuning strategy to clone only effective actions, endowing LLMs with tool-aware capability for compositional visual reasoning and achieving state-of-the-art results on multiple VR benchmarks.

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

This paper proposes Dynamic-VLM, which employs a dynamic visual token compressor to flexibly adjust the number of tokens per frame according to video length. Combined with a 2-million-scale high-quality synthetic video QA dataset, the method achieves a 2.7% improvement over LLaVA-OneVision on VideoMME and a 10.7% improvement on MuirBench.

Dynamic Group Detection using VLM-augmented Temporal Groupness Graph

This paper proposes a VLM-augmented temporal groupness graph for detecting dynamically changing groups in video. The core innovation lies in using CLIP to extract groupness-augmented features from bounding boxes containing person pairs and background context to estimate grouping probability, followed by Louvain clustering over a full-sequence temporal graph to enable dynamic group detection.

Dynamic Multimodal Prototype Learning in Vision-Language Models

This paper proposes ProtoMM, a training-free multimodal prototype learning framework that models prototypes as discrete distributions over textual descriptions and visual particles. By leveraging optimal transport to dynamically update multimodal prototypes, ProtoMM achieves state-of-the-art performance across 15 zero-shot benchmarks.

Effective Training Data Synthesis for Improving MLLM Chart Understanding

This paper proposes a modular five-stage chart data synthesis pipeline that produces a high-quality training set, ECD (Effective Chart Dataset), comprising 10k+ chart images and 300k+ QA pairs, consistently improving chart understanding across multiple open-source MLLMs.

Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features

This paper proposes Sparse Attention Vectors (SAVs) — a training-free method that extracts fewer than 5% of attention heads from frozen generative Large Multimodal Models (LMMs) as strong feature representations. With only approximately 20 labeled samples per class, SAVs achieve state-of-the-art performance on vision-language classification tasks, outperforming LoRA fine-tuning by an average of 7% on challenging benchmarks including BLINK, VLGuard, and NaturalBench.

Enrich and Detect: Video Temporal Grounding with Multimodal LLMs

This paper proposes ED-VTG, a two-stage framework for video temporal grounding (VTG) that first enriches the input query and then predicts temporal intervals. By leveraging the descriptive capability of multimodal LLMs to supplement query details, combined with a lightweight interval decoder and a multiple instance learning (MIL) framework, ED-VTG is the first LLM-based method to comprehensively match or surpass specialized models across multiple benchmarks.

Evading Data Provenance in Deep Neural Networks

This paper exposes the false sense of security in existing Dataset Ownership Verification (DOV) methods. Through a unified evasion framework, Escaping DOV, task-relevant but identity-free knowledge is transferred from a teacher model to a surrogate student via OOD data, successfully bypassing all 11 evaluated DOV methods simultaneously.

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

This work systematically investigates the optimal architecture and training strategy for encoder-free VLMs, proposing a Divide-and-Conquer architecture that fully decomposes a transformer into modality-specific components (independent attention/FFN/LayerNorm per modality). Using only 100M publicly available data, EVEv2 surpasses all encoder-free counterparts and approaches the performance of encoder-based VLMs.

Exploiting Vision Language Model for Training-Free 3D Point Cloud OOD Detection

This paper proposes Graph Score Propagation (GSP), a training-free framework that performs score propagation over a graph constructed from class prototypes and test data. By incorporating prompt clustering and a self-training negative prompting strategy, GSP leverages VLMs for efficient OOD detection on 3D point clouds, consistently outperforming existing state-of-the-art methods on both synthetic and real-world datasets.

FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection

This paper proposes FA (Forced prompt leArning), which introduces a learnable "forced prompt" and trains it to produce higher ID-class matching scores than a frozen original prompt, compelling it to capture richer ID class descriptions beyond label text semantics. FA achieves significant improvements in CLIP-based few-shot OOD detection without external auxiliary data or additional parameters.

FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers

This paper proposes FALCON, which introduces learnable Visual Registers into the ViT encoder. Through the ReCompact mechanism, visual redundancy is eliminated directly during the encoding stage (achieving 9× token compression), while the ReAtten module resolves visual fragmentation caused by image cropping via inter-register interactions.

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

This paper identifies a systematic positional bias in early visual token pruning for VLMs—caused by RoPE, which tends to retain tokens from the bottom of the image—and proposes FEATHER, which addresses this issue via RoPE-free attention, uniform sampling, and multi-stage pruning, achieving over 5× performance improvement on visual grounding tasks.

FedMVP: Federated Multimodal Visual Prompt Tuning for Vision-Language Models

This paper proposes FedMVP, which, under a federated learning setting, employs a PromptFormer network to fuse image visual features with LLM-generated category attribute text features, generating dynamic multimodal visual prompts injected into CLIP's visual encoder. FedMVP achieves substantial improvements of 1.57%–2.26% over existing federated prompt learning methods across 20 datasets and three generalization settings.

Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving

This paper proposes VLADBench, a fine-grained vision-language model evaluation benchmark for autonomous driving scenarios, covering 5 major domains, 11 second-level dimensions, and 29 third-level tasks. Using a closed-ended QA format, it progressively assesses VLM capabilities from static knowledge to dynamic reasoning, and trains small-scale domain-specific (DS) models on 1.4M domain-specific QA data to validate cognitive interactions across domains.

FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging

This paper proposes FinMMR, a bilingual (Chinese–English) multimodal financial numerical reasoning benchmark comprising 4,300 questions, 8,700+ financial charts, and 14 financial sub-domains. It systematically evaluates 15 MLLMs to identify bottlenecks in complex domain-specific reasoning, and proposes three improvement strategies: visual filtering, knowledge augmentation, and model collaboration.

FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance

This paper proposes FOLDER — a plug-and-play visual token compression module that systematically analyzes three key factors of information loss (reduction impact, propagation effect, and aggregation method), performs aggressive token merging in the last few layers of the visual encoder, and achieves up to 70% token reduction while maintaining or even improving model performance.

FREE-Merging: Fourier Transform for Efficient Model Merging

This paper is the first to identify the frequency-domain manifestation of task interference in model merging. It proposes FR-Merging, which removes low-frequency interference via high-pass filtering to construct a high-quality merged backbone, and combines it with lightweight task expert modules (FREE-Merging) to achieve an optimal performance–cost trade-off across vision, language, and multimodal tasks.

Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference

This paper proposes Free-MoRef, a training-free method inspired by Mixture-of-Experts (MoE) that partitions long video tokens into multiple short sequences as multi-references, queries them in parallel via the MoRef attention mechanism, and fuses unified activation values. The approach enables efficient and comprehensive understanding of 2× to 8× longer frame inputs on a single A100 GPU, surpassing dedicated long-video models on VideoMME, MLVU, and LongVideoBench.

From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning

This paper proposes the MIR benchmark, comprising 22,257 multi-image interleaved reasoning QA pairs with five-stage reasoning steps, and introduces a progressive curriculum learning strategy that trains MLLMs from easy to hard samples to improve multi-image interleaved reasoning capability.

From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning

This paper proposes Dual-LoRA and Visual Cue Enhancement (VCE) modules that adopt a "from holistic to localized" paradigm to address data conflicts in efficient visual instruction fine-tuning, surpassing LoRA-MoE methods with only a 1.16× inference time overhead.

G2D: Boosting Multimodal Learning with Gradient-Guided Distillation

This paper proposes G2D (Gradient-Guided Distillation), which addresses the modality imbalance problem in multimodal learning by combining feature distillation and logit distillation from unimodal teachers to a multimodal student, together with a Sequential Modality Prioritization (SMP) gradient modulation strategy guided by unimodal teacher confidence scores. G2D achieves 85.89% accuracy on CREMA-D, surpassing all state-of-the-art methods focused on modality imbalance.

GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography

This paper introduces the DataDoP dataset (29K free-moving camera trajectories with descriptions extracted from real films) and the GenDoP auto-regressive Transformer model, which generates artistic, high-quality camera motion trajectories conditioned on text and/or RGBD input, outperforming existing methods in controllability, motion smoothness, and complexity.

Generalizable Object Re-Identification via Visual In-Context Prompting

VICP proposes a generalizable object re-identification framework in which an LLM infers identity-discriminative rules from a small set of positive/negative image pairs and converts them into dynamic visual prompts injected into a frozen visual foundation model (DINOv2), enabling zero-parameter-update generalization to unseen object categories.

GTA-CLIP: Generate, Transduct, Adapt — Iterative Transduction with VLMs

This paper proposes GTA-CLIP, which iteratively executes three steps — LLM-based attribute generation, attribute-enhanced transductive inference, and encoder fine-tuning — achieving an average zero-shot improvement of 9.5% and few-shot improvement of 3–4% across 12 datasets, and for the first time unifying attribute discovery, transductive inference, and model adaptation in a zero-label setting.

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks

This paper introduces GEOBench-VLM, a comprehensive benchmark designed to evaluate VLMs on geospatial tasks, encompassing 31 sub-tasks across 8 major categories and over 10,000 manually verified instructions. The benchmark reveals that current state-of-the-art VLMs, including GPT-4o, still perform poorly on geospatial tasks, with the highest accuracy reaching only 41.7%.

Global and Local Entailment Learning for Natural World Imagery

This paper proposes Radial Cross-Modal Embeddings (RCME), a framework that explicitly models the transitivity of entailment relations to learn hierarchical representations in vision-language models. RCME enables inference at arbitrary taxonomic ranks on the Tree of Life and achieves state-of-the-art performance on hierarchical classification and retrieval tasks.

GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

GRAB is a graph analysis benchmark for large multimodal models (LMMs), comprising 3,284 synthetically generated questions spanning 5 tasks and 23 graph properties. The strongest model evaluated, Claude 3.5 Sonnet, achieves only 21.0% accuracy, revealing critical deficiencies in LMMs' capacity for visual analytical reasoning.

Growing a Twig to Accelerate Large Vision-Language Models

This paper proposes TwigVLM, which attaches a lightweight twig module to the early layers of a VLM to simultaneously enable twig-guided visual token pruning (TTP, for prefilling acceleration) and self-speculative decoding (SSD, for decoding acceleration). On LLaVA-1.5-7B, TwigVLM retains 96% accuracy after pruning 88.9% of visual tokens and achieves a 154% speedup in long-answer generation, substantially outperforming existing methods in both accuracy and speed.

GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-Based VLM Agent

This paper identifies that relying solely on outcome rewards during RL training of VLM agents leads to "thought collapse," and proposes the GTR framework, which employs an external VLM corrector to automatically rectify reasoning processes and jointly trains thoughts and actions via PPO + SFT, achieving 3–5× improvement in task success rates on the Game of 24 and ALFWorld benchmarks.

Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

This work identifies that the encoder of masked autoregressive (MAR) models inherently possesses both the fine-grained image features required for generation and the high-level semantic representations required for understanding. Based on this observation, Harmon is proposed — an autoregressive framework that unifies image generation and understanding via a shared MAR encoder. Through three-stage progressive training, Harmon achieves an Overall score of 0.76 on GenEval, surpassing all unified models, while matching the understanding performance of the Janus series that employs a dedicated SigLIP encoder.

Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving

This paper proposes the Hints of Prompt (HoP) framework, which enhances CLIP visual representations through three hierarchical hints (Affinity/Semantic/Question hint) to capture instance-level structure, domain-specific semantics, and question relevance. HoP surpasses the fully trained baseline on autonomous driving VQA tasks using only 25% of the training data.

HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?

This paper introduces HRScene, a benchmark covering 25 real-world scenarios and 2 diagnostic datasets (resolution 1K–35K). Evaluating 28 VLMs reveals that current state-of-the-art models achieve an average accuracy of only ~50% on real high-resolution tasks, with significant regional performance divergence and a pronounced lost-in-middle problem.

IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves

This paper proposes IDEATOR, a framework that leverages VLMs themselves as red-team models to autonomously generate multimodal jailbreak image-text pairs, achieving a 94% attack success rate against MiniGPT-4's safety mechanisms. Based on this framework, the authors construct VLJailbreakBench, a safety evaluation benchmark comprising 3,654 samples.

IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves

This paper proposes IDEATOR, the first black-box jailbreak framework that uses a VLM to red-team other VLMs. A weakly safety-aligned VLM (MiniGPT-4) serves as the attacker, generating semantically rich image–text jailbreak pairs in conjunction with Stable Diffusion. A breadth-depth exploration strategy iteratively refines attacks, achieving a 94% attack success rate (ASR) on MiniGPT-4 with an average of 5.34 queries, and transferring to LLaVA/InstructBLIP/Chameleon at 75–88%. The work also introduces VLJailbreakBench (3,654 samples) to expose safety vulnerabilities across 11 VLMs.

Information Density Principle for MLLM Benchmarks

This paper proposes an "information density" principle to evaluate MLLM benchmark quality along four dimensions — Fallacy, Difficulty, Redundancy, and Diversity — and constructs a three-tier automated evaluation pipeline (Human–Model–Data) to conduct a systematic "benchmark for benchmark" analysis of 19 mainstream benchmarks.

Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models

This paper proposes MVP (Mixture of Visual Projectors), a Mixture-of-Experts framework for visual projectors conditioned on instruction context. Through an expert recommendation strategy and an expert pruning mechanism, MVP enables generative VLMs to continually learn new vision-language tasks without catastrophic forgetting, while maintaining responsiveness to diverse instruction types. MVP consistently outperforms existing methods across classification, captioning, and question-answering tasks.

Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs

This paper proposes the Instruction-oriented Preference Alignment (IPA) framework, which anchors alignment signals to instruction completion efficacy rather than hallucination factors alone, via an automated preference construction mechanism and a progressive preference data collection pipeline. IPA achieves consistent improvements on Qwen2VL-7B across 9 benchmarks spanning hallucination evaluation, general VQA, and text comprehension.

Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model

This paper proposes LaZSL, which leverages Optimal Transport (OT) to achieve fine-grained alignment between local visual regions and semantic attributes, constructing an interpretable zero-shot classifier without additional training. LaZSL demonstrates strong accuracy, interpretability, and domain generalization across 9 datasets.

Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining

Iris introduces two core innovations — Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL) — achieving SOTA on multiple GUI understanding benchmarks with only 850K annotated samples, matching methods that use over 10× more data, while reducing inference time from 3 seconds to 1 second.

Is Less More? Exploring Token Condensation as Training-free Test-time Adaptation

This paper proposes Token Condensation as Adaptation (TCA), a training-free test-time adaptation method that leverages a Domain-aware Token Reservoir (DTR) to guide cross-head token pruning/merging and logits self-correction. Without modifying model parameters, TCA improves cross-dataset performance of CLIP/SigLIP variants by up to 21.4% while reducing GFLOPs by 12.2%–48.9%.

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

This paper identifies a Shuffle Inconsistency between the comprehension capability and the safety capability of multimodal large language models (MLLMs)—models can understand shuffled harmful instructions, yet their safety mechanisms fail to defend against them. Building on this finding, the authors propose SI-Attack, a query-based black-box jailbreak method that achieves substantially higher attack success rates on both open-source and closed-source commercial models.

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

This paper proposes the first automated feature interpretation framework for Large Multimodal Models (LMMs). It employs Sparse Autoencoders (SAEs) to decompose LMM internal representations into monosemantic features, leverages larger LMMs to automatically interpret these features, and demonstrates that feature steering can correct model hallucinations.

LATTE: Collaborative Test-Time Adaptation of Vision-Language Models in Federated Learning

This paper proposes Latte, a framework that enables collaborative test-time adaptation of vision-language models (e.g., CLIP) in decentralized federated learning settings. Through a dual-memory mechanism combining local and external memory, Latte achieves cross-client knowledge sharing while preserving client-level personalization.

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

LLaVA-CoT proposes a method enabling vision-language models to perform autonomous multi-stage structured reasoning. By constructing the LLaVA-CoT-100k structured reasoning annotation dataset, the model is trained to sequentially execute four stages—Summary, Caption, Reasoning, and Conclusion—and a Stage-Wise Retracing Search (SWIRES) is proposed for test-time scaling, allowing an 11B model to surpass Gemini-1.5-pro and GPT-4o-mini.

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

This paper proposes the LLaVA-KD framework, which transfers knowledge from large-scale MLLMs to small-scale MLLMs via Multimodal Distillation (MDist) and Relational Distillation (RDist) strategies combined with a three-stage training scheme (DPT-SFT-DFT), significantly improving small model performance without modifying the model architecture.

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

By exploiting the sparsity of attention scores between the CLS token and spatial tokens in the visual encoder, this work adaptively prunes and merges visual tokens, maintaining comparable LMM performance while retaining only 5.5% of visual tokens.

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

By constructing the LLaVA-CoT-100k dataset with structured reasoning annotations, the proposed method trains a VLM to autonomously execute a four-stage reasoning pipeline—Summary → Caption → Reasoning → Conclusion—combined with a SWIRES search strategy at test time. The resulting 11B model outperforms substantially larger models including GPT-4o-mini and Gemini-1.5-pro.

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

By exploiting the sparsity of attention scores between the [CLS] token and visual tokens in CLIP-ViT, PruMerge adaptively selects important visual tokens via IQR-based outlier detection, then merges pruned tokens back into retained tokens through k-nearest-neighbor clustering, achieving up to 14× visual token compression with negligible performance degradation.

Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness

This paper proposes DataTailor — a collaborative multimodal data selection framework grounded in three principles: informativeness, uniqueness, and representativeness. Using only 15% of the data, DataTailor achieves 101.3% of the performance obtained with full-data fine-tuning, embodying the "Less is More" philosophy.

MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling

This paper proposes MaTVLM, which replaces a portion of Transformer layers in a pretrained VLM with Mamba-2 layers and trains the resulting model via single-stage knowledge distillation, achieving 3.6× inference speedup and 27.5% memory reduction while maintaining competitive performance.

MAVias: Mitigate Any Visual Bias

This paper proposes MAVias, an open-set visual bias mitigation framework that extracts visual attribute tags from images using a tagging foundation model, employs an LLM to filter out tags irrelevant to the target class as potential biases, encodes the identified biases via vision-language embeddings, and incorporates them into training to learn bias-invariant representations. MAVias substantially outperforms existing methods on CelebA, Waterbirds, UrbanCars, and ImageNet9.

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

This paper introduces Multi-Context Visual Grounding as a novel task and the MC-Bench benchmark—comprising 2,000 manually annotated samples, 3 text description styles, and 20 practical skills—to evaluate 20+ MLLMs and foundation models. It reveals a substantial performance gap between current models and humans (human AP50=41.3% vs. best end-to-end model AP50=30.7%), and provides an agentic baseline combining GPT-4o and G-DINO (AP50=36.2%).

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

This paper proposes Visual-Predictive Instruction Tuning (VPiT), which extends a pretrained LLM into a unified model—MetaMorph—capable of both visual understanding and generation via lightweight instruction tuning alone. A key finding is that visual generation ability emerges as a natural byproduct of visual understanding, and the two capabilities mutually benefit each other in an asymmetric manner.

METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models

METEOR proposes the first three-stage progressive token pruning framework for multi-encoder MLLMs: at the encoding stage, feature rank is used to allocate sparsity ratios across encoders; at the fusion stage, collaborative pruning eliminates cross-encoder redundancy; at the decoding stage, pruning ratios are adaptively adjusted based on text prompts. The framework reduces visual tokens by 76% with only a 0.3% performance drop.

Mitigating Object Hallucinations via Sentence-Level Early Intervention

This paper proposes SENTINEL, a framework grounded in the key observation that hallucinations emerge early in generation and propagate forward. By combining in-domain candidate bootstrapping with dual-detector cross-validation to construct sentence-level preference data, and employing Context-aware DPO (C-DPO) for early intervention, SENTINEL reduces hallucinations on Object HalBench by 92% while preserving general capabilities.

MM-IFEngine: Towards Multimodal Instruction Following

This paper proposes the MM-IFEngine pipeline, which systematically generates high-quality image–instruction pair data (in both SFT and DPO variants) and constructs the MM-IFEval benchmark, achieving significant improvements in multimodal instruction following for MLLMs.

MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs

Apple proposes the CA-VQA dataset and MM-Spatial model, leveraging high-quality 3D scene data and open-set annotations to generate training/evaluation data covering spatial relation prediction, metric estimation, and 3D grounding. The resulting general-purpose MLLM achieves SOTA on 3D spatial understanding benchmarks while remaining competitive on other tasks.

MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning

This paper introduces MMAT-1M, the first million-scale multimodal agent tuning dataset, constructed via a four-stage data engine (Foundation → Rationale → Reflection → Integration). It endows MLLMs with CoT reasoning, tool invocation, and self-reflection capabilities, achieving an average improvement of 2.7% on InternVL2.5-8B and 8.8% on RAG tasks.

MMOne: Representing Multiple Modalities in One Scene

MMOne is a general framework that addresses property disparity and granularity disparity in multi-modal scene representation through a modality modeling module (with modality indicators) and a multi-modal decomposition mechanism. It jointly models RGB, thermal, and language modalities within a single 3DGS representation, achieving consistent improvements across all modalities.

MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild

This paper proposes MolParser, an end-to-end Optical Chemical Structure Recognition (OCSR) method that handles Markush structures via an extended SMILES representation (E-SMILES), constructs a large-scale training set MolParser-7M with 7 million samples, and incorporates real-world literature data through active learning. MolParser achieves 76.9% accuracy on the WildMol benchmark, significantly outperforming existing methods.

Multi-Cache Enhanced Prototype Learning for Test-Time Generalization of Vision-Language Models

This paper proposes MCP/MCP++, a multi-cache enhanced prototype learning framework that constructs compact intra-class distributions via three complementary cache modules—entropy cache, align cache, and negative cache—and further introduces cross-modal residual learning to refine the alignment between visual and textual prototypes, achieving state-of-the-art zero-shot generalization across 15 downstream tasks.

Multimodal LLMs as Customized Reward Models for Text-to-Image Generation

This paper proposes LLaVA-Reward, which leverages the hidden states (rather than text generation outputs) of a pretrained MLLM to directly predict reward scores. A Skip-connection Cross Attention (SkipCA) module is introduced to enhance bidirectional visual-text interaction, and LoRA adapters are employed to handle different evaluation dimensions. The method achieves state-of-the-art performance on text-image alignment, fidelity, and safety evaluation, and can be applied to inference-time scaling for diffusion models.

MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models

This paper proposes MultiVerse, a multi-turn conversation evaluation benchmark comprising 647 dialogues collected from 12 VLM evaluation datasets, spanning 484 task types and 484 interaction goals. Using a checklist-based evaluation approach, the benchmark reveals that even the strongest model, GPT-4o, achieves only ~50% success rate on complex multi-turn conversations.

MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

This paper proposes Semantic Discrete Encoding (SDE), which injects pretrained CLIP semantic features into the quantization process of a visual tokenizer, enabling discrete visual tokens to be naturally aligned with language tokens. With only 24M image-text pairs, the resulting unified model achieves state-of-the-art performance on both visual understanding and generation.

MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

This paper proposes a Semantic Discrete Encoding (SDE) visual tokenizer that augments VQGAN with SigLIP semantic feature constraints, enabling discrete visual tokens to align semantically with language tokens. Built upon SDE, a unified autoregressive VLM (MUSE-VL) is constructed that, using only 24M training samples, outperforms Emu3 by 4.8% on understanding benchmarks, surpasses the specialist model LLaVA-NeXT 34B by 3.7%, and simultaneously supports image generation.

NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection

This paper proposes NegRefine, which leverages an LLM to filter proper nouns and subcategory labels from the negative label set, and designs a multi-label matching scoring function to handle cases where an image simultaneously matches both in-distribution and negative labels. On the ImageNet-1K benchmark, NegRefine achieves an average AUROC improvement of 1.82% and FPR95 reduction of 4.35%, establishing a new state of the art in zero-shot OOD detection.

On Large Multimodal Models as Open-World Image Classifiers

This paper systematically evaluates 13 large multimodal models (LMMs) on open-world image classification, proposes an evaluation protocol comprising four complementary metrics, and reveals systematic error patterns in LMMs regarding granularity judgment and fine-grained discrimination.

One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models

This paper proposes C-PGC, a framework that trains a conditional perturbation generator via malicious contrastive learning to produce a pair of universal image-text adversarial perturbations (UAPs), fundamentally disrupting the multimodal alignment of VLP models and achieving strong attack performance across multiple VLP models and downstream tasks in both white-box and black-box settings.

ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models

This paper proposes ONLY, a training-free single-layer intervention decoding method. It selects text-biased attention heads via the Text-to-Visual Entropy Ratio (TVER) to generate textually-enhanced logits, which are then used in adaptive contrastive or collaborative decoding against the original logits. With only 1.07× inference overhead, ONLY outperforms VCD/M3ID by 3.14% on POPE and reduces CHAIR_S by 6.2 points on CHAIR.

OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning

This paper introduces OpenVision — a fully open-source (data, training code, and weights) family of vision encoders (5.9M–632.1M parameters) trained on the CLIPS framework with the Recap-DataComp-1B dataset. When integrated into multimodal frameworks such as LLaVA, OpenVision matches or surpasses OpenAI CLIP and Google SigLIP, providing the community with a transparent and flexible alternative visual backbone.

OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography

This paper proposes OracleFusion, a two-stage semantic typography framework. Stage 1 employs MLLM-enhanced Spatial Awareness Reasoning (SAR) to analyze the glyph structure of oracle bone script (OBS) and localize key components. Stage 2 introduces Structural Oracle Vector Fusion (SOVF), which generates semantically enriched vector glyphs through glyph structure constraints and skeleton-preserving losses, conveying semantic meaning while preserving original glyph integrity to assist expert decipherment of undeciphered OBS characters.

orderchain towards general instruct-tuning for stimulating the ordinal understan

This paper proposes OrderChain, a prompting paradigm that enhances the ordinal understanding capability of multimodal large language models (MLLMs) via task-aware prompts and a Range-Optimized Chain-of-Thought (RO-CoT), achieving for the first time a unified ordinal regression model across diverse tasks.

Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

This paper proposes the Abstract Perspective Change (APC) framework, which leverages visual foundation models to construct an abstract scene representation and perform perspective transformations, enabling VLMs to reason spatially from arbitrary viewpoints. APC substantially outperforms existing VLMs and fine-tuned models on both synthetic and real-image benchmarks.

Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models

This paper proposes Physics Context Builders (PCBs), a modular framework that fine-tunes small specialized VLMs on simulation data to generate detailed physical scene descriptions, which serve as physical context to augment the physical reasoning capabilities of large foundation VLMs (e.g., GPT-4o), without modifying the large model itself.

PhysSplat: Efficient Physics Simulation for 3D Scenes via MLLM-Guided Gaussian Splatting

This paper proposes PhysSplat, the first approach to leverage multimodal large language models (MLLMs) for zero-shot estimation of physical properties of objects in 3D scenes. Combined with a physics-geometry adaptive sampling strategy, it achieves realistic physics simulation on a single GPU within 2 minutes.

Pi-GPS: Enhancing Geometry Problem Solving by Unleashing the Power of Diagrammatic Information

Pi-GPS leverages diagrammatic information to resolve ambiguities in textual descriptions. By introducing a lightweight Rectifier–Verifier module, it addresses a previously overlooked problem of textual ambiguity, achieving nearly 10% improvement over prior state-of-the-art neuro-symbolic methods on Geometry3K.

PRO-VPT: Distribution-Adaptive Visual Prompt Tuning via Prompt Relocation

This paper proposes PRO-VPT, a framework that co-designs Adaptive Distribution Optimization (ADO) with Visual Prompt Tuning (VPT) via nested optimization. By iteratively relocating prompts through idleness score-based pruning and a reinforcement learning-based allocation strategy, PRO-VPT achieves gains of 1.6 pp and 2.0 pp over VPT on VTAB-1k and FGVC, respectively.

ProbRes: Probabilistic Jump Diffusion for Open-World Egocentric Activity Recognition

This paper proposes ProbRes, a framework that leverages a probabilistic residual search strategy based on jump diffusion, combined with ConceptNet commonsense priors and VLM likelihood estimation, to efficiently navigate large-scale search spaces in open-world egocentric activity recognition. ProbRes substantially reduces the number of VLM queries while improving recognition accuracy.

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

This paper proposes StepGRPO, an online reinforcement learning framework that introduces two rule-based step-wise reasoning rewards — StepRAR (Step-wise Reasoning Accuracy Reward) and StepRVR (Step-wise Reasoning Validity Reward) — without requiring a process reward model. The framework addresses the sparse reward problem in RL-based MLLM training, enabling models to autonomously explore and improve their reasoning capabilities.

ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering

This paper proposes ReasonVQA, a dataset constructed through a low-cost and scalable framework that automatically integrates structured encyclopedic knowledge (Wikidata) with images, generating 1/2/3-hop multi-hop reasoning questions. The benchmark comprises 598K images and 4.2M questions, posing significant challenges to existing VQA models.

Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks

This work identifies a pervasive vulnerability of mainstream VLMs to Gaussian noise, proposes the Robust-VLGuard safety dataset (covering both image-text aligned and misaligned scenarios) with noise-augmented fine-tuning to improve Gaussian noise robustness, and combines it with DiffPure to convert adversarial noise into Gaussian-like noise, forming the DiffPure-VLM general defense framework that effectively resists adversarial attacks of varying strengths.

SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders

SAUCE leverages sparse autoencoders (SAEs) to identify and selectively suppress features associated with target concepts in VLM intermediate representations, enabling fine-grained concept unlearning without weight updates. Evaluated across 60 concepts, it surpasses the previous SOTA in forgetting quality by 18%.

SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning

SC-Captioner proposes a multi-turn reinforcement learning framework based on policy gradient. By designing a correction reward function that incorporates correctness bonuses and mistake punishments, it enables large vision-language models to acquire self-correction capabilities for image captioning, while also introducing an improved CAPTURE evaluation metric.

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

This paper proposes the Vision Value Model (VisVM), trained via temporal difference (TD) learning, to guide sentence-level inference-time search in VLMs for generating higher-quality descriptive captions. Compared to greedy decoding and CLIP-PRM, VisVM search significantly reduces hallucination (CHAIRs from 32.4 to 26.2), and data generated through this process, when used for self-training, yields an average improvement of 10.8% across 9 benchmarks.

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

This paper proposes the Vision Value Model (VisVM), a value network trained via TD learning to predict the long-term value of sentences generated by a VLM. VisVM guides sentence-level beam search at inference time to produce image descriptions with fewer hallucinations and richer detail. High-quality captions generated by VisVM are further used for self-training, achieving an average improvement of 10.8% over LLaVA-Next across 9 benchmarks.

Scaling Laws for Native Multimodal Models

By training 457 models across diverse architectures, scales, and training data mixtures, this paper systematically investigates scaling laws for Native Multimodal Models (NMMs). It finds that early-fusion architectures (without pretrained visual encoders) outperform late-fusion counterparts at small parameter scales, are more training-efficient, and simpler to deploy; incorporating MoE further yields substantial performance gains.

SCAN: Bootstrapping Contrastive Pre-training for Data Efficiency

This paper proposes SCAN, a dynamic bootstrapping dataset pruning method that iteratively identifies pruning candidates and applies dataset mutation operations, achieving an average performance drop of less than 1% at a 30–35% pruning rate in CLIP and MoCo contrastive pre-training.

ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

This work identifies significant layer-level redundancy in MLLMs—most layers contribute minimally to the transformation of visual tokens—and proposes ShortV: freezing visual tokens (skipping their attention and FFN computations) in approximately 60% of layers. On LLaVA-NeXT-13B, this achieves a 50% reduction in FLOPs with negligible performance degradation. The method is training-free and orthogonal to token pruning approaches, allowing them to be combined.

SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models

SimpleVQA is the first VQA benchmark designed for comprehensive multimodal factuality evaluation of MLLMs. It spans 9 task types and 9 thematic domains, and employs a short-answer design with deterministic references alongside an LLM-as-a-judge scoring protocol to systematically assess the factual capabilities of 18 MLLMs and 8 text-only LLMs.

SMoLoRA: Exploring and Defying Dual Catastrophic Forgetting in Continual Visual Instruction Tuning

This paper identifies a phenomenon termed "dual catastrophic forgetting" in continual visual instruction tuning (CVIT) of multimodal large models, wherein both visual understanding capability and instruction-following capability degrade simultaneously. To address this, SMoLoRA is proposed, employing a separable-routing mixture of LoRA experts to effectively mitigate both forms of forgetting.

SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs

This paper reveals a "visual head sparsity" phenomenon in Multimodal Large Language Models (MLLMs), where only approximately 5% of attention heads actively participate in visual understanding. It proposes a training-free visual head identification framework based on OCR tasks and introduces SparseMM — an acceleration strategy that asymmetrically allocates KV-Cache budgets across heads according to their visual scores — achieving 1.38× real-time speedup and 52% memory reduction with no performance degradation.

SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

This paper proposes SparseVILA—the first VLM inference acceleration framework that decouples visual sparsity between the prefill and decode stages: query-agnostic redundant token pruning during prefill, and query-aware relevant token retrieval during decode. The approach achieves up to 4.0× prefill speedup, 2.5× decode throughput improvement, and 2.6× end-to-end acceleration, while maintaining accuracy in multi-turn conversation settings where existing methods suffer severe degradation due to permanent token deletion.

Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation

This paper proposes Sparse Optimization (SO), a framework that replaces low-rank adaptation methods (e.g., LoRA) via dynamic sparse gradient selection and importance-based momentum pruning. SO achieves state-of-the-art performance on few-shot VLM adaptation across 11 datasets while reducing memory overhead.

Spatial Preference Rewarding for MLLMs Spatial Understanding

This paper proposes SPR (Spatial Preference Rewarding), a framework that automatically constructs preference data pairs via semantic and localization scores, and trains MLLMs with DPO to distinguish high-precision grounding (chosen) from ambiguous or erroneous grounding (rejected), substantially improving fine-grained spatial understanding—particularly at high IoU thresholds.

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?

This paper proposes STI-Bench, a benchmark for evaluating the precise spatial-temporal understanding capabilities of multimodal large language models (MLLMs), covering three scene categories (desktop/indoor/outdoor), eight static and dynamic task types, and over 2,000 QA pairs. The benchmark reveals that the current state-of-the-art MLLM (Gemini-2.5-Pro) achieves an average accuracy of only 41.4%, exposing fundamental deficiencies in precise spatial quantification and temporal dynamic understanding.

Synergistic Prompting for Robust Visual Recognition with Missing Modalities

This paper proposes the Synergistic Prompting (SyP) framework, which employs a dynamic adapter to generate input-adaptive scaling factors that modulate a base prompt (dynamic prompt), synergizing with a static prompt that captures shared cross-modal features. SyP achieves robust visual recognition under missing-modality conditions and consistently outperforms SOTA methods such as DCP on MM-IMDb, Food101, and Hateful Memes.

TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models

This paper proposes TAB (Transformer Attention Bottleneck), a single-head co-attention bottleneck layer inserted after standard MHSA. By removing the skip connection and constraining attention values to \([0,1]\), TAB enables precise attention visualization, ground-truth-supervised training, and test-time user editing intervention in VLMs. On change captioning tasks, it establishes for the first time a causal relationship between attention values and VLM outputs.

Taming the Untamed: Graph-Based Knowledge Retrieval and Reasoning for MLLMs to Conquer the Unknown

Using Monster Hunter: World as a testbed, this paper constructs a multimodal knowledge graph (MH-MMKG) containing text, images, video, and complex entity relations, designs 238 complex queries along with a multi-agent knowledge retrieval method, and reveals the inadequacy of current MLLMs in domain-specific knowledge retrieval and reasoning tasks.

The Inter-Intra Modal Measure: A Predictive Lens on Fine-Tuning Outcomes in Vision-Language Models

This paper proposes the Inter-Intra Modal Measure (IIMM)—a metric that requires only a single forward pass to predict both the performance gain and the degree of catastrophic forgetting following fine-tuning of vision-language dual-encoder models. By quantifying intra-modal image embedding similarity and inter-modal misaligned label alignment, IIMM demonstrates strong linear predictive power (\(R^2 > 0.85\)) across 4 foundation models and 5 fine-tuning strategies.

ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

This paper proposes ToolVQA — a large-scale multimodal tool-augmented VQA dataset containing 23K samples. It is automatically constructed via the ToolEngine pipeline, which combines image-guided DFS with LCS-based example matching, to generate multi-step reasoning data in realistic scenarios. LLaVA-7B fine-tuned on this dataset surpasses GPT-3.5-Turbo on 5 OOD benchmarks.