CVPR2025 Multimodal VLM AI paper notes paper summaries Multimodal/VLM LLM Alignment/RLHF Few-/Zero-Shot Learning Robotics Layout & Composition

🧩 Multimodal VLM¶

📷 CVPR2025 · 136 paper notes

📌 Same area in other venues: 📷 CVPR2026 (420) · 🔬 ICLR2026 (211) · 💬 ACL2026 (83) · 🧪 ICML2026 (89) · 🤖 AAAI2026 (75) · 🧠 NeurIPS2025 (107)

🔥 Top topics: Multimodal/VLM ×83 · LLM ×14 · Alignment/RLHF ×10 · Few-/Zero-Shot Learning ×6 · Robotics ×4

4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models: This paper proposes 4D LangSplat, which constructs a 4D language field by leveraging multimodal large language models (MLLMs) to generate object-wise video captions. Combined with a status deformable network to model the temporally continuous evolution of semantics, it achieves the first time-sensitive and time-agnostic open-vocabulary queries in dynamic scenes.
Active Data Curation Effectively Distills Large-Scale Multimodal Models: Proposes ACID (Active data Curation as Implicit Distillation) and ACED (combined with explicit distillation), demonstrating that actively filtering training data using a larger model as a reference is a more effective multimodal model compression approach than traditional knowledge distillation. Combining the two complementarily achieves SOTA performance on 27 zero-shot tasks with fewer inference FLOPs.
ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding: This paper proposes the ASAP framework, which systematically advances image-text semantic alignment to improve multi-modal manipulation detection and grounding performance through three core modules: Large Model-Assisted Alignment (LMA), Manipulation-Guided Cross-Attention (MGCA), and Patch Manipulation Modeling (PMM). It achieves a 94.38% AUC and 76.52% text grounding F1 on the DGM4 benchmark, significantly outperforming existing methods.

ASAP: Advancing Semantic Alignment for Multi-Modal Manipulation Detection

Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning: This paper proposes the AiR (Augmenting discriminative Richness) framework, which utilizes a LoRA-fine-tuned Stable Diffusion model to generate synthetic images and construct an auxiliary classifier. By complementarily fusing it with the text classifier, the text-to-image matching paradigm in unsupervised prompt learning is extended to image-to-image matching, significantly improving classification accuracy on challenging datasets such as fine-grained categorizations and remote sensing.
Calico: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models: This paper proposes Calico—the first large vision-language model designed for part-level semantic co-segmentation. By establishing part-level semantic correspondence across multiple images using a Correspondence Extraction Module (CEM) and a Correspondence Adaptation Module (CAM), and fine-tuning only 0.3% of the parameters, it thoroughly outperforms existing methods on the newly constructed MixedParts benchmark, achieving a 6.3% gain in mIoU and a 51.3% speedup in inference.
Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?: This work systematically investigates the self-correction capabilities of VLMs in semantic grounding tasks. It reveals that intrinsic self-correction (without external feedback) actually degrades performance (by -7 to -17 points). However, iterative correction guided by feedback from the same VLM acting as a binary verifier can improve performance by up to 8.4 percentage points, highlighting that feedback quality is the critical bottleneck for self-correction.
CodePercept: Code-Grounded Visual STEM Perception for MLLMs: Through scaling analysis, this work discovers that the true bottleneck of STEM visual reasoning is perception rather than reasoning, and proposes using executable Python code as a precise perceptual medium. By constructing the ICC-1M dataset (Image-Caption-Code triplets) for training, CodePercept-8B improves by $+3.0\%$ to $+12.3\%$ over Qwen3-VL-8B on STEM perception benchmarks.
CoLLM: A Large Language Model for Composed Image Retrieval: This work proposes CoLLM, a unified framework for Composed Image Retrieval (CIR) leveraging Large Language Models. By generating training triplets on-the-fly from image-caption pairs, producing joint multimodal embeddings with an LLM, and constructing a large-scale MTCIR dataset with 3.4 million samples, CoLLM achieves SOTA performance across multiple CIR benchmarks, with MTCIR yielding up to a 15% performance improvement.
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation: Addressing the core issues of poor narrative coherence and inconsistent entity styles in existing interleaved image-text datasets (such as MMC4/OBELICS), this work constructs the CoMM dataset (227K documents, 2.28M images). By targeting instructional content collection combined with a multi-perspective quality filtering strategy, it ensures text coherence, image consistency, and image-text alignment, while proposing four interleaved generation evaluation tasks.
Completion as Enhancement: A Degradation-Aware Selective Image Guided Network: Reformulates image enhancement as a "completion" paradigm, employing a degradation-aware selection mechanism to guide the network to focus on regions requiring enhancement, thereby avoiding over-processing of already clear areas.
Compositional Caching for Training-free Open-vocabulary Attribute Detection: ComCa proposes a training-free open-vocabulary attribute detection method. By leveraging web-scaled image databases and an LLM, the method constructs an auxiliary image cache labeled with soft attribute probabilities. During inference, it aggregates the similarities of cached images to enhance the VLM's attribute prediction capabilities, competing effectively with training-based methods without any additional training.
Context-Aware Multimodal Pretraining: This paper proposes LIxP (Language-Image Contextual Pretraining), which introduces a cross-attention contextualization mechanism into contrastive image-text pretraining. This significantly improves the metric-based few-shot adaptation capability of vision-language models without sacrificing zero-shot performance (achieving an average gain of over 5% across 21 downstream tasks and up to a 4x increase in sample efficiency).
Continual Learning with Vision-Language Models via Semantic-Geometry Preservation: This paper proposes the SeGP-CL framework, which precisely detects vulnerable regions at the semantic boundaries of old and new tasks using adversarial anchors via Dual-Targeted PGD (DPGD). By combining Anchor-guided Cross-modal Geometry Distillation (ACGD) and Text Semantic-Geometry Regularization (TSGR) to preserve the cross-modal geometric structure of VLMs, SeGP-CL achieves state-of-the-art (SOTA) performance on five continual learning benchmarks.
COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts: This paper constructs COUNTS, a large-scale OOD dataset featuring 14 natural distribution shifts, over 222K samples, and more than 1.19 million bounding box annotations. It introduces two benchmarks, O(OD)² and OODG, to systematically evaluate the generalization capability of object detectors and multimodal large language models under distribution shifts, revealing that even GPT-4o only achieves a grounding accuracy of 56.7%.
Cropper: Vision-Language Model for Image Cropping through In-Context Learning: This paper proposes the Cropper framework, which is the first to leverage the in-context learning (ICL) capability of large vision-language models (VLMs) for image cropping. Through efficient prompt retrieval and feedback-based iterative crop refinement strategies, it significantly outperforms supervised state-of-the-art (SOTA) methods across three tasks—free cropping, subject-aware cropping, and aspect-ratio cropping—without requiring any training.
Cross-modal Information Flow in Multimodal Large Language Models: Through the "attention knockout" method, the flow path of visual and textual information in MLLMs is systematically traced, revealing that visual information integrates into linguistic representations in two stages (first global, then local), and eventually propagates from the question positions to the last position in middle layers to generate the answer.
Data Distributional Properties as Inductive Bias for Systematic Generalization: It is discovered that manipulating only the distributional properties of training data (diversity, burstiness, and latent intervention) can induce systematic generalization in multimodal masked language models. Specifically, increasing attribute diversity boosts out-of-distribution (OOD) shape prediction accuracy from 0.6% to 90%, requiring no modifications to the model architecture or training strategies.
Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization: Targeting the modality bias issue in MLLMs (over-reliance on language priors or visual details), NaPO constructs a biased dataset, RLAIF-V-Bias, by masking modality information. It proposes a noise-aware preference optimization algorithm based on a negative Box-Cox transformation to achieve robust training on automatically constructed noisy data, yielding superior results in both debiasing and hallucination mitigation.
Distraction is All You Need for Multimodal Large Language Model Jailbreaking: Proposed the "distraction hypothesis"—generating out-of-distribution (OOD) effects by constructing high-contrast, multi-subgraph composite inputs to increase visual complexity, which, combined with query decomposition and carefully designed benign instructions, achieves black-box jailbreaking with attack success rates of 42-64% against closed-source MLLMs like GPT-4o.
DocoPilot: Improving Multimodal Models for Document-Level Understanding: This paper constructs Doc-750K—a high-quality, document-level multimodal dataset containing 758K question-answer pairs and 3.1M images. Based on this, the authors train Docopilot, a native document understanding model. It outperforms InternVL2-8B by 19.9 percentage points on MM-NIAH, processing multi-page documents efficiently without relying on RAG.
DocVLM: Make Your VLM an Efficient Reader: Proposes a model-agnostic OCR encoding module that compresses OCR-extracted text and layout information into 64 learned query tokens and injects them into a frozen VLM, significantly improving document understanding capabilities under extremely low visual token counts (up to +30.6 points on DocVQA) and generalizing zero-shot to multi-page documents.
DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models: This work proposes the Dual-Prompt Collaboration (DPC) framework. By freezing the original tuned prompt to maintain new-class generalization and training a parallel prompt to strengthen base-class performance, along with a weighted decoupled inference mechanism, DPC serves as a plug-and-play module that consistently improves the base-new harmonic mean across four prompt tuning baselines.
Dynamic Updates for Language Adaptation in Visual-Language Tracking: DUTrack is proposed to resolve the semantic inconsistency between static references and dynamic targets in visual-language tracking by dynamically updating multi-modal reference information (template frames + language descriptions), outperforming the best vision-only trackers on LaSOT for the first time.
DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution: Modeling the dynamic resolution mechanism of human "fixation + saccade", this work constructs multi-level nested views around target regions with random sampling during training and selective combination during inference based on task or image priors, outperforming 7B+ models on region captioning, attribute detection, and dense captioning with only 4.2B parameters.
Efficient Motion-Aware Video MLLM: This paper proposes EMA (Efficient Motion-Aware video MLLM), which utilizes the GOP structure in compressed videos to fuse spatial and motion information. By leveraging a native slow-fast architecture, it reduces redundancy while enhancing motion representation. Additionally, it introduces MotionBench as a motion understanding benchmark, achieving SOTA on various video QA and motion understanding tasks.
EgoLM: Multi-Modal Language Model of Egocentric Motions: This work proposes a unified multimodal language model framework that integrates egocentric motion tracking (sparse sensors $\rightarrow$ full-body motion) and motion understanding (motion $\rightarrow$ language description). By combining a VQ-VAE motion tokenizer and a GPT-2 backbone, the framework jointly models four modalities (text, motion tokens, sensors, and video). Incorporating egocentric video reduces tracking errors by 10-20mm.
Embodied Scene Understanding for Vision Language Models via MetaVQA: A large-scale VQA benchmark (4.3 million questions) based on Set-of-Mark annotations and scene graphs is constructed to systematically evaluate the spatial reasoning and embodied understanding capabilities of VLMs. It demonstrates that fine-tuning on MetaVQA significantly improves spatial reasoning (+28 points), and the capabilities learned from simulator data successfully transfer zero-shot to real-world scenarios and unseen closed-loop driving tasks.
Evaluating Model Perception of Color Illusions in Photorealistic Scenes: This paper proposes an automated framework to generate the RCID dataset containing 19,000 photorealistic color illusion images, systematically revealing for the first time that VLMs indeed exhibit human-like color perception biases, and employs a mixed-training approach to enable models to simultaneously understand both human perception and ground-truth pixel values.
Evaluating Vision-Language Models as Evaluators in Path Planning: This paper introduces the PathEval benchmark to systematically evaluate the capability of Vision-Language Models (VLMs) serving as path planning evaluators. It is discovered that although VLMs can abstract features of the optimal path from scene descriptions, their visual components suffer from a severe bottleneck in perceiving low-level details of the path. End-to-end fine-tuning cannot effectively address this issue, necessitating task-specific discriminative visual encoder adaptation.
EventGPT: Event Stream Understanding with Multimodal Large Language Models: The first MLLM specifically designed for event camera streams. By employing a three-stage progressive training paradigm (vision-language alignment $\to$ event-language alignment $\to$ instruction tuning), it bridges the massive domain gap between asynchronous event data and language, substantially outperforming general MLLMs in event scene description and VQA.
Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond: Utilizes SAM's semantic priors to enhance infrared-visible image fusion via a persistent attention module, then transfers semantic knowledge to an ultra-lightweight student sub-network of only 0.136M parameters using bi-level optimization knowledge distillation, achieving SAM-free inference in 10.47ms while outperforming all dedicated fusion methods by over 3+ mIoU on segmentation tasks.
FastVLM: Efficient Vision Encoding for Vision Language Models: Proposes FastViTHD, a hybrid convolution-transformer vision encoder that achieves $32\times$ spatial downsampling through a 5-stage architecture. Under comparable accuracy, it generates $16\times$ fewer vision tokens and achieves a $3.7\times$ faster encoding speed compared to ViT-L/14, reducing TTFT by up to $85\times$.
Finer-CAM: Spotting the Difference Reveals Finer Details for Visual Explanation: By changing the explanation target of CAM from a single-class logit $y^c$ to the contrastive difference between classes $y^c - \gamma \cdot y^d$ (the logit difference between the target class and a similar class), Finer-CAM upgrades any CAM method into a fine-grained version with zero additional parameters, refining the activation maps from "overall silhouettes" to "discriminative local details".
FLAIR: VLM with Fine-grained Language-informed Image Representations: This work proposes text-conditioned attention pooling, which uses text embeddings as queries to adaptively aggregate relevant visual information from local image tokens. Trained on only 30M synthetic caption data, it significantly outperforms SigLIP/OpenCLIP trained on billions of data in fine-grained retrieval and zero-shot segmentation.
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion: This work replaces CLIP with the generative vision foundation model Florence-2 as the vision encoder for VLMs. Through "Depth-Breadth Fusion" (DBFusion), it integrates low-level DaViT features with high-level features from three task prompts (caption, OCR, and grounding), achieving performance that surpasses multi-encoder approaches using only a single encoder with 576 tokens.
Free on the Fly: Enhancing Flexibility in Test-Time Adaptation with Online EM: FreeTTA proposes a training-free and storage-free test-time adaptation method that explicitly models the target domain distribution via an online EM algorithm. By leveraging CLIP zero-shot predictions as priors to iteratively estimate the Gaussian distribution parameters of each class, it consistently outperforms existing TTA methods across 15 datasets.
From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons: GEA adapts a pretrained multimodal LLM (LLaVA-OneVision) to five major domains (manipulation, navigation, gaming, UI control, and planning) via a learned multi-embodiment action tokenizer. It first undergoes SFT using 2.2 million cross-domain expert trajectories, followed by fine-tuning with online PPO reinforcement learning, enabling a single model to outperform or match domain-specific models across multiple benchmarks.
Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding: Galaxy-Walker is proposed as the first geometry-aware vision-language model framework. By performing random walks across Euclidean, spherical, and hyperbolic spaces to generate Geometry Prompts, coupled with a Mixture-of-Geometry-Experts adapter (Geometry Adapter), it substantially outperforms general VLMs and domain-specific models on galaxy attribute estimation (with $R^2$ up to 0.91) and morphological classification tasks (with an F1 score improvement of +0.17).
Generalized Few-Shot 3D Point Cloud Segmentation with Vision-Language Model: GFS-VL proposes a generalized few-shot 3D point cloud segmentation framework that synergistically fuses dense but noisy pseudo-labels generated by a 3D Vision-Language Model (3D VLM) with precise but sparse few-shot annotations. Through prototype-guided pseudo-label selection, adaptive infilling, and novel-base mix augmentation, it achieves SOTA performance on both existing and newly established challenging benchmarks.
GENIUS: A Generative Framework for Universal Multimodal Search: The first universal generative multimodal retrieval framework, which encodes multimodal data into discrete IDs through modal-decoupled semantic quantization and utilizes an autoregressive decoder to directly generate target IDs from queries. It outperforms preceding generative methods by over 25 points on Flickr30K Text-to-Image retrieval, while reducing storage overhead by 99% compared to CLIP.
GeoMM: On Geodesic Perspective for Multi-Modal Learning: This work introduces geodesic distance into multimodal contrastive learning for the first time. By constructing a hierarchical graph structure, it efficiently calculates the manifold distance between samples to replace the traditional cosine distance. This enables more accurate mining of positive and negative sample relationships, improving performance in downstream tasks such as image-text retrieval and VQA.
Global-Local Tree Search in VLMs for 3D Indoor Scene Generation: Proposes a global-local tree search algorithm that leverages the spatial reasoning capabilities of VLMs. Through hierarchical scene representations and visual prompts from an emoji grid, it achieves high-quality 3D indoor scene layout generation, ranking first on average in user studies.
Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels: Ground-V is constructed as a dataset containing 500,000 instruction-segmentation pairs to systematically address five major challenges in real-world referring expression segmentation (hallucinated references, multi-object targeting, reasoning, multi-granularity, and part-level references). After training, the VLM achieves an N-Acc improvement of over 20% compared to the previous SOTA on gRefCOCO.
Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment: A novel vision-language alignment framework is proposed: by freezing pre-trained unimodal vision (DINOv2) and language (All-Roberta-Large) encoders and only training lightweight MLP projection layers to achieve multimodal alignment, it reaches or exceeds CLIP-level performance with a 20x reduction in data and a 65x reduction in compute.
HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator: This work proposes HEIE, a Hierarchical Explainable AIGC Image Implausibility Evaluator based on Multimodal Large Language Models (MLLMs). Through a CoT-driven trinity evaluator, it simultaneously outputs defect heatmaps, scores, and textual explanations. An Adaptive Hierarchical Implausibility Mapper is employed to achieve precise localization of both global and local defects, achieving state-of-the-art (SOTA) performance on the RichHF-18K and AbHuman datasets.
HiFICL: High-Fidelity In-Context Learning for Multimodal Tasks: Through a precise mathematical decomposition of the attention formula, this work reveals that the effect of ICL is inherently a query-dependent dynamic mixture of standard self-attention outputs and contextual values. Based on this insight, "virtual KV pairs" (via low-rank decomposition) are directly parameterized to simulate ICL with high fidelity. With only 2.2M parameters, this method outperforms MimIC/LoRA while training 7.5x faster.
HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios: HomeSafe-Bench is the first benchmark to evaluate VLMs on unsafe action detection in household scenarios (438 cases covering 6 functional areas), and proposes HD-Guard, a hierarchical streaming architecture that coordinates a lightweight FastBrain and a large-scale SlowBrain to achieve real-time safety monitoring.
Identifying and Mitigating Position Bias of Multi-image Vision-Language Models: This paper identifies a severe position bias in multi-image Large Vision-Language Models (LVLMs)—where open-source models place excessive emphasis on trailing images and closed-source models neglect middle images. It proposes a training-free SoFt Attention (SoFA) method that mitigates this bias by linearly interpolating between causal attention and bidirectional attention across images, improving average accuracy by 2~3% across multiple benchmarks.
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models: Proposes a data synthesis method inspired by contrastive learning. It automatically generates similar image pairs containing subtle object differences along with their difference descriptions. After fine-tuning MLLMs on this data, it outperforms GPT-4V and Gemini on MMVP by 12 points, achieving an average improvement of 3.06% across 8 general MLLM benchmarks.
Improving Personalized Search with Regularized Low-Rank Parameter Updates: This paper proposes POLAR, which applies a rank-1 LoRA update with regularization to the value matrix of the last layer of the CLIP text encoder. With only a few samples, it learns personalized concepts while retaining general knowledge, outperforming previous text-inversion-based methods by 4% to 22% on the DeepFashion2 and ConCon-Chi benchmarks.
Instruction-based Image Manipulation by Watching How Things Move: This paper proposes InstructMove, which constructs a large-scale real-world image editing dataset by sampling frame pairs from videos and generating editing instructions using multimodal large language models (MLLMs). Combined with a spatial conditioning strategy to fine-tune T2I models, it achieves SOTA performance on non-rigid editing tasks such as pose adjustment and viewpoint transformation.
It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data: This paper presents the first systematic study on the feasibility of "blind matching" using only the pairwise internal distances within the respective vision and language embedding spaces, in the complete absence of parallel data. It proposes a factored Hahn-Grant QAP solver (reducing memory complexity from $O(N^4)$ to $O(N^3)$) and demonstrates the feasibility of this matching through large-scale experiments involving 33 vision models $\times$ 27 language models, even achieving unsupervised image classification.
Joint Vision-Language Social Bias Removal for CLIP: This paper reveals the "over-debiasing" problem caused by inconsistent bias distributions in image and text modalities within CLIP. It proposes a joint framework of dual-modal bias alignment and counterfactual debiasing. While effectively reducing gender, age, and racial biases, it preserves vision-language alignment capabilities and designs the ABLE metric to comprehensively evaluate both debiasing performance and downstream capabilities.
LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant: Transforms a generative Large Multimodal Model (LMM) into a general multimodal retriever and reranker. By utilizing a two-stage training process (language pre-training and multimodal instruction tuning) along with joint pointwise/listwise reranking training, introducing lightweight LoRA modules allows it to significantly outperform dual-encoder approaches across 16 retrieval tasks and show strong generalization capability on 10 unseen datasets.
LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models: This paper proposes LayoutVLM, which leverages the semantic knowledge of VLMs to generate a dual scene layout representation containing numerical pose estimations and spatial relation constraints. By jointly optimizing semantic objectives and physical plausibility constraints via differentiable optimization, it significantly outperforms existing methods across 11 room types.
LLaVA-Critic: Learning to Evaluate Multimodal Models: LLaVA-Critic is the first open-source general-purpose multimodal evaluation model. By training on a carefully constructed 113k evaluation instruction dataset, it endows open-source LMMs with pointwise scoring and pairwise ranking capabilities close to the level of GPT-4o. It can also act as a reward model to provide effective preference signals for iterative DPO, surpassing the LLaVA-RLHF reward model trained on human feedback.
Locality-Aware Zero-Shot Human-Object Interaction Detection: This paper proposes the LAIN framework, which enhances the local fine-grained perception and interaction reasoning capabilities of CLIP representations through Locality Adapters (LA) and Interaction Adapters (IA), achieving state-of-the-art performance across various zero-shot HOI detection settings.
MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures: This paper proposes MarkushGrapher, a multimodal approach that recognizes Markush structures (chemical structure templates) in patent documents by jointly encoding text, image, and layout information. It also constructs M2S, the first real-world annotated benchmark for Markush structures, outperforming SOTA chemical-specific and general vision-language models under most evaluation settings.
MARTEN: Visual Question Answering with Mask Generation for Multi-Modal Document Understanding: The VQAMask pre-training paradigm is proposed, which introduces an auxiliary mask generation task (discarded during inference) on top of VQA text parsing. Through explicit spatial alignment supervision, it enhances the vision encoder's perception of text regions in document images. The resulting Marten model achieves state-of-the-art (SOTA) performance among 8B-level MLLMs across multiple document understanding tasks.
Mastering Negation: Boosting Grounding Models via Grouped Opposition-Based Learning: This work constructs D-Negation, the first visual grounding dataset containing positive and negative semantic descriptions, and proposes the Grouped Opposition-Based Learning (GOBL) fine-tuning mechanism to significantly enhance the grounding model's understanding of negative semantics via oppositional semantic constraints.
Mimic In-Context Learning for Multimodal Tasks: This paper mathematically analyzes the "shifting effect" of in-context demonstrations (ICDs) on self-attention in ICL. It proposes the MimIC method, which simulates ICL behavior by inserting a learnable shift vector and a query-dependent scaling factor into each attention head. With only 0.26M parameters, MimIC outperforms 32-shot ICL and all existing shift vector methods on VQA and image captioning tasks.
MIMO: A Medical Vision Language Model with Visual Referring Multimodal Input and Pixel Grounding Multimodal Output: This paper proposes MIMO, the first medical vision-language model that simultaneously supports "visual referring multimodal input" (users specify regions of interest via points/boxes) and "pixel-level grounding multimodal output" (the model embeds segmentation masks within textual answers). It constructs the MIMOSeg dataset with 895K samples and demonstrates unique referring + grounding capabilities across various medical VQA and segmentation tasks.
MLLM-as-a-Judge for Image Safety without Human Labeling: Proposes the CLUE framework, which achieves zero-shot image safety judgment without human labeling through rule objectification, CLIP relevance scanning, precondition chain decomposition, and debiased token probability analysis, significantly outperforming baselines across multiple MLLMs.
MMRL: Multi-Modal Representation Learning for Vision-Language Models: MMRL proposes a shared, modality-agnostic learnable representation space that projects representation tokens into high-level layers of image and text encoders (preserving low-level generalization knowledge). Through a decoupled inference strategy (utilizing representation + class features for base classes, and only class features for novel classes), MMRL achieves an optimal balance between few-shot adaptation and generalization across 15 datasets, establishing a new SOTA in base-to-novel generalization.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models: This paper proposes the Molmo family of VLMs and the PixMo dataset. Completely independent of synthetic data from closed-source VLMs, they construct high-quality training data from scratch through innovative data collection methods (voice descriptions of images, interactive Q&A annotation, and 2D pointing annotation). Its 72B model outperforms Claude 3.5 Sonnet and Gemini 1.5 Pro on academic benchmarks and human evaluations, ranking second only to GPT-4o.
Mosaic of Modalities: A Comprehensive Benchmark for Multimodal Graph Learning: This paper proposes MM-Graph, the first comprehensive graph learning benchmark that incorporates both textual and visual node attributes. Covering 7 real-world datasets of varying scales and 3 categories of graph tasks (link prediction, node classification, and knowledge graph completion), it systematically evaluates the impact of visual information on graph learning, revealing key findings such as "multimodal GNNs underperforming traditional GNNs" and "the crucial importance of feature alignment."
MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders: This paper proposes MoVE-KD—the first framework to fuse the strengths of multiple visual encoders (CLIP/EVA/ConvNeXt/SAM) into a single encoder from the perspective of knowledge distillation. It alleviates multi-teacher knowledge conflicts through Mixture-of-LoRA-Experts (MoLE), adaptively weights distilled tokens and teachers using CLIP $[CLS]$ attention, and achieves consistent improvements on LLaVA/LLaVA-NeXT.
Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices: This paper systematically studies two core problems of multi-layer visual feature fusion in multimodal LLMs: (1) how to select the most effective visual layers and (2) how to best fuse them into the language model. The study reveals that selecting one layer from each representation similarity stage and applying external direct fusion is the best practice.
A Two-Stage Progressive Pre-training using Multi-Modal Contrastive Masked Autoencoders: This paper proposes a progressive two-stage pre-training strategy. In the first stage, patch-level contrastive learning is used to align cross-modal representations of RGB and depth modalities. In the second stage, joint training of masked autoencoding, diffusion-inspired denoising, and feature distillation is conducted. This achieves a +1.3% mIoU improvement over Mask3D on ScanNet semantic segmentation and reaches SOTA performance on multiple RGB-D downstream tasks.
Multimodal Autoregressive Pre-training of Large Vision Encoders: Apple proposes the AIMV2 series of vision encoders, which pairs a ViT encoder with a multimodal autoregressive decoder—simultaneously generating raw image patches and text tokens as pre-training objectives. While maintaining a simple training pipeline, it achieves general-purpose performance across diverse tasks. AIMV2-3B reaches 89.5% on ImageNet frozen trunk evaluation and comprehensively outperforms CLIP and SigLIP on multimodal understanding benchmarks.
Multimodal OCR: Parse Anything from Documents: This work proposes the Multimodal OCR (MOCR) paradigm, which unifies the parsing of text and graphics (charts, icons, UI, etc.) in documents into a structured text representation (including SVG code). The 3B model achieves a SOTA score of 83.9 on olmOCR-Bench, outperforming Gemini 3 Pro in graphics parsing.
NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval: This paper proposes NeighborRetr, which addresses the hubness problem (where a few samples dominate nearest neighbors) in cross-modal retrieval through a triple mechanism: centrality-weighted loss (reducing training weights of hub samples), neighborhood adjustment loss (distinguishing between good/bad hubs), and uniform regularization (ensuring each sample is retrieved fairly). It achieves 49.5% (+0.9% over SOTA) R@1 on MSR-VTT text-to-video retrieval.
NLPrompt: Noise-Label Prompt Learning for Vision-Language Models: This paper discovers that simply replacing the CE loss with MAE loss in CLIP prompt learning can significantly improve robustness against noisy labels, which is theoretically proven via feature learning theory. Building on this, the authors propose NLPrompt—a method that combines an Optimal Transport-based data purification (PromptOT) to split the dataset into clean and noisy subsets, which are then trained using CE and MAE losses respectively, outperforming existing methods by a wide margin under various noise settings.
NVILA: Efficient Frontier Visual Language Models: NVILA proposes the "Scale-then-Compress" paradigm. By scaling up spatial and temporal resolutions and subsequently compressing visual tokens, it maintains or even surpasses SOTA accuracy while reducing training costs by 1.9-5.1x, prefill latency by 1.6-2.2x, and decoding latency by 1.2-2.8x.
On the Out-of-Distribution Generalization of Multimodal Large Language Models: This paper systematically evaluates the out-of-distribution (OOD) generalization capabilities of 14 MLLMs across 20 datasets, finding that MLLMs perform near-randomly on domain-specific data such as medical and molecular imaging. Through a three-hypothesis analysis, "semantic-visual mapping deficits" are identified as the primary cause. Additionally, the study demonstrates that In-Context Learning (ICL) significantly mitigates this issue but remains sensitive to label shifts and spurious correlation shifts.
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation: This paper proposes the OpenING benchmark (5,400 human-annotated instances, 56 real-world tasks) and the IntJudge evaluation model (82.42% agreement rate with human judgments), filling the vacuum in open-ended interleaved image-text generation evaluation. It finds that current integrated pipelines (e.g., Gemini+Flux) significantly outperform end-to-end models, yet all methods still fall far short of human annotation quality.
Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy: This paper proposes Optimus-2, which utilizes MLLMs for high-level planning combined with a Goal-Observation-Action Conditioned Policy (GOAP) for low-level control. Within this framework, GOAP models the causal relationship between observations and actions using an Action-guided Behavior Encoder, and aligns behavior tokens with language instructions using an MLLM. It achieves average improvements of 27% on Minecraft atomic tasks, 10% on long-horizon tasks, and 18% on open-ended instruction tasks.
PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models: Proposes the PARC framework. Through the three pillars of 11 linguistic/visual prompt variations, reliability scoring, and metric calibration, the framework systematically quantifies and analyzes the prompt sensitivity of 22 VLMs across 7 datasets for the first time. The findings show that VLMs inherit the linguistic sensitivity of LLMs and exhibit symmetric behavior in the visual domain, with the InternVL2 family being the most robust to prompt changes.
PEACE: Empowering Geologic Map Holistic Understanding with MLLMs: This paper constructs the first geologic map understanding benchmark, GeoMap-Bench (covering 5 capabilities, 25 tasks, and 3,864 questions), and proposes GeoMap-Agent (hierarchical information extraction + domain knowledge injection + enhanced QA), which significantly outperforms GPT-4o (scoring 0.811 overall vs. 0.369) in geologic map understanding.
Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model: This paper proposes Period-LLM—the first MLLM equipped with period-perception capabilities. It adopts an "easy-to-hard" progressive training paradigm (text repetition $\rightarrow$ macro-periodic video $\rightarrow$ micro-periodic signals) paired with a "Resisting Logical Oblivion" (RLO) gradient optimization strategy, significantly outperforming existing MLLMs on cross-modal periodic tasks such as repetitive action counting and rPPG heart rate estimation.
Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy: The JOOD framework is proposed to jailbreak LLMs and MLLMs with a high success rate through black-box attacks. By transforming malicious inputs into out-of-distribution (OOD) formats (e.g., mixing image/text), it significantly increases model uncertainty and bypasses safety alignment safeguards.
Post-pre-training for Modality Alignment in Vision-Language Foundation Models: CLIP-Refine is proposed, a "post-pre-training" approach positioned between pre-training and fine-tuning. By utilizing two key techniques—Random Feature Alignment (RaFA) and Hybrid Contrastive Distillation (HyCD)—it narrows CLIP's modality gap and enhances zero-shot performance with only 1 epoch of training on a small dataset.
RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models: The RAP (Retrieval-Augmented Personalization) framework is proposed to achieve personalization in MLLMs via a three-step "Remember-Retrieve-Generate" pipeline. It stores user concepts in an external database, dynamically retrieves relevant concept information using a multimodal retriever, and injects it into the MLLM to generate personalized responses. Each concept requires only 1 image and its description, supporting real-time updates.
Realistic Test-Time Adaptation of Vision-Language Models: This paper reveals that existing test-time adaptation (TTA) / transductive methods for VLMs can severely damage the zero-shot robustness of CLIP in realistic scenarios (variable number of active classes, non-i.i.d. data streams). It proposes StatA, which introduces a KL-divergence regularization based on text encoder knowledge (statistical anchors) on the parameters of a Gaussian mixture model, maintaining stable improvements across all deployment scenarios.
Reasoning to Attend: Try to Understand How \<SEG> Token Works: This paper conducts an in-depth analysis of the working mechanism of the \<SEG> token in reasoning segmentation tasks, discovering that it learns semantic features similar to direct textual mentions for image-text semantic alignment. Based on this finding, the READ method is proposed to convert the similarity map between the \<SEG> token and image tokens into point prompts, guiding the SAM decoder to generate more precise segmentation masks in a plug-and-play manner.
Recognition-Synergistic Scene Text Editing: This work proposes RS-STE (Recognition-Synergistic Scene Text Editing), which unifies text recognition and text editing into a single multimodal parallel decoder. It leverages the recognition model's inherent ability to implicitly disentangle style and content to assist the editing process, and designs a cyclic self-supervised fine-tuning strategy to enable effective training on real-world data without paired annotations.
Relation-Rich Visual Document Generator for Visual Information Extraction: This paper proposes RIDGE, a relation-rich visual document generator. It leverages LLMs to generate hierarchically structured text content combined with self-supervised content-driven layout generation. By synthesizing document images annotated with entity categories and linkage relations, RIDGE significantly enhances the performance of VIE models across multiple benchmarks.
Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages: By analyzing the learning dynamics of PEFT in few-shot adaptation, this work discovers that the training process naturally divides into two stages: "task-level feature extraction" and "available class specialization". Accordingly, the authors propose 2SFS: first tuning LayerNorm to learn general/task-level features, and then training a linear classifier to enhance known-class discrimination. 2SFS matches or exceeds SOTA performance under both base-to-novel and all-to-all settings.
Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector: Proposed M2F2-Det, the first multimodal face forgery detector that simultaneously outputs deepfake detection scores and textual explanations. It adapts CLIP to learn forgery features via Forgery Prompt Learning, fuses CLIP and deepfake encoder features using a Bridge Adapter, and guides the LLM to generate trustworthy explanations using frequency-domain tokens.
Rethinking VLMs for Image Forgery Detection and Localization: Proposed IFDL-VLM, demonstrating that VLM priors contribute minimally to forgery detection/localization. By decoupling detection/localization from linguistic explanation in a two-stage framework, the method utilizes a ViT+SAM expert model for detection and localization, subsequently employing the generated localization mask as an auxiliary input to enhance VLM training for generating interpretable textual explanations.
ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos: This paper proposes ReVisionLLM, the first vision-language model capable of temporal grounding in hour-long videos. It mimics human search strategies to recursively process videos by first coarsely localizing relevant segments and progressively refining them to precise temporal boundaries, outperforming the state-of-the-art on the MAD dataset by +2.6% in [email protected].
Revisiting Model Stitching in the Foundation Model Era: This paper systematically studies the stitchability between heterogeneous Vision Foundation Models (e.g., CLIP, DINOv2, SigLIP 2), finding that pre-training the stitch layer with Final Feature Matching enables reliable stitching, and proposes the VFM Stitch Tree architecture to achieve efficient multi-VFM sharing.
RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness: RLAIF-V proposes an alignment framework entirely based on open-source MLLMs. It generates high-quality preference data using a deconfounded candidate response generation strategy and a divide-and-conquer feedback annotation method. When integrated with iterative DPO training and self-feedback inference-time scaling, the framework slashes the hallucination rate of a 7B model by 80.7% and enables a 12B model to surpass the trustworthiness of GPT-4V utilizing only its own feedback.
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics: RoboSpatial constructs a large-scale robotic spatial understanding dataset featuring 1M images, 5k 3D scans, and 3M spatial relation annotations. It leverages an automated pipeline to generate three categories of spatial QA pairs (spatial context, compatibility, and configuration) from existing 3D scene data and introduces three reference frames (ego-centric, world-centric, and object-centric). Training multiple 2D and 3D VLMs on this dataset significantly boosts spatial reasoning performance, with its effectiveness validated through real-world robotic manipulation experiments.
Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents: The MONDAY framework automatically generates mobile navigation datasets from YouTube tutorial videos. Through an OCR-based scene transition detection and a 3-step action recognition pipeline with GPT-4o, it constructs 313K annotated frames covering both iOS and Android platforms at 1/17th of the cost of manual annotation ($0.34 vs $5.76 per video). After pre-training, the agent achieves a performance gain of 18.11% on the unseen Windows Mobile platform.
Seeing the Abstract: Translating the Abstract Language for Vision Language Models: Proposes ACT (Abstract-to-Concrete Translator), which analyzes the representation discrepancy between abstract and concrete texts in the VLM latent space via PCA. During inference, ACT shifts the representation of abstract descriptions towards the concrete direction in a training-free manner, mitigating the VLM's insufficient understanding of abstract language and significantly outperforming fine-tuned models on text-to-image retrieval tasks in the fashion domain.
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories: SegAgent models referring expression segmentation as an iterative operation process of a human annotator—the MLLM observes the current mask state and predicts the next click location, according to which the interactive segmentation model updates the mask, obtaining the final segmentation result after multiple steps of iteration. It significantly improves segmentation accuracy in complex scenarios through the StaR+ policy improvement and PRM with tree search.
Self-Evolving Visual Concept Library using Vision-Language Critics: This paper proposes the Escher framework, which automatically evolves a visual concept library using an iterative loop consisting of a VLM as a critic and an LLM as a concept generator. This evolution improves the performance of concept bottleneck models in image classification, boosting LM4CV from 63.26% to 83.17% (+19.91%) on the CUB dataset.
Self-Supervised Spatial Correspondence Across Modalities: Extends the Contrastive Random Walk (CRW) framework to cross-modal pixel-level correspondence. By simultaneously learning intra-modal and inter-modal cycle-consistent feature representations, it achieves cross-modal dense matching for RGB-Depth, RGB-Thermal, Photo-Sketch, etc., without requiring paired annotations, significantly outperforming existing methods.
Single Domain Generalization for Few-Shot Counting via Universal Representation Matching: Proposes URM, the first single domain generalization model for few-shot counting. By distilling CLIP's universal vision-language representations into learnable prototypes to construct correlations, it significantly improves cross-domain generalization capability (reducing MAE by 27.5%) without sacrificing in-domain performance.
SketchAgent: Language-Driven Sequential Sketch Generation: Without any training or fine-tuning, SketchAgent achieves human-level sketch generation (reaching 85% of human Top-1 recognition rate) stroke-by-stroke using a grid-canvas coordinate system, in-context examples, and a Bézier curve fitting post-processing pipeline designed for pre-trained multimodal LLMs. It supports interactive collaborative drawing and conversational editing.
Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves: This paper reveals that prompt tuning with frozen VLM parameters neither facilitates knowledge transfer nor significantly improves efficiency (only reducing memory by 6% and time by 16%). It proposes Skip Tuning, which shortens the gradient propagation flow of full fine-tuning through Layer-wise Skipping (LSkip) and Class-wise Skipping (CSkip), achieving 15× speedup and 6.4× memory efficiency while delivering superior accuracy.
SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design: This paper constructs SldprtNet, a large-scale multimodal CAD dataset containing over 240k industrial parts. Each sample aligns four modalities: 3D models, multi-view images, parametric modeling scripts, and natural language descriptions. An encoder/decoder tool supporting 13 CAD operations is developed to achieve lossless bidirectional conversion. Experiments demonstrate that multimodal input significantly outperforms text-only input.
SmartCLIP: Modular Vision-language Alignment with Identification Guarantees: SmartCLIP achieves modular vision-language alignment by introducing an adaptive masking network, theoretically proving the identifiability of latent variables. It effectively addresses the issues of information misalignment and representation entanglement in CLIP training, significantly outperforming existing methods on various tasks such as long/short text retrieval and zero-shot classification.
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models: SPA-VL constructs a large-scale safety preference alignment dataset for VLMs containing 100,788 quadruplets (query, image, preferred response, dispreferred response), covering 6 domains, 13 categories, and 53 subcategories of harmful content. Based on diverse responses from 12 VLMs and a fully automated annotation pipeline, models trained with DPO/PPO achieve significant safety improvements while maintaining helpfulness.
SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs: This paper proposes the SPARROW framework, which addresses the challenges of poor temporal referential consistency and unstable first-frame initialization in video MLLMs through Target-Specific Tracking Features (TSF) and a dual-prompt (BOX+SEG) mechanism, achieving consistent improvements across 3 mainstream video MLLMs on 6 benchmarks.
StarVector: Generating Scalable Vector Graphics Code from Images and Text: StarVector is proposed, a multimodal large language model-based SVG generation framework that reformulates image vectorization as an inverse rendering and code generation task. By leveraging visual semantic understanding, it directly generates compact SVG code comprising rich primitive types (circles, polygons, text, etc.), establishing a new state-of-the-art (SOTA) across 10 datasets on 3 tasks.
BadVision: Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models: This study is the first to reveal the backdoor security threats of SSL vision encoders to LVLMs, and proposes BadVision. Through bi-level trigger optimization and a trigger-focusing backdoor learning mechanism, tampering only with the vision encoder can induce free-form visual hallucinations (ASR > 99%) in downstream LVLMs while bypassing SOTA detection methods.
STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection: Constructs the first multimodal X-ray baggage security dataset STCray (46,642 image-description pairs, 21 threat classes including IEDs and 3D printed guns), designs the STING protocol to systematically generate domain-aware high-quality descriptions, and trains the domain-specific VLM STING-BEE, establishing new baselines in scene understanding, threat localization, visual grounding, and VQA, while demonstrating SOTA cross-domain generalization capabilities.
SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation: This paper proposes SVLTA, a vision-language temporal alignment benchmark generated through a synthetic simulation environment. It contains 25.3K dynamic scenes, 96 compositional actions, and 77.1K high-quality temporal annotations with a controllable, compositional, and unbiased temporal distribution. Through three evaluation dimensions—temporal QA, sensitivity to distribution shifts, and temporal adaptation—it reveals a severe lack of temporal alignment capabilities in current VidLLMs (even the strongest GPT-4o achieves only 11.69% R@1 at IoU=0.5).
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization: SymDPO identifies the "visual context overlook" issue in multimodal ICL (where replacing demonstration images with blank images does not affect performance) and proposes replacing text answers in demonstrations with semantic-free random symbols. This forces the model to understand the visual content to correctly match symbols with answers. Through DPO training, this consistently improves multimodal ICL performance on OpenFlamingo and IDEFICS.
Synthetic Data is an Elegant GIFT for Continual Vision-Language Models: Using Stable Diffusion to generate synthetic images from class names, knowledge distillation is performed via contrastive distillation + image-text alignment constraints + adaptive weight consolidation. With only 1K synthetic images per task, this approach outperforms ZSCL, a continual learning method that uses 100K real ImageNet images.
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment: This paper proposes Task Preference Optimization (TPO), which integrates specialized vision task heads (region grounding, temporal grounding, and segmentation) into MLLMs via learnable task tokens. By leveraging vision task annotations as "task preferences" to backpropagate and optimize the MLLM, this approach significantly enhances fine-grained visual understanding without compromising conversation capabilities, achieving an average improvement of 14.6% over the VideoChat baseline.
Taxonomy-Aware Evaluation of Vision-Language Models: Proposes a taxonomy-aware VLM evaluation framework. By mapping the free-text output of VLMs onto a taxonomic tree, it utilizes hierarchical precision (hP) and hierarchical recall (hR) to quantify the correctness and specificity of predictions, solving the problem where traditional exact match and text similarity metrics fail to score "partially correct" answers.
Teaching Large Language Models to Regress Accurate Image Quality Scores Using Score Distribution: Proposes DeQA-Score, which discretizes the Gaussian distribution of quality scores into soft labels (replacing Q-Align's one-hot labels), significantly reducing discretization information loss (by 10-35 times). It introduces a fidelity loss based on the Thurstone model to achieve joint training on multiple IQA datasets, comprehensively outperforming baseline models on score regression tasks.
Topo-R1: Detecting Topological Anomalies via Vision-Language Models: This work reveals that existing VLMs (including GPT-5.2 and Gemini-2.5) exhibit near-zero performance on topological anomaly detection ($[email protected] < 1.5\%$). It proposes the Topo-R1 framework, which endows VLMs with topological awareness via SFT + GRPO incorporating a topology-aware composite reward (integrating type-aware Hungarian matching and clDice), achieving a peak $[email protected]$ of 45.2%.
Towards Understanding How Knowledge Evolves in Large Vision-Language Models: This study presents the first systematic analysis of the multimodal knowledge evolution process within LVLMs. It reveals a "critical layer-mutation layer" dual-node pattern of knowledge evolution across three levels: single-token probability, token probability distribution, and feature encoding. The evolution process is categorized into three stages: rapid evolution $\rightarrow$ stabilization $\rightarrow$ mutation, and deep-layer mutations are shown to be closely associated with hallucinations.
UNEM: UNrolled Generalized EM for Transductive Few-Shot Learning: UNEM is proposed to unroll each iteration of the Generalized EM (GEM) algorithm as a neural network layer. It automatically optimizes the class-balance hyperparameter $\lambda$ and temperature scaling $T$ through end-to-end learning. It achieves an average accuracy of 77.8% under the vision-language setting across 11 fine-grained datasets (vs. 73.6% of EM-Dirichlet) and up to a 10% gain under the vision-only setting.
Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly: Reveals the prevalent phenomenon in MLLMs of "understanding visual content but still giving incorrect answers", constructs the MMVU benchmark consisting of 12 categories of positive-negative question pairs, discovers that the root causes lie in training data bias towards positive samples and insufficient attention on visual tokens, and proposes a three-pronged solution: the MMVU-Train dataset (112K positive-negative pairs) + Content Guided Refinement (CGR) + Visual Attention Refinement (VAR).
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation: This paper proposes the UPME framework, which enables multiple MLLMs to generate questions and review each other using only image data through an unsupervised peer review mechanism, a vision-language scoring system, and dynamic weight optimization. It achieves a Pearson correlation of 0.944 with human evaluation on MMStar, effectively mitigating the reliance of MLLM evaluation on human annotations and addressing review bias.
V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents: This work introduces V-Stylist, a video stylization system based on multi-agent collaboration and reflection of MLLMs. By coordinating three agent roles—Video Parser (video shot segmentation), Style Parser (style tree search), and Style Artist (multi-round self-reflective rendering)—V-Stylist achieves state-of-the-art performance on complex transition videos and open-domain style descriptions, outperforming FRESCO by 6.05% on overall metrics.
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?: This paper proposes the VidComposition benchmark, specifically designed to evaluate MLLMs' composition understanding capabilities on compiled videos (movies, animations, etc.). It encompasses 5 major categories and 15 subtasks (shot motion, narrative structure, character understanding, etc.). Evaluation of 33 MLLMs reveals a huge gap between current models and humans in cinematic video understanding (the best model achieves $63.3\%$ vs. $86.3\%$ for humans).
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding: Leveraging the internal KV sparsification capability of LLMs for long video token compression by introducing Visual Summarization Tokens (VST) to compress the visual information of each video segment into its KV and offloading the original visual KV. Combined with dynamic compression and curriculum learning, it processes 2048 frames on a single A100 and outperforms GPT-4o on MLVU Dev.
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos: VideoGLaMM is a video large multimodal model that achieves pixel-level fine-grained visual grounding in videos using a dual-visual encoder (spatial + temporal), tunable V→L and L→V adapters, and a spatiotemporal pixel decoder, while establishing the first 38K video-grounded QA dataset.
VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge: Proposes the VILA-M3 framework, which integrates knowledge from medical domain expert models (segmentation/classification) into a generalist VLM on-demand via a four-stage training scheme. It achieves an average of ~9% SOTA improvement across multiple medical benchmarks such as VQA, report generation, and classification, with a model scale significantly smaller than Med-Gemini (3B-40B vs 1.5T).
Vision-Language Model IP Protection via Prompt-based Learning: This paper proposes the IP-CLIP framework, which achieves VLM IP protection on a frozen CLIP backbone via lightweight IP-Prompt learning (domain tokens + image tokens) and a style-augmented memory branch. This allows the model to maintain high accuracy on the authorized domain while deliberately degrading performance on unauthorized domains, resulting in a 0% performance drop on the authorized domain.
Vision-Language Models Do Not Understand Negation: This paper proposes the NegBench benchmark to systematically reveal the severe deficiencies of vision-language models like CLIP in understanding negation (performing close to random-guess levels). By fine-tuning on a large-scale synthetic negation dataset, the retrieval recall of negation queries is improved by 10%, and the MCQ accuracy is boosted by up to 40%.
VisionArena: 230K Real World User-VLM Conversations with Preference Labels: VisionArena constructs a large-scale dataset containing 230K real-world user-VLM interaction records (including preference labels), covering 73K users, 45 VLMs, and 138 languages. It reveals the current limitations of VLMs in spatial reasoning and planning tasks, and demonstrates that fine-tuning on real dialogue data significantly outperforms LLaVA-Instruct.
VisionZip: Longer is Better but Not Necessary in Vision Language Models: VisionZip reveals significant redundancy in visual tokens generated by vision encoders (CLIP/SigLIP), where only a fraction of tokens aggregate the vast majority of attention and information. Based on this observation, a text-independent token selection and merging method is proposed, maintaining over 95% of model performance with only 10% of tokens while achieving an 8x pre-fill acceleration.
Visual and Semantic Prompt Collaboration for Generalized Zero-Shot Learning: This paper proposes the Visual and Semantic Prompt Collaboration Network (VSPCN). By concurrently learning visual and semantic prompts in a pre-trained ViT and designing a weak-fusion-at-shallow-layers and strong-fusion-at-deep-layers mechanism, it efficiently adapts ViT to extract semantic-relevant discriminative visual features, achieving state-of-the-art performance on CUB, SUN, and AWA2 benchmarks.
VladVA: Discriminative Fine-tuning of LVLMs: The VladVA framework is proposed to transform generative LVLMs (LLaVA) into strong discriminative models via a hybrid short/long caption data strategy, joint training with contrastive and autoregressive losses, and parameter-efficient adaptation using soft prompting and LoRA. It substantially outperforms CLIP-based models and the 18B EVA-CLIP on image-text retrieval and compositional understanding benchmarks.
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models: VLsI proposes a natural language-based layer-wise distillation method. By introducing "verbalizers" in the intermediate layers of large and small VLMs to map features into the language space, combined with an adaptive layer matching strategy to align inference processes, VLsI enables 2B/7B small models to outperform GPT-4V by an average of 11.0%/17.4% on 10 VL benchmarks without any architectural modifications or parameter increases.
What's in the Image? A Deep-Dive into the Vision of Vision Language Models: This paper systematically analyzes the visual information processing mechanism of VLMs (InternVL2-76B and LLaVA-1.5-7B) through Attention Knockout experiments, revealing three key findings: (1) query text tokens act as global image describers that compress high-level visual information, (2) the middle layers (~25%) dominate the cross-modal information transfer while early and late layers contribute minimally, and (3) fine-grained object details are extracted from image tokens through spatial localization. Based on these findings, an Image Re-prompting application is proposed, which maintains 96% of VQA performance using only 5% of the image tokens.
Words or Vision: Do Vision-Language Models Have Blind Faith in Text?: This paper identifies the phenomenon of "blind faith in text" in VLMs—where models systematically favor text (even when incorrect) when visual and textual inputs are inconsistent. By constructing a benchmark with three text variants (Match, Corruption, and Irrelevance), this work evaluates 10 VLMs, analyzes five influencing factors, demonstrates that SFT with text augmentation effectively mitigates this issue, and provides a theoretical explanation tracing the root cause to the imbalance between text-only and multimodal training data.
Your Large Vision-Language Model Only Needs a Few Attention Heads for Visual Grounding: It is discovered that frozen LVLMs naturally contain a small number of "localization heads" that consistently capture object locations corresponding to textual semantics. Using the attention maps of only 3 attention heads, training-free visual grounding achieves 86.5% on RefCOCO val, outperforming the fine-tuned LISA-7B.