Skip to content

🧩 Multimodal VLM

🎞️ ECCV2024 · 44 paper notes

📌 Same area in other venues: 📷 CVPR2026 (420) · 🔬 ICLR2026 (211) · 💬 ACL2026 (83) · 🧪 ICML2026 (89) · 🤖 AAAI2026 (75) · 🧠 NeurIPS2025 (107)

🔥 Top topics: Multimodal/VLM ×26 · LLM ×8 · Adversarial Robustness ×3 · Few-/Zero-Shot Learning ×3 · Self-Supervised Learning ×2

A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis

This work constructs the CDDM dataset containing 137k crop disease images and 1 million question-answering pairs, and proposes a strategy to apply LoRA fine-tuning simultaneously to the vision encoder, adapter, and language model. This enables Qwen-VL-Chat and LLaVA to leap from single-digit accuracy to over \(90\%\) in crop disease diagnosis.

AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting

The AdaShield framework is proposed, which comprises a meticulously designed static defense prompt (AdaShield-S) and an LLM-based adaptive iterative optimization framework (AdaShield-A). Without fine-tuning MLLMs or training additional modules, it effectively defends against structure-based jailbreak attacks, reducing the attack success rate from over 75% to below 15% while maintaining normal task performance.

AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization

The AddressCLIP framework is proposed, which models the Image Address Localization (IAL) problem as an end-to-end vision-language alignment task through two core components: image-text alignment (contrastive learning of address and scene descriptions) and image-geography matching (manifold learning based on GPS distance). It achieves a Top-1 accuracy of up to 85.92% on three self-constructed IAL datasets.

ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling

This paper proposes to reformulate the visual attribute recognition problem as a sentence generation probability problem under an image-conditioned Prefix Language Model (PrefixLM). By replacing traditional "contrastive retrieval" with "generative retrieval", the model explicitly captures the conditional dependency between objects and attributes, significantly outperforming contrastive methods on both the VAW and the newly proposed VGARank datasets.

Attention Prompting on Image for Large Vision-Language Models

This paper proposes Attention Prompting on Image (API), which utilizes an auxiliary VLM (CLIP or LLaVA) to generate attention attribution maps based on text queries. These maps are overlaid as heatmaps onto the original image to guide the LVLM to focus on relevant regions. API improves LLaVA-1.5 by up to 3.8% on MM-Vet and is widely effective across various LVLMs, including GPT-4V.

BLINK: Multimodal Large Language Models Can See but Not Perceive

Introduces BLINK—a multimodal evaluation benchmark containing 14 classic computer vision perception tasks (3,807 multiple-choice questions) that humans can solve "in a blink" (95.7% accuracy), but the strongest GPT-4V achieves only 51.26% (only 13.17% above random guessing), revealing a severe deficiency of current MLLMs in core visual perception capabilities.

BRAVE: Broadening the Visual Encoding of Vision-Language Models

This paper systematically analyzes the impact of different visual encoders (CLIP, DINOv2, EVA-CLIP, etc.) on VLM performance, finding that no single encoder is optimal across all tasks. Based on this, the BRAVE method is proposed, which utilizes a lightweight MEQ-Former to fuse features from multiple frozen encoders into a compact representation. Consequently, it achieves SOTA results on captioning and VQA tasks with only 116M trainable parameters while significantly reducing visual hallucinations.

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

This paper proposes the CAT model, which captures fine-grained audio-visual features via a question-aware Clue Aggregator. Combined with a hybrid multimodal training strategy and an AI-assisted Vagueness-aware Direct Preference Optimization (ADPO) strategy, it significantly improves MLLM question-answering accuracy in dynamic audio-visual scenarios, achieving SOTA performance on multiple AVQA benchmarks.

CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts

From the perspective of causal generative models, this paper proposes CLAP (Contrastive Learning with Augmented Prompts). It trains a lightweight disentanglement network using text prompt augmentation and contrastive learning to separate content and style within CLIP pre-trained features. Trained solely on text, CLAP simultaneously improves representation quality for both image and text modalities, achieving consistent gains in zero-shot classification, few-shot classification, and adversarial robustness.

Dataset Growth (InfoGrowth)

InfoGrowth is proposed as an efficient online data cleaning and selection algorithm. By estimating the information gain of each sample through nearest neighbor search, it enables continuous dataset growth while maintaining cleanliness and diversity, outperforming full training on CC3M using only 1/6 of the data.

Decoupling Common and Unique Representations for Multimodal Self-supervised Learning

Proposes DeCUR, which explicitly splits embedding dimensions into cross-modal common and modality-unique parts in multimodal self-supervised learning. Alignment and decoupling are driven by the cross-correlation matrix, respectively. Intramodal training is introduced to ensure that the unique dimensions learn meaningful information. DeCUR outperforms baselines like Barlow Twins and CLIP in three multimodal scenarios: SAR-Optical, RGB-DEM, and RGB-Depth.

Efficient Inference of Vision Instruction-Following Models with Elastic Cache

Elastic Cache proposes a KV Cache management method tailored for multimodal instruction-following models. It adopts an importance-based cache merging strategy (rather than eviction) in the instruction encoding stage and a fixed-point eviction strategy in the output generation stage. With the "one sequence, two policies" design, it achieves high-efficiency inference at any acceleration ratio, delivering a 78% actual speedup with only 20% (0.2) KV Cache budget while maintaining generation quality.

Elevating All Zero-Shot Sketch-Based Image Retrieval Through Multimodal Prompt Learning

SpLIP is proposed, a bidirectional multimodal prompt learning framework based on frozen CLIP. By utilizing bidirectional knowledge exchange between vision and text encoders, an adaptive-margin triplet loss, and a conditional cross-modal jigsaw task, it achieves SOTA performance across three sketch retrieval settings: ZS-SBIR, GZS-SBIR, and FG-ZS-SBIR.

Elysium: Exploring Object-level Perception in Videos via MLLM

Elysium is proposed as an end-to-end trainable Multimodal Large Language Model (MLLM). By constructing a million-scale video object perception dataset (ElysiumTrack-1M) and designing a visual token compression network (T-Selector), it extends the object-level perception capability of MLLMs from static images to the video domain, supporting three major tasks: Single Object Tracking (SOT), Referring Single Object Tracking (RSOT), and Video Referring Expression Generation (Video-REG).

Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

ECSO (Eyes Closed, Safety On) is proposed: a training-free MLLM defense method that detects the safety of its own responses and adaptively converts images in unsafe queries into text descriptions, thereby restoring the intrinsic safety mechanism of pre-aligned LLMs. It achieves up to a 71.3% safety improvement on MM-SafetyBench without compromising general performance.

FlexAttention for Efficient High-Resolution Vision-Language Models

This paper proposes FlexAttention, which reduces computational costs by nearly 40% while maintaining or even exceeding the performance of existing high-resolution VLMs, achieved through dynamic high-resolution token selection based on attention maps and a hierarchical self-attention fusion mechanism.

FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models

This work achieves open-set human motion synthesis without using any motion capture (MoCap) data for the first time by leveraging an MLLM (GPT-4V) as a keyframe designer and animator combined with physics-based motion tracking.

Genixer: Empowering Multimodal Large Language Model as a Powerful Data Generator

The Genixer data generation pipeline is proposed to train an MLLM itself as a data generator, automatically generating high-quality visual instruction tuning data without relying on GPT-4V. The generated 915K VQA and 350K REC data respectively improve the performance of LLaVA1.5 and Shikra across multiple benchmarks.

Grounding Language Models for Visual Entity Recognition

AutoVER is proposed, which is the first method to apply Multimodal Large Language Models (MLLMs) to large-scale visual entity recognition. By integrating retrieval capability directly inside the MLLM, and combining contrastive training with trie-constrained decoding, it substantially outperforms prior methods like PaLI-17B on the Oven-Wiki benchmark.

LoA-Trans: Enhancing Visual Grounding by Location-Aware Transformers

LoA-Trans proposes a location-aware query selection mechanism to generate multiple potential target locations as location-aware queries (instead of relying solely on the estimated center point), and introduces the TaskSyn network to achieve task collaboration between Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES) in the decoder, significantly improving the accuracy of visual grounding.

m&m's: A Benchmark to Evaluate Tool-Use for Multi-step Multi-modal Tasks

Proposes the m&m's benchmark, which contains 4K+ multi-step multi-modal tasks and 33 executable tools, to systematically evaluate the tool-use capability of 10 LLMs across different planning strategies (multi-step vs. step-by-step), plan formats (JSON vs. code), and feedback types (parsing/validation/execution), discovering that multi-step JSON planning coupled with feedback is the currently optimal design.

MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection

MarvelOVD is proposed to integrate the detector's context-awareness and background recognition capabilities into the pseudo-label generation and training pipeline of VLMs. By purifying noisy pseudo-labels online and adaptively reweighting training boxes, this framework significantly outperforms existing methods on COCO and LVIS.

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

This paper introduces MathVerse, a multimodal mathematical reasoning benchmark containing 2,612 visual math problems (transformed into 6 versions totaling 15K test samples). By systematically manipulating the allocation of information in text and images, MathVerse assesses whether MLLMs truly "understand" mathematical diagrams. The authors also propose a CoT evaluation strategy for fine-grained reasoning process scoring, revealing that most MLLMs rely heavily on text rather than visual diagrams for mathematical reasoning.

Merlin: Empowering Multimodal LLMs with Foresight Minds

Proposes a two-stage training paradigm consisting of Foresight Pre-Training (FPT) and Foresight Instruction-Tuning (FIT). By incorporating trajectory modeling, it empowers Multimodal Large Language Models (MLLMs) with "foresight thinking" capabilities, enabling them to predict future events and reason based on current observations.

Meta-Prompting for Automating Zero-Shot Visual Recognition with LLMs

This paper proposes MPVR (Meta-Prompting for Visual Recognition), which leverages a two-stage meta-prompting strategy to automatically generate diverse, class-specific VLM prompts, significantly improving the zero-shot recognition performance of models like CLIP without requiring manual design of LLM queries.

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Apple systematically ablates the three primary axes of MLLM construction (architecture, data, and training), deriving key design principles: Image Resolution > Model Size > Training Data; the choice of VL connector type has minimal impact; and the meticulous blending of caption, interleaved, and text-only data is crucial. This systematically constructed MM1 model family (ranging from 3B-30B dense models to up to 64B MoE models) achieves state-of-the-art performance in few-shot pre-training evaluations.

MMBench: Is Your Multi-modal Model an All-Around Player?

Proposes MMBench—a bilingual (English/Chinese) multimodal benchmark comprising 3,217 multiple-choice questions across 20 fine-grained ability dimensions, featuring a CircularEval evaluation strategy and an LLM-based choice extraction mechanism to significantly improve evaluation robustness and fairness.

MyVLM: Personalizing VLMs for User-Specific Queries

MyVLM is the first to explore the VLM personalization problem. It detects user-specific concepts (e.g., "your dog") using an external concept recognition head, and learns concept embeddings in the VLM's intermediate feature space to guide the language model to naturally incorporate the concept in its responses. It achieves personalized captioning and VQA with only 3-5 images.

Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild

Nymeria is currently the world's largest in-the-wild human motion dataset (300 hours, 264 participants), providing synchronized and co-localized multi-device multimodal egocentric data (Project Aria glasses + wristbands + motion capture suits) for the first time, accompanied by 310.5K hierarchical motion-language descriptions.

Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models

OVT via parameter-efficient fine-tuning significantly improves the robustness of VLP models (e.g., CLIP) to 3D viewpoint changes (averaging +9-10%) by constructing a 4.6-million multi-view image-text dataset, MVCap, and designing a minimax-optimized cross-viewpoint alignment framework, while incurring almost no loss in original performance.

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

The REVISION framework is proposed to leverage Blender 3D rendering to generate spatially accurate synthetic images. These images guide text-to-image (T2I) models in a training-free manner to generate spatially consistent images. It also implements the RevQA benchmark to evaluate the spatial reasoning capabilities of MLLMs.

Robust Calibration of Large Vision-Language Adapters

This paper discovers that CLIP adaptation methods (Adapter/Prompt Learning/TTA) severely impair the calibration capability of the zero-shot baseline in OOD scenarios, reveals that increased logit range (rather than increased logit norm) is the root cause of miscalibration, and proposes three simple and model-agnostic logit range constraint schemes (ZS-Norm, Penalty, and SaLS) that effectively mitigate miscalibration while maintaining discriminative performance.

Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models

This paper proposes a Selective Dual-Teacher Knowledge Transfer (SND) framework. By measuring the representation discrepancy between the pre-trained VLM and the recently fine-tuned VLM, it adaptively selects the appropriate teacher for knowledge distillation on an unlabeled reference dataset, mitigating catastrophic forgetting while maintaining zero-shot classification capabilities.

Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities

The EdgeVL framework is proposed to adapt large-scale VLMs (such as CLIP) to edge devices through a two-stage adaptation (dual-modality knowledge distillation + quantization-aware contrastive learning), achieving open-vocabulary cross-modality (RGB and non-RGB) classification without requiring human annotations. This achieves up to a 15.4% accuracy improvement and a 93x model compression.

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

ShareGPT4V constructs a high-quality descriptive caption dataset comprising 1.2M entries (seed of 100K generated by GPT4-Vision + expanded to 1.2M via Share-Captioner). By using this dataset to train ShareGPT4V-7B (a model based on the LLaVA architecture) in both pre-training and SFT stages, it achieves state-of-the-art performance on 9 out of 11 multi-modal benchmarks. This demonstrates that high-quality captions are the key bottleneck in multi-modal alignment for LMMs.

SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant

This paper proposes a Visual Self-Questioning (SQ) training paradigm, enabling LLMs to not only learn how to answer questions but also actively ask questions based on images. By fully exploiting the rich semantic information inherent in the questions themselves within instruction-following data, the proposed method enhances vision-language alignment.

The Hard Positive Truth About Vision-Language Compositionality

This paper reveals an evaluation blind spot in existing CLIP compositionality benchmarks—the lack of hard positives testing. It discovers that hard negative fine-tuning causes the model to become "oversensitive" (falsely reducing matching scores for paraphrases that preserve semantics). This issue is mitigated by jointly training with both hard positives and hard negatives.

Towards Open-ended Visual Quality Comparison

This work proposes Co-Instruct, the first large multimodal model for open-ended visual quality comparison. By constructing a 562K instruction-tuning dataset from two "weakly supervised sources" (LLM-merged single-image descriptions + GPT-4V pseudo-labels), Co-Instruct achieves higher accuracy in multi-image quality comparison than its teacher model, GPT-4V, and introduces MICBench, the first multi-image comparison benchmark.

Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models

This paper proposes WResVLM, a semi-supervised learning framework that utilizes vision-language models (VLMs) to provide clearness evaluation and semantic description supervision signals for real-world adverse weather images. It enhances clearness via VLM image evaluation coupled with weather prompt learning, and enhances semantics via description-aided semantic regularization. This approach comprehensively outperforms existing methods on real-world deraining, dehazing, and desnowing tasks.

Uni3DL: Unified Model for 3D and Language Understanding

This paper proposes Uni3DL, a unified 3D vision-language model operating directly on point clouds. By learning task-agnostic semantic/mask outputs through a Query Transformer and then combining multiple functional heads using a Task Router, it achieves functional unification across six tasks: semantic segmentation, instance segmentation, object detection, visual grounding, 3D caption generation, and text-to-3D retrieval. Its performance reaches or exceeds the task-specific SOTA for each task.

UniCode: Learning a Unified Codebook for Multimodal Large Language Models

UniCode proposes learning a unified codebook to tokenize both visual and textual signals simultaneously. It progressively aligns the visual tokenizer's codebook with the LLM's vocabulary through a language-driven iterative training paradigm. Additionally, it introduces an in-context image decompression pre-training task to enhance image generation quality, enabling MLLMs to achieve multimodal understanding and generation without requiring extra alignment modules.

Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

This paper proposes the Vary method, which scales up the vision vocabulary of Large Vision-Language Models (LVLMs) by generating and integrating a new vision vocabulary. This empowers the model with new fine-grained visual perception capabilities, such as document-level OCR and chart understanding, while maintaining its original general capabilities.

X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs

Proposes X-Former, a lightweight Transformer module that fuses complementary visual features from CLIP-ViT (contrastive learning) and MAE-ViT (masked image modeling) through a dual cross-attention mechanism. It significantly outperforms BLIP-2 on fine-grained visual understanding tasks using only 1/10 of the training data.

Zero-shot Object Counting with Good Exemplars (VA-Count)

This work proposes the VA-Count framework, which leverages Grounding DINO via an Exemplar Enhancement Module (EEM) to discover high-quality positive and negative exemplars, combined with a Noise Suppression Module (NSM) utilizing contrastive learning to distinguish positive and negative density maps, achieving state-of-the-art zero-shot object counting performance on FSC-147 and CARPK.