Skip to content

🧩 Multimodal VLM

🧪 ICML2025 · 42 paper notes

📌 Same area in other venues: 📷 CVPR2026 (420) · 🔬 ICLR2026 (211) · 💬 ACL2026 (83) · 🧪 ICML2026 (89) · 🤖 AAAI2026 (75) · 🧠 NeurIPS2025 (107)

🔥 Top topics: Multimodal/VLM ×28 · Alignment/RLHF ×5 · LLM ×2 · Adversarial Robustness ×2

CoCoA-Mix: Confusion-and-Confidence-Aware Mixture Model for Context Optimization

The CoCoA-Mix framework is proposed to construct a prompt mixture model via a confusion-aware loss (CoA-loss) and confidence-aware weights (CoA-weights), simultaneously improving both the specialization and generalization of VLM prompt tuning without introducing extra network parameters.

CoMemo: LVLMs Need Image Context with Image Memory

This work proposes CoMemo, a dual-path architecture where the Context path concatenates image tokens with text for autoregressive processing, and the Memory path utilizes cross-attention for persistent image memory. Combined with RoPE-DHR positional encoding to maintain 2D spatial awareness and alleviate long-range decay, and a three-stage training strategy to balance the dual paths, this approach comprehensively outperforms LVLM-S and LVLM-X under equivalent settings.

Context is Key: A Benchmark for Forecasting with Essential Textual Information

This paper proposes the Context is Key (CiK) benchmark—consisting of 71 manually designed forecasting tasks across 7 domains. Each task requires combining numerical history and natural language context to make accurate predictions. The paper also introduces the RCRPS evaluation metric and the Direct Prompt method, demonstrating that a simple prompting strategy on Llama-3.1-405B (RCRPS=0.159) significantly outperforms all statistical and time-series foundation models.

Core Knowledge Deficits in Multi-Modal Language Models

This paper proposes the CoreCognition benchmark (comprising 12 core cognitive abilities across 1,503 questions). Following a large-scale evaluation of 230 MLLMs, it reveals that models systematically lag behind humans in foundational cognitive abilities. Moreover, this deficit does not improve with larger model scales; instead, larger models tend to rely more heavily on shortcut learning rather than genuine understanding.

Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces

This work proposes a unified framework for constructing multimodal diffusion models on arbitrary state spaces. By introducing independent decoupled noise schedules for each modality, it simultaneously achieves both unconditional and modality-conditional generation within a single model without requiring external tokenizers or VAE preprocessing.

Do Vision-Language Models Really Understand Visual Language?

This work systematically evaluates the diagram understanding capabilities of large vision-language models (LVLMs) by constructing a comprehensive test suite (including synthetic and real-world diagrams). It reveals that while models can identify entities, their understanding of relationships is extremely limited; their seemingly excellent performance in diagram reasoning actually stems from utilizing background knowledge as a shortcut.

Dynamic Mixture of Curriculum LoRA Experts for Continual Multimodal Instruction Tuning

This paper proposes the D-MoLE method, which automatically evolves the MLLM architecture under parameter budget constraints to continuously adapt to new tasks, achieving an average improvement of 15% over the best baseline through a dynamic layer-wise LoRA expert allocator and a gradient-based inter-modal continual curriculum strategy.

Efficient Quantification of Multimodal Interaction at Sample Level

Proposes the LSMI (Lightweight Sample-wise Multimodal Interaction) estimator, achieving precise and efficient sample-wise quantification of multimodal interactions (redundancy, uniqueness, and synergy) on real-world continuous distribution data for the first time, and demonstrates its practical value in data partitioning, knowledge distillation, and model ensemble.

ELEMENTAL: Interactive Learning from Demonstrations and Vision-Language Models for Reward Design in Robotics

ELEMENTAL integrates vision-language models (VLMs) with inverse reinforcement learning (IRL) to extract feature functions via VLMs, optimize weights via IRL, and iteratively improve via self-reflection, achieving a 42.3% improvement over EUREKA across 9 IsaacGym tasks.

ERL-VLM: Enhancing Rating-Based RL to Leverage Feedback from Large VLMs

ERL-VLM is proposed to leverage Large Vision-Language Models (VLMs) to provide absolute ratings for single trajectories instead of pairwise preferences. By combining stratified sampling and MAE loss to address data imbalance and noisy labels, it significantly improves VLM feedback-driven reward learning.

Enhancing Target-unspecific Tasks through a Features Matrix

Proposes the Features Matrix (FM) method, which leverages multiple hand-crafted prompt templates to extract general knowledge from frozen CLIP to construct a features matrix. By aligning unexpected features with fine-tuned visual features, it enhances the model's performance on target-unspecific tasks (e.g., base-to-novel generalization, cross-dataset generalization, domain generalization).

From Black Boxes to Transparent Minds: Evaluating and Enhancing the Theory of Mind in Multimodal Large Language Models

This work evaluates the Theory of Mind (ToM) capabilities of Multimodal Large Language Models (MLLMs) from an interpretability perspective, constructs a 2D grid world-based multimodal ToM dataset named GridToM, and proposes a training-free attention head activation intervention method to significantly enhance the ToM performance of MLLMs.

Graph4MM: Weaving Multimodal Learning with Structural Information

Proposing the Graph4MM framework, which injects multi-hop graph structural information into the self-attention mechanism via Hop-Diffused Attention and designs MM-QFormer to achieve cross-modal fusion, achieving an average improvement of 6.93% on generative and discriminative tasks.

Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin

Proposes the CAP framework, which addresses the class imbalance problem of VLMs generating pseudolabels through concept alignment (detecting and fixing concept mismatch) and confusion-aware calibrated margin (alleviating concept confusion), achieving a 6.29% relative improvement over SOTA models across six datasets and three paradigms.

Importance Corrected Neural JKO Sampling

Proposes Importance Corrected Neural JKO Sampling (Neural JKO IC), which alternates between the local JKO steps of continuous normalizing flows (CNFs) and rejection resampling steps based on importance weights. This approach overcomes the local optima issues of Wasserstein gradient flows on multimodal distributions while maintaining independent and identically distributed (i.i.d.) sampling and density tractability.

Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models

Proposed Kernel-based Unsupervised Embedding Alignment (KUEA), a method that aligns the visual representations of CLIP and DINOv2 in a kernel space. By fine-tuning solely on image data, it enhances CLIP's fine-grained perception while maintaining compatibility with the text encoder, thereby boosting the performance of downstream MLLMs.

LADA: Scalable Label-Specific CLIP Adapter for Continual Learning

This paper proposes LADA (Label-specific ADApter), which appends lightweight class-specific memory vectors after the frozen CLIP image encoder to concentrate the discriminative information of all learned tasks into a unified feature space. This completely eliminates the parameter selection step during inference and achieves SOTA performance under the X-TAIL continual learning setting.

LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models

This work points out that the classic ImageNet-C out-of-distribution robustness benchmark is no longer truly OOD for models trained on web-scale datasets like LAION. To address this, the authors construct the LAION-C benchmark with 6 novel, highly synthetic image distortions and conduct psychophysical experiments with 19 human subjects, revealing a paradigm shift in OOD generalization where the best models have caught up with or even surpassed humans.

Learning Invariant Causal Mechanism from Vision-Language Models

It is proved through causal analysis that CLIP embeddings are linear transformations of true invariant/variant factors. The CLIP-ICM framework is proposed to estimate a linear projection matrix using intervention data, restricting predictions to the invariant subspace for consistent predictions across environments.

Learning Optimal Multimodal Information Bottleneck Representations

This paper proposes the OMIB framework, which guarantees the optimality of multimodal information bottleneck representations (retaining all task-relevant information and eliminating redundant information) by theoretically deriving the upper bound of the regularization parameter \(\beta\) and dynamically adjusting the weight \(r\) of each modality.

LEMoN: Label Error Detection using Multimodal Neighbors

This paper proposes LEMoN, a method that leverages the multimodal neighborhood structure of image-text pairs in the embedding space of contrastively pre-trained multimodal models (such as CLIP). It automatically detects label errors in both classification and image captioning scenarios, achieving a 3-4% F1-score improvement over training-free baselines. Furthermore, downstream classification and captioning performance are enhanced when trained on the filtered data.

LlavaGuard: An Open VLM-based Framework for Safeguarding Vision Datasets and Models

Proposes LlavaGuard, an open visual content safety moderation framework based on open-source VLMs. By utilizing a customizable safety taxonomy, a high-quality human-annotated dataset, and policy-enhanced training, it achieves flexible and precise safety assessment of image content, substantially outperforming existing open-source and closed-source moderation tools in both accuracy and policy adaptability.

M3-JEPA: Multimodal Alignment via Multi-gate MoE based on JEPA

Generalizes JEPA (Joint-Embedding Predictive Architecture) to multimodal alignment of arbitrary modality combinations. It utilizes a Multi-gate MoE as a cross-modal predictor to perform alignment in the latent space (rather than the token space), where the gating function decouples modality-specific and shared information. Alternating gradient descent is employed to avoid gradient conflict between multi-directional tasks. With only 140M trainable parameters, it outperforms state-of-the-art models like BLIP-2 (1.2B) on multiple retrieval and classification tasks.

MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

This paper proposes MMedPO, a clinical-aware multimodal medical preference optimization method. By constructing multimodal preference data through believable hallucination injection and localized lesion noise addition, and leveraging collaboration among multiple medical LLMs to evaluate clinical relevance as a weighting signal integrated into DPO training, it achieves average improvements of 14.2% and 51.7% on Med-VQA and report generation tasks, respectively.

MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding

To address the "Attention Deficit Disorder" issue characterized by cross-modal attention inconsistency and layer-wise decay in Multimodal Large Language Models (MLLMs), this paper proposes a modular duplex attention mechanism named MODA. By decoupling attention into intra-modal self-refinement and inter-modal interaction pathways, and utilizing a Duplex Aligner combined with adaptive masked attention to implement a "Correct-after-Align" strategy, its effectiveness is validated across 21 perception, cognition, and emotion benchmarks.

OmniBal: Towards Fast Instruction-Tuning for Vision-Language Models via Omniverse Computation Balance

To address the computation imbalance caused by data and model heterogeneity in large-scale vision-language model instruction-tuning, the OmniBal framework is proposed to systematically balance computational workloads across devices from three perspectives: data, model, and memory, achieving approximately \(1.8\times\) training speedup on InternVL-Chat.

Parrot: Multilingual Visual Instruction Tuning

Proposes Parrot, which utilizes a text-guided cross-attention mechanism and an MoE module to transform English-biased visual features into language-specific representations, significantly enhancing the multilingual capabilities of MLLMs with an extremely small amount of multilingual data (~10K samples per language).

Ranked from Within: Ranking Large Multimodal Models Without Labels

This work systematically investigates whether the relative performance of LMMs can be predicted in label-free scenarios. By evaluating 47 SOTA LMMs across 9 VQA benchmarks, the study reveals that uncertainty metrics based on softmax distributions provide a robust unsupervised model ranking (Spearman correlation \(\rho=0.92\) with the ground-truth ranking).

Robust Multimodal Large Language Models Against Modality Conflict

Reveals an overlooked source of MLLM hallucinations—modality conflict (the inherent incompatibility between visual and textual inputs). It formally defines modality conflicts across three levels: object, attribute, and relation, constructs the MMMC dataset with 20K samples, and proposes three mitigation approaches (prompt engineering, SFT, and RL), among which RL achieves the best performance.

RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer

This work reveals that the self-attention mechanism in multimodal Transformers loses its dynamic adaptability (favoring a single modality) due to a "self-reinforcing loop." It proposes the RollingQ algorithm to break this loop by rotating query vectors, thereby reviving cross-modal cooperation dynamics.

SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs

Fully automated generation of SK-VQA, a large-scale synthetic KB-VQA dataset containing over 2 million QA pairs, using GPT-4 to train MLLMs for context-augmented generation, significantly outperforming existing datasets in cross-domain generalization.

Streamline Without Sacrifice — Squeeze out Computation Redundancy in LMM

ProxyV is proposed to introduce a small number of proxy vision tokens to replace original vision tokens during recomputation operations (self-attention, FFN) in LLM decoder layers. This significantly compresses computational redundancy while retaining all visual information, and even improves performance under certain settings.

The Devil Is in the Details: Tackling Unimodal Spurious Correlations for Generalizable Multimodal Reward Models

This work discovers that Multimodal Reward Models (MM-RMs) over-rely on unimodal text shortcuts during training, leading to poor out-of-distribution (OOD) generalization. To address this, the authors propose a Shortcut-aware MM-RM learning algorithm that reduces reliance on unimodal spurious correlations through dynamic sample reweighting, improving OOD accuracy from 68.1% to 78.5%.

Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage

Proposed the CapMAS multi-agent system, which corrects hallucinations through LLM-MLLM collaboration by decomposing detailed image-text descriptions into atomic propositions and verifying their truthfulness one by one. It also introduces a framework to evaluate detailed captions from the dual dimensions of factuality and coverage, significantly improving the description quality of various MLLMs, including GPT-4V.

Towards Efficient Online Tuning of VLM Agents via Counterfactual Soft Reinforcement Learning

This paper proposes Counterfactual Soft Reinforcement Learning (CoSo), which leverages counterfactual reasoning to evaluate the causal impact of each token on the final action. By incorporating causally weighted entropy regularization to concentrate exploration on key tokens, CoSo addresses the text action space explosion in online RL fine-tuning for VLM agents. Experimental results demonstrate performance gains of 12.3%, 9.3%, and 16.7% on Android control, card reasoning games, and embodied AI tasks, respectively.

Towards Rationale-Answer Alignment of LVLMs via Self-Rationale Calibration

This paper proposes the Self-Rationale Calibration (SRC) framework, which guides LVLMs to output intermediate reasoning processes through lightweight rationale fine-tuning. It leverages sentence-level beam search to generate diverse candidate responses, and employs a specially designed R-Scorer with a pairwise scoring strategy to select positive and negative rationale-answer pairs. Finally, DPO-based preference alignment is used to iteratively calibrate the model's rationale-answer consistency, achieving significant improvements across multiple perception, reasoning, and generalization benchmarks.

Understanding and Mitigating Miscalibration in Prompt Tuning for Vision-Language Models

By analyzing the root cause of VLM calibration failure during prompt tuning (text feature shift), this paper proposes a Dynamic Outlier Regularization (DOR) method. It utilizes high semantic similarity nouns from WordNet as text outliers to constrain feature drift during fine-tuning, significantly reducing calibration error.

Universal Retrieval for Multimodal Trajectory Modeling

This work systematically defines the multimodal trajectory retrieval task for the first time. It constructs the Unified Agent Trajectory Dataset (UATD) containing 7,747 demonstrations and 82,793 states, alongside the GAE-Bench benchmark containing 714,628 positive sample pairs. Additionally, the VLM2Vec-based GAE-Retriever framework is proposed, achieving an average improvement of 10.22 percentage points over the strongest baseline, VLM2Vec-V2.2, across 5 GUI environments.

Vision-Language Model Selection and Reuse for Downstream Adaptation

Proposes the Model Label Learning (MLL) paradigm, which performs offline "labeling" of 49 pre-trained VLMs (describing each model's capability across different visual concepts) by constructing a semantic graph. For a new task, it selects and ensembles the most suitable models via semantic matching, achieving data-efficient, computationally efficient, and scalable VLM selection and reuse.

Vision-Language Models Create Cross-Modal Task Representations

This paper discovers that autoregressive vision-language models (VLMs) compress conceptually equivalent inputs (whether text/image examples, instructions, or few-shot prompts) into a shared "task vector". It validates the existence and utility of such representational alignment through cross-modal patching experiments.

Vision Graph Prompting via Semantic Low-Rank Decomposition

This work proposes Vision Graph Prompting (VGP), the first visual prompt learning framework tailored for Vision GNN (ViG). By leveraging the low-rank characteristics of semantic connected components in graphs, VGP designs semantic low-rank prompts at three granularities: graph, edge, and node levels (SeLo-Graph/Edge/Node Prompt). This approach achieves downstream transfer performance close to full fine-tuning while maintaining parameter efficiency.

What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities

This paper proposes OmniBench, a graph-based, scalable virtual agent benchmark. By synthesizing tasks of controllable complexity through an automated pipeline, combined with the OmniEval multi-dimensional evaluation framework, it generates 36K tasks across 20 application scenarios, systematically revealing the weaknesses of virtual agents across different capability dimensions.