ACL2025 Multimodal VLM AI paper notes paper summaries Multimodal/VLM LLM Adversarial Robustness Dialogue Alignment/RLHF Question Answering

🧩 Multimodal VLM¶

💬 ACL2025 · 111 paper notes

📌 Same area in other venues: 📷 CVPR2026 (420) · 🔬 ICLR2026 (211) · 💬 ACL2026 (83) · 🧪 ICML2026 (89) · 🤖 AAAI2026 (75) · 🧠 NeurIPS2025 (107)

🔥 Top topics: Multimodal/VLM ×88 · LLM ×13 · Adversarial Robustness ×6 · Dialogue ×4 · Alignment/RLHF ×4

A Parameter-Efficient and Fine-Grained Prompt Learning for Vision-Language Models: This paper proposes the DoPL (Detail-oriented Prompt Learning) method. Based on the theory of low-entropy information concentration, it discovers shared-interest tokens between text and vision. It uses these to construct alignment weights to enhance text and visual prompts. With only 0.25M (0.12%) trainable parameters, it achieves fine-grained multimodal semantic alignment, surpassing full-parameter fine-tuning methods on six benchmarks.
Activating Distributed Visual Region within LLMs for Efficient and Effective Vision-Language Training and Inference: This paper discovers "visual regions" within LLMs—sparse and uniformly distributed subsets of layers similar to the human visual cortex. Updating only 25% of the layers preserves 99% of visual performance while maintaining or even improving language capabilities. Based on this, the authors propose an efficient paradigm for visual-region-targeted training and pruning.
Adaptive Linguistic Prompting (ALP) Enhances Phishing Webpage Detection in Multimodal Large Language Models: Proposes Adaptive Linguistic Prompting (ALP), an 8-shot structured prompting approach that guides multimodal LLMs to jointly reason across HTML text, screenshots, and URLs to detect phishing webpages. Combined analysis achieves an F1-score of $0.93$ on GPT-4o, outperforming traditional zero-shot baselines.
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates: This paper proposes the MAC benchmark and a diversity-promoting self-training method. By leveraging LLMs to generate deceptive texts, it systematically exposes the compositional vulnerabilities of pre-trained multimodal representations like CLIP, significantly outperforming existing methods across image, video, and audio modalities.
Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents: This paper proposes Agent-RewardBench, the first benchmark to evaluate the capability of multimodal LLMs as agent reward models. It covers three dimensions (perception, planning, and safety) across seven real-world scenarios, containing 1,136 high-quality step-level samples. Experiments reveal that even the strongest model, GPT-4o, achieves only 61.4% accuracy, and stronger models surprisingly perform worse in the safety dimension.
AGRI-CM3: A Chinese Massive Multi-Modal Multi-Level Benchmark for Agricultural Understanding: This paper introduces AGRI-CM3, a large-scale Chinese multimodal and multi-level evaluation benchmark for the agricultural domain. It covers various agricultural subtasks, including crop identification, pest and disease diagnosis, and farming operation understanding, to systematically evaluate the capabilities of VLMs in the agricultural vertical domain.
AkaCE: A Multimodal Multi-party Dataset for Emotion Recognition in Movie Dialogues: This work constructs AkaCE—the first multimodal conversational emotion recognition dataset for an African language, covering Akan (the primary language of Ghana, with approximately 20 million speakers). It contains 385 dialogues with 6,162 utterances (spanning audio, visual, and text modalities), 308 speakers (gender-balanced with 155 males and 153 females), and provides the first word-level prosodic prominence annotations for an African language.
Aligning VLM Assistants with Personalized Situated Cognition: Based on the sociological concept of "Role-Set" to characterize user diversity, this paper proposes the PCogAlign framework. By utilizing a cognition-aware, action-oriented reward model to generate personalized responses for VLM assistants, the framework ensures that users with different roles receive advice tailored to their specific needs within the same visual scene.
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models: Proposes AlignMMBench, the first multimodal alignment evaluation benchmark for Chinese visual contexts, covering 13 tasks across 3 major categories, with 1054 images and 4978 QA pairs (including single-turn/multi-turn dialogues). Additionally, a ChatGLM3-6B-based evaluator, CritiqueVLM, is trained, which outperforms GPT-4 in evaluation consistency.
Aria-UI: Visual Grounding for GUI Instructions: This paper proposes Aria-UI, a vision-only multimodal model specifically designed for GUI visual grounding. By utilizing a scalable instruction synthesis data pipeline and a interleaved text-image action history mechanism, Aria-UI achieves state-of-the-art (SOTA) performance on both offline and online agent benchmarks, including 1st place on AndroidWorld (44.8%) and 3rd place on OSWorld (15.2%).
Attacking Vision-Language Computer Agents via Pop-ups: This work systematically designs a set of adversarial pop-up attack methods to attack vision-language model-based computer agents. The proposed methods achieve an average attack success rate of $86\%$ on OSWorld and VisualWebArena, decreasing the task success rate by $47\%$ while showing that basic defense mechanisms are almost ineffective.
AVG-LLaVA: An Efficient Large Multimodal Model with Adaptive Visual Granularity: This work integrates a visual granularity scaler (obtaining multi-scale granularity tokens via spatial pyramid pooling) and a visual granularity router (adaptively selecting granularity based on the image + instruction) into LLaVA-NeXT. It also introduces the RGLF training paradigm, which utilizes the LMM's own generation probabilities as feedback to train the router, achieving the effect of "reducing tokens while improving performance" across 11 benchmarks.
Enhancing Multimodal Continual Instruction Tuning with BranchLoRA: To address parameter inefficiency and catastrophic forgetting of MoELoRA in multimodal continual instruction tuning (MCIT), this paper proposes BranchLoRA—an asymmetric architecture. It shares matrix A to capture cross-task general patterns and maintains multi-branch matrices B to encode task-specific knowledge. Complemented by a flexible tuning-freezing mechanism and task-specific routers, it significantly outperforms the previous SOTA MoELoRA on the CoIN benchmark with fewer parameters (ACC: 44.20 vs 37.13, BWT: -20.98 vs -25.91).
Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?: The paper proposes the TempVS benchmark to systematically evaluate the grounding and reasoning capabilities of 38 MLLMs on multi-event temporal relationships in image sequences, revealing a substantial performance gap between state-of-the-art models and humans.
Can MLLMs Understand the Deep Implication Behind Chinese Images?: This paper proposes CII-Bench (Chinese Image Implication Understanding Benchmark), which contains 698 Chinese internet/traditional culture images and 800 multiple-choice questions. It systematically evaluates MLLMs' high-level understanding of the deep implications within Chinese images. The study reveals that the best-performing model achieves an accuracy of only 64.4%, significantly lower than the human average of 78.2%, with models performing worst in the domain of traditional Chinese culture.
Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers: This paper introduces MISS-QA, the first benchmark specifically designed to evaluate the capability of multimodal foundation models to understand schematic diagrams in scientific papers. It contains 1,500 expert-annotated samples and reveals a significant performance gap between the state-of-the-art models and human experts.
Can Vision-Language Models Evaluate Handwritten Math?: This paper proposes the FERMAT benchmark to systematically evaluate the error detection, localization, and correction capabilities of 9 VLMs on handwritten mathematical content. Using 609 manually curated Grade 7-12 math problems alongside over 2,200 handwritten erroneous solutions (covering computation, conceptual, notation, and formatting errors), the evaluation reveals that Gemini-1.5-Pro achieves the highest correction rate of 77%, though all models still face significant challenges when processing handwritten content.
Can Vision Language Models Understand Mimed Actions?: This paper proposes the Mime benchmark (86 mimed actions × 10 variations = 860 samples), constructing a controllable evaluation via motion capture + 3D rendering. It finds that while humans maintain near-100% accuracy under various perturbations, the strongest VLM achieves only 52.3% (multiple-choice) / 19.8% (free-form), revealing that VLMs heavily rely on scene context cues rather than the action itself.
MMSafeAware: Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs: This work proposes MMSafeAware, the first multimodal safety awareness benchmark that simultaneously evaluates "unsafe content identification" and "over-sensitivity." It contains 1,500 image-text pairs across 29 safety scenarios. Evaluating 9 MLLMs reveals that all models suffer from a severe trade-off between safety and helpfulness—GPT-4V misclassifies $36.1\%$ of unsafe inputs as safe while misclassifying $59.9\%$$ of safe inputs as unsafe. None of the three mitigation methods can fundamentally resolve this issue.

CART: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling

Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model: This paper systematically investigates three dimensions of multilingual LVLM training strategies: the number of training languages, training data distribution, and multilingual OCR. It discovers that 100 languages can be trained simultaneously using only 25-50% non-English data, based on which Centurio, a state-of-the-art model covering 100 languages, is trained.
ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation: Propses ChartCoder, the first dedicated chart-to-code MLLM. Using a Code LLM as the language backbone, combined with a 160K large-scale chart-to-code dataset and a "Snippet-of-Thought" step-by-step reasoning approach, the 7B model beats all open-source MLLMs across three benchmarks and approaches the performance of GPT-4o.
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation: The CoSyn framework is proposed to automatically create 400K text-rich images (charts, documents, diagrams, etc.) and 2.7M instruction-tuning data using the code generation capabilities of text-only LLMs. The trained 7B VLM achieves state-of-the-art (SOTA) performance across 7 benchmarks, outperforming GPT-4V and Gemini 1.5 Flash.
COLING-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models: This paper proposes a scientific chart visual question answering system based on Multimodal Large Language Model (MLLM) ensembling. Utilizing a few-shot exemplar retrieval strategy and a confidence-aware model selection mechanism, the system achieved third place (average F1 = 85.12) in the SciVQA 2025 shared task.
Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities: This paper proposes the Con Instruction method, which optimizes adversarial images or audio to align them with target malicious instructions in the embedding space. This achieves jailbreaking of multimodal large language models (MLLMs) without textual inputs, reaching an attack success rate of 86.6% on LLaVA-v1.5. Additionally, the ARC evaluation framework is introduced to simultaneously measure both the quality and relevance of attack responses.
ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT: This paper constructs ConECT—the first Czech-Polish e-commerce multimodal translation dataset (11,400 sentence pairs + product images + category paths). Through a systematic comparison of three technical routes (VLM end-to-end translation, NMT + category path prefix, and NMT + image description prefix), the authors find that structured category context consistently improves translation quality (COMET $+0.005$), whereas injecting synthetic image descriptions in a cascaded manner severely damages translation performance (COMET plunges by $0.11+$).
Insight Over Sight: Exploring the Vision-Knowledge Conflicts in Multimodal LLMs: This paper presents the first systematic study of commonsense-level vision-knowledge conflicts in MLLMs. It proposes an automated framework to construct the ConflictVis benchmark (374 images + 1122 QA pairs), finding that MLLMs over-rely on parametric knowledge in approximately 20% of conflict scenarios (particularly in Yes-No and action-related questions). Additionally, a Focus-on-Vision prompting strategy is proposed to mitigate this issue.
CORDIAL: Can Multimodal Large Language Models Effectively Understand Coherence Relations?: This paper introduces CORDIAL, the first benchmark designed to evaluate the multimodal discourse analysis capabilities of MLLMs using Coherence Relations. Spanning three discourse domains (disaster management, social media, and online articles), CORDIAL includes coherence relations of varying granularity. Experiments reveal that even Gemini 1.5 Pro and GPT-4o fail to match a simple CLIP-based classifier baseline, highlighting a fundamental deficiency of MLLMs in pragmatic understanding.
CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG: CoRe-MMRAG proposes an end-to-end multimodal RAG framework that addresses Parametric-Retrieval Knowledge Inconsistency (PRKI) and Visual-Textual Knowledge Inconsistency (VTKI) through a four-stage pipeline (parametric knowledge generation → joint visual-textual reranking → external knowledge generation → internal-external knowledge integration), achieving improvements of 5.6% and 9.3% on InfoSeek and Encyclopedic-VQA, respectively.
Coreference as an Indicator of Context Scope in Multimodal Narrative: This paper reveals a significant divergence between human and large multimodal language models in the distribution of coreferential expressions within visual storytelling tasks. While humans dynamically interweave references to different entities and preserve textual-visual alignment, models exhibit limited capacity in tracking mixed references. Furthermore, a suite of quantitative metrics characterizing coreference patterns is proposed.
COSMMIC: Comment-Sensitive Multimodal Multilingual Indian Corpus: Builds COSMMIC, the first comment-sensitive multimodal multilingual dataset for Indian languages—covering 9 Indian languages, 4,959 article-image pairs, and 24,484 reader comments; proposes comment filtering (IndicBERT) and image classification (CLIP) enhancement schemes, and establishes summarization and headline generation benchmarks using GPT-4 and LLaMA3.
CrafText Benchmark: Advancing Instruction Following in Complex Multimodal Open-Ended World: This paper proposes CrafText, a multimodal instruction-following benchmark based on the Craftax open-world environment. It contains 3,924 instructions with 3,423 unique words, covering four task categories: localization, conditional, building, and achievement. It also introduces a dual-evaluation protocol designed to test the language and goal generalization capabilities of agents.
Cultivating Game Sense for Yourself: Making VLMs Gaming Experts: This paper proposes the GameSense framework, elevating the VLM from a direct game controller to a high-level developer. By enabling the VLM to autonomously observe tasks and develop task-specific "game sense" execution modules (ranging from rule-based scripts to neural networks), it achieves smooth gameplay across diverse genres, including action, shooting, and casual games for the first time.
DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning: The DALR framework is proposed to address Cross-Modal misalignment Bias (CMB) and Intra-modal Semantic Divergence (ISD) in multimodal sentence representation learning using a dual-level alignment strategy of cross-modal consistency learning and intra-modal rank distillation, achieving SOTA performance on STS and TR tasks.
Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation: Proposes a cognitive science-inspired two-stage framework (Perception + Prediction) and constructs WM-ABench, a large-scale benchmark (23 dimensions, 6 simulators, over 100k instances). Through 660 sets of experiments, it systematically reveals severe deficiencies in the foundational world modeling capabilities of 15 state-of-the-art VLMs.
Donate or Create? Comparing Data Collection Strategies for Emotion-labeled Multimodal Social Media Posts: This paper systematically compares three strategies for collecting author-annotated emotion data (creation, donation, recent posts). It reveals that research-created data exhibits significant differences from real-world data in text length, emotional prototypicality, and image-text relations. However, created data remains effective for training generalizable models, whereas real-world data is indispensable for accurate model evaluation.
Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models: Discovers the "blind token" phenomenon in LVLMs—where a small number of semantically irrelevant image tokens attract a disproportionate amount of attention weights—and proposes AvisC, a method that recalibrates the influence of blind tokens via test-time contrastive decoding to effectively mitigate visual hallucinations.
Enhance Multimodal Consistency and Coherence for Text-Image Plan Generation: This paper proposes an autoregressive text-image plan generation framework (MPlanner) that effectively enhances the coherence of visual steps and text-image consistency in multimodal plans through a four-stage iteration: textual drafting, image editing, visual information extraction, and textual refinement.
Error-driven Data-efficient Large Multimodal Model Tuning: Proposes an error-driven data-efficient fine-tuning framework in which a teacher model analyzes the erroneous reasoning steps of a student model to identify missing skills, and retrieves targeted training samples from an external dataset for fine-tuning, achieving an average performance improvement of 7.01% without requiring task-specific data.
Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users: Through user surveys, this work identifies the core needs and challenges of visually impaired individuals regarding AI visual assistants. It designs an evaluation framework covering five user-centric tasks: image captioning, multilingual VQA, optical Braille recognition, video object recognition, and video QA. By systematically evaluating 12 MLLMs, it reveals significant deficiencies of current models in cultural understanding, multilingual support, Braille reading, assistive device recognition, and hallucination control.
Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration: This paper proposes a semi-automated framework for constructing cultural VLM benchmarks. Through human-VLM collaboration, multiple-choice VQA samples are generated to construct the K-Viscuit dataset (657 questions) focusing on Korean culture, revealing a significant gap between open-source and closed-source VLMs in cultural understanding.
Exploring Compositional Generalization of Multimodal LLMs for Medical Imaging: This paper proposes the Med-MAT dataset (comprising 106 medical datasets and 53 subsets) and decomposes medical imaging attributes using MAT-Triplet (Modality-Anatomical area-Task). It systematically verifies, for the first time, the phenomenon of compositional generalization (CG) in multimodal LLMs on medical imaging, and demonstrates that compositional generalization is the key driver of generalization gains in multi-task training.
Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder: This study systematically investigates why generative multimodal LLMs (e.g., LLaVA) drastically outperform CLIP on visual reasoning tasks despite using the exact same vision encoder, revealing that patch tokens, positional encodings, and prompt weighting are the key factors.
Filter-And-Refine: A MLLM Based Cascade System for Industrial-Scale Video Content Moderation: TikTok proposes a two-stage cascade content moderation system based on MLLM (Router-Ranker). By filtering 97.5% of compliant traffic using a lightweight embedding retrieval router, only high-risk videos are routed to a fine-tuned LLaVA for fine-grained classification. This improves F1 by 66.5% while reducing deployment costs to 1.5% of full direct deployment.
Finding Needles in Images: Can Multi-modal LLMs Locate Fine Details?: This paper proposes the NiM benchmark dataset to systematically evaluate the capability of Multimodal Large Language Models (MLLMs) to locate fine-grained information in complex documents, and designs the Spot-IT method to significantly improve model performance in detail extraction tasks through intelligent patch selection and a Gaussian attention mechanism.
FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation: This paper proposes FlagEvalMM, an open-source multimodal model evaluation framework. By leveraging an architectural design that decouples model inference from the evaluation process, it uniformly supports the evaluation of various multimodal tasks, including vision-language understanding (VQA), text-to-image/video generation, and image-text retrieval.
GODBench: A Benchmark for Multimodal Large Language Models in Video Comment Art: GODBench presents the first benchmark designed to systematically evaluate the video bullet-screen/comment-art generation capabilities of Multimodal Large Language Models (MLLMs), defining 5 creative dimensions and 25 subcategories. It introduces Ripple of Thought (RoT), a multi-step reasoning framework inspired by physical wave propagation, to enhance models' creative generation performance.
Harnessing PDF Data for Improving Japanese Large Multimodal Models: Proposes a fully automated PDF data extraction pipeline to extract image-text pairs from Japanese PDFs and generate instruction data. By continually fine-tuning the LLaVA1.5 framework, it significantly improves the performance of Japanese multimodal models, achieving a 2.1%–13.8% gain on Heron-Bench.
Hidden in Plain Sight: Evaluation of the Deception Detection Capabilities of LLMs in Multimodal Settings: This paper systematically evaluates the deception detection capabilities of LLMs and multimodal large models across various modalities such as text, video, and audio. It finds that fine-tuned LLMs achieve SOTA performance in text-based deception detection, whereas multimodal models still exhibit significant deficiencies in utilizing cross-modal cues.
HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model: Through CKA analysis, it is discovered that the top layer of MLLM learns task-specific information while the remaining layers learn general knowledge. This paper proposes HiDe-LLaVA: MoE-style task-specific expansion for top-layer LoRA (using dual-modal anchor matching) paired with uniform merging for LoRA in the remaining layers. On the newly constructed UCIT benchmark (free of information leakage), it achieves a 5.8% improvement over the best baseline.
HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language Models: This paper proposes HSCR, a hierarchical self-contrastive rewarding method that exposes the model's intrinsic modality misalignment through visual token dropout to automatically generate high-quality preference data. Combined with explicit/implicit multi-level preference optimization, it significantly enhances the zero-shot performance and trustworthiness of medical VLMs using only 2,000 training samples.
I See What You Mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue: This paper proposes a self-supervised pre-training method to learn embeddings of co-speech iconic gestures, grounding skeleton motions into language. It demonstrates the complementarity of gestures and speech in face-to-face reference resolution tasks, where the gesture+speech accuracy of 31% significantly outperforms speech-only (24%) or gesture-only (19%).
Improving Medical Large Vision-Language Models with Abnormal-Aware Feedback: This paper proposes UMed-LVLM, which enhances the abnormal region localization capability of medical LVLMs through Abnormal-Aware Instruction Tuning and Abnormal-Aware Rewarding (including Relevance Reward, Abnormal Localization Reward, and Vision Relevance Reward) training strategies. It achieves a 58% improvement over the baseline on the MAU dataset and demonstrates excellent cross-modal and OOD generalization capabilities.
Improving MLLM's Document Image Machine Translation via Synchronously Self-reviewing Its OCR Proficiency: This paper proposes the Synchronously Self-Reviewing (SSR) paradigm. By requiring the MLLM to first generate OCR text before outputting the translation during the document image translation process, SSR leverages the "bilingual cognitive advantage" to alleviate catastrophic forgetting caused by fine-tuning, while simultaneously enhancing both OCR and Document Image Machine Translation (DIMT) performance.
iNews: A Multimodal Dataset for Modeling Personalized Affective Responses to News: A personalized affective annotation dataset, iNews, is constructed, containing annotations from 291 UK annotators on 2,899 multimodal Facebook news posts. Annotator profiles (demographics, personality, media trust, etc.) explain 15.2% of the annotation variance, and combining persona information with LLM zero-shot prediction improves accuracy by up to 7%.
Inference Compute-Optimal Video Vision Language Models: This paper presents the first systematic study on the optimal allocation of inference compute budget for video VLMs. Under a fixed inference FLOPs constraint, through large-scale training sweeps (~100k A100 hours) and add-interact parametric modeling ($R^2$=0.98), the optimal trade-off strategy across three dimensions—language model size $x_N$, frame count $x_T$, and visual token count per frame $x_V$—is identified.
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model: Based on InternLM-XComposer2.5, a discriminative multi-modal reward model named IXC-2.5-Reward is constructed. By training on a meticulously curated preference dataset spanning multiple domains across text, images, and videos, it surpasses GPT-4o (62.4%) with a 70.0% Macro Acc on the multi-modal reward benchmark VL-RewardBench. Furthermore, three downstream applications—RL training, Best-of-N test-time scaling, and data cleaning—are successfully demonstrated.
Jailbreak Large Vision-Language Models Through Multi-Modal Linkage: Formulates the Multi-Modal Linkage (MML) attack framework, which jailbreaks state-of-the-art vision-language models with an extremely high success rate (over 99% on GPT-4o) through a cross-modal encryption-decryption mechanism and an "evil alignment" strategy.
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games: This paper proposes the ActVLP training paradigm, which introduces a vision-language post-training stage (incorporating world knowledge, visual alignment, and spatial grounding) prior to action imitation learning. Based on this, they construct JARVIS-VLA, the first VLA model capable of successfully executing over 1,000 atomic tasks in Minecraft, outperforming the best baseline by 40%.
LogicQA: Logical Anomaly Detection with Vision Language Model Generated Questions: Proposes the LogicQA framework, which utilizes pretrained VLMs to automatically generate anomaly-related questions and detects logical anomalies through a QA voting mechanism. It achieves SOTA performance in training-free, annotation-free, and few-shot settings while providing natural language explanations for the detected anomalies.
MAGIC-VQA: Multimodal and Grounded Inference with Commonsense Knowledge for Visual Question Answering: The paper proposes the MAGIC-VQA framework, which systematically injects external commonsense knowledge into LVLMs through a three-stage pipeline (explicit knowledge retrieval $\rightarrow$ type-based post-processing $\rightarrow$ GNN-based implicit enhancement). This achieves plug-and-play commonsense reasoning enhancement on benchmarks like ScienceQA, TextVQA, and MMMU, requiring only 0.33M trainable parameters.
Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval: Proposes the MaxMatch method, which addresses the issues of sparse supervision and set collapse in set embedding methods through a maximal pair assignment similarity based on the Hungarian algorithm and two new loss functions, achieving SOTA performance on MS-COCO and Flickr30k.
Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search: Proposes the AutoCaption framework, which uses Monte Carlo Tree Search (MCTS) to automatically and iteratively generate fine-grained video captioning keypoints (averaging 122 per video). It builds the MCTS-VCB benchmark to evaluate the video captioning capabilities of 20+ MLLMs, and demonstrates that the generated data can be used for fine-tuning to significantly improve model performance.
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval: This paper proposes the MegaPairs data synthesis method, which leverages heterogeneous KNN triplets to mine matching image pairs from an open-domain image corpus, combined with VLMs/LLMs to generate retrieval instructions, synthesizing 26 million multimodal training instances. The resulting MMRet model, trained on only 0.5M data, outperforms MagicLens which uses 36.7M data (achieving $70\times$ data efficiency), and achieves SOTA on 4 CIR benchmarks and 36 MMEB datasets.
MEIT: Multimodal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation: This paper proposes the MEIT framework, which aligns ECG signals with LLMs through multimodal instruction tuning. It injects ECG embeddings into the self-attention layers of the LLM using a lightweight concatenation fusion strategy (requiring no extra parameters) to achieve automatic ECG report generation. It also establishes a comprehensive benchmark covering four tasks: quality evaluation, zero-shot transferability, noise robustness, and expert alignment.
MIRA: Empowering One-Touch AI Services on Smartphones with MLLM-based Instruction Recommendation: This paper proposes the MIRA framework, which enables context-aware AI service instruction recommendations on smartphones by long-pressing text or images. Through structured reasoning, template-enhanced reasoning, and trie-constrained decoding, MIRA with a 7B model outperforms GPT-4V (F1: 0.9121 vs. 0.879) while utilizing only 1/7 of the tokens.
MIRe: Enhancing Multimodal Queries Representation via Fusion-Free Modality Interaction: This paper proposes the MIRe framework, which avoids direct fusion of text features during the vision-language alignment stage through "fusion-free modality interaction". It utilizes a query-guided attention pooling module to let text embeddings guide visual information extraction without feeding back text signals to the visual representation. This effectively mitigates the text-dominant issue in multimodal retrieval, achieving zero-shot SOTA on four benchmarks.
MMINA: Benchmarking Multihop Multimodal Internet Agents: This work introduces the MMInA benchmark, consisting of 1,050 human-annotated multihop multimodal web tasks across 14 real-world dynamic websites (averaging 2.85 hops). It designs a hop-by-hop evaluation protocol and a memory augmentation method, revealing a substantial performance gap on multihop web navigation between the strongest current agent (GPT-4V with a task success rate of only 21.8%) and humans (96.3%).
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark: A more robust MMMU-Pro benchmark is constructed based on MMMU through a three-step hardening process (filtering text-only solvable questions, expanding options to 10, and introducing vision-only input). Performance across all models drops by $16.8\%$ to $26.9\%$, revealing that current multimodal models are far from achieving true cross-modal understanding.
MMSciBench: Benchmarking Language Models on Chinese Multimodal Scientific Problems: This work proposes MMSciBench, a multimodal scientific reasoning benchmark containing 4,482 Chinese high school mathematics and physics problems. It covers both multiple-choice and question-answering formats, across text-only and multimodal (text-image) settings, complete with human-annotated difficulty levels and a three-level knowledge taxonomy. Evaluation shows that the strongest model, Gemini 1.5 Pro 002, only achieves 63.77% accuracy, with a significant performance drop on multimodal problems (a gap of 36.28 percentage points).
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus: Proposes mOSCAR, the first large-scale multilingual multimodal document-level corpus (163 languages, 303M documents, 200B tokens, 1.15B images), extracted as interleaved image-text documents from Common Crawl, demonstrating significant few-shot learning gains for multilingual mLLMs trained on this data.
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering: This work introduces MTVQA, the first multilingual text-centric visual question answering benchmark covering 9 languages. It resolves the "vision-text misalignment" issue of translation-based approaches through human expert annotation. Evaluations reveal a substantial performance gap between the best MLLM (InternVL-2.5, 32.2%) and the human baseline (79.7%), highlighting the severe challenges of multilingual text understanding.
MultiMM: Cultural Bias Matters — Cross-Cultural Benchmark for Multimodal Metaphors: Proposes MultiMM, the first cross-cultural multimodal metaphor dataset containing 8,461 Chinese-English advertisement image-text pairs with fine-grained annotations, and designs the SEMD model integrating sentiment features to enhance metaphor detection.
Multimodal Coreference Resolution for Chinese Social Media Dialogues: Dataset and Benchmark Approach: TikTalkCoref is proposed, the first multimodal coreference resolution dataset for Chinese social media dialogues (based on Douyin short videos), along with a pipeline benchmark containing three modules: textual coreference resolution, visual character tracking, and cross-modal alignment.
NegVQA: Can Vision Language Models Understand Negation?: Proposes the NegVQA benchmark (7,379 binary-choice VQA questions) to systematically evaluate the negation understanding capabilities of 20 VLMs, revealing a sharp performance drop across all models (averaging 29.7%) and uncovering a "U-shaped" scaling trend.
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference: This work constructs OmniAlign-V (a 200K high-quality multimodal SFT dataset) and the MM-AlignBench evaluation benchmark. By utilizing diverse image sources, open-ended question designs, and varied response formats, it significantly enhances the human preference alignment capability of open-source MLLMs, enabling LLaVA-Next-32B to surpass Qwen2VL-72B after SFT+DPO.
A Survey on Patent Analysis: From NLP to Multimodal AI: A systematic survey of NLP and multimodal AI applications in four core patent analysis tasks (classification, retrieval, quality analysis, and generation), proposing a taxonomy based on the patent lifecycle, and revealing methodological evolution trends from Word2Vec+LSTM to BERT/GPT and multimodal models, along with key research gaps.
Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models: This work systematically reveals a significant performance gap (up to 18%) in entity knowledge extraction in vision-language models (VLMs) between visual and textual representations. Using mechanistic interpretability tools, the authors discover that the key information flow of image tokens occurs deep within the intermediate layers of the model, leaving insufficient subsequent layers for factual reasoning.
PunchBench: Benchmarking MLLMs in Multimodal Punchline Comprehension: This paper proposes PunchBench, a multimodal humor/sarcasm comprehension benchmark containing 6,000 image-text pairs and 54,000 QA pairs. It eliminates language shortcuts through synonymous/antonymous caption generation, and proposes a Simple-to-Complex Chain-of-Question (SC-CoQ) strategy to consistently improve punchline comprehension capabilities across all models and question formats.
R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding: Proposes R-VLM, which introduces region proposals and IoU-aware losses from traditional object detection into VLM-based GUI element grounding. Through two-stage zoom-in inference and an IoU-weighted cross-entropy loss, it achieves an average improvement of 13% in grounding accuracy on ScreenSpot and AgentStudio.
RATE-Nav: Region-Aware Termination Enhancement for Zero-shot Object Navigation with Vision-Language Models: This paper proposes RATE-Nav, a zero-shot object navigation method based on marginal utility theory. By employing geometric predictive region segmentation and region-based exploration rate estimation, combined with the macro-environmental perception capabilities of VLMs, the method intelligently determines whether to terminate the exploration of the current region. It achieves a 67.8% success rate and 31.3% SPL on HM3D, improving by approximately 10% compared to prior zero-shot methods on MP3D.
REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark: This work proposes the REAL-MM-RAG multi-modal document retrieval benchmark, defining four key attributes of real-world retrieval benchmarks (multi-modal documents, enhanced difficulty, realistic RAG queries, and accurate annotations). It introduces multi-level query rephrasing robustness evaluation and achieves SOTA retrieval performance through targeted training datasets (a rephrasing dataset and a financial table dataset).
Redundancy Principles for MLLMs Benchmarks: This paper systematically quantifies the redundancy in current MLLM benchmarks across three levels: dimension redundancy, instance redundancy, and cross-benchmark redundancy. It proposes a redundancy analysis framework based on performance ranking correlations, providing principled guidance for future benchmark design.
Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities: By training linear probes on three intermediate representation spaces of VLMs (visual encoder, VL projection layer, and language decoder), this study systematically reveals a counter-intuitive phenomenon: for most visual tasks, the visual encoder and the VL projection layer actually retain sufficient visual information, and the real bottleneck lies in the representation space of the language decoder—where a significant amount of information is lost during transmission from the projection layer to the final text output.
Scalable Vision Language Model Training via High Quality Data Curation: This work proposes the SAIL-VL family of open-source vision-language models (2B/8B). The core contributions lie in constructing the highest-quality SAIL-Caption dataset of 300 million images, being the first to reveal the log-linear scaling law of data volume in VLM pre-training (validated with 655B token experiments), and shifting the scaling curve from log to near-linear through a three-stage curriculum SFT, achieving SOTA performance on 18 benchmarks.
SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification: SciVer is the first benchmark dataset for multimodal scientific claim verification, containing 3,000 expert-annotated samples across 1,113 Computer Science (CS) papers. It designs four reasoning subtasks: Direct, Parallel, Sequential, and Analytical. Evaluation of 21 foundation models shows that the strongest model, o4-mini (77.7%), still exhibits a significant gap of 16% compared to human experts (93.8%).
SemEval-2025 Task 1: AdMIRe -- Advancing Multimodal Idiomaticity Representation: The SemEval-2025 AdMIRe shared task is designed to evaluate model comprehension of idiomatic expressions in multimodal (text + image) and multilingual (English + Brazilian Portuguese) contexts through two subtasks: image ranking and image sequence completion. The best-performing system achieves near-human performance using Mixture-of-Experts and multi-query smoothing strategies.
Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions: This work proposes leveraging sighted users to "evaluate" rather than "generate" VLM diagram descriptions. This approach builds Sightation, the first multi-task dataset of 5k diagrams and 137k samples validated by BLV expert educators. After preference fine-tuning, a 2B model achieved an average improvement of $1.67\sigma$ in BLV usefulness ratings.
SingaKids: A Multilingual Multimodal Dialogic Tutor for Language Learning: This paper proposes SingaKids, a multilingual multimodal dialogic language learning tutoring system tailored for primary school students. Through a picture description task, it integrates dense image captioning, multilingual dialogue, speech understanding, and child-friendly speech generation, supporting interactive learning across four languages: English, Chinese, Malay, and Tamil.
Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation: This paper proposes M4Doc, a document image machine translation framework based on "single-to-mix modality alignment." During the training phase, it leverages the joint vision-language representation of Multimodal Large Language Models (MLLMs) to enhance a lightweight image encoder. During inference, the MLLM is discarded to maintain efficiency. This approach achieves significant translation quality improvements in cross-domain generalization and complex document scenarios.
Can Multimodal Large Language Models Understand Spatial Relations?: Proposes the SpatialMQA benchmark to evaluate the spatial relation reasoning capability of MLLMs in a multiple-choice format, revealing that the state-of-the-art model only achieves 48.14% accuracy, far below the human performance of 98.40%.
Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues: This paper proposes VENUS—the first large-scale multimodal dialogue dataset (89,459 dialogues, 14,910 hours) containing temporally aligned text, 3D facial expressions, and body language annotations. Based on this dataset, the authors develop the MARS multimodal language model, which discretizes nonverbal cues using VQ-VAEs to unify them with text in a single autoregressive framework, enabling joint understanding and generation of text and nonverbal actions in dialogues.
SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation: Introduces SPHERE, a three-tier hierarchical spatial reasoning evaluation framework (single-skill $\rightarrow$ multi-skill $\rightarrow$ reasoning). Based on 2,285 human-annotated QA pairs on MS COCO, it reveals a 25% performance gap between GPT-4o (67.9%) and humans (93.0%), with severe deficiencies particularly in distance judgment, perspective switching, and physical reasoning.
Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images: Proposing S-VCO (Symmetrical Visual Contrastive Optimization), a novel VLM fine-tuning objective. By symmetrically aligning/rejecting matched/contradictory image-text pairs, it enhances visual reliance. Coupled with the Minimal Visual Contrastive (MVC) dataset, it reduces hallucinations by 22% and significantly improves performance on vision-dependent tasks.
Table Understanding and (Multimodal) LLMs: A Cross-Domain Case Study on Scientific Tables: This paper introduces the TableEval benchmark (3,017 tables, 5 formats) to systematically compare the performance of text LLMs and multimodal LLMs on scientific vs. non-scientific table understanding tasks. It reveals that while models remain robust to table modalities (image vs. text), their performance decreases significantly on scientific tables.
Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions: This paper proposes the ClearVQA benchmark and an automated data generation pipeline to teach VLMs to actively raise clarification questions rather than forcing an answer when encountering ambiguous visual questions. By systematizing interactive VQA through three categories of ambiguity (referential ambiguity, attribute ambiguity, and relational ambiguity), experiments show that fine-tuned VLMs can significantly improve ambiguity recognition and clarification quality. This work was recognized with the ACL 2025 SAC Highlight Award.
TheoremExplainAgent: Towards Video-based Multimodal Explanations for LLM Theorem Understanding: This paper proposes TheoremExplainAgent, a dual-agent system (Planner + Coder) that automatically generates up to 10-minute-long theorem explanation videos via Manim animation scripts. Accompanying this is TheoremExplainBench (240 STEM theorems evaluated across 5 dimensions), proving that agentic planning is key to generating long-form videos, and showing that visual explanations can expose reasoning flaws that textual evaluations fail to detect.
Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings: This paper systematically studies compression strategies for patch-level embeddings in Visual Document Retrieval (VDR). It finds that pruning is inherently unsuitable for VDR (where simple random pruning unexpectedly performs best), whereas token merging combined with fine-tuning preserves 94.6% of retrieval performance while retaining only 2.8% of storage (Light-ColPali/ColQwen2).
Transferring Textual Preferences to Vision-Language Understanding through Model Merging: This work proposes a training-free method to transfer the preference capability of text-only reward models (RMs) into Large Vision-Language Models (LVLMs) through model parameter merging, building a Vision-Language Reward Model (VLRM) that outperforms direct LVLM scoring and text-only RMs across multiple multimodal evaluation benchmarks.
Unsolvable Problem Detection: Evaluating Trustworthiness of Large Multimodal Models: This work propounds the Unsolvable Problem Detection (UPD) task to systematically evaluate whether Large Multimodal Models (LMMs) can correctly refuse to answer when faced with unanswerable MCQA questions across three types of unsolvable problems (absent answers, incompatible options, and image-text mismatches), revealing a dimension of trustworthiness overlooked by existing benchmarks.
Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension: This paper proposes the CAPTex benchmark. Through culturally procedural text comprehension tasks (step ordering, multiple-choice questions, and conversational reasoning, etc.) across 7 countries/languages, it systematically reveals the blind spots and limitations of multilingual large models in understanding culturally specific procedural texts.
Unveiling the Lack of LVLM Robustness to Fundamental Visual Variations: Why and Path Forward: This work proposes V2R-Bench, a benchmark framework to systematically evaluate the robustness of 21 LVLMs against four fundamental visual variations (position, scale, orientation, and context). It reveals significant vulnerabilities in even advanced models on simple visual tasks and demonstrates through component-level analysis that these loopholes stem from insufficient multimodal alignment and error accumulation in pipelined architectures rather than data scarcity.
Value-Spectrum: Quantifying Preferences of Vision-Language Models via Value Decomposition: The paper proposes the Value-Spectrum benchmark, which utilizes over 50K social media short-video screenshots and the Schwartz theory of basic human values to systematically evaluate the intrinsic value preferences of Vision-Language Models (VLMs) and their alignment capability during persona role-playing.
VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos: This paper proposes the VF-Eval benchmark to systematically evaluate the capability of 13 MLLMs in providing feedback on AIGC videos across four tasks: consistency verification, error perception, error type detection, and reasoning evaluation. The evaluation reveals that even GPT-4.1 struggles to perform consistently across all tasks, highlighting the challenges of AIGC video understanding.
ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding: Proposes ViGiL3D—a linguistically diverse diagnostic dataset and an automatic analysis framework—to evaluate 3D visual grounding (3DVG) methods across various linguistic phenomena such as negation, coarse-grained references, and coreference resolution, revealing a significant performance drop (up to 20+ points) of existing methods on out-of-distribution prompts.
Vision-Language Models Struggle to Align Entities across Modalities: This paper proposes the MATE benchmark (5,500 QA instances) to systematically evaluate the entity linking performance of VLMs via cross-modal attribute retrieval tasks in synthetic 3D scenes. The study reveals that even the strongest closed-source models still lag behind humans by approximately 15 percentage points, and their performance drops sharply as the number of objects in the scene increases—primarily due to cross-modal feature binding difficulties rather than single-modal perception.
VLM2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues: This paper proposes VLM2-Bench, a benchmark designed to evaluate the capability of visual-language models (VLMs) in cross-image/frame "visual cue association". It covers 9 subtasks divided into 3 major categories (general, object-centric, and person-centric cues) with a total of 3000+ test samples. The study reveals that even state-of-the-art commercial models lag behind humans by over $30\%$ on this task, highlighting a significant gap in foundational visual matching capabilities.
VLMInferSlow: Evaluating the Efficiency Robustness of Large Vision-Language Models as a Service: This work is the first to investigate the efficiency robustness of VLMs in black-box settings, proposing the VLMInferSlow method. By searching for adversarial image perturbations via zeroth-order optimization to force VLMs to generate longer sequences, it increases computational costs by up to 128.47%, revealing the efficiency-related security vulnerabilities of VLMs deployed under MLaaS scenarios.
VLSBench: Unveiling Visual Leakage in Multimodal Safety: This work reveals the issue of Visual Safety Information Leakage (VSIL) in existing multimodal safety benchmarks—where hazardous content in images is already exposed in text queries, enabling models to refuse based solely on text and rendering safety evaluations unreliable. To address this, the authors construct the leakage-free VLSBench benchmark (2.2k image-text pairs) and find that multimodal alignment significantly outperforms text-only alignment in VSIL-free scenarios.
Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains: This paper proposes the Focus-Centric Visual Chain multi-image reasoning paradigm, achieving cross-image reasoning through question decomposition and stepwise focusing on key visual information. It constructs the VISC-150K dataset, leading to consistent performance improvements of 2-3% across seven multi-image benchmarks.
WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts: This paper proposes WikiMixQA, a benchmark consisting of 1,000 multiple-choice questions that require cross-modal reasoning over tables and charts. Evaluating 12 VLLMs reveals that while closed-source models achieve around 70% accuracy when provided with precise context, their performance drops drastically when retrieval from long documents is required. Open-source models reach a maximum accuracy of only 27%, highlighting the severe shortcomings of existing vision-language models in long-context multimodal document understanding.