Skip to content

🧩 Multimodal VLM

💬 ACL2026 · 52 paper notes

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends

This paper presents a systematic survey of Multimodal Large Language Model (MLLM)-based Visually Rich Document Understanding (VRDU), organizing OCR-based and OCR-free methods along two dimensions—feature representation/fusion and training paradigms—while discussing emerging directions such as data scarcity, multi-page documents, multilingual support, RAG, and agent-based frameworks.

Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization

This paper proposes the GPRO framework, which addresses the overthinking problem in LVLMs by inserting a meta-reasoning controller that dynamically routes computation at each token generation step to one of three paths (fast / perception re-examination / reasoning reflection), simultaneously improving both accuracy and efficiency.

AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

This paper proposes AICA-Bench, a comprehensive benchmark covering three dimensions — Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG) — to evaluate 23 VLMs. The evaluation reveals two systematic deficiencies: intensity calibration failure and shallow description, and introduces a training-free framework, GAT Prompting, to mitigate these issues.

All Changes May Have Invariant Principles: Improving Ever-Shifting Harmful Meme Detection via Design Concept Reproduction

This paper proposes RepMD, a method that constructs a Design Concept Graph (DCG)—inspired by attack trees to model the steps and logic behind malicious meme creation—to guide MLLMs in detecting ever-shifting harmful memes, achieving 81.1% accuracy on GOAT-Bench.

Automatic Slide Updating with User-Defined Dynamic Templates and Natural Language Instructions

This paper formalizes a novel task of dynamic slide updating on user-defined templates guided by natural language instructions, constructs the DynaSlide benchmark comprising 20,036 instruction-execution triplets, and proposes SlideAgent as a strong reference baseline.

Benchmarking Deflection and Hallucination in Large Vision-Language Models

This paper proposes VLM-DeflectionBench, a multimodal benchmark comprising 2,775 samples that systematically evaluates the deflection vs. hallucination behavior of large vision-language models (LVLMs) under insufficient or misleading evidence, through four evaluation scenarios (Parametric / Oracle / Realistic / Adversarial). Experiments covering 20 state-of-the-art LVLMs reveal that virtually no model can reliably deflect under noisy evidence.

CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding, Interpretation, and Authenticity

This paper introduces CArtBench — a multi-task benchmark grounded in the Palace Museum collection — to evaluate VLMs across four capabilities in Chinese art understanding (evidence-anchored QA, structured connoisseurship, defensible reinterpretation, and authenticity verification). Even the strongest models exhibit significant performance drops in evidence association and style-period reasoning, while authenticity verification approaches random-chance performance.

CogGen: A Cognitively Inspired Recursive Framework for Deep Research Report Generation

CogGen proposes a multi-agent recursive framework that simulates the human cognitive writing process. It achieves global restructuring via a macro-cognitive loop, parallel section refinement via a micro-cognitive cycle, and semantic-level text–chart co-planning via Abstract Visual Representation (AVR). On the OWID benchmark, CogGen reaches human expert-level performance and surpasses Gemini Deep Research.

Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games

This paper proposes a collaborative multi-agent framework for automatically generating high-quality murder mystery game scripts and training data. Through a two-stage training strategy (CoT fine-tuning + GRPO reinforcement learning with ScoreAgent reward shaping), it enhances VLM multi-hop reasoning under imperfect information, achieving significant improvements on WhodunitBench in narrative reasoning, fact extraction, and deception resistance.

Doc-PP: Document Policy Preservation Benchmark for Large Vision-Language Models

This paper proposes the Doc-PP benchmark, revealing a "reasoning-induced safety gap" in large vision-language models (LVLMs) during multimodal document question answering—models bypass explicit non-disclosure policies and leak sensitive information when cross-modal reasoning is required. A structured reasoning framework, DVA (Decompose–Verify–Aggregation), is proposed to substantially reduce leakage rates.

Don't Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction

This paper proposes VeriGUI, a framework that incorporates a Thinking-Verification-Action-Expectation (TVAE) closed-loop reasoning mechanism and a two-stage training pipeline (Robust SFT + GRPO), enabling GUI agents to verify whether each action succeeds and self-correct upon failure. VeriGUI achieves substantial improvements over baselines at both 3B and 7B scales.

Dynamic Emotion and Personality Profiling for Multimodal Deception Detection

This paper identifies that existing deception detection datasets provide only participant-level emotion/personality labels (all samples from the same subject share identical labels), and proposes a sample-level dynamic annotation scheme along with a reliability-weighted multimodal fusion framework, Rel-DDEP, achieving improvements of 2.53% in deception detection F1, 2.66% in emotion detection F1, and 9.30% in personality detection F1.

Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects

This paper proposes a systematic taxonomy for efficient inference in large vision-language models (LVLMs), analyzing bottlenecks across an encode–prefill–decode three-stage pipeline. It identifies a systemic efficiency barrier caused by visual token dominance and presents a comprehensive map of optimization techniques spanning information density shaping, long-context attention management, and memory bandwidth breakthroughs.

Enhancing Multimodal Large Language Models for Ancient Chinese Character Evolution Analysis via Glyph-Driven Fine-Tuning

This paper constructs a benchmark for ancient Chinese character evolution analysis comprising 11 tasks and 130,000+ instances, evaluates 19 MLLMs to reveal their limited capacity for glyph-level recognition and evolution reasoning, and proposes GEVO—a glyph-driven contrastive fine-tuning framework—that achieves consistent improvements across all tasks on a 2B-scale model.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

This paper formally defines the multimodal error detection task and constructs the ErrorRadar benchmark — comprising 2,500 K-12 multimodal math problems drawn from real student responses — to evaluate MLLMs on two subtasks: error step identification (STEP) and error type classification (CATE). The strongest model, GPT-4o, still lags behind human evaluators by approximately 10–15%.

Faithful-First Reasoning, Planning, and Acting for Multimodal LLMs

This paper proposes the Faithful-First RPA framework, which employs the FaithEvi pipeline to evaluate perceptual faithfulness at each reasoning step (i.e., whether claimed objects genuinely exist in the image), and the FaithAct mechanism to enforce evidence-grounded planning and acting during reasoning generation. The framework improves perceptual faithfulness by up to 24% without degrading task accuracy.

FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models

FineSteer decomposes inference-time steering into two complementary stages: Subspace-guided Conditional Steering (SCS) determines when to steer — using the subspace energy ratio of IR queries as a gate; Mixture of Steering Experts (MoSE) determines how to steer — dynamically aggregating prototype experts via an attention gating network with residual refinement to produce query-specific steering vectors. The framework surpasses SOTA on both safety and truthfulness benchmarks.

From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models

This paper proposes HONES, a framework that first localizes task-critical attention heads and then uses them as conditions to guide FFN neuron attribution, achieving unified, gradient-free, neuron-level causal analysis and lightweight task performance improvement across heterogeneous tasks in multi-task VLMs.

From Inheritance to Saturation: Disentangling the Evolution of Visual Redundancy for Architecture-Aware MLLM Inference Acceleration

This work identifies two sources of visual redundancy in MLLM inference — inherent visual redundancy (IVR) arising from dense ViT tokenization, and secondary saturation redundancy (SSR) emerging from deep-layer semantic saturation whose manifestation varies across backbone architectures — and proposes the HalfV framework to address each type separately, achieving a 4.1× FLOPs speedup on Qwen2.5-VL while retaining 96.8% of performance.

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck

This paper proposes MM-Mem, a pyramidal multimodal memory architecture inspired by Fuzzy Trace Theory (FTT). The memory is organized into three hierarchical layers — a Sensory Buffer (vision-dominant), an Episodic Stream (event-level summaries), and a Symbolic Schema (knowledge graph) — and achieves SOTA performance on 4 long-video benchmarks by compressing redundancy bottom-up via SIB-GRPO (Semantic Information Bottleneck + RL) and retrieving top-down via entropy-driven adaptive depth selection.

GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models

This paper proposes GAMBIT, a gamified multimodal jailbreak framework that decomposes harmful queries into puzzle images with hidden keywords and embeds them within competitive game scenarios. By exploiting the model's reasoning incentives and cognitive load, GAMBIT bypasses safety filters, achieving attack success rates of 92.13% on Gemini 2.5 Flash and 85.87% on GPT-4o, and is effective against both reasoning and non-reasoning models.

GeoRC: A Benchmark for Geolocation Reasoning Chains

This paper introduces GeoRC, the first geolocation reasoning chain benchmark authored by GeoGuessr champion-level experts (800 reasoning chains, 500 scenes), designed to evaluate VLMs' ability to generate auditable reasoning chains. Findings reveal that while closed-source VLMs can match human-level localization accuracy, their reasoning chain quality remains substantially inferior, and open-source VLMs perform nearly on par with a pure hallucination baseline.

HiPrune: Hierarchical Attention for Efficient Token Pruning in Vision-Language Models

This paper identifies a hierarchical attention pattern in vision encoders—middle layers attend to foreground objects while deep layers capture global information—and proposes HiPrune, a training-free, model-agnostic visual token pruning method. By selecting three categories of tokens (Anchor/Buffer/Register) to preserve information at different semantic levels, HiPrune retains 99.3% of performance using only 1/3 of the tokens while reducing FLOPs by 58.7%.

LaMI: Augmenting Large Language Models via Late Multi-Image Fusion

This paper proposes LaMI, a late-fusion architecture that integrates visual features with LLM outputs at the final prediction stage, and at inference time generates multiple images from input text for confidence-weighted aggregation. LaMI significantly enhances the visual commonsense reasoning capability of LLMs without compromising their text reasoning performance.

Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Unauthorized Images

This paper proposes ImageProtector, which embeds nearly imperceptible adversarial perturbations into images as a visual prompt injection attack, causing MLLMs to generate refusal responses when analyzing protected images. This prevents malicious actors from exploiting open-weight MLLMs to extract private information from images at scale.

Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

This paper exposes the threat of Adversarial Smuggling Attacks (ASA) in MLLM-based content moderation—encoding harmful content into visually human-readable but AI-imperceptible formats to evade automated detection. The authors construct SmuggleBench, a benchmark of 1,700 samples spanning 9 attack techniques, and demonstrate that SOTA models including GPT-5 achieve attack success rates exceeding 90%.

MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems

This work proposes the FlowVerse benchmark—which decomposes mathematical problem information into four components (DI/EI/RP/OQ) and constructs six variant versions—and the MathFlow modular pipeline, which decouples perception and reasoning into independent stages. A dedicated perception model, MathFlow-P-7B, is trained to extract critical information from mathematical diagrams, substantially improving visual mathematical problem-solving performance across diverse reasoning models.

MedLayBench-V: A Large-Scale Benchmark for Expert-Lay Semantic Alignment in Medical Vision Language Models

This paper presents MedLayBench-V, the first large-scale multimodal medical expert-lay semantic alignment benchmark (79,793 image-text pairs). Through a Structured Concept-Grounded Refinement (SCGR) pipeline, professional radiology reports are transformed into lay descriptions, reducing reading difficulty from graduate level to high school level while preserving clinical semantic fidelity. Zero-shot retrieval experiments demonstrate that lay descriptions incur less than 1% performance degradation.

Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation

This paper proposes the MPD framework, which decouples hallucination components via semantics-aware orthogonal subspace projection and selectively updates only the parameters most relevant to hallucinations. MPD reduces hallucinations by 23.4% while preserving 97.4% of general generation capability, without introducing any additional inference overhead.

MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models

This paper presents MMErroR, a multimodal erroneous reasoning benchmark comprising 1,997 samples, each containing exactly one deliberately injected reasoning error, spanning 6 domains and 4 error types. The benchmark requires VLMs to not only detect the presence of errors in reasoning chains but also classify the error type (Visual Perception Error / Knowledge Deficiency Error / Question Comprehension Error / Reasoning Error). Evaluation of 12 representative VLMs reveals that even the strongest model, Gemini-3-Pro-Preview, achieves only 66.65% accuracy.

More Than Meets the Eye: Measuring the Semiotic Gap in Vision-Language Models via Semantic Anchorage

This paper exposes the "literal superiority bias" in VLMs from a cognitive-semiotic perspective—models tend toward literal rather than metaphorical/idiomatic interpretations of high-fidelity images. By introducing the DIVA benchmark (iconographically abstracted images) and the Semantic Alignment Gap metric, the paper demonstrates that reducing visual fidelity significantly narrows the gap between literal and idiomatic interpretations.

Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge

This paper proposes MT-RL-Judge, a multi-task reinforcement learning framework that jointly optimizes multiple evaluation tasks via GRPO to train a unified MLLM-as-a-Judge model. The framework consistently outperforms SFT baselines across six benchmarks covering text-image alignment, safety compliance, and visual quality assessment, and demonstrates strong out-of-distribution generalization on the unseen MJ-Bench pairwise comparison format (82.23% on Safety vs. 49.40% for SFT-Unified).

OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Models

This paper presents OMIBench — the first large-scale benchmark for olympiad-level multi-image reasoning, covering 1,000+ competition problems across biology, chemistry, mathematics, and physics. Even the strongest LVLM (Gemini-3-Pro) achieves only ~50% accuracy, a drop of more than 25% compared to single-image benchmarks.

Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval

This paper proposes OEA (Omni-Embed-Audio), which employs a multimodal LLM as a unified encoder to construct a retrieval-oriented audio-text embedding space. It introduces the User-Intent Queries (UIQ) benchmark and hard-negative discrimination metrics (HNSR/TFR), demonstrating that the LLM backbone significantly outperforms CLAP-based methods on T2T retrieval (+22%) and hard negative discrimination (+4.3%p HNSR@10).

Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning

This position paper argues that Multimodal Large Language Models (MLLMs) can significantly advance cross-disciplinary scientific reasoning. It proposes a four-stage research roadmap (broad knowledge recognition → analogical reasoning & generalization → insightful reasoning → creative hypothesis generation), and systematically surveys the current state of MLLM applications across mathematics, physics, chemistry, and biology, identifying five major challenges and eight future directions.

Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring

This paper proposes the Representational Contrastive Scoring (RCS) framework, which analyzes the geometric structure of intermediate-layer representations within LVLMs. By learning a lightweight projection and applying contrastive scoring, RCS distinguishes malicious intent from benign distributional shift, achieving state-of-the-art jailbreak detection performance under a rigorous cross-attack-type generalization evaluation protocol.

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

This paper proposes the SafetyALFRED benchmark, which introduces six categories of kitchen safety hazards into the ALFRED embodied task setting. It reveals a critical alignment gap: multimodal large language models can identify hazards in static QA (up to 92%) but fail to proactively mitigate them during embodied planning (<60%), advocating a paradigm shift from QA-based to embodied safety evaluation.

Seeing No Evil: Blinding Large Vision-Language Models to Safety Instructions via Adversarial Attention Hijacking

This paper proposes Attention-Guided Visual Jailbreaking, which bypasses—rather than directly confronts—safety alignment mechanisms by suppressing model attention to safety instructions and anchoring attention to adversarial image features. The method achieves a 94.4% attack success rate (ASR) on Qwen-VL while reducing gradient conflicts by 45%.

Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation

This paper proposes DaID (Dual-Anchor Introspective Decoding), which mitigates hallucinations within a single forward pass by exploiting layer-wise differences in visual perception within MLLMs — Spotlight layers amplify visual signals while Shadow layers suppress language priors.

Targeted Exploration via Unified Entropy Control for Reinforcement Learning

This paper proposes UEC-RL, a unified bidirectional entropy control framework that addresses entropy collapse and training instability in GRPO by performing high-temperature targeted exploration on difficult prompts (entropy increase) and consolidating high-quality trajectories via an experience replay stabilizer (entropy decrease), achieving a 37.9% relative improvement on Geometry3K.

TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

This paper proposes TEMA (Text-oriented Entity Mapping Architecture), the first CIR framework designed for multi-modification text (MMT). It enhances entity coverage via a Parsing Assistant (PA), resolves clause-entity misalignment via an Entity Mapping (EM) module, and introduces two multi-modification benchmarks—M-FashionIQ and M-CIRR—achieving state-of-the-art performance in both standard and multi-modification settings.

Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding

This paper proposes the Perception Magnifier (PM), a visual decoding method that, at each autoregressive decoding step, iteratively identifies critical visual regions based on multi-layer attention and adaptively magnifies them. By increasing the effective resolution of key regions, PM mitigates visual hallucinations in VLMs while preserving spatial structure and reasoning capability.

Topology-Aware Layer Pruning for Large Vision-Language Models

This paper proposes TopoVLM, a topology-aware layer pruning framework that models hidden states at each layer as point clouds and quantifies inter-layer topological consistency via zigzag persistent homology. The method adaptively retains critical representation-transition layers while removing structurally redundant ones, achieving significant improvements over existing pruning methods at 50–60% sparsity.

Tree-of-Evidence: Efficient "System 2" Search for Faithful Multimodal Grounding

This paper proposes Tree-of-Evidence (ToE), an inference-time discrete beam search algorithm that formalizes multimodal model interpretability as a discrete optimization problem over coarse-grained evidence units (vital sign time windows, radiology report segments). With only 5 evidence units, ToE retains over 98% of the full-input model's AUROC while generating auditable evidence trace paths.

TRACE: Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning

This paper proposes TRACE (Textual Representation of Allocentric Context from Egocentric Video), a prompting method that guides multimodal large language models to generate structured textual allocentric 3D environment representations from egocentric video—comprising meta context, camera trajectory, and an entity registry—as intermediate reasoning steps to enhance spatial question answering. TRACE consistently outperforms existing prompting strategies on both VSI-Bench and OST-Bench.

VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models

This paper proposes VLA-Forget, the first hybrid unlearning framework for vision-language-action (VLA) models. It employs ratio-aware selective editing for perception/cross-modal layers and significance-based selective editing for reasoning/action layers, achieving targeted behavior removal while improving perceptual specificity (+22%) and task success rate (+9%).

What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning

This paper proposes UILoop (UI-in-the-Loop), a paradigm that restructures GUI reasoning from the conventional "screen→action" pipeline into a cyclic "screen→UI elements→action" process. Through UI element-driven reinforcement fine-tuning, the model is trained to explicitly locate, understand, and leverage key UI elements, achieving state-of-the-art performance on GUI reasoning benchmarks.

What Do Vision-Language Models Encode for Personalized Image Aesthetics Assessment?

Through linear probing, this paper demonstrates that VLM hidden representations encode rich, multi-level aesthetic attribute information (illumination, color, composition, etc.) that propagates into language decoder layers. Building on this finding, the authors propose a simple linear regression approach for personalized image aesthetics assessment (PIAA) that requires no fine-tuning, significantly outperforming few-shot and LoRA fine-tuning baselines.

When Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM-Powered Safety in Daily Life

This paper introduces SaLAD, a benchmark comprising 2,013 real image-text samples spanning 10 daily-life categories, designed to evaluate the ability of multimodal large language models to identify implicit safety risks and provide safety warnings during everyday assistance. Results reveal that even the strongest model achieves only 57.2% accuracy on unsafe queries.

When Slower Isn't Truer: Inverse Scaling Law of Truthfulness in Multimodal Reasoning

This paper identifies an "inverse scaling law" in multimodal reasoning models — slow-thinking (reasoning) models are more prone to producing untruthful outputs than fast-thinking (chat) models when faced with misleading visual inputs. To systematically diagnose this phenomenon, the authors construct the TruthfulVQA benchmark (5,000+ samples, 50 annotators, three-tier graded prompts) and the TruthfulJudge evaluation model (88.4% accuracy).

When Vision-Language Models Judge Without Seeing: Exposing Informativeness Bias

This paper exposes a severe informativeness bias in VLM-as-a-Judge systems—judges tend to favor more detailed and elaborate responses even when such responses contradict the image content. The proposed BIRCH paradigm first calibrates candidate answers against the image before comparison, reducing bias by up to 17% and improving performance by up to 9.8%.

WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering

This paper proposes WikiSeeker, which redefines the role of VLMs in multimodal RAG—transforming them from mere answer generators into two specialized agents: a Refiner (trained with RL to rewrite queries) and an Inspector (to verify the reliability of retrieved contexts). WikiSeeker achieves state-of-the-art performance on three benchmarks: EVQA, InfoSeek, and M2KR.