🧩 Multimodal VLM¶
💬 ACL2026 · 83 paper notes
📌 Same area in other venues: 📷 CVPR2026 (388) · 🔬 ICLR2026 (211) · 🧪 ICML2026 (89) · 🤖 AAAI2026 (75) · 🧠 NeurIPS2025 (105) · 📹 ICCV2025 (106)
🔥 Top topics: Multimodal/VLM ×42 · LLM ×8 · Layout & Composition ×4 · Alignment/RLHF ×3 · RAG ×3
- A Survey of Deep Learning for Geometry Problem Solving
-
To be added after in-depth reading.
- A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends
-
This paper systematically reviews Visually Rich Document Understanding (VRDU) based on Multimodal Large Language Models (MLLMs), categorizing OCR-based and OCR-free methods from two dimensions: feature representation/fusion and training paradigms, while discussing emerging directions such as data scarcity, multi-page documents, multilingual support, RAG, and agents.
- AdaTooler-V: Adaptive Tool-Use for Images and Videos
-
This paper identifies a widespread blind tool-use problem in existing "thinking with images" MLLMs—models tend to force zoom-in or frame extraction for all visual questions, resulting in overthinking that degrades accuracy and increases inference costs. To address this, the authors propose AdaTooler-V, which introduces the AT-GRPO reinforcement learning algorithm. By using a sample-level Tool Benefit Score to dynamically adjust reward scales (encouraging tool use when effective and penalizing it when unnecessary), a 7B model achieves 89.8% on the V* high-resolution benchmark, surpassing GPT-4o and Gemini 1.5 Pro.
- AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce
-
The AFMRL framework is proposed, framing fine-grained understanding of e-commerce products as an attribute generation task. It enhances contrastive learning via key attributes generated by MLLM (AGCL) and back-optimizes the attribute generator using retrieval performance as a reward signal (RAR), achieving SOTA retrieval performance on large-scale e-commerce datasets.
- AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis
-
This paper proposes AICA-Bench, a comprehensive benchmark covering three dimensions: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). Evaluating 23 VLMs reveals two systematic flaws: intensity calibration failure and shallow descriptions. A training-free framework, GAT Prompting, is introduced to mitigate these issues.
- Aligned Multi-View Scripts for Universal Chart-to-Code Generation
-
Utilizing "semantically equivalent scripts for the same chart in Python, R, and LaTeX" as a new supervision signal, this work constructs the 176K quadruplet dataset Chart2NCode. It proposes CharLuMA, a lightweight adapter that integrates a "language-conditional low-rank subspace router" into the LLaVA projector, enabling a single model to achieve high execution rates and visual fidelity across three plotting languages.
- All Changes May Have Invariant Principles: Improving Ever-Shifting Harmful Meme Detection via Design Concept Reproduction
-
Ours proposes RepMD, a method that constructs Design Concept Graphs (DCG)—inspired by the concept of attack trees to describe the steps and logic used by malicious users to design harmful memes—to guide MLLMs in detecting evolving harmful memes, achieving 81.1% accuracy on GOAT-Bench.
- Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models
-
BloomBench reconstructs VLM evaluation using Bloom’s cognitive taxonomy by organizing 7,747 bilingual image-text QA samples into 6 cognitive levels and 106 task types. It finds that high scores in current VLMs often mask significant shortcomings in factual recall, creative synthesis, and cross-lingual reasoning.
- Automatic Slide Updating with User-Defined Dynamic Templates and Natural Language Instructions
-
Defines the new task of "dynamic slide updating on user-defined templates based on natural language instructions," constructs the DynaSlide benchmark containing 20,036 instruction-execution triplets, and proposes SlideAgent as a strong reference baseline.
- Beyond Screenshots: Evaluating VLMs' Understanding of UI Animations
-
This work constructs AniMINT, the first evaluation set for UI animation understanding (300 densely annotated animation videos + 3 experts + 300 user annotations). After systematically testing nine SOTA VLMs, it was found that while basic motion effects are recognizable, significant gaps remain in functional classification and high-level semantic interpretation compared to humans. Furthermore, enhancing Gemini-2.5-Flash with Motion-Context-Perceptual Cues (MCPC) simultaneously improves classification and interpretation performance.
- CARES: Context-Aware Resolution Selector for VLMs
-
CARES adds a lightweight query-aware resolution selector before the target VLM. Using low-resolution images and text queries, it predicts the minimal input resolution "sufficient to answer." It maintains accuracy across 9 multimodal benchmarks while saving approximately 65–85% of prefill computational costs on average.
- CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding, Interpretation, and Authenticity
-
This paper constructs CArtBench—a multi-task benchmark based on the collections of the Palace Museum—to evaluate four capabilities of VLMs in Chinese art understanding (evidence-based QA, structured appreciation, defensible re-interpretation, and authenticity discrimination). It finds that even the strongest models show significant performance degradation in evidence association and style-period reasoning, while authenticity discrimination remains near random levels.
- ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts
-
The authors constructed ChartDiff, the first large-scale benchmark for "comparative summarization of chart pairs" (8,541 pairs, covering 6 chart types, 3 visualization libraries, ~60 visual styles, with LLM-generated + human-verified comparative summaries). Systematic evaluation of 14 VLMs/pipelines reveals that state-of-the-art closed-source models lead in GPT Score but yield low ROUGE scores, while specialized chart models show the opposite, exposing a severe mismatch between ROUGE and human-perceived quality. Furthermore, multi-series charts remain a consistent blind spot for all models.
- CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution
-
CharTide attributes the bottleneck of "chart-to-plotting code" to the data itself. It utilizes Tri-Perspective Decomposed SFT (orthogonal data streams for visual perception, text-only code logic, and modality fusion) to break the scaling wall of homogeneous data. Furthermore, it employs a frozen Inspector for objective verification via atomic QA to provide verifiable rewards for RL, allowing 7B/8B open-source models to outperform GPT-4o and approach GPT-5.
- CNSL-bench: Benchmarking the Sign Language Understanding Capabilities of MLLMs on Chinese National Sign Language
-
CNSL-bench is the first authoritative benchmark for evaluating Chinese sign language in MLLMs based on the National Common Sign Language Dictionary. It covers 6,707 unique sign entries across text, image, and video modalities, featuring three types of hand articulation (air-writing, finger-spelling, and manual-alphabet), totaling 20,121 four-way multiple-choice questions. Evaluations across 21 SOTA MLLMs reveal that while GPT-5 achieves 89.6% in text, it drops to 67.0% in image and 56.7% in video—a significant gap compared to the 97% human performance. Furthermore, CoT reasoning provides minimal benefit for video understanding.
- CO-EVO: Co-evolving Semantic Anchoring and Style Diversification for Federated DG-ReID
-
Addressing the "semantic-style conflict" in Federated Domain Generalized Person Re-identification (FedDG-ReID), CO-EVO proposes CSA (Camera-invariant Semantic Anchoring) to learn frozen identity-level textual prototypes as "gravitational centers" and GSD (Global Style Diversification) using a lightweight GCSB (Global Camera Style Bank) to synthesize realistic cross-domain perturbations. The coupled optimization of these two components achieves an average mAP improvement of 14 points (34.1→48.1) over SOTA on ViT across Market-1501/MSMT17/CUHK03 leave-one-out experiments.
- CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook
-
CodeBind revamps ImageBind/ViT-Lens style multimodal alignment using shared-specific representation decoupling and a compositional VQ codebook. It simultaneously improves cross-modal classification/retrieval across nine modalities while preserving stronger mode-specific fine-grained information.
- CogGen: A Cognitively Inspired Recursive Framework for Deep Research Report Generation
-
CogGen proposes a multi-agent recursive framework that simulates the human cognitive writing process. It implements macro-cognitive loops for global restructuring, micro-cognitive loops for parallel chapter refinement, and Abstract Visual Representation (AVR) for semantic-level text-chart synergistic planning, achieving human expert-level performance on the OWID benchmark and surpassing Gemini Deep Research.
- CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering
-
CRAFT is a claim-centric pipeline for multi-video question answering in news events. It combines dynamic key-frame selection, ASR transcription, iterative refinement via UNLI/MNLI/LLM critics, and citation consolidation, achieving 0.739 macro average, 0.810 reference recall, and 0.635 citation F1 on MAGMaR-Test.
- Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models
-
This paper proposes Vulca-Bench, a three-tier evaluation framework (Automated Metrics + Single-Judge Scoring + Human Sigmoid Calibration) covering 6 major art traditions, 165 cultural dimensions, and an L1–L5 "visual description → cultural interpretation" hierarchy. For the first time, it quantitatively reveals that across 15 VLMs, "model performance drops significantly in deep cultural interpretation, with a systematic bias toward Western art."
- Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality
-
MACCO enables CLIP to mask compositional concepts such as "relations/attributes" in one modality and reconstruct them using complete information from the other modality. Combined with two auxiliary alignment losses, it significantly enhances the compositional understanding of VLMs without generating hard negative samples.
- Cross-Modal Taxonomic Generalization in (Vision-) Language Models
-
This paper systematically investigates whether language models in VLMs can generalize taxonomic knowledge (hypernym relations) learned from text to visual inputs. It finds that even without hypernym labels during training, pretrained LMs recognize hypernym categories in images, provided there is visual coherence among category members.
- DMN: A Compositional Framework for Jailbreaking Multimodal LLMs with Multi-Image Inputs
-
This paper proposes DMN, a multi-image jailbreak evaluation framework that combines distributed instructions, multimodal evidence, and digit chain auxiliary tasks. It demonstrates that current MLLMs supporting multi-image inputs exhibit significant weaknesses in cross-image safety alignment and provides a multi-image-aware filter as a preliminary defense.
- Do MLLMs Capture How Interfaces Guide User Behavior? A Benchmark for Multimodal UI/UX Design Understanding
-
This paper proposes WiserUI-Bench, which utilizes 300 pairs of real-world A/B test-verified UI images and 684 expert explanations to evaluate whether MLLMs understand how interface design influences user behavior. Results show that existing models perform near random chance in selecting winners and significantly lag behind expert levels in explaining the underlying reasons.
- Doc-PP: Document Policy Preservation Benchmark for Large Vision-Language Models
-
This paper introduces the Doc-PP benchmark, revealing a "reasoning-induced safety gap" in Large Vision-Language Models (LVLMs) during multimodal document QA—models bypass explicit non-disclosure policies to leak sensitive information when cross-modal reasoning is required. The study proposes the DVA (Decompose–Verify–Aggregation) structured reasoning framework to significantly reduce leakage rates.
- DraDDP: A Multimodal Multi-Party Dialogue Discourse Parsing Dataset
-
DraDDP constructs the first publicly available English multimodal multi-party dialogue discourse parsing dataset. Using traditional parsers, LLMs, and multimodal LLMs, it systematically evaluates the varying contributions of text, audio, and video cues to dependency edge and discourse relation recognition.
- DualFact: A Multimodal Fact Verification Framework for Procedural Video Understanding
-
The authors decompose the factual evaluation of procedural video captions (e.g., cooking, furniture making) into dual-layer facts: conceptual facts (abstract roles like Action/Ingredient/Tool/Location) and contextual facts (observable predicate–argument relations in the video, such as stir(soup, pot)). They construct two benchmarks, YouCook3-Fact and CraftBench-Fact, which include annotations for Verifiable Implicit Argument (VIA) completion and contrastive facts. Additionally, they propose MultiFactScore, which utilizes multimodal/textual NLI to verify facts at the role level, further classifying errors into Hallucination, Saliency, and Omission. Experiments reveal that SOTA MLLM captions are "fluent but factually incomplete"; evaluating captions alone overestimates Hallucination by approximately half, and only video-grounded evaluation can distinguish saliency from true hallucination.
- Dynamic Emotion and Personality Profiling for Multimodal Deception Detection
-
This paper highlights that existing deception detection datasets only provide participant-level emotion/personality labels (shared across all samples from the same person). It proposes a sample-level dynamic annotation scheme and a reliability-weighted multimodal fusion framework, Rel-DDEP. This approach achieves gains of 2.53% in deception detection F1, 2.66% in emotion detection, and 9.30% in personality detection.
- E2E-GMNER: End-to-End Generative Grounded Multimodal Named Entity Recognition
-
Ours proposes E2E-GMNER, the first end-to-end GMNER framework that unifies entity recognition, semantic classification, visual grounding, and implicit knowledge reasoning within a single multimodal large language model. This framework adaptively determines the availability of visual/knowledge cues through CoT reasoning and introduces Gaussian Risk-aware Box Perturbation (GRBP) to enhance the robustness of generative bounding box prediction.
- EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions
-
The authors release the EDU-CIRCUIT-HW dataset containing 1,334 real-world handwritten university circuit homework samples and propose an "upstream recognition + downstream grading" dual-layer evaluation protocol. They find that even the strongest MLLMs (GPT-5.1 / Gemini-3-Preview) have recognition errors in 37–85% of samples, but only 7–20% propagate to grading. A regrading module using LLM-judge error patterns with only 3.3% human-in-the-loop backup improves point-agreement from 70% to 76%.
- Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects
-
This paper proposes a systematic taxonomy for LVLM inference efficiency, analyzing bottlenecks along the encoding-prefilling-decoding pipeline. It reveals the systemic efficiency barrier caused by "vision token dominance" and summarizes a comprehensive technical map ranging from information density shaping and long-context attention management to memory bandwidth breakthroughs.
- Enhancing Multimodal Large Language Models for Ancient Chinese Character Evolution Analysis via Glyph-Driven Fine-Tuning
-
This paper constructs a benchmark for ancient Chinese character evolution analysis containing 11 tasks and over 130,000 instances. After evaluating 19 MLLMs, it is observed that existing models have limited capabilities in glyph-level recognition and evolutionary reasoning. Consequently, the authors propose GEVO, a glyph-driven contrastive fine-tuning framework, achieving full-task improvements on a 2B model.
- From Charts to Code: A Hierarchical Benchmark for Multimodal Models
-
This paper proposes Chart2Code, a hierarchical benchmark featuring 2,186 tasks across 22 chart types. It is organized into three progressive difficulty levels: Chart Replication (Level 1), Chart Editing (Level 2), and Long Table-to-Chart (Level 3). Evaluating 29 SOTA multimodal models reveals that even the strongest GPT-5.2 achieves a chart quality score of only 33.41 on editing tasks, highlighting significant deficiencies in current models for practical chart code generation.
- From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models
-
The HONES framework is proposed to achieve unified, gradient-free neuron-level causal analysis across heterogeneous tasks and lightweight performance enhancement in multi-task VLMs by first locating task-critical attention heads and then guiding FFN neuron attribution conditioned on those heads.
- From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck
-
This paper proposes MM-Mem, a pyramidal multimodal memory architecture inspired by Fuzzy Trace Theory. It organizes memory into three levels: a Sensory Buffer (visual-dominant), an Episodic Stream (event-level summaries), and a Symbolic Schema (Knowledge Graph). Redundancy is compressed bottom-up via SIB-GRPO (Semantic Information Bottleneck + Reinforcement Learning), while retrieval is conducted top-down driven by entropy. The method achieves SOTA performance on four long-video benchmarks.
- GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents
-
GameplayQA is proposed as an end-to-end benchmarking framework based on multiplayer 3D game videos. Using dense timeline annotations (1.22 labels/sec) and a structured distractor taxonomy, it systematically evaluates the perception and reasoning capabilities of Multimodal Large Language Models (MLLMs) in decision-dense, POV-synced scenarios, revealing a significant performance gap between frontier models and human performance.
- GroupToM-Bench: Benchmarking Group Theory of Mind and Nonlinear Social Emergence in MLLMs
-
This paper introduces GroupToM-Bench, which utilizes 240 expert-designed multimodal group interaction scenarios and a 7-layer cognitive audit framework to evaluate whether MLLMs can reason from individual beliefs/desires/intentions to group tensions, structural constraints, and nonlinear collective outcomes. Results indicate a significant "group cognitive gap" in current models.
- GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance
-
GuideDog utilizes an "expert-norm-driven silver-label generation + manual verification for gold labels" pipeline to construct 22K egocentric pedestrian scene image-text pairs (including an 818-question QA benchmark) from 269 global walking videos. This provides the first scaled, geographically diverse, and standardized training and evaluation data for MLLMs in BLV (Blind and Low-Vision) navigation tasks.
- How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study
-
This paper introduces VRUBench, a textual viewpoint rotation understanding (VRU) benchmark. Using layer-wise probing and head-wise path patching, it reveals that the near-random performance of LLMs/VLMs on this task stems from the failure of critical heads in mid-to-late layers to bind "perceived orientation" with "corresponding observations." By fine-tuning only 32 key heads, the authors achieve performance comparable to full fine-tuning in 50% of the GPU time without degrading general capabilities.
- Hybrid Autoregressive-Diffusion Model for Real-Time Sign Language Production
-
This paper proposes HybridSign, which combines autoregressive frame-by-frame generation with flow-based diffusion refinement, incorporating a three-expert multi-scale pose representation and confidence-aware causal attention to achieve a superior quality-latency trade-off on PHOENIX14T and How2Sign.
- "I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?
-
This paper introduces MultiPun—the first multimodal pun benchmark featuring "adversarial non-pun distractors" (445 puns + 890 non-puns, covering both homophonic and homographic types). It systematically evaluates 11 VLMs across three tasks: detection, localization, and explanation. The study finds that all models tend to misidentify non-puns as puns (TNR generally < 0.4). The authors propose Pun-CoT prompting and Pun-Tuning strategies, achieving an average F1 gain of 16.5%.
- Jailbreaking Multimodal Large Language Models using Multi-Clip Video
-
This paper constructs MCV SafetyBench to evaluate the safety of video MLLMs, discovering that multi-clip and multi-context video inputs systematically increase attack success rates (ASR), while simple filtering of sampled image frames significantly mitigates these risks.
- LaMI: Augmenting Large Language Models via Late Multi-Image Fusion
-
LaMI is proposed, which utilizes a late fusion architecture to integrate visual features with LLM outputs at the final stage of prediction. During inference, multiple images are generated from text and aggregated based on confidence. This significantly enhances the visual common sense reasoning capabilities of LLMs without compromising their text reasoning performance.
- Learning More from Less: Exploiting Counterfactuals for Data-Efficient Chart Understanding
-
Targeting the unique attribute of charts as "programmatically generated visual artifacts," this paper proposes ChartCF. By using GPT-5 to make minimal modifications to plotting code, it generates "counterfactual chart pairs" that are visually similar but have different answers. Through joint preference optimization using Text DPO + Image DPO, the VLM learns fine-grained visual discrimination. Using only 4K preference pairs, it matches or exceeds the performance of ECD trained on 300K SFT data across multiple chart QA benchmarks.
- Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Unauthorized Images
-
The authors propose ImageProtector, which embeds near-imperceptible adversarial perturbations as visual prompt injection attacks into images. This induces MLLMs to generate refusal responses for protected images, preventing malicious actors from using open-weight MLLMs to extract private information at scale.
- Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs
-
This work systematically disentangles the relationship between "compositionality" and "long-caption understanding" in contrastive VLMs. It finds that these capabilities mutually promote each other in a bidirectional manner, yet this transfer is highly sensitive to training data quality and optimization strategies. Specifically, grounded long-caption data with high vocabulary coverage combined with full-parameter fine-tuning allows a model to excel in both capabilities simultaneously. Conversely, low-quality synthetic captions (e.g., DAC/DCI) or LoRA-based partial updates lead to failure in both. LongCLIP's strategy of freezing the first 20 positional embeddings, while intended to preserve general alignment, severely restricts compositional learning. The "control model" proposed in this work, LSS, outperforms LongCLIP within the original 77-token context window by fine-tuning on ShareGPT4V with full parameters.
- Lost in Translation: Do LVLM Judges Generalize Across Languages?
-
This paper introduces MM-JudgeBench, the first large-scale multilingual multimodal judgment benchmark (25 languages, 60K+ preference instances). Evaluating 22 LVLMs reveals significant cross-lingual performance gaps—model size and architecture do not predict multilingual robustness, and even state-of-the-art judges exhibit inconsistency, highlighting the necessity for multilingual multimodal evaluation benchmarks.
- MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems
-
This work introduces the FlowVerse benchmark (decomposing mathematical problem information into four components: DI/EI/RP/OQ to construct six variants) and the MathFlow modular pipeline (decoupling perception and reasoning into independent stages). By training a specialized perception model, MathFlow-P-7B, to extract key information from mathematical diagrams, the approach significantly enhances the visual mathematical problem-solving capabilities of various reasoning models.
- Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity (MM-Eval)
-
Proposed the MM-Eval evaluation framework for the "Multimodal Summarization with Multimodal Output (MSMO)" task. It aggregates three sub-scores—text quality (OpenFActScore + G-Eval), cross-modal alignment (MLLM-as-Judge), and visual diversity (Truncated CLIP Entropy)—into a single score using weights learned via Ridge regression. On the mLLM-EVAL news benchmark, it improves the Kendall \(\tau\) relative to human preferences from the 0.041 of an equal-weight baseline to 0.374.
- MedLayBench-V: A Large-Scale Benchmark for Expert-Lay Semantic Alignment in Medical Vision Language Models
-
This paper introduces MedLayBench-V, the first large-scale multimodal medical expert-lay semantic alignment benchmark (79,793 image-text pairs). Through the Structured Concept-Grounded Refinement (SCGR) pipeline, professional radiology reports are converted into lay descriptions. This ensures clinical semantic fidelity while reducing reading difficulty from graduate to high school level. Zero-shot retrieval experiments demonstrate that lay descriptions result in less than a 1% performance loss.
- MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A
-
MM-BizRAG demonstrates that enterprise multimodal RAG should not solely rely on page screenshots and vision embeddings. Instead, it should differentiate between reports and slides based on document structure, followed by explicit parsing of text, tables, and images. By assembling multimodal contexts during inference, it significantly outperforms vision-centric baselines on SlideVQA, FinRAGBench-V, and internal enterprise data.
- MONETA: Multimodal Industry Classification through Geographic Information with Multi Agent Systems
-
This paper proposes MONETA, the first multimodal industry classification benchmark combining text (websites, Wikipedia, Wikidata) and geospatial data (OpenStreetMap, satellite imagery). It designs two training-free pipelines—Zero-Shot and Multi-Turn Multi-Agent—achieving 62.10%-74.10% accuracy across 20 NACE categories using open and closed-source MLLMs, with the multi-turn design providing gains of up to 22.80%.
- More Than Meets the Eye: Measuring the Semiotic Gap in Vision-Language Models via Semantic Anchorage
-
This paper reveals the "literal superiority bias" of VLMs from a cognitive semiotics perspective—models tend towards literal interpretation rather than metaphorical/idiomatic understanding on high-fidelity images. By introducing the DIVA benchmark (iconographically simplified images) and the Semantic Alignment Gap metric, the authors prove that reducing visual fidelity significantly narrows the gap between literal and idiomatic interpretations.
- MSEarth: A Multimodal Benchmark for Earth Science Phenomenon Discovery with MLLMs
-
This work extracts 289K images from 64,560 CC-BY open-source Earth science papers and synthesizes "refined captions" using "raw captions + body context." Through 5-model multi-agent voting and three-stage PhD expert validation, it generates a 7,195-item graduate-level test set (including captioning / MCQ / open-ended). The study systematically reveals a 20+ point "perception >> reasoning" gap in SOTA MLLMs regarding multi-image Earth science reasoning and provides a 441K training set that enables open-source 7B models to rival GPT-4o after GRPO.
- PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection
-
PRISM discovers that the non-zero mean of MLLM visual features causes Global Semantic Drift, which contaminates similarity-based data selection. By using training-free mean re-centering and low-correlation sample selection, it achieves 101.7% relative performance while retaining only approximately 30% of visual samples, reducing end-to-end GPU time by about 70%.
- Reducing Peak Memory Usage for Modern Multimodal Large Language Model Pipelines
-
The paper shifts the memory bottleneck of Multimodal Large Language Models (MLLMs) from "long-context caching in the decoding stage" to the "peak visual token caching in the prefill stage." It proposes a structure-aware KV-cache framework that performs computation and compression concurrently during prefill, maintaining peak memory within a fixed budget while preserving image and video understanding capabilities.
- Region-R1: Reinforcing Query-Side Region Cropping for Multi-Modal Re-Ranking
-
This paper proposes Region-R1, which models query-side region cropping in multi-modal re-ranking as a decision problem. By using reinforcement learning (r-GRPO) to learn when and how to crop question-relevant regions in the query image, it improves CondRecall@1 by 20% on E-VQA and 8% on InfoSeek.
- Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
-
Response-G1 utilizes query-guided online scene graphs, historical scene graph retrieval, and timestamped trigger prompts to explicitly align visual evidence with the response conditions of user queries. This approach significantly enhances the "when to answer" decision-making capability of Video-LLMs in streaming videos without requiring fine-tuning.
- Revisit What You See: Revealing Visual Semantics in Vision Tokens to Guide LVLM Decoding
-
ReVisiT discovers that visual tokens in LVLMs already encode interpretable object semantics. By utilizing contextually constrained vocabularies, visual token selection, and logit fusion, it enhances visual grounding and reduces hallucinations without retraining or additional forward passes.
- Prune-then-Merge: Towards Efficient Multi-Vector Visual Document Retrieval
-
This paper proposes Prune-then-Merge, a two-stage training-free multi-vector document compression framework. It first removes low-information patches via adaptive attention pruning, then merges the remaining high-signal patches through hierarchical agglomerative clustering. It extends the near-lossless compression range from 50-60% to 60-70% across 29 VDR datasets and significantly outperforms single-stage methods at high compression rates of 80%+.
- Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing
-
SEA decomposes continuous sign language video subtitle alignment into sign segmentation, text-sign embedding, and episode-level dynamic programming. It achieves SOTA [email protected] on BOBSL, How2Sign, WMT-SLT SRF, and SwissSLi datasets, and can efficiently process long videos on CPUs.
- SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding
-
Ours proposes SlideAgent, a hierarchical agentic framework that constructs structured knowledge representations through three levels of specialized agents (global, page, and element), significantly enhancing the fine-grained understanding of multi-page visual documents, particularly slides.
- Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling
-
Proposes DASH (Delta Attention Selective Halting), a training-free inference acceleration method that monitors layer-wise update magnitudes \(\Delta_{attn}\) to identify "semantically solidified" tokens and halt their subsequent computations. It achieves significant prefill acceleration on long-context text and vision-language benchmarks with almost no loss in accuracy.
- STELLA: A Multimodal LLM for Protein Functional Annotation via Unified Sequence-Structure Encoding
-
STELLA integrates the unified sequence-structure protein representations of ESM3 into Llama-3.1-8B-Instruct. Through two-stage multimodal instruction tuning, it performs protein functional description and enzyme catalytic reaction prediction, setting new benchmarks for functional annotation on the OPI-Struc series.
- StructBreak: Structural Cognitive Overload-Induced Safety Failures in MLLMs
-
StructBreak proposes the "Structural Cognitive Overload" (SCO) attack paradigm, leveraging the topological complexity of Visual Knowledge Graphs (VKG) to induce safety failures in Multimodal LLMs. It achieves an average attack success rate of 92% across six frontier MLLMs in a black-box setting (reaching 97% on Gemini 2.5) and reveals safety collapse mechanisms through attention dissipation, latent space topology, and geometric analysis.
- TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval
-
This paper proposes TEMA (Text-oriented Entity Mapping Architecture), the first framework for Composed Image Retrieval (CIR) oriented toward multi-modification texts. It enhances modified entity coverage through an MMT Parser Assistant (PA) and addresses the clause-entity alignment problem with an Entity Mapping (EM) module. Furthermore, it constructs two multi-modification benchmarks, M-FashionIQ and M-CIRR, achieving state-of-the-art performance in both original and multi-modification scenarios.
- Test-Time Scaling in Multimodal Foundation Models: A Comprehensive Survey of Generation and Reasoning
-
The first survey dedicated to Test-Time Scaling (TTS) in Multimodal Foundation Models (MFM): it unifies various methods of "dynamically allocating compute at the inference stage" into a framework of \(\pi^*=\arg\max_\pi \mathbb{E}[U(x,y)]\) s.t. compute budget constraints. It categorizes these into three paradigms: Sampling-based, Feedback-based, and Search-based, covering both multimodal generation and reasoning tasks, and provides a roadmap of representative methods, benchmarks, and open challenges.
- TeXOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction
-
This paper advances scientific PDF OCR from "converting to text/Markdown" to "reconstructing page-level LaTeX that is compilable without manual intervention." It proposes TEXOCR-Bench, TEXOCR-Train, and a two-stage SFT+RLVR training paradigm, enabling a Qwen3-VL-2B derivative model to significantly outperform open-source baselines of the same scale in structural consistency, citation validity, and compilation success rate.
- Text-Guided Multi-Scale Frequency Representation Adaptation
-
This paper proposes FreqAdapter, which transforms visual and textual embeddings of CLIP/LLaVA into the DCT frequency domain. It employs text-guided multi-scale global adaptation and cross-modal modulation to fine-tune visual frequency representations. With approximately 0.11% additional parameters, it consistently outperforms common prompt/adapter methods in image-text retrieval and VQA.
- Topology-Aware Layer Pruning for Large Vision-Language Models
-
This paper proposes TopoVLM, a layer pruning framework based on Topological Data Analysis (TDA). It models hidden states as point clouds and quantifies inter-layer topological consistency via zigzag persistent homology to adaptively retain transition-critical layers while pruning structural redundancies. It significantly outperforms existing pruning methods at 50-60% sparsity.
- Towards Visually Grounded Multimodal Summarization via Cross-Modal Transformer and Gated Attention
-
This paper proposes SPeCTrA-Sum, which integrates a layer-aligned Deep Visual Processor, gated cross-modal attention, and a DPP-distilled image selector. This allows multimodal summarization to maintain near-SOTA ROUGE scores while selecting more relevant and diverse supporting images.
- TRACE: Evidence Localization-based Multi-video Event Understanding and Claim Generation
-
TRACE achieves SOTA on multi-video event understanding tasks, improving F1 from 0.705 to 0.811, by employing a "localize-then-reason" pipeline that builds text-searchable video timelines via OCR and object detection, performs query-conditioned evidence localization with a text LLM, and generates cited claims using an LVLM.
- Tree-of-Evidence: Efficient "System 2" Search for Faithful Multimodal Grounding
-
This paper proposes Tree-of-Evidence (ToE), an inference-time discrete beam search algorithm that formalizes multimodal model interpretability as a discrete optimization problem over coarse-grained evidence units (vital sign time windows, radiology report snippets). Using only 5 evidence units, it retains over 98% of the AUROC of the full-input model while generating auditable evidence tracking paths.
- UniversalRAG: Retrieval-Augmented Generation for Multimodal Corpora
-
UniversalRAG proposes a general any-to-any RAG framework that utilizes modality-aware routing and granularity-aware retrieval to dynamically select the most appropriate knowledge sources from heterogeneous multimodal corpora (text, image, video at varying granularities). This approach avoids the modality gap problem inherent in unified embedding spaces and significantly outperforms single-modality and unified methods across 10 benchmarks.
- Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation
-
This paper shifts multimodal RAG image selection from "semantic similarity ranking" to "utility estimation of helpfulness for final answers." By utilizing lightweight multimodal surrogate models to efficiently predict evidence helpfulness, it simultaneously improves response quality and inference efficiency on MRAG-Bench and Visual-RAG.
- VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation
-
This paper proposes VAUQ, which measures whether LVLM responses truly rely on visual evidence using image information scores and attention-driven core region masking. This enables more reliable multimodal self-evaluation and hallucination detection without requiring training or external evaluators.
- VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models
-
VIGNETTE constructs a VQA bias evaluation benchmark with 30M+ synthetic paired images, using four types of questions—factuality, perception, stereotyping, and decision-making—to reveal how VLMs link identity cues, activity contexts, and social hierarchies to produce fine-grained and sometimes contradictory biases.
- ViLL-E: Video LLM Embeddings for Retrieval
-
Proposes ViLL-E, the first unified Video LLM architecture supporting both text generation and embedding generation. Through a three-stage generation-contrastive joint training and an adaptive KV-Former embedding head, it approaches expert models in video retrieval and temporal grounding while maintaining competitiveness in VideoQA.
- Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues
-
This paper uses 1,360 controlled real-world photos and pre-registered statistical tests to find that current VLMs perform much worse than humans in judging which object a person is looking at. They primarily mistake head orientation for gaze direction. Fine-tuning specialized gaze models can alleviate but cannot completely eliminate this bias.
- VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding
-
VULCA-Bench utilizes 8 cultural traditions, 7,410 image-bilingual expert critique pairs, and an L1-L5 five-layer cultural understanding framework to advance VLM evaluation from "seeing objects" to "understanding symbols, history, and aesthetic philosophy," revealing that existing models typically drop 31-40 percentage points on high-level cultural reasoning.
- What Do Vision-Language Models Encode for Personalized Image Aesthetics Assessment?
-
This paper discovers through linear probing that the hidden representations of VLMs encode rich multi-level aesthetic attribute information (lighting, color, composition, etc.), which propagates to the language decoder layers. Based on this, it proposes achieved training-free Personalized Image Aesthetics Assessment (PIAA) using simple linear regression, which significantly outperforms few-shot and LoRA fine-tuning baselines.
- When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models
-
This paper constructs WHOOPS-AHA! to put VLM commonsense knowledge into direct conflict with counterfactual visual evidence, discovering that a small number of late-layer attention heads causally control whether the model relies on internal knowledge or visual input.
- WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering
-
WikiSeeker is proposed to redefine the role of VLMs in multi-modal RAG—transforming them from mere answer generators into two specialized agents: a Refiner trained via RL for query rewriting and an Inspector to verify the reliability of retrieved context. It achieves SOTA performance on EVQA, InfoSeek, and M2KR benchmarks.