🧩 Multimodal VLM¶

📷 CVPR2026 · 287 paper notes

A3: Towards Advertising Aesthetic Assessment: This paper proposes the A3 framework, comprising a theory-driven three-stage advertising aesthetic assessment paradigm A3-Law (Perceptual Attention → Formal Interest → Desire Impact), a 120K-annotation dataset A3-Dataset, an SFT+GRPO aligned model A3-Align, and the evaluation benchmark A3-Bench. A3-Align surpasses existing MLLMs on automated advertising aesthetic assessment.
A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks: This paper proposes a training-free, annotation-free debiasing method for VLMs that operates in cross-modal embedding spaces. Via orthogonal decomposition, it achieves a Pareto-optimal fairness–utility trade-off with a closed-form solution and provides theoretical upper bounds on utility loss.
A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks: This paper proposes a closed-form debiasing method for VLMs that performs orthogonal decomposition of attribute subspaces in the cross-modal embedding space and solves via Chebyshev scalarization, achieving Pareto-optimal fairness with bounded utility loss. The approach is training-free and annotation-free, and uniformly covers three downstream tasks: zero-shot classification, text-image retrieval, and text-image generation.
Activation Matters: Test-time Activated Negative Labels for OOD Detection with Vision-Language Models: This paper proposes TANL (Test-time Activated Negative Labels), which dynamically evaluates the "activation degree" of negative labels on OOD samples at test time to identify the most effective negative labels. Combined with an activation-aware scoring function, TANL reduces FPR95 from 17.5% to 9.8% on the ImageNet benchmark, while remaining entirely training-free and test-time efficient.
AVR: Adaptive VLM Routing for Computer Use Agents: This paper proposes AVR, an adaptive routing framework for Computer Use Agents that combines a lightweight multimodal embedding model for action difficulty assessment, small-model logprob confidence probing, and warm agent memory injection, enabling a three-tier routing strategy (simple → small model; difficult → large model; high-risk → large model + guardrail). AVR reduces inference cost by 78% with only a 2 pp accuracy loss.
Adaptive Vision-Language Model Routing for Computer Use Agents: This paper proposes the Adaptive VLM Routing (AVR) framework, which inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. Through three mechanisms — multimodal difficulty classification, logprob confidence probing, and historical memory injection — AVR dynamically selects the most cost-efficient model for each action, reducing inference cost by up to 78% with an accuracy drop of no more than 2 percentage points.
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition: This paper proposes AdaptVision, which enables VLMs to autonomously determine the minimum number of visual tokens required per sample through a coarse-to-fine active visual mechanism and reinforcement learning training, combined with Decoupled Turn Policy Optimization (DTPO) to achieve an optimal trade-off between efficiency and accuracy.
AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models: AGFT proposes an alignment-guided fine-tuning framework that enhances zero-shot adversarial robustness of VLMs while preserving the pre-trained cross-modal semantic structure, through text-guided adversarial training and distribution consistency calibration. The method achieves an average robust accuracy of 46.57% across 15 zero-shot benchmarks, surpassing the previous state of the art by 3.1 percentage points.
Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow: This paper identifies that excessive attention from text tokens to irrelevant visual tokens is the root cause of the "see but misperceive" phenomenon in VLMs. It proposes Adaptive Information Flow (AIF), a training-free method that modulates information flow at inference time by modifying the causal mask based on token dynamic entropy, blocking irrelevant visual-to-text connections and improving perceptual performance across multiple VLMs.
AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors: AnomalyVFM proposes a general framework that transforms arbitrary Vision Foundation Models (VFMs) into strong zero-shot anomaly detectors via a three-stage synthetic data generation pipeline and parameter-efficient LoRA adaptation, achieving 94.1% image-level AUROC across 9 industrial datasets with RADIO as the backbone, surpassing the previous SOTA by 3.3 percentage points.
ApET: Approximation-Error Guided Token Compression for Efficient VLMs: From an information-theoretic angle, this paper proposes a visual token importance measure based on linear-approximation reconstruction error. The method requires no attention weights, is naturally compatible with FlashAttention, and on LLaVA-1.5 retains 95.2% of the original performance after compressing away 88.9% of visual tokens.
ApET: Approximation-Error Guided Token Compression for Efficient VLMs: Grounded in information theory, ApET reconstructs each visual token via linear approximation and measures its informativeness by reconstruction error (larger error = more information = should be retained). The proposed framework is entirely independent of attention weights, achieves 95.2% accuracy retention at 88.9% compression on LLaVA-1.5-7B, even surpasses the baseline at 100.4% on video tasks, and is fully compatible with FlashAttention.
Asking like Socrates: Socrates helps VLMs understand remote sensing images: This paper identifies the "pseudo-reasoning" phenomenon in remote sensing VLMs—where explicit reasoning chains actually degrade performance—attributing it to the "Glance Effect" (insufficient single-pass perception). It proposes RS-EoT (Evidence-of-Thought), an iterative evidence search paradigm. A SocraticAgent self-play mechanism synthesizes reasoning trajectories for SFT cold-start, followed by two-stage progressive RL (grounding → VQA) to enhance and generalize reasoning. RS-EoT-7B achieves state-of-the-art performance across multiple remote sensing VQA and grounding benchmarks.
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models: This paper introduces AV-SpeakerBench, a speaker-centric audiovisual reasoning benchmark comprising 3,212 multiple-choice questions, revealing Gemini 2.5 Pro's superiority in audiovisual fusion while exposing significant deficiencies of open-source models in speaker reasoning.
AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention: This work revisits visual processing in VLA models from a POMDP perspective and proposes the AVA-VLA framework, which dynamically modulates the importance of visual tokens in the current frame based on historical context via a recurrent state and an active visual attention module, achieving state-of-the-art performance on benchmarks including LIBERO and CALVIN.
BALM: A Model-Agnostic Framework for Balanced Multimodal Learning under Imbalanced Missing Rates: BALM proposes a model-agnostic plug-and-play framework to address multimodal learning under Imbalanced Missing Rates (IMR). It introduces a Feature Calibration Module (FCM) to align representations across different missing patterns, and a Gradient Rebalancing Module (GRM) to balance the optimization dynamics of each modality from both distributional and spatial perspectives. The framework consistently improves the robustness of various backbone networks across multiple multimodal sentiment recognition benchmarks.
Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality: This paper introduces ContrAR, the first benchmark for contradictory virtual content attacks in AR environments, comprising 312 real videos recorded on Meta Quest 3, validated by 10 annotators with an average Likert score of 4.66/5. It systematically evaluates 11 VLMs (including GPT-5/Gemini-2.5/Grok-4) on semantic contradiction detection, finding that GPT-5 achieves the highest accuracy (88.14%) but incurs a 19s latency, while GPT-4o offers the best accuracy–latency trade-off (84.62% / 7.26s). An OCR-only text baseline reaches only 56%, demonstrating that visual reasoning is indispensable.
Beyond Heuristic Prompting: A Concept-Guided Bayesian Framework for Zero-Shot Image Recognition: This paper reformulates VLM zero-shot image recognition as a Bayesian framework, constructs a concept proposal distribution via an LLM-driven multi-stage concept synthesis pipeline, and employs an adaptive soft-trim likelihood to suppress the influence of outlier concepts, achieving state-of-the-art performance across 11 classification benchmarks.
Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models: By constructing the Isle-Brick-V2 benchmark using psychologically inspired controlled LEGO scenes, this work systematically exposes significant deficiencies in current VLMs' Visual Perspective Taking (VPT) capabilities—even when scene understanding is near-perfect, spatial reasoning and perspective-taking performance degrade substantially, accompanied by persistent directional biases.
Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models: This paper introduces FAQ (Forensic Answer-Questioning), the first large-scale multiple-choice QA benchmark focused on temporal inconsistencies in deepfake videos (33K QA pairs, ~4,500 videos). Through a three-level progressive task hierarchy (facial perception → temporal localization → forensic reasoning), FAQ systematically enhances VLM forensic capabilities, yielding significant gains both on in-domain benchmarks and cross-dataset detection after fine-tuning (Qwen2.5-VL average accuracy improves from 21.6% to 52.4%).
Beyond the Mean: Modelling Annotation Distributions in Continuous Affect Prediction: This paper proposes a Beta distribution-based framework for modelling affective annotation consensus. The model predicts only the mean and standard deviation of the annotation distribution, from which higher-order descriptors—including skewness, kurtosis, and quantiles—are derived in closed form via moment matching. Experiments on SEWA and RECOLA demonstrate that Beta distributions effectively capture the full distributional characteristics of annotator disagreement.
BiCLIP: Domain Canonicalization via Structured Geometric Transformation: This paper proposes BiCLIP, a minimalist few-shot adaptation method for CLIP that applies a bilinear transformation matrix with an upper-triangular structural constraint to geometrically align image features with text embeddings, achieving state-of-the-art performance across 11 standard benchmarks with an exceptionally low parameter count.
BriMA: Bridged Modality Adaptation for Multi-Modal Continual Action Quality Assessment: BriMA is proposed to address the non-stationary modality imbalance problem in multi-modal continual action quality assessment (AQA) via memory-guided bridging imputation and modality-aware replay optimization, achieving an average improvement of 6–8% in correlation coefficient and a 12–15% reduction in error across three benchmarks.
BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection: This paper proposes BUSSARD, the first learning-based scene-specific anomalous relationship detection method. It encodes scene graph triplets via pretrained language model embeddings, applies an autoencoder for dimensionality reduction, and employs normalizing flows for likelihood estimation. BUSSARD achieves approximately 10% AUROC improvement on the SARD dataset and demonstrates robustness to synonym variation.
Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions: This paper constructs a synthetic counting benchmark dataset, systematically evaluates the counting capabilities of open-source VLMs under varying image and prompt conditions, and investigates mechanisms for improving counting behavior through visual attention reweighting at the decoder level.
CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment: This paper proposes CAPT, a confusion-aware prompt tuning framework that explicitly models systematic misalignment patterns in VLMs via a Semantic Confusion Miner (SEM) and a Sample Confusion Miner (SAM). A Multi-Granularity Discrepancy Expert (MGDE) further integrates confusion information across different granularities. CAPT achieves a state-of-the-art HM of 83.90% across 11 benchmarks.
ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding: This paper introduces ChartNet—a 1.5-million-scale, high-quality multimodal chart dataset. A code-guided synthesis pipeline generates aligned quintuples comprising image–code–data table–text–reasoning QA. Fine-tuning on ChartNet significantly improves VLM performance on chart understanding and reasoning tasks, enabling small models to surpass GPT-4o.
Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking: This paper proposes the first circuit tracing framework for VLMs, training per-layer transcoders on Gemma-3-4B and constructing attribution graphs to reveal the hierarchical integration mechanisms underlying multimodal reasoning, visual arithmetic circuits, and the internal causes of six-finger hallucinations. The causal controllability of the discovered circuits is validated through feature steering and circuit patching.
CLIP-Free, Label-Free, Unsupervised Concept Bottleneck Models: This paper proposes TextUnlock, a method that aligns the output distribution of an arbitrary frozen visual classifier to a vision-language correspondence space, enabling the construction of a fully unsupervised Concept Bottleneck Model (U-F²-CBM) that requires no CLIP, no labels, and no trained linear probes. U-F²-CBM surpasses supervised CLIP-based CBMs across 40+ models.
Concept-wise Attention for Fine-grained Concept Bottleneck Models: CoAt-CBM achieves adaptive fine-grained image–concept alignment via learnable concept-wise visual queries and Concept Contrastive Optimization (CCO), surpassing both existing concept bottleneck models and black-box models while maintaining high interpretability.
CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning: This paper proposes CodeDance, which uses executable code as a unified medium for visual reasoning. Atomic capabilities are instilled via SFT, and a difficulty-adaptive tool-calling reward (BAT) is applied during RL to enable dynamic tool orchestration and self-verification reasoning. The resulting 7B model surpasses GPT-4o on tasks such as counting, visual search, and chart QA.
CodePercept: Code-Grounded Visual STEM Perception for MLLMs: Through systematic scaling analysis, this work identifies perception—rather than reasoning—as the true bottleneck for MLLMs in STEM domains. It proposes the CodePercept paradigm, which uses executable Python code as an anchoring medium, constructs the million-scale ICC-1M dataset and the STEM2Code-Eval benchmark, and achieves significant improvements in STEM visual perception and downstream reasoning after two-stage SFT+RL training.
CodePercept: Code-Grounded Visual STEM Perception for MLLMs: Through systematic scaling analysis, this paper reveals that perception rather than reasoning is the true bottleneck of MLLMs on STEM visual tasks. It proposes a paradigm that uses executable code as a medium to enhance perceptual capability, constructs ICC-1M — a 1M-scale Image-Caption-Code triplet dataset — and introduces two training tasks: code-grounded caption generation and STEM image-to-code translation.
CoMP: Collaborative Multi-Mode Pruning for Vision-Language Models: CoMP proposes a collaborative multi-mode pruning framework that eliminates inconsistencies between parameter and token pruning metrics via a Collaborative Importance Metric (CIM), and adaptively selects the optimal pruning mode at each stage through a Multi-mode Pruning Strategy (MPS), achieving significant improvements over single-mode and naive joint pruning approaches at high pruning ratios.
Conditional Factuality Controlled LLMs with Generalization Certificates via Conformal Sampling: This paper proposes CFC (Conditional Factuality Control), a post-hoc conformal framework that learns feature-conditional acceptance threshold functions via augmented quantile regression, providing conditional coverage guarantees (rather than merely marginal guarantees) for LLM sampled outputs. The authors further derive a PAC-style finite-sample certificate CFC-PAC, and validate the approach on synthetic data, reasoning/QA benchmarks, and VLM settings.
Continual Learning with Vision-Language Models via Semantic-Geometry Preservation: This paper proposes SeGP-CL, which constructs adversarial anchors via dual-objective projected gradient descent to probe fragile regions at old–new semantic boundaries. Combined with Anchor-guided Cross-modal Geometry Distillation (ACGD) and Text Semantic Geometry Regularization (TSGR), SeGP-CL effectively preserves the cross-modal semantic-geometric structure of VLMs under exemplar-free conditions, substantially alleviating catastrophic forgetting.
Continual Learning with Vision-Language Models via Semantic-Geometry Preservation: This paper proposes SeGP-CL, which constructs anchor samples at the semantic boundaries between old and new classes via adversarial PGD, and couples them with Anchor-guided Cross-modal Geometry Distillation (ACGD) and Text Semantic Geometry Regularization (TSGR) to preserve cross-modal semantic-geometric structures during VLM continual learning without requiring replay of old data, achieving state-of-the-art performance on five benchmarks.
CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models: This paper identifies the "visual preference conflict" problem in visual encoder fine-tuning within MLLMs, and proposes the CoVFT framework. By introducing Context Vector Extraction (CVE) and Context-aware Mixture of Experts (CoMoE), CoVFT achieves context-aware visual fine-tuning, attaining state-of-the-art performance across 12 multimodal benchmarks with significantly improved stability over existing methods.
CoVR-R: Reason-Aware Composed Video Retrieval: CoVR-R proposes a reasoning-first zero-shot composed video retrieval framework that leverages a large multimodal model (Qwen3-VL) to explicitly reason about the "after-effects" (state transitions, temporal phases, shot changes, etc.) implied by edit instructions. The paper further introduces the CoVR-R benchmark, comprising structured reasoning traces and hard negatives, to evaluate reasoning capability. The method substantially outperforms existing approaches in retrieval accuracy.
CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning: This paper proposes a graph-based automatic data generation pipeline that constructs the CRIT dataset and benchmark for training and evaluating VLMs on cross-modal multi-hop reasoning over interleaved image-text content. Models fine-tuned on CRIT achieve significant improvements on multiple benchmarks including SPIQA.
CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception: This paper proposes CropVLM — a lightweight 256M-parameter cropping network trained via GRPO reinforcement learning (without manual bounding box annotations) that dynamically selects the most informative image regions for VLMs to focus on, enabling plug-and-play integration with both open-source and commercial VLMs to improve fine-grained visual understanding.
CrossHOI-Bench: A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods: This paper proposes CrossHOI-Bench, the first unified multiple-choice HOI benchmark for evaluating both VLMs and HOI-specific models. Through carefully curated positive and negative examples that eliminate erroneous penalties from incomplete annotations, the benchmark reveals that large VLMs under zero-shot settings surpass state-of-the-art HOI methods by +5.18% in Instance-F1, while still exhibiting systematic weaknesses in multi-action recognition and cross-person action attribution.
Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens: This paper proposes CubiD, the first model to perform discrete diffusion generation over high-dimensional representation tokens (768-dim). By conducting fine-grained mask prediction over an \(h \times w \times d\) cubic tensor, CubiD achieves high-quality image generation while preserving visual understanding capability.
Customized Visual Storytelling with Unified Multimodal LLMs: This paper proposes the VstoryGen framework and its core component CustFilmer, which leverages a unified multimodal large language model (UMLLM) to enable customized multimodal story generation with joint conditioning on text descriptions, character/scene reference images, and shot types. Two new benchmarks, MSB and M2SB, are also introduced.
DC-Merge: Improving Model Merging with Directional Consistency: DC-Merge identifies that the key to effective model merging lies in maintaining directional consistency in singular space between the merged multi-task vector and the original single-task vectors. By combining singular value smoothing with shared orthogonal subspace projection, DC-Merge achieves state-of-the-art merging performance on both Vision and Vision-Language tasks.
DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles: This paper proposes DeAR, which uses a Concept Entropy metric to decompose the deep-layer attention heads of ViT into three functional roles—attribute heads, generalization heads, and mixed heads—and designs a role-based attention masking mechanism to precisely control information flow, achieving the best balance between task adaptation and zero-shot generalization across 15 datasets.
Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation: This paper proposes DASP, which diagnoses biased modalities via a redundancy score and applies an asymmetric adaptation strategy to decouple stability and plasticity, addressing negative transfer and catastrophic forgetting in multi-modal test-time adaptation.
Demographic Fairness in Multimodal LLMs: A Benchmark of Gender and Ethnicity Bias in Face Verification: This paper presents the first systematic evaluation of demographic fairness in face verification across 9 open-source MLLMs, measuring gender and ethnicity bias on the IJB-C and RFW benchmarks using 4 FMR-based fairness metrics, and finds that bias patterns in MLLMs differ substantially from those in traditional face recognition systems.
Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models: This paper identifies an overlooked "Narrow Policy" bottleneck in driving VLA models—over-exploitation during the IL phase causes exploration collapse, which in turn constrains the RL phase. The proposed Curious-VLA framework achieves SOTA on Navsim (PDMS 90.3, Best-of-N 94.8) via feasible trajectory expansion and diversity-aware RL.
Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection: The paper proposes CARE, a framework that first applies causal mediation analysis to precisely localize neurons and layers causally associated with unsafe behavior in VLMs (diagnosis), then constructs a dual-modal safety subspace via generalized eigendecomposition and projects activations onto it at inference time (repair), reducing attack success rates to below 10% with negligible loss of general capability.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs: This paper proposes DACO, a framework that constructs a multimodal concept dictionary of 15,000 concepts from WordNet and CC-3M, and combines it with sparse autoencoders (SAE) to achieve fine-grained concept control over frozen MLLM activation spaces, significantly improving safety across multiple benchmarks while preserving general capability.
Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement: This paper proposes HRNet, which learns clean shared representations via cross-scale disentanglement and adaptive projection (CDAP), and jointly predicts rigid and non-rigid transformations in a unified coarse-to-fine pipeline without iteration, achieving state-of-the-art performance on four multimodal datasets.
Do Vision-Language Models Leak What They Learn? Adaptive Token-Weighted Model Inversion Attacks: This paper presents the first systematic study of model inversion (MI) attacks against VLMs. It proposes SMI-AW, a sequence-level inversion method based on adaptive token attention weighting, which dynamically weights token gradients according to their visual relevance to reconstruct private training images from VLMs. The method achieves a human-evaluated attack accuracy of 61.21%.
Do Vision Language Models Need to Process Image Tokens?: This paper systematically demonstrates that image token representations in VLMs stabilize in shallow layers and become functionally interchangeable across deeper layers, while text token representations undergo continuous dynamic reconstruction — the necessity of deep image processing is highly dependent on the output task type.
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding: DocSeeker is proposed to achieve structured reasoning and evidence grounding in long document understanding via an ALR (Analyze–Locate–Reason) visual reasoning paradigm combined with two-stage training (SFT + EviGRPO). The model is trained exclusively on short documents yet generalizes robustly to documents of extreme length.
Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small VLMs: This paper systematically investigates the effect of LLM scaling on multimodal capabilities, finding that vision-dependent tasks—rather than LLM-intrinsic tasks—suffer the most, and that perception degradation is as severe as reasoning degradation. The proposed Extract+Think method (visual extraction tuning + step-by-step reasoning) uses a 0.6B perception module and a 1.7B reasoning module to outperform PrismCaptioner and LLaVA-OneVision-0.5B, which are up to 12× larger.
DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing: DSCA decomposes the VLM representation space into a set of orthogonal semantic subspaces and performs gated residual interventions within each subspace for knowledge editing, achieving >95% editing success rate with near-zero forgetting after 1,000 sequential edits.
DSERT-RoLL: Robust Multi-Modal Perception for Diverse Driving Conditions: This paper introduces the DSERT-RoLL driving dataset, the first to integrate six sensor modalities — stereo event cameras, RGB, thermal imaging, 4D radar, and dual LiDAR — covering diverse weather and lighting conditions, along with a unified multi-modal 3D detection fusion framework.
DUET-VLM: Dual Stage Unified Efficient Token Reduction for VLM Training and Inference: This paper proposes DUET-VLM, a dual-stage visual token compression framework. Stage 1 operates within the visual encoder: dominant tokens are selected via V2V self-attention, and remaining tokens are merged into contextual tokens through attention-guided local cluster aggregation. Stage 2 operates within the LLM, progressively pruning visual tokens via T2V cross-attention across multiple layers. On LLaVA-1.5-7B, DUET-VLM achieves 67% token compression while retaining 99%+ accuracy, and 89% compression while retaining 97%+ accuracy, with a 31% reduction in training time.
DUET-VLM: Dual Stage Unified Efficient Token Reduction for VLM Training and Inference: DUET-VLM proposes a dual-stage visual token compression framework: the first stage (V2V) merges redundant tokens into compact, information-preserving representations via local cluster aggregation on the vision encoder side; the second stage (T2V) progressively discards low-information tokens through text-guided hierarchical adaptive pruning on the language backbone side. On LLaVA-1.5-7B, 67% compression retains 99% accuracy and 89% compression retains 97% accuracy.
Dynamic Token Reweighting for Robust Vision-Language Models: This paper proposes DTR (Dynamic Token Reweighting), the first inference-time defense against multimodal jailbreak attacks that operates by optimizing the KV cache of VLMs. DTR introduces the concept of "Reversal Safety-Relevant Shift" (RSS) to identify visual tokens responsible for safety degradation, dynamically adjusts their weights to restore the model's safety alignment, and preserves benign task performance.
DTR: Dynamic Token Reweighting for Robust Vision-Language Models: DTR is proposed as the first method to defend against multimodal jailbreak attacks via KV cache optimization. It identifies adversarial visual tokens using a Reversal Safety-Relevant Shift (RSS) and suppresses their influence through dynamic reweighting. With only 4 optimization steps and without relying on image-to-text conversion, DTR substantially reduces attack success rates (HADES S+T+A: 56.9%→15.9%) while preserving VLM performance and inference efficiency.
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs: This paper proposes DynamicGTR, a framework that dynamically routes each query at inference time to the optimal graph topology representation (GTR, 8 variants spanning visual and textual modalities), substantially improving VLM performance on zero-shot graph algorithm QA, with transferability to real-world tasks such as link prediction and node classification.
EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval: EagleNet constructs a text-frame relational graph and employs a relational graph attention network to learn fine-grained text-frame and frame-frame relationships, generating enhanced text embeddings enriched with video contextual information. An energy-based matching mechanism is further introduced to capture the distribution of ground-truth text-video pairs. The method achieves state-of-the-art performance on four benchmark datasets.
EBMC: Enhance-then-Balance Modality Collaboration for Robust Multimodal Sentiment Analysis: This paper proposes EBMC, a two-stage framework that first improves the representation quality of weak modalities via semantic disentanglement and cross-modal enhancement, then achieves balanced multimodal sentiment analysis through energy-guided modality coordination and instance-aware trust distillation, maintaining strong robustness under missing-modality scenarios.
Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs: This paper proposes the AGDI framework for black-box copyright tracking in MLLMs via adversarially optimized trigger images. A dual injection mechanism simultaneously injects copyright information at the response level (CE loss driving an auxiliary model to produce a target answer) and the semantic level (minimizing cosine distance between the trigger image and target text in CLIP space). An adversarial training scheme simulates fine-tuning resistance. AGDI consistently outperforms PLA and RNA baselines on Qwen2-VL and LLaVA-1.5.
Efficient Document Parsing via Parallel Token Prediction: This paper proposes PTP (Parallel Token Prediction), a model-agnostic plug-and-play acceleration method that enables parallel multi-token prediction by inserting learnable register tokens into training sequences, achieving 1.6×–2.2× throughput gains on OmniDocBench without accuracy loss.
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs: This paper proposes EgoMind, a CoT framework that requires no geometric priors. Through two core components—Role-Play Caption (RPC) and Progressive Spatial Analysis (PSA)—it achieves competitive multi-frame spatial reasoning using only 5K SFT and 20K RL samples.
EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models: This paper proposes EMO-R3, which guides MLLMs to perform step-by-step emotional reasoning via Structured Emotional Thinking (SET), and introduces a Reflective Emotional Reward (RER) that prompts the model to re-evaluate the visual-textual consistency and emotional coherence of its reasoning, substantially improving both interpretability and accuracy in multimodal affective understanding.
EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis: This paper introduces EmoVerse — the first large-scale interpretable visual emotion dataset (219K+ images) covering both CES (Mikels 8-class discrete emotions) and DES (1024-dimensional continuous emotion space). It proposes a B-A-S (Background-Attribute-Subject) triplet knowledge graph annotation scheme and an Annotation & Verification Pipeline (Gemini/GPT-4o + EmoViT + CoT Critic Agent), and fine-tunes Qwen2.5-VL-3B to perform 1024-dimensional DES projection and emotion attribution explanation.
EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis: This paper proposes EmoVerse, a 219K-scale visual emotion dataset that achieves word-level and subject-level emotion attribution via knowledge graph-inspired Background-Attribute-Subject triplets. It provides dual emotion annotations in both discrete CES and continuous 1024-dimensional DES spaces, accompanied by a multi-stage annotation validation pipeline and an interpretable emotion model based on Qwen2.5-VL.
Empowering Semantic-Sensitive Underwater Image Enhancement with VLM: This paper proposes a plug-and-play strategy (-SS) that leverages VLMs to generate semantic guidance maps. Through a dual-guidance mechanism comprising cross-attention injection and a semantic alignment loss, the approach directs underwater image enhancement models to focus on semantically critical regions during restoration, yielding significant improvements in perceptual quality as well as downstream detection and segmentation performance.
Empowering Semantic-Sensitive Underwater Image Enhancement with VLM: This paper proposes a VLM-driven semantic-sensitive learning strategy that leverages LLaVA to generate object descriptions, BLIP to construct spatial semantic guidance maps, and a dual-guidance mechanism (cross-attention injection + semantic alignment loss) to steer the UIE decoder during reconstruction. The approach yields consistent improvements in both perceptual quality and downstream detection/segmentation performance.
ENC-Bench: A Benchmark for Evaluating MLLMs in Electronic Navigational Chart Understanding: This paper introduces ENC-Bench, the first professional-grade benchmark for Electronic Navigational Chart (ENC) understanding, comprising 20,490 samples organized under a three-level hierarchical evaluation framework (Perception → Spatial Reasoning → Maritime Decision-Making). Systematic evaluation of 10 MLLMs reveals that the best-performing model achieves only 47.88% accuracy, exposing a critical capability gap of general-purpose models in safety-critical specialized domains.
EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards: This paper proposes EvoLMM, a fully unsupervised self-evolving framework that instantiates two roles from a single backbone LMM: a Proposer (generating visual questions) and a Solver (producing multiple answers). By replacing discrete majority voting with continuous self-consistency rewards, the model improves multimodal mathematical reasoning using only raw images (ChartQA +2.7%, MathVista +2.1%).
EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards: This paper proposes EvoLMM, a fully unsupervised self-evolving framework that derives a Proposer (generating image-grounded questions) and a Solver (answering those questions) from a single LMM. A continuous self-consistency reward — replacing discrete majority voting — forms a closed-loop training signal. Using only raw images (no annotations, no external reward models), EvoLMM achieves consistent gains of approximately 2–3% across eight multimodal mathematical reasoning benchmarks.
Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory: This paper proposes MM-SafetyBench++ and the EchoSafe framework, which accumulates safety insights by maintaining a self-reflective memory bank at inference time, enabling MLLMs to distinguish visually similar scenarios with different safety intents based on context—improving contextual safety without any training.
EvoPrompt: Evolving Prompt Adaptation for Vision-Language Models: EvoPrompt addresses catastrophic forgetting and modality bias in VLM prompt learning via a trajectory-aware prompt evolution strategy — comprising unified embedding projection, direction–magnitude decoupled training, and feature geometric regularization — achieving state-of-the-art performance across few-shot, cross-dataset, and domain generalization benchmarks while preserving zero-shot capability.
Evolving Prompt Adaptation for Vision-Language Models: This paper proposes EvoPrompt, a framework that treats prompt training as a progressive evolution from general semantic anchors to task-specific features. It introduces a Modal-shared Prompt Projector (MPP) for unified cross-layer and cross-modal prompt generation, an evolution trajectory-aware strategy (direction–magnitude decoupling with historical direction freezing) to prevent forgetting, and Feature Geometry Regularization (FGR) to prevent representation collapse. EvoPrompt achieves an average HM of 80.73% on base-to-novel generalization across 11 datasets, surpassing all existing prompt learning methods.
Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration: This paper proposes the LMEE benchmark and the MemoryExplorer framework, which jointly evaluate the process and outcome of embodied exploration by unifying multi-object navigation with memory-based question answering. By fine-tuning an MLLM via reinforcement learning to actively invoke memory retrieval tools, the method achieves an SR of 23.53% on LMEE-Bench (surpassing 3D-Mem's 16.91%) and an SR of 46.40% on GOAT-Bench.
FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Models: This paper proposes FairLLaVA, a parameter-efficient fairness-aware fine-tuning method that eliminates demographic shortcuts in multimodal large language models by minimizing the mutual information between hidden states and demographic attributes, significantly narrowing inter-group performance gaps in chest X-ray report generation and skin lesion question answering.
FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Alignment: This paper proposes FALCON, a learning-based mini-batch construction strategy that employs a negative mining scheduler to adaptively balance the trade-off between hard negatives and false negatives, substantially improving cross-modal alignment quality in vision-language pretraining (VLP).
Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients: This paper proposes Quantization-aware Integrated Gradients (QIG), advancing sensitivity analysis for LVLM quantization from the modality level to the token level. By leveraging axiomatic attribution principles, QIG precisely quantifies each token's contribution to quantization error, achieving significant accuracy improvements under W4A8 and W3A16 settings with negligible additional computational overhead.
FINER: MLLMs Hallucinate under Fine-grained Negative Queries: This paper identifies that MLLMs suffer a dramatic increase in hallucination rates under fine-grained negative queries (queries involving multiple objects/attributes/relations with only one subtle error), proposes the FINER benchmark and FINER-Tuning (based on DPO), achieving up to 24.2% improvement on InternVL3.5-14B.
FlashCache: Frequency-Domain-Guided Outlier-KV-Aware Multimodal KV Cache Compression: This paper proposes FlashCache, the first method to analyze the importance distribution of multimodal KV Cache from a frequency-domain perspective. It discovers that KV pairs deviating from low-frequency principal components—termed "outlier KVs"—encode features critical for inference. By identifying outlier KVs via DCT low-pass filtering and prioritizing their retention alongside dynamic per-layer budget allocation, FlashCache achieves 1.69× decoding speedup under 80% KV memory compression with negligible task performance degradation, while being natively compatible with FlashAttention.
FlowComposer: Composable Flows for Compositional Zero-Shot Learning: FlowComposer is the first work to introduce Flow Matching into Compositional Zero-Shot Learning (CZSL). It learns two primitive flows—an attribute flow and an object flow—to transport visual features into their corresponding text embedding spaces, and employs a learnable Composer to explicitly combine velocity fields into a compositional flow. A leakage-guided augmentation strategy further converts imperfect feature disentanglement into auxiliary supervision signals. As a plug-and-play module, FlowComposer consistently improves CZSL performance across three benchmarks.
FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching VLA Models: FlowHijack is the first systematic backdoor attack framework targeting the vector field dynamics of flow-matching VLA models. It achieves high attack success rates and behavioral stealthiness via a τ-conditional injection strategy and a dynamic imitation regularizer.
FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy: This paper proposes FluoCLIP, a two-stage vision-language framework that first performs stain-grounding to enable CLIP to learn the semantics of fluorescence stains, then conducts stain-guided ranking for stain-aware focus quality assessment. The paper also introduces FluoMix, the first multi-stain tissue-level fluorescence microscopy dataset for FQA.
PinPoint: Focus, Don't Prune — Identifying Instruction-Relevant Regions for Information-Rich Image Understanding: This paper proposes PinPoint, a two-stage framework that first localizes instruction-relevant image regions via Instruction-Region Alignment, then re-encodes the selected regions at fine granularity, achieving higher VQA accuracy with fewer visual tokens.
From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing: This paper proposes TAR-FAS, a framework that reformulates Face Anti-Spoofing (FAS) as a Chain-of-Thought with Visual Tools (CoT-VT) paradigm for the first time, enabling MLLMs to adaptively invoke external visual tools (LBP/FFT/HOG, etc.) during inference—upgrading from "intuitive judgment" to "fine-grained investigation"—achieving SOTA on the 1-to-11 cross-domain protocol.
from masks to pixels and meaning a new taxonomy benchmark and metrics for vlm im: This paper argues that existing image tampering detection benchmarks rely on coarse mask annotations that are severely misaligned with actual edit signals. It proposes PIXAR—a pixel-level, semantically-aware tampering detection benchmark containing 420K+ image pairs—along with a new training framework and evaluation metrics that substantially outperform existing methods in precise localization and semantic understanding.
From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings: This paper proposes LAPS (Latent Action-based Primitive Segmentation), a pipeline that defines a "Latent Action Energy" metric in the latent action space to unsupervisedly discover and segment semantic action primitives from unannotated industrial video streams, providing structured data for VLA model pre-training.
G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval: This paper proposes G-MIXER, a training-free zero-shot composed image retrieval method that achieves state-of-the-art performance via geodesic mixup-based implicit semantic expansion (expanding the retrieval scope along multiple interpolation ratios on the hypersphere) and explicit semantic re-ranking (filtering noisy candidates using MLLM-generated attributes).
GACD: Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection: By estimating each token's (visual/textual/output) contribution to the current prediction via first-order Taylor gradient, the GACD framework simultaneously mitigates text-visual bias (amplifying visual token influence) and co-occurrence bias (suppressing visual tokens anchored to previously generated objects). It achieves an 8% improvement in overall AMBER score and 8% gain in POPE F1, without requiring training or auxiliary models.
Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning: This paper proposes GAR-SSL, a training-free sound source localization (SSL) framework that reframes SSL as a three-stage metacognitive reasoning process—Generate, Analyze, and Refine—leveraging the intrinsic reasoning capabilities of MLLMs via prompt engineering alone. The method achieves performance comparable to or surpassing supervised approaches on both single-source and multi-source localization benchmarks.
GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning: This paper proposes GraphVLM, a benchmark that systematically evaluates VLMs in three roles for multimodal graph learning (MMGL): VLM-as-Encoder (enhancing GNN features), VLM-as-Aligner (bridging modalities for LLM-based reasoning), and VLM-as-Predictor (serving directly as the graph learning backbone). Experiments across six datasets demonstrate that VLM-as-Predictor consistently achieves the best performance, revealing the substantial potential of VLMs as a new foundation for MMGL.
GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning: This work proposes the GraphVLM benchmark, which systematically evaluates VLMs across three roles in multimodal graph learning (Encoder / Aligner / Predictor). The VLM-as-Predictor paradigm consistently achieves the best performance, revealing the substantial potential of VLMs as backbones for multimodal graph reasoning.
GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding: This paper proposes GroundVTS, a query-guided fine-grained visual token sampling architecture for video large language models, which adaptively preserves spatiotemporally relevant information at the token level. It achieves an 18.4-point mIoU improvement on Charades-STA and a 20.6-point mAP improvement on QVHighlights.
GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training: This paper proposes GTR-Turbo, a framework that merges historical checkpoints generated during RL training to serve as a free teacher model. Without relying on expensive external API models, GTR-Turbo achieves performance comparable to or better than GTR in multi-turn visual agent training, while reducing training time by 50% and computational cost by 60%.
GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training: This paper proposes GTR-Turbo, which generates a "free teacher model" by merging historical checkpoints produced during RL training via TIES, and uses this teacher to guide subsequent training (via SFT or KL distillation). GTR-Turbo matches or surpasses GTR—which relies on external teachers such as GPT-4o—across multiple visual agent benchmarks, while reducing training time by 50% and computational cost by 60%.
GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks: This paper proposes GUIDE, a benchmark comprising 67.5 hours of screen recordings and think-aloud annotations from 120 novice users across 10 software applications. It defines three hierarchical tasks—behavioral state detection, intent prediction, and assistance prediction—and finds that current state-of-the-art multimodal models show limited capability in understanding user behavior and judging assistance needs (behavioral detection accuracy of only 44.6%), while providing structured user context substantially improves performance (up to +50.2pp on assistance prediction).
HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding: This paper proposes the HAMMER framework, which extracts contact-aware intention embeddings from an MLLM, enhances point cloud features via hierarchical cross-modal fusion, and injects 3D spatial information into the intention embeddings through a multi-granular geometry lifting module. The framework achieves interaction-image-based 3D affordance grounding and comprehensively outperforms existing methods on the PIAD benchmark.
HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models: This paper introduces HandVQA — a large-scale diagnostic benchmark containing over 1.6 million multiple-choice questions, automatically generated from 3D hand joint annotations. The benchmark covers joint angles, distances, and relative positions, and systematically exposes severe deficiencies of current VLMs in fine-grained hand spatial reasoning. The paper further demonstrates that models fine-tuned on HandVQA can zero-shot transfer to downstream tasks such as gesture recognition (+10.33%) and hand-object interaction recognition (+2.63%).
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models: This paper proposes HAWK, a head importance-aware visual token pruning method that offline computes per-head contribution weights to visual understanding and dynamically evaluates each visual token's importance via text-guided attention scores. On Qwen2.5-VL, HAWK retains 96.0% of original performance after pruning 80.2% of visual tokens while reducing inference latency by 26%.
HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models: This paper proposes HiF-VLA, a framework that uses Motion Vectors (MV) as compact temporal primitives to unify three temporal reasoning capabilities—Hindsight, Insight, and Foresight—enabling bidirectional temporal extension of VLA models. HiF-VLA substantially outperforms baselines on long-horizon manipulation tasks with minimal computational overhead.
HiFICL: High-Fidelity In-Context Learning for Multimodal Tasks: By precisely decomposing the attention formula to reveal the mathematical essence of the ICL effect (a dynamic mixture of standard attention output and demonstration value matrices), this paper proposes HiFICL—which directly parameterizes the source of ICL via learnable low-rank virtual key-value pairs rather than approximating its effect—achieving comprehensive improvements over existing ICL approximation methods on multimodal benchmarks with only 2.2M parameters.
HiFICL: High-Fidelity In-Context Learning for Multimodal Tasks: HiFICL reframes the ICL approximation problem through rigorous attention formula derivation — shifting from "fitting a shift vector" to "directly parameterizing the source of ICL" — by injecting learnable low-rank virtual key-value pairs into attention heads. Trained end-to-end, this yields a dynamic, context-aware parameter-efficient fine-tuning method that surpasses existing ICL approximation methods and LoRA on multiple multimodal benchmarks with significantly fewer parameters.
HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models: HiSpatial decomposes 3D spatial intelligence into four cognitive levels (geometric perception → object attributes → inter-object relations → abstract reasoning), constructs an automated data pipeline processing ~5M images, 45M objects, and 2B QA pairs, and designs an RGB-D VLM that takes metric-scale point cloud maps as auxiliary input. With only 3B parameters, it surpasses GPT-5 and Gemini-2.5-Pro on multiple spatial reasoning benchmarks.
HIVE: Query, Hypothesize, Verify — An LLM Framework for Multimodal Reasoning-Intensive Retrieval: HIVE is a plug-and-play multimodal retrieval framework that improves nDCG@10 from 27.6 (best multimodal model) to 41.7 (+14.1 absolute points) on reasoning-intensive multimodal retrieval through four stages — initial retrieval → LLM-driven compensatory query synthesis (explicitly expressing visual reasoning gaps) → secondary retrieval → LLM verification reranking — without any additional training.
HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models: This paper proposes HOG-Layout, a hierarchical framework for 3D indoor scene generation, optimization, and editing based on VLM and LLM. It achieves superior performance over LayoutVLM on SceneEval at 4.5× faster speed, through RAG-enhanced semantic consistency and force-directed hierarchical optimization for physical plausibility.
HoneyBee: Data Recipes for Vision-Language Reasoners: This work systematically investigates the principles underlying the construction of vision-language reasoning datasets—covering context source strategies, data interventions (image caption auxiliary signals and text-only reasoning), and multi-dimensional data scaling—and uses these insights to build HoneyBee, a 2.5M-sample CoT reasoning dataset. A 3B VLM trained on HoneyBee surpasses the prior SOTA by 7.8% on MathVerse, while a proposed test-time scaling strategy reduces decoding cost by 73%.
HoneyBee: Data Recipes for Vision-Language Reasoners: This paper systematically investigates the design space of VL reasoning training data—covering data source selection, intervention strategy filtering, and three-dimensional scaling across images, questions, and CoTs. Based on the resulting insights, the authors construct the HoneyBee dataset with 2.5M samples. A 3B VLM trained on HoneyBee surpasses the previous SOTA on MathVerse by 7.8pp, and a shared caption decoding strategy for test-time scaling reduces token consumption by 73%.
HouseMind: Tokenization Allows MLLMs to Understand, Generate and Edit Architectural Floor Plans: This paper proposes HouseMind, a framework that discretizes architectural floor plans into structured sequences of contour tokens and room instance tokens via a hierarchical VQ-VAE. Combined with three-stage multimodal alignment and instruction fine-tuning on Qwen3-0.6B as the backbone, HouseMind achieves unified modeling of floor plan understanding, generation, and editing, substantially outperforming existing methods in geometric validity and controllability.
HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models: This paper proposes HulluEdit, a single-pass, reference-free subspace editing framework that decomposes hidden states into three orthogonal subspaces—a visual evidence subspace, a conflicting prior subspace, and a residual uncertainty subspace—to selectively suppress hallucination patterns without interfering with visual grounding, achieving state-of-the-art hallucination mitigation on the POPE and CHAIR benchmarks.
HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks: This paper introduces HumanVBench, a human-centric video understanding benchmark comprising 16 fine-grained tasks, accompanied by two automated pipelines (video annotation and distractor-aware QA synthesis). Evaluation of 30 mainstream video MLLMs reveals critical deficiencies in current models regarding nuanced emotion perception and speech-visual alignment.
HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks: This paper presents HumanVBench, a video benchmark comprising 16 fine-grained tasks, systematically evaluating the human-centric video understanding capabilities of MLLMs via two automated pipelines (video annotation and distractor generation). The benchmark reveals significant deficiencies in current models regarding emotion perception and speech-visual alignment.
IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding: This paper proposes IAG, the first multi-target backdoor attack method against VLM-based visual grounding. By employing a text-conditioned U-Net to dynamically generate input-aware triggers, IAG embeds the semantic information of any attacker-specified target object into the visual input, achieving the highest attack success rate in 11 out of 12 evaluated settings.
Interpretable Debiasing of Vision-Language Models for Social Fairness: This paper proposes DeBiasLens, which trains a Sparse Autoencoder (SAE) on VLM encoders to localize "social neurons" encoding social attributes, then selectively deactivates these neurons at inference time to mitigate bias. The method reduces Max Skew by 9–16% on CLIP and reduces gender bias rates by 40–50% on InternVL2, while preserving general performance.
IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment: IsoCLIP provides a theoretical analysis of the CLIP projection head structure, revealing that the cosine similarity computation implicitly contains an inter-modal operator \(\Psi = W_i^\top W_t\) responsible for cross-modal alignment, and an intra-modal operator \(\Psi_i = W_i^\top W_i\) responsible solely for normalization without promoting intra-modal alignment. By applying singular value decomposition to \(\Psi\), the method identifies an approximately isotropic alignment subspace and, by removing anisotropic directions, significantly improves intra-modal retrieval and classification performance without any training.
It's Time to Get It Right: Improving Analog Clock Reading and Clock-Hand Spatial Reasoning in Vision-Language Models: This paper reveals that state-of-the-art VLMs still fail to reliably read analog clocks in real-world scenes (zero-shot accuracy below 10%), and proposes TickTockVQA, a real-world dataset of 12K images, along with a Swap-DPO fine-tuning framework that improves Llama-3.2-11B's time-reading accuracy from 1.43% to 46.22%.
Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild: This paper proposes the JALA framework, which constructs a unified latent action space via joint alignment between predictive embeddings and latent actions inferred by an inverse dynamics model, enabling VLAs to learn simultaneously from labeled data and unlabeled in-the-wild human videos. Combined with the 7.5M-sample UniHand-Mix dataset, JALA significantly improves the generalization of robot manipulation policies.
KEC: Hierarchical Textual Knowledge for Enhanced Image Clustering: KEC leverages LLMs to construct hierarchical concept-attribute structured textual knowledge to guide image clustering, outperforming zero-shot CLIP on 14 out of 20 datasets without any training, demonstrating that discriminative attributes are more effective than simple class names.
KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing: This paper proposes KVSmooth, a training-free plug-and-play inference-time method that applies adaptive exponential moving average (EMA) smoothing to KV-Cache guided by attention row entropy, effectively suppressing semantic drift and hallucination generation caused by sink tokens during decoding in multimodal large language models (MLLMs). On LLaVA-1.5, CHAIR_S is reduced from 41.8 to 18.2 (a 56% reduction), while F1 improves from 77.5 to 79.2.
KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing: KVSmooth proposes a training-free, plug-and-play method that applies attention row entropy-guided adaptive EMA smoothing to the KV-Cache, reducing LLaVA-1.5's CHAIR_S from 41.8 to 18.2 (a 56% reduction) while simultaneously improving F1 from 77.5 to 79.2, achieving gains in both precision and recall.
Label-Free Cross-Task LoRA Merging with Null-Space Compression: Motivated by the observation that the null-space ratio of the down-projection matrix \(\mathbf{A}\) decreases during LoRA fine-tuning and is strongly correlated with task performance, this paper proposes NSC Merging — a label-free, task-agnostic LoRA merging method that achieves state-of-the-art results across 20 heterogeneous vision tasks, 6 NLI tasks, and VLM benchmarks.
Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection: This paper proposes the PROGRESS framework, which dynamically selects the most informative training samples by tracking a VLM's learning progress across automatically discovered multimodal concept clusters. Using only 16–20% of annotated data, PROGRESS achieves 99–100% of full-data performance with shorter total training time.
LFPC: Learning to Focus and Precise Cropping for MLLMs: LFPC proposes a two-stage pure reinforcement learning framework that addresses the spurious tool-calling behavior ("answer-before-crop") observed in existing agent-based MLLMs. It introduces an information gap mechanism — deliberately downsampling the global image to force the model to rely on high-resolution cropped regions — and a grounding loss to improve cropping precision, achieving state-of-the-art performance on high-resolution VQA benchmarks.
Linking Perception, Confidence and Accuracy in MLLMs: This paper reveals a severe confidence miscalibration problem in MLLMs—accuracy drops sharply when visual inputs are degraded while confidence remains unchanged—and proposes CDRL (Confidence-Driven Reinforcement Learning with clean-noisy image pairs) for perception-sensitive training. The calibrated confidence is then leveraged for adaptive test-time scaling via CA-TTS, achieving an average improvement of 8.8% across four benchmarks.
LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models: Addressing three core challenges in multimodal multi-turn VLM dialogues—concealed malicious intent, cumulative contextual risk, and cross-modal joint risk—this work constructs the MMDS dataset (4,484 annotated dialogues) and the MCTS-based MMRT red-teaming framework, and proposes the LLaVAShield auditing model, achieving F1 scores of 95.71%/92.24% on the user/assistant sides respectively, substantially outperforming baselines such as GPT-5-mini.
LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models: This paper proposes LLaVAShield — the first content moderation model designed for multimodal multi-turn dialogues — along with the MMDS dataset (4,484 dialogues covering 8 major categories and 60 subcategories of risk) and MMRT, an automated MCTS-based red-teaming framework. LLaVAShield substantially outperforms baselines such as GPT-5-mini on safety auditing of both user and assistant turns.
LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models: Inspired by foveal encoding and cortical magnification in the human visual system, this paper proposes LLMind, a training-free adaptive sampling framework that leverages Möbius transformations for non-uniform pixel allocation. A closed-loop semantic feedback mechanism optimizes sampling parameters at test time, achieving substantial improvements over uniform sampling under tight pixel budgets of only 1%–5%.
Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation: This paper proposes LTS-FS (Locate-Then-Sparsify for Feature Steering), a framework that employs causal intervention-based attribution to identify hallucination-relevant layers and applies layer-wise sparse control over feature steering intensity according to attribution scores, effectively mitigating hallucinations in LVLMs while preserving generalization capability.
MA-Bench: Towards Fine-grained Micro-Action Understanding: This paper proposes MA-Bench, a micro-action understanding benchmark comprising 1,000 videos and 12,000 structured QA pairs. It introduces a three-tier "Perception–Comprehension–Reasoning" evaluation architecture to systematically assess fine-grained micro-action understanding across 23 MLLMs, and constructs a 20.5K training corpus, MA-Bench-Train, to support model fine-tuning and improvement.
MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures: MarkushGrapher-2 proposes an end-to-end multimodal chemical structure recognition model that jointly encodes image, text, and layout information via a dedicated chemical OCR module. Combined with a two-stage training strategy (first adapting to OCSR features, then integrating multimodal encoding), the model substantially outperforms existing methods on Markush structure recognition (M2S accuracy 56% vs. 38%), while remaining competitive on standard molecular structure recognition.
MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models: This paper identifies a "smoothing misalignment" problem that arises when channel-wise smooth quantization methods (e.g., SmoothQuant) are directly applied to MLLMs—the large discrepancy in activation magnitudes across modalities causes non-dominant modalities to be over-smoothed. MASQuant is proposed to address this via modality-aware smoothing factors and SVD whitening-based cross-modal low-rank compensation.
Mastering Negation: Boosting Grounding Models via Grouped Opposition-Based Learning: This paper constructs D-Negation, the first visual grounding dataset with paired positive/negative semantic descriptions (14K images, 140K annotations), and proposes Grouped Opposition-Based Learning (GOBL), an efficient fine-tuning mechanism with two opposition-based loss functions—PNC and TSO. By tuning fewer than 10% of model parameters, GOBL improves Grounding DINO and APE by up to 5.7 mAP on negation-semantic benchmarks while simultaneously boosting performance on affirmative semantics.
Mastering Negation: Boosting Grounding Models via Grouped Opposition-Based Learning: This paper proposes the D-Negation dataset and a Grouped Opposition-Based Learning (GOBL) fine-tuning mechanism. By leveraging semantically opposed description pairs and two dedicated loss functions, GOBL fine-tunes fewer than 10% of model parameters while substantially improving negation semantic understanding in visual grounding models (up to +5.7 mAP).
Medic-AD: Towards Medical Vision-Language Model's Clinical Intelligence: Medic-AD upgrades a general-purpose medical VLM into a clinically intelligent model through a three-stage progressive training framework—anomaly detection (<Ano> token), longitudinal difference reasoning (<Diff> token), and visual explanation (heatmaps)—achieving state-of-the-art performance on multiple medical tasks with capabilities spanning lesion detection, symptom tracking, and visual interpretability.
Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs: This paper identifies that existing VLM-based OOD detection methods select negative texts using intra-modal distances (text-to-text or image-to-image), which are inconsistent with the cross-modal distances optimized by CLIP. The proposed InterNeg framework systematically leverages cross-modal distances from both textual and visual perspectives, achieving a 3.47% FPR95 reduction on ImageNet.
MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents: MindPower proposes a Robot-Centric Theory-of-Mind reasoning framework that organizes perception → belief → desire → intention → decision → action into a three-level six-layer reasoning hierarchy (MindPower Reasoning Hierarchy), and employs Mind-Reward (GRPO-based reinforcement learning) to optimize reasoning consistency, surpassing GPT-4o by 12.77% on decision-making and 12.49% on action generation.
MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents: MindPower proposes a robot-centric Theory-of-Mind (ToM) reasoning framework that organizes perception → belief → desire → intention → decision → action into a six-layer reasoning hierarchy, and optimizes reasoning consistency via Mind-Reward (based on GRPO), surpassing GPT-4o by 12.77% on decision-making and 12.49% on action generation.
Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection: This paper proposes GACD (Gradient-based Influence-Aware Constrained Decoding), which employs first-order Taylor gradient estimation to quantify each token's influence on the output. GACD simultaneously mitigates multimodal hallucinations caused by text-visual bias and co-occurrence bias at inference time, requiring neither auxiliary models nor fine-tuning.
MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with MLLMs: MMR-AD constructs the largest multimodal reasoning-oriented industrial anomaly detection dataset to date (127K images, 188 product categories, 395 anomaly types) and proposes Anomaly-R1, a GRPO reinforcement learning-based baseline model that significantly outperforms general-purpose MLLMs.
MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping: This paper proposes MoDES, the first training-free expert skipping framework for MoE multimodal large language models. By leveraging Globally Modulated Local Gating (GMLG) and Dual-Modal Thresholding (DMT), MoDES adaptively skips redundant experts, retaining over 97% of original performance while skipping 88% of experts, and achieving 2.16× prefill speedup.
MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping: MoDES is the first expert skipping framework for MoE multimodal large language models. It incorporates layer-level importance into routing probabilities via Global-Modulated Local Gating (GMLG), applies modality-specific skipping strategies for text and visual tokens via a Dual-Modal Threshold (DMT), and efficiently optimizes thresholds via frontier search. On Qwen3-VL-MoE-30B, MoDES retains 97.33% accuracy with 88% expert skipping, achieving a 2.16× prefill speedup.
MODIX: Training-Free Multimodal Information-Driven Positional Index Scaling for VLMs: This paper proposes MODIX, a training-free framework that dynamically adjusts the positional encoding step sizes of visual and textual tokens in VLMs via information-theoretic analysis (covariance entropy + cross-modal alignment), allocating finer positional granularity to information-dense modalities to enhance multimodal reasoning.
MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models: This paper models expert selection in MoE as a sequential decision problem and optimizes the routing strategy via GRPO-based reinforcement learning. By introducing modality-aware router guidance, the proposed method consistently outperforms deterministic top-K routing and its variants on image and video understanding tasks in VLMs.
More than the Sum: Panorama-Language Models for Adverse Omni-Scenes: This paper proposes the Panorama-Language Modeling (PLM) paradigm and the PanoVQA large-scale panoramic VQA dataset (653K QA pairs). A plug-and-play panoramic sparse attention (PSA) module is designed to enable existing VLMs to process equirectangular projection (ERP) panoramic images without retraining, achieving superior global reasoning over multi-view stitching approaches in adverse scenarios such as occlusion and accidents.
Mixture of States (MoS): Routing Token-Level Dynamics for Multimodal Generation: This paper proposes Mixture of States (MoS), a novel fusion paradigm for multimodal diffusion models. A lightweight, learnable token-level router dynamically routes hidden states from arbitrary layers of an understanding tower (frozen LLM/VLM) to arbitrary layers of a generation tower (DiT). With only 3–5B parameters, MoS matches or surpasses the 20B Qwen-Image on both image generation and editing benchmarks.
Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models: MoT probe experiments reveal asymmetric pruning sensitivity between the text and visual pathways in LVLMs — the text pathway is highly sensitive and must be calibrated with text tokens, while the visual pathway is highly redundant and can tolerate 60% sparsity. Based on these findings, ATV-Pruning constructs a calibration pool using all text tokens plus a small, layer-adaptively selected subset of visual tokens.
MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding: This paper proposes MSJoE, a framework that jointly evolves an MLLM and a lightweight keyframe sampler via reinforcement learning. The MLLM generates visual queries to guide frame retrieval, a 1D U-Net sampler learns selection weights from a CLIP similarity matrix, and both components are optimized end-to-end, achieving +8% accuracy improvement on long-form video QA.
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following: This paper introduces Multi-Crit, the first benchmark for evaluating the pluralistic criteria-following capability of multimodal judge models. It features criterion-level human annotations and preference-conflicting samples, along with three new metrics—PAcc, TOS, and CMR—to comprehensively evaluate 25 LMMs, revealing that even the strongest closed-source model achieves only 32.78% multi-criteria consistency on open-ended generation tasks.
Multi-Modal Image Fusion via Intervention-Stable Feature Learning: This paper proposes a causal inference-inspired multi-modal image fusion framework that employs three structured intervention strategies (complementary masking, random masking, and modality dropout) to probe genuine inter-modal dependencies, and designs a Causal Feature Integrator (CFI) to learn intervention-stable features. The method achieves PSNR of 66.02 and AG of 4.129 on MSRS, and mAP of 0.821 on object detection.
Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery: This paper proposes SSR²-GCD, a framework that learns structured representations with uniformly compressed intra-modal distributions via a Semi-Supervised Rate Reduction (SSR²) loss, and introduces a Retrieval-based Text Aggregation (RTA) strategy to enhance cross-modal knowledge transfer. The method surpasses existing multi-modal GCD approaches on 8 benchmarks.
Multimodal OCR: Parse Anything from Documents: This paper proposes the Multimodal OCR (MOCR) paradigm, which unifies the parsing of text and graphics (charts, diagrams, UI components, etc.) in documents into structured textual representations (plain text + SVG code). The trained 3B-parameter dots.mocr model ranks second only to Gemini 3 Pro on OCR Arena, achieves a state-of-the-art score of 83.9 on olmOCR Bench, and surpasses Gemini 3 Pro on the image-to-SVG benchmark.
MUPO: All Roads Lead to Rome - Incentivizing Divergent Thinking in Vision-Language Models: MUPO identifies a reasoning diversity collapse in GRPO training — models prematurely converge to a small number of reasoning strategies while discarding most alternatives. By partitioning responses into groups for localized advantage estimation and introducing a diversity reward, MUPO incentivizes VLMs to maintain divergent thinking, achieving 2–7% improvements across multiple reasoning benchmarks.
Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy: Nano-EmoX proposes a cognition-inspired three-level emotional task hierarchy (Perception → Understanding → Interaction) and is the first multimodal language model to unify six core affective tasks within a compact 2.2B parameter framework, employing a P2E progressive training paradigm that cultivates capabilities from basic perception to high-level empathy.
Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning: This paper proposes the Narrative Weaver framework, which combines narrative planning via MLLMs with fine-grained generation via diffusion models. Through learnable queries and a dynamic Memory Bank, the framework achieves long-range visually consistent generation under multi-modal conditioning. The authors also introduce EAVSD, the first e-commerce advertising video storyboard dataset, comprising 330K+ images.
No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models: C2LIP proposes a contrastive learning fine-tuning approach that requires no hard negatives: by decomposing text into noun-phrase concepts and introducing cross-modal attention pooling, it achieves state-of-the-art performance on the SugarCrepe/SugarCrepe++ compositionality benchmarks while maintaining or improving zero-shot and retrieval performance.
No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection: This paper proposes LAVIDA, an end-to-end zero-shot video anomaly detection framework that transforms semantic segmentation datasets into pseudo-anomaly training data via an Anomaly Exposure Sampler. Combined with MLLM-based deep anomaly semantic feature extraction and reverse-attention token compression for spatiotemporal sparsity, LAVIDA achieves frame-level and pixel-level SOTA without any real VAD data.
Noise-Aware Few-Shot Learning through Bi-directional Multi-View Prompt Alignment: This paper proposes NA-MVP, a framework that employs bi-directional (clean + noise-aware) multi-view prompt design combined with Unbalanced Optimal Transport (UOT) for fine-grained patch-to-prompt alignment, and applies classical OT for selective label correction on identified noisy samples, consistently outperforming state-of-the-art methods in noisy few-shot learning scenarios.
Noise-Aware Few-Shot Learning through Bi-directional Multi-View Prompt Alignment: This paper proposes the NA-MVP framework, which employs a bi-directional (clean + noise-aware) multi-view prompt design coupled with Unbalanced Optimal Transport (UOT) for fine-grained patch-to-prompt alignment, and applies classical OT for selective label correction on identified noisy samples, consistently surpassing state-of-the-art methods in noisy few-shot learning scenarios.
OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models: This paper proposes OddGridBench to evaluate the fine-grained visual discrepancy sensitivity of MLLMs (i.e., identifying the element in a grid that differs from others in color, size, rotation, or position). All evaluated MLLMs fall far below human performance. To address this gap, the authors propose OddGrid-GRPO, which combines curriculum learning with a distance-aware reward to significantly improve visual discrimination ability.
OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens: OmniLottie proposes a Lottie Tokenizer that converts Lottie JSON files into structured command-parameter sequences, enabling pretrained VLMs to generate high-quality vector animations from multimodal cross-modal instructions. The work also introduces the MMLottie-2M large-scale dataset to support training.
On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models: This paper identifies the "token's dilemma" in dynamic MoE continual learning — ambiguous and old tokens in new-task data contribute minimally to new knowledge acquisition yet cause routing drift and catastrophic forgetting. The proposed LLaVA-DyMoE mitigates routing drift via Token Assignment Guidance and Routing Score Regularization, achieving over 7% MFN improvement and 12% forgetting reduction on the CoIN benchmark.
Overthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language Models: This paper reveals a novel mechanism underlying VLM hallucinations — overthinking: the model generates an excessive number of competing object hypotheses in intermediate decoding layers, and confounders propagate across layers to corrupt the final prediction. The paper proposes the Overthinking Score to quantify inter-layer hypothesis diversity × uncertainty, achieving F1 of 78.9% on MSCOCO and 71.58% on the OOD benchmark AMBER.
PaddleOCR-VL: Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing: PaddleOCR-VL introduces a coarse-to-fine document parsing framework that first employs a lightweight VRFM module to detect valid regions and predict reading order, then applies a compact 0.9B VLM for fine-grained recognition, achieving state-of-the-art document parsing performance with minimal visual tokens and parameters.
PaddleOCR-VL: Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing: PaddleOCR-VL proposes a coarse-to-fine document parsing architecture: the coarse stage employs a lightweight Valid Region Focusing Module (VRFM) to localize effective visual regions and predict reading order, while the fine stage applies a compact 0.9B vision-language model to perform detailed recognition on cropped regions, achieving state-of-the-art document parsing performance with minimal visual tokens and parameters.
PaddleOCR-VL: Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing: This paper proposes PaddleOCR-VL, a coarse-to-fine document parsing framework. The coarse stage employs a lightweight VRFM module to identify effective visual regions, while the fine stage applies a compact 0.9B VLM to process only those regions. With minimal visual tokens and parameters, the framework achieves state-of-the-art performance on OmniDocBench v1.5, substantially reducing latency and resource consumption.
Parallel In-context Learning for Large Vision Language Models: This paper proposes Parallel-ICL, which partitions the long demonstration context in multimodal in-context learning (MM-ICL) into chunks for parallel processing, and integrates predictions at the logit level via weighted Product-of-Experts (PoE). The method achieves performance on par with or superior to full-context MM-ICL while significantly reducing inference latency.
PersonaVLM: Long-Term Personalized Multimodal LLMs: This paper proposes PersonaVLM, a multimodal agent framework for long-term personalization. Through proactive memory management (four-type memory database), multi-step reasoning-based retrieval, and a momentum-based personality evolution mechanism, it transforms a general-purpose MLLM into a personalized assistant capable of adapting to shifting user preferences, surpassing GPT-4o by 5.2% under a 128K context.
Phantasia: Context-Adaptive Backdoors in Vision Language Models: Phantasia introduces the first context-adaptive backdoor attack against VLMs. Rather than generating fixed malicious text, a poisoned model receiving a triggered image silently answers an attacker-specified target question instead of the user's original query. The generated response is semantically consistent with the input image and linguistically fluent, thereby evading defenses such as STRIP-P and ONION-R. The paper also provides the first empirical demonstration that the stealthiness of existing VLM backdoor attacks has been substantially overestimated.
PhysInOne: Visual Physics Learning and Reasoning in One Suite: PhysInOne is a large-scale synthetic dataset comprising 153,810 dynamic 3D scenes and 2 million annotated videos, covering 71 fundamental physical phenomena across mechanics, optics, fluid dynamics, and magnetism, establishing a new benchmark for physically-aware world models.
Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision: This paper proposes the DeepfakeJudge framework, which scales human-annotated reasoning supervision into large-scale structured scoring data via a bootstrapped generator-evaluator pipeline. The framework trains 3B/7B vision-language models as automatic judges for deepfake detection reasoning quality, achieving high human alignment in both pointwise and pairwise evaluation settings.
PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models: PointAlign is proposed to apply feature-level alignment regularization to point cloud tokens at intermediate LLM layers (aligned with Q-Former outputs) in 3D VLMs. By training only a lightweight alignment projector and LoRA adapters, the method effectively prevents geometric information from degrading during language modeling, achieving a 7.50pp improvement on open-vocabulary classification.
Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees: This paper proposes Proof-of-Perception (PoP), which models multimodal reasoning as an executable directed acyclic graph (DAG) where each perception/logic node outputs set-valued predictions with conformal certificates providing step-wise reliability guarantees. A lightweight controller adaptively allocates computation within a budget based on these certificates. PoP outperforms CoT, ReAct, and PoT baselines on document, chart, and multi-image QA benchmarks.
Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models: This paper systematically diagnoses visual representation degradation in MLLMs across two levels—global functionality and patch-level semantic structure—revealing that such degradation is an intrinsic "visual sacrifice" induced by the pure text-generation objective. It proposes Predictive Regularization (PRe), which mitigates degradation by training intermediate-layer features to predict the initial visual features, achieving consistent improvements across multiple vision-language benchmarks.
Prime Once, then Reprogram Locally: An Efficient Alternative to Black-Box Service Model Adaptation: This paper proposes AReS, which replaces the continuous API calls of conventional zeroth-order optimization (ZOO) with a single-round API query to prime a local encoder. AReS achieves a +27.8% improvement on GPT-4o (where ZOO methods are nearly ineffective), while reducing API calls by over 99.99% and enabling zero-cost inference.
Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving: The first plug-and-play token pruning framework for multi-view autonomous driving VLMs. By leveraging T-FPS (Token-wise Farthest Point Sampling) to preserve semantic and spatial diversity, combined with view-adaptive pruning rate optimization to automatically allocate token budgets per camera, the framework achieves 6.40× prefill acceleration on DriveLM while retaining only 10% of tokens with only a 3% performance drop.
Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving: Prune2Drive is the first plug-and-play token pruning framework designed for multi-view autonomous driving VLMs. It combines T-FPS (Token-wise Farthest Point Sampling) to preserve semantic and spatial diversity with view-adaptive pruning rate optimization to automatically allocate token budgets across camera views. Retaining only 10% of tokens on DriveLM, it achieves 6.40× prefill speedup with only a 3% performance drop.
Purify-then-Align: Towards Robust Human Sensing under Modality Missing with Knowledge Distillation from Noisy Multimodal Teacher: This paper proposes PTA (Purify-then-Align), a framework that first purifies a noisy multimodal teacher via a meta-learning-driven modality weighting mechanism, then aligns each unimodal student through diffusion-model-driven knowledge distillation, enabling unimodal encoders to maintain strong robustness under modality-missing scenarios. PTA achieves state-of-the-art performance on MM-Fi and XRF55.
Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization: This paper proposes Quant Experts (QE), a token-aware adaptive quantization error compensation framework based on Mixture of Experts (MoE). By partitioning important channels into token-independent and token-dependent groups, QE employs shared experts and routed experts to perform global and local quantization error reconstruction respectively, achieving significant accuracy recovery on VLMs ranging from 2B to 72B parameters.
Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization: This paper proposes Quant Experts (QE), a token-aware adaptive quantization error reconstruction framework based on Mixture-of-Experts. It partitions important channels into token-independent (high-frequency, globally consistent) and token-dependent (low-frequency, locally dynamic) groups, compensating global and local quantization errors via low-rank adapters in shared and routed experts, respectively. QE consistently improves VLM performance across diverse quantization settings ranging from W4A6 to W3A16.
Reason-SVG: Enhancing Structured Reasoning for Vector Graphics Generation with Reinforcement Learning: This paper proposes the Reason-SVG framework, which introduces a "Drawing-with-Thought" (DwT) paradigm that enables LLMs to perform explicit multi-stage design reasoning prior to SVG generation. Combined with SFT and GRPO reinforcement learning with a hybrid reward function, Reason-SVG consistently outperforms existing methods in semantic alignment, structural validity, and visual quality.
ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps: This paper introduces the ReasonMap benchmark, which constructs 1,008 QA pairs from high-resolution transit maps of 30 cities and proposes a two-level evaluation framework (correctness + quality) to systematically assess fine-grained visual reasoning capabilities of 16 MLLMs. A key finding is that among open-source models, base models outperform reasoning models, while the opposite holds for closed-source models.
ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps: This paper introduces the ReasonMap benchmark, constructed from high-resolution transit maps of 30 cities comprising 1,008 QA pairs, to systematically evaluate fine-grained visual understanding and spatial reasoning capabilities of 16 MLLMs. The work reveals the counter-intuitive phenomenon that base variants of open-source models consistently outperform their reasoning counterparts, and establishes a GRPO-based reinforcement fine-tuning training baseline.
ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval: This paper reveals a Capability Degradation phenomenon that occurs when adapting generative MLLMs into discriminative retrievers, and proposes the ReCALL framework — a three-stage pipeline that diagnoses retriever blind spots, leverages the base MLLM's CoT reasoning to generate corrective triplets, and applies grouped contrastive refinement to recover degraded fine-grained compositional reasoning ability. ReCALL achieves R@1 of 55.52% on CIRR and R@10 of 57.04% on FashionIQ.
Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress: This paper proposes R²VLM, a recurrent reasoning framework that processes local video segments sequentially, maintains a dynamically updated CoT record tracking task decomposition and completion status, and leverages a multi-dimensional RL reward scheme to achieve state-of-the-art performance in long-horizon embodied task progress estimation. The framework additionally supports downstream applications including policy learning, reward modeling, and proactive assistance.
Recursive Think-Answer Process for LLMs and VLMs: R-TAP proposes a recursive think-answer process that employs a confidence generator to assess the certainty of model responses and guide iterative reasoning refinement. Combined with dual reinforcement signals—a recursive confidence growth reward and a final answer confidence reward—R-TAP consistently outperforms single-pass inference methods on both LLMs and VLMs, while substantially reducing "Oops!"-style self-reflection expressions during reasoning.
ReHARK: Refined Hybrid Adaptive RBF Kernels for Robust One-Shot Vision-Language Adaptation: This paper proposes ReHARK — a training-free one-shot CLIP adaptation framework that constructs a hybrid prior by fusing CLIP text knowledge, GPT-3 semantic descriptions, and visual prototypes, and performs global proximal regularization in RKHS via multi-scale RBF kernels, achieving a new one-shot SOTA of 65.83% average accuracy across 11 benchmarks.
ReHARK: Refined Hybrid Adaptive RBF Kernels for Robust One-Shot Vision-Language Adaptation: ReHARK is a four-stage refinement pipeline that constructs hybrid semantic-visual priors, augments the support set, applies adaptive distribution rectification, and integrates multi-scale RBF kernels, achieving 65.83% one-shot adaptation accuracy across 11 benchmarks and substantially outperforming Tip-Adapter and ProKeR.
World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training: This paper proposes the World-Env framework, which leverages a physically consistent world model as a virtual environment in place of real-world interaction to perform RL post-training on VLA models. With only 5 demonstrations per task, the framework achieves significant improvements in manipulation success rates.
World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training: This paper proposes World-Env, a framework that employs a physically consistent world model as a virtual simulator in place of real-world interaction. Combined with a VLM-guided instant reflector that provides continuous rewards and dynamic termination signals, the framework enables safe and efficient RL post-training of VLA models using only 5 demonstration trajectories per task, improving average success rate from 74.85% to 79.6%.
Relational Visual Similarity: This paper formally defines the problem of relational visual similarity — the intrinsic relational or functional correspondence between two images, as opposed to surface-level attribute similarity — constructs a 114K anonymous-description dataset, trains the relsim model, and reveals fundamental deficiencies in existing similarity metrics (CLIP, DINO, etc.) for capturing relational similarity.
ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding: This paper proposes ReMoRa, which operates directly on compressed video representations (I-frames + motion vectors). A Refined Motion Representation (RMR) module refines coarse block-level motion vectors into fine-grained motion representations approximating optical flow, while a Hierarchical Motion State Space (HMSS) module performs linear-time long-range temporal modeling. ReMoRa surpasses baselines on LongVideoBench, NExT-QA, MLVU, and other benchmarks.
Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance: This paper proposes Residual Decoding (ResDec), a training-free plug-and-play decoding strategy that identifies the semantic anchoring phase by analyzing U-shaped JSD patterns in historical token logit distributions, aggregates logits from this phase as residual guidance to steer current decoding, and effectively suppresses language-prior hallucinations in LVLMs at near-zero additional inference overhead.
Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in VDU: Layer-wise linear probing analysis reveals a significant gap between internal representations and generated responses in LVLMs for visual document understanding (VDU). Intermediate layers encode more linearly accessible task-relevant information than final layers, and fine-tuning intermediate layers simultaneously improves accuracy and narrows the gap.
Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token: This paper proposes SELF1E, the first MLLM segmentation method that requires neither a dedicated mask decoder nor more than a single [SEG] token. By introducing Residual Features Refilling (RFR) and Residual Features Amplifier (RFA) to recover resolution lost during pixel-shuffle compression, SELF1E achieves performance competitive with decoder-based methods across multiple segmentation tasks.
Rethinking VLMs for Image Forgery Detection and Localization: This work reveals that VLMs inherently favor semantic plausibility over authenticity (CLIP cosine similarity for forged images reaches 96–99%), and proposes IFDL-VLM, which decouples detection/localization from language explanation into two stages: Stage-1 employs ViT+SAM for detection and localization, and Stage-2 feeds the resulting mask as auxiliary input to a VLM to enhance interpretability. The method achieves state-of-the-art performance across 9 benchmarks.
Rethinking VLMs for Image Forgery Detection and Localization: This paper proposes IFDL-VLM, a framework that identifies an inherent semantic plausibility bias in VLMs — their tendency to favor semantic coherence over authenticity — which impedes forgery detection performance. The framework decouples detection/localization from language explanation into a two-stage optimization pipeline, and leverages localization masks as auxiliary inputs to VLMs to enhance interpretability, achieving comprehensive SOTA results across 9 benchmarks.
Revisiting Model Stitching in the Foundation Model Era: This paper systematically investigates the feasibility of stitching heterogeneous Vision Foundation Models (VFMs), finds that conventional methods fail in this setting, and proposes a two-stage training strategy — Final Feature Matching + Task Loss Training — that enables reliable stitching across heterogeneous VFMs. The resulting stitched models can even surpass both constituent VFMs individually. Building on this, the paper introduces the VFM Stitch Tree (VST) architecture, which provides a controllable accuracy–efficiency trade-off for multi-VFM systems.
Revisiting Model Stitching In the Foundation Model Era: A two-stage stitching training method (Final Feature Matching + Task Loss Training) for heterogeneous Vision Foundation Models (VFMs) is proposed, demonstrating that heterogeneous VFMs can be reliably stitched and fused for complementary knowledge. A VFM Stitch Tree (VST) architecture is also designed to achieve controllable accuracy–efficiency trade-offs in multi-VFM systems.
Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach: This paper proposes FlashCache — the first training-free multimodal KV Cache compression framework that requires no attention scores. By identifying Outlier KVs via frequency-domain low-pass filtering and dynamically allocating per-layer budgets, FlashCache achieves 80% memory reduction and 1.69× decoding speedup while preserving model performance.
SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning: This paper proposes SALMUBench—the first association-level machine unlearning benchmark for CLIP-style models—comprising a 60K synthetic person–sensitive-attribute paired dataset, from-scratch-trained Compromised/Clean model pairs, and a structured holdout evaluation protocol. It is the first work to systematically reveal three failure modes of existing unlearning methods: catastrophic destruction, over-generalized unlearning, and ineffective unlearning.
Scaling Spatial Intelligence with Multimodal Foundation Models: SenseNova-SI systematically constructs an 8M-scale diverse spatial dataset (SenseNova-SI-8M) to cultivate spatial intelligence in multimodal foundation models including Qwen3-VL, InternVL3, and Bagel, achieving unprecedented performance on multiple spatial benchmarks such as VSI-Bench and MMSI while preserving general multimodal understanding capabilities.
Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework: This paper proposes the Self-Critical Inference (SCI) framework, which simultaneously addresses language bias and language sensitivity in LVLMs via multi-round textual and visual counterfactual logit aggregation. A dynamic robustness benchmark, DRBench, is introduced to evaluate robustness in a model-specific manner. Increasing the number of counterfactual inference rounds yields consistent robustness gains, opening a new direction for test-time scaling.
Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism: This paper proposes FlexMem — a training-free visual memory mechanism that constructs a visual memory bank via iterative dual-pathway KV cache compression, and introduces both encoding-based and fast index-based memory retrieval strategies, enabling MLLMs to process 1000+ frame long videos on a single 3090 GPU while substantially outperforming existing efficient video understanding methods.
Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models: This paper proposes Scene-VLM — the first VLM fine-tuning-based framework for video scene segmentation — which leverages structured multimodal shot representations (visual frames + dialogue + metadata), causal sequential prediction, a context-focus window mechanism, and token logits-based confidence extraction, achieving substantial gains of +6 AP and +13.7 F1 on MovieNet, while demonstrating natural language explanation capability.
SciPostGen: Bridging the Gap between Scientific Papers and Poster Layouts: This work introduces SciPostGen, a large-scale dataset of 18,097 paper–poster pairs. Analysis reveals a moderate correlation between paper structure and the number of poster layout elements. A retrieval-augmented poster layout generation framework is proposed, which leverages contrastive learning to retrieve layout templates matching the input paper and guides an LLM to generate the final poster layout.
SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker: This paper proposes SEATrack, a multimodal tracker that achieves dynamic cross-modal attention map alignment via AMG-LoRA and efficient global relation modeling via HMoE, attaining a state-of-the-art performance–efficiency trade-off on RGB-T/D/E tracking with minimal trainable parameters.
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models: This paper introduces AV-SpeakerBench, a benchmark comprising 3,212 speaker-centric audiovisual reasoning multiple-choice questions, which systematically evaluates multimodal large language models on fine-grained audiovisual fusion capabilities—specifically, who is speaking, what was said, and when—revealing a gap of over 20% between the strongest current models and human performance.
See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles: This paper proposes State-aware Reasoning (StaR), which teaches multimodal agents a three-step reasoning chain — "perceive current state → analyze target state → decide whether to act" — improving GUI toggle control accuracy by over 30% without degrading general agent task performance.
Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness: This paper proposes an efficient plug-and-play module that learns multimodal class embeddings to enhance VLM recognition and reasoning on rare objects. On the visual side, a cross-attention adapter refines visual tokens; on the textual side, object detection prompts are injected. Without fine-tuning the VLM, the method achieves a significant gain from 72.8 to 75.4 on CODA-LM.
Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions: This paper introduces the tactile localization task—identifying regions in an image that share the same material properties as a given tactile input—and addresses it via local visual-tactile alignment and a material-diversity pairing strategy for learning dense cross-modal features. Two new tactile-material segmentation datasets are also constructed.
Self-Consistency for LLM-Based Motion Trajectory Generation and Verification: This paper extends the self-consistency paradigm of LLMs from natural language reasoning to the visual domain. It defines shape families for motion trajectories via a Lie transformation group hierarchy, and clusters multiple LLM-sampled trajectories under transformation-invariant distance metrics to achieve unsupervised trajectory generation improvement (+4–6%) and verification (precision +11.8%), without any training.
Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning: This paper proposes the Similarity-as-Evidence (SaE) framework, which reinterprets VLM text-image similarities as Dirichlet evidence. A Similarity Evidence Head (SEH) is introduced to calibrate overconfident softmax outputs, and a dual-factor acquisition strategy based on vacuity and dissonance enables interpretable, label-efficient medical active learning, achieving a SOTA macro-average accuracy of 82.57% across 10 datasets under a 20% annotation budget.
SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models: SIMPACT proposes a test-time simulation-augmented action planning framework that automatically constructs a physics simulation environment from a single RGB-D image, enabling VLMs to propose actions, observe simulation outcomes, and iteratively refine their reasoning—achieving SOTA performance on both rigid and deformable object manipulation tasks without any additional training.
SoPE: Spherical Coordinate-Based Positional Embedding for 3D LVLMs: This paper identifies spatial perception bias in RoPE when applied to 3D LVLMs (1D indexing disrupts 3D locality and ignores directionality), and proposes SoPE, a spherical coordinate-based positional embedding using a four-dimensional index \((t, r, \theta, \phi)\) with multi-dimensional frequency allocation and multi-scale mixing. SoPE achieves state-of-the-art performance on 3D layout estimation and object detection benchmarks built upon SpatialLM.
SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs: This paper proposes the SPARROW framework, which injects temporal referential consistency via Target-Specific Tracked Features (TSF) and stabilizes pixel-level localization through dual-prompt (BOX+SEG) initialization. As a plug-and-play module, SPARROW consistently improves performance across three video MLLM baselines on six benchmarks.
SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs: This paper proposes the SPARROW framework, which injects temporal consistency supervision via Target-Specific Features (TSF), stabilizes first-frame initialization through dual-prompt ([BOX]+[SEG]) coarse-to-fine decoding, and integrates into existing video MLLMs in a plug-and-play manner, achieving consistent improvements across 6 benchmarks on 3 tasks.
SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models: This paper proposes SpatiaLQA, a benchmark comprising 9,605 QA pairs across 241 real-world indoor scenes, systematically evaluates 41 VLMs on spatial logical reasoning, and introduces a recursive scene graph-assisted reasoning method to enhance VLMs' spatial logical reasoning capabilities.
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence: This paper introduces SpatialScore, currently the most comprehensive multimodal spatial intelligence benchmark (5K samples / 30 tasks), and proposes two complementary approaches to enhance spatial understanding in MLLMs: a data-driven fine-tuning scheme via SpatialCorpus (331K QA pairs) and a training-free SpatialAgent system equipped with 12 specialized tools.
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning: This paper proposes SpatialStack, a framework that injects multi-level geometric features from a multi-view geometry encoder (VGGT) into different layers of an LLM decoder (rather than fusing only the final layer), achieving open-source SOTA on multiple 3D spatial reasoning benchmarks through hierarchical alignment where shallow layers handle fine-grained spatial perception and deep layers support high-level semantic reasoning.
SSR2-GCD: Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery: This paper proposes SSR2-GCD, a framework that replaces conventional contrastive losses with a Semi-Supervised Rate Reduction (SSR2) loss to learn uniformly compressed, structured representations. The work further reveals that inter-modal alignment is not only unnecessary but harmful in multi-modal GCD, achieving +3.1% and +6.3% over the prior state of the art on Stanford Cars and Flowers102, respectively.
See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles (StaR): This paper reveals the severe failure of existing multimodal GUI agents on toggle control tasks (GPT-5 achieves only 37% O-AMR), and proposes State-aware Reasoning (StaR), a three-step reasoning chain (perceive current state → analyze target state → decide whether to act) that improves execution accuracy by 30%+, without degrading general agent capabilities.
StructXLIP: Enhancing Vision-Language Models with Multimodal Structural Cues: StructXLIP adopts edge maps as proxy representations of visual structure and introduces three structure-centric losses during CLIP fine-tuning — edge-structure text alignment, local region-text chunk matching, and edge-color image connection. By maximizing the mutual information of multimodal structural representations, the model is guided toward more robust and semantically stable optima, surpassing existing competitors on cross-modal retrieval tasks.
Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models: This paper proposes TARA, a framework that injects taxonomic hierarchy knowledge into large multimodal models (LMMs) by aligning their intermediate representations with taxonomy-aware features from a biological foundation model (BFM), substantially improving hierarchical visual recognition performance on both known and novel categories.
Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention: This paper proposes Vision-Guided Attention (VGA), a training-free method that constructs precise visual grounding from the semantic features of visual tokens to guide model attention toward relevant visual regions, effectively mitigating hallucinations in MLLMs while remaining compatible with FlashAttention.
Test-Time Attention Purification for Backdoored Large Vision Language Models: This work identifies that the essence of backdoor behavior in LVLMs is cross-modal attention stealing (trigger visual tokens hijack the attention weights of text tokens), and proposes CleanSight — the first training-free test-time backdoor defense framework — which eliminates backdoor effects by detecting and pruning high-attention trigger tokens.
Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction: This paper proposes TOMCap — a text-only training approach for image captioning that combines retrieval augmentation, modality gap correction, and LoRA fine-tuning. The model trains exclusively on text yet processes images at inference time, surpassing existing training-free and text-only methods.
The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulated Visual Contexts: This work identifies a critical and overlooked threat: existing multimodal manipulation detection methods fail to account for MLLMs' ability to generate semantically coherent deceptive narratives. The authors construct MDSM, a semantically aligned manipulation dataset of 441k samples, and propose AMD, a framework based on Artifact Tokens and manipulation-oriented reasoning. With only 0.27B parameters, AMD achieves state-of-the-art cross-domain generalization of 88.18 ACC / 60.25 mAP / 61.02 mIoU.
The Coherence Trap: MLLM-Crafted Narratives Exploit Manipulated Visual Contexts: This paper identifies two fundamental flaws in existing multimodal disinformation detection—underestimating semantically coherent fake narratives generated by MLLMs and over-reliance on simple misalignment artifacts—and constructs the 441k-sample MDSM dataset (image manipulation + MLLM-generated semantically aligned text). The proposed AMD framework (Artifact Pre-perception + Manipulation-Oriented Reasoning) achieves 88.18 ACC / 60.25 mAP / 61.02 mIoU on cross-domain detection.
The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition: This paper reveals that open-source LLMs lack hierarchical taxonomic knowledge about the visual world (often failing to recognize even basic biological classification systems), making the LLM the bottleneck for hierarchical visual recognition in Vision LLMs.
The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment: This paper proposes Contrastive Fusion (ConFu), a framework that extends CLIP-style bimodal contrastive learning to tri-modal higher-order alignment, jointly learning paired and fused representations within a unified objective to support both 1→1 and 2→1 retrieval.
Think360: Evaluating the Width-centric Reasoning Capability of MLLMs Beyond Depth: This paper presents Think360, a multimodal benchmark focused on reasoning width—i.e., a model's capability for multi-path search, multi-constraint pruning, backtracking, and trial-and-error exploration. The benchmark comprises 1,200+ high-quality samples and introduces a fine-grained Tree-of-Thought evaluation protocol, revealing significant deficiencies in current MLLMs along the width dimension of reasoning.
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models: This paper presents the first quantitative analysis of CoT reasoning in diffusion multimodal LLMs (dMLLMs), identifying two critical issues — "early answer generation" and "weak visual grounding" — and proposes two training-free methods, PSP (Position-Step Penalty) and VRG (Visual Reasoning Guidance), achieving up to 7.5% accuracy improvement at over 3× speedup.
Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World: This paper proposes Dyn-Bench — a large-scale benchmark for dynamic understanding of the physical 4D world (1k videos, 7k VQA pairs, 3k dynamic grounding pairs) — that systematically evaluates the spatio-temporal reasoning capabilities of general, spatial-aware, and region-level MLLMs. The study finds that existing models fail to maintain consistency between reasoning and grounding simultaneously, and introduces two structured integration methods, Mask-Guided Fusion and ST-TCM, that significantly improve dynamic perception.
TIGeR: A Unified Framework for Time, Images and Geo-location Retrieval: This paper proposes TIGeR, a multimodal Transformer framework that jointly learns a unified geo-temporal embedding space over images, locations, and timestamps, enabling three tasks—geolocalization, capture time prediction, and geo-temporally aware image retrieval—within a single model. A high-quality benchmark dataset of 4.5M images is also introduced.
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs: This paper systematically investigates the key factors for building video temporal grounding (VTG) capabilities in MLLMs from two dimensions — data quality and algorithm design. It releases the high-quality benchmark TimeLens-Bench and training set TimeLens-100K, and constructs the TimeLens model series via interleaved textual timestamp encoding combined with a thinking-free RLVR training paradigm, achieving state-of-the-art performance among open-source models and surpassing GPT-5 and Gemini-2.5-Flash.
TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment: TIPSv2 is proposed by discovering that distillation substantially improves patch-text alignment, and this insight is translated into a new pretraining objective, iBOT++ (where visible tokens also participate in the loss computation). Combined with head-only EMA and multi-granularity text augmentation, TIPSv2 achieves state-of-the-art performance across 9 tasks and 20 datasets.
Token Warping Helps MLLMs Look from Nearby Viewpoints: This paper proposes performing spatial warping on ViT image tokens within MLLMs—rather than conventional pixel-level warping—to simulate viewpoint changes. It is found that backward token warping maintains semantic consistency while remaining robust to depth estimation noise. The proposed method substantially outperforms pixel-level warping, specialized spatial-reasoning MLLMs, and generative warping approaches on the newly constructed ViewBench benchmark.
Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans (HouseMind): This paper presents HouseMind, which discretizes architectural floor plans into room-level spatial tokens via a hierarchical VQ-VAE, enabling floor plan understanding, generation, and editing within a unified MLLM framework. The approach comprehensively outperforms diffusion model and general-purpose VLM baselines in geometric validity and controllability.
Topo-R1: Detecting Topological Anomalies via Vision-Language Models: Topo-R1 is proposed as the first framework to equip VLMs with topology-aware perception. Through an automated data construction pipeline combined with SFT and GRPO reinforcement learning (incorporating a topology-aware composite reward), it enables annotation-free topological anomaly detection and classification in tubular structures.
Towards Calibrating Prompt Tuning of Vision-Language Models: To address the "dual miscalibration" problem in prompt-tuned CLIP (underconfidence on base classes and overconfidence on novel classes), this paper proposes two complementary regularizers — mean-variance margin regularization and text moment-matching loss — as plug-and-play modules that consistently reduce ECE across 7 prompt tuning methods and 11 datasets.
Towards Multimodal Domain Generalization with Few Labels: This paper defines and investigates the novel problem of Semi-Supervised Multimodal Domain Generalization (SSMDG), and proposes a unified framework integrating consensus-driven pseudo-labeling, disagreement-aware regularization, and cross-modal prototype alignment to achieve cross-domain generalization of multimodal models under limited annotation.
Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training: This paper proposes DocHumming, a data-training co-design framework that constructs the large-scale synthetic dataset DocMix-3M via Realistic Scene Synthesis, and introduces a Document-Aware Training Recipe (DATR) combining progressive learning and structure token weighting. On a 1B-parameter MLLM, DocHumming achieves an OmniDocBench Overall score of 93.75, surpassing Qwen3-VL-235B (89.15), with only a 6.72-point degradation under real-world capture conditions (vs. 18–20 points for pipeline-based methods).
Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation: This paper proposes CES (Coordinator-Executor-State Tracker), a multi-agent framework coupled with a staged execution-feedback reinforcement learning algorithm. By decoupling high-level task planning from low-level execution, and through dedicated training of the Coordinator and State Tracker, CES significantly improves GUI agent planning and state management capabilities on long-horizon tasks.
TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration: TreeTeaming proposes an automated red-teaming framework based on a hierarchical strategy tree, in which an LLM-driven Orchestrator dynamically explores and evolves attack strategies. The framework achieves state-of-the-art attack success rates (ASR) across 12 mainstream VLMs (87.60% on GPT-4o) and discovers diverse novel attack strategies that go beyond all known strategy sets.
TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration: This paper proposes TreeTeaming, an autonomous red-teaming framework that transforms strategy exploration from static testing into a dynamic evolutionary process. An LLM orchestrator autonomously constructs and expands a hierarchical strategy tree, while a multimodal executor carries out concrete attacks. TreeTeaming achieves state-of-the-art attack success rates on 11 out of 12 evaluated VLMs, reaching 87.60% on GPT-4o.
TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration: TreeTeaming proposes an autonomous red-teaming framework that dynamically constructs and expands a strategy tree via an LLM-driven Orchestrator, autonomously discovering diverse VLM attack strategies from a single seed example. It achieves state-of-the-art attack success rates across 12 mainstream VLMs (87.60% on GPT-4o), while the discovered strategy diversity surpasses the union of all known publicly available strategies.
TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition: This paper proposes TRivia, a self-supervised fine-tuning framework that leverages QA-driven GRPO reinforcement learning to enable VLMs to learn table recognition directly from unannotated table images. The resulting TRivia-3B surpasses proprietary models such as Gemini 2.5 Pro and GPT-5 on multiple benchmarks.
Unbiased Dynamic Multimodal Fusion: UDML proposes an unbiased dynamic multimodal learning framework comprising two core components: a noise-aware uncertainty estimator (which injects controllable noise and predicts its intensity to achieve accurate modality quality assessment under both low-noise and high-noise conditions) and a modality dependency calculator (which quantifies the model's inherent dependency bias toward each modality via Dropout and incorporates it into the weighting mechanism). The framework addresses the dual suppression problem in existing methods and consistently improves performance across multiple multimodal benchmarks.
Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models: This paper proposes Beta-KD, an uncertainty-aware knowledge distillation framework grounded in a Bayesian perspective. By modeling teacher supervision as a Gibbs prior and deriving a closed-form solution via Laplace approximation, Beta-KD automatically balances data and teacher signals, consistently improving distillation performance on multimodal VQA benchmarks.
Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models: This paper proposes UNCHA, a framework that models the semantic representativeness of image parts with respect to the whole scene via hyperbolic uncertainty in hyperbolic VLMs. By incorporating uncertainty-guided contrastive loss and entailment loss, UNCHA enhances compositional scene understanding and outperforms existing hyperbolic VLMs across multiple downstream tasks.
Understanding Task Transfer in Vision-Language Models: This paper presents the first systematic study of how fine-tuning a VLM on one visual perception task affects its zero-shot performance on other perception tasks. It proposes the Perfection Gap Factor (PGF), a normalized metric for quantifying cross-task transfer, and reveals structural regularities in task transfer (positive/negative transfer cliques, task personas, scale dependence) across three scales of Qwen-2.5-VL. The paper further demonstrates that PGF can guide data selection to improve fine-tuning efficiency.
UNICBench: UNIfied Counting Benchmark for MLLM: This paper introduces UNICBench, the first unified cross-modal (image/text/audio) multi-level counting benchmark, comprising 5,508 + 5,888 + 2,905 = 14,301 QA pairs organized along a three-level capability taxonomy (Pattern/Semantic/Reasoning) × three-level difficulty taxonomy (Easy/Medium/Hard). The benchmark systematically evaluates 45 state-of-the-art MLLMs, revealing that basic counting tasks are near saturation while reasoning-level and hard-difficulty tasks exhibit substantial performance gaps.
UniGame: Turning a Unified Multimodal Model Into Its Own Adversary: UniGame proposes the first self-adversarial post-training framework for unified multimodal models (UMMs). By attaching a lightweight perturber at the shared visual token interface, the generation branch actively constructs semantically consistent adversarial samples to challenge the understanding branch, forming a minimax self-play game that substantially improves consistency (+4.6%), understanding (+3.6%), generation, and robustness.
UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression: This paper proposes UniMMAD, the first unified framework for multi-modal (RGB/Depth/IR, etc.) and multi-class anomaly detection. It follows a General-to-Specific paradigm: a general multi-modal encoder compresses features, and a Cross Mixture-of-Experts (C-MoE) decompresses them into domain-specific features. The method achieves state-of-the-art results on 5 datasets spanning industrial, medical, and synthetic scenarios at 59 FPS inference speed.
UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression: This paper proposes UniMMAD, the first unified framework that handles multi-modal and multi-class anomaly detection simultaneously with a single parameter set. The core contribution is an MoE-based feature decompression mechanism that adaptively decomposes general multi-modal encoded features into domain-specific unimodal reconstructions, achieving state-of-the-art performance across 9 datasets spanning 3 domains, 12 modalities, and 66 categories.
V2Drop: Variation-aware Vision Token Dropping for Faster Large Vision-Language Models: This work is the first to approach vision token compression from the perspective of inter-layer token variation. It identifies "lazy" vision tokens with small inter-layer variation as having negligible impact on model output, and proposes V2Drop, a progressive dropping scheme that eliminates low-variation tokens. V2Drop retains 94.0% of image understanding performance while reducing generation latency by 31.5%, and retains 98.6% of video understanding performance while reducing latency by 74.2%, with full compatibility with FlashAttention.
Variation-Aware Vision Token Dropping for Faster Large Vision-Language Models: This paper proposes V2Drop, the first method to approach token importance from the perspective of token variation. By progressively dropping "lazy" vision tokens with minimal variation inside the LLM, V2Drop achieves training-free, position-bias-free, and efficient-operator-compatible LVLM inference acceleration, retaining 94.0% and 98.6% of original performance on image and video understanding tasks while reducing LLM generation latency by 31.5% and 74.2%, respectively.
VecGlypher: Unified Vector Glyph Generation with Language Models: VecGlypher is proposed as the first unified language model for text- and image-guided vector glyph generation. Through a two-stage training pipeline (large-scale SVG syntax learning followed by expert-annotated alignment), it autoregressively generates editable SVG paths directly, without rasterization intermediate steps or vectorization post-processing.
Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping: This paper defines the novel task of Aesthetic Guidance (AG) and constructs the AesGuide benchmark (10,748 photos annotated with aesthetic scores, analyses, and guidance), then proposes Venus, a two-stage framework that first empowers MLLMs with aesthetic guidance capability via progressive aesthetic QA, and subsequently activates aesthetic cropping capability through CoT reasoning, achieving state-of-the-art performance on both tasks.
VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving: This paper proposes VGGDrive, a framework that injects cross-view geometric perception into VLMs via a frozen 3D visual foundation model (VGGT). A plug-and-play CVGE module is designed to hierarchically and adaptively fuse 3D features into the 2D visual embeddings at each VLM layer, achieving significant performance gains across five autonomous driving benchmarks.
Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models: This paper proposes VisionToM, a lightweight vision-based intervention framework that probes and intervenes on attention heads sensitive to visual input and ToM reasoning within MLLMs. Without fine-tuning the backbone, VisionToM substantially enhances Theory of Mind reasoning in multimodal large language models, achieving significant performance gains on the EgoToM benchmark.
VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion: This paper proposes VideoFusion, the first large-scale infrared-visible video fusion framework, which jointly models cross-modal complementarity and temporal dynamics via cross-modal differential reinforcement, complete-modality guided fusion, and bidirectional temporal collaborative attention, generating spatiotemporally consistent high-quality fusion videos. The authors also construct the M3SVD dataset comprising 220 videos and 153,797 frames.
VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion: This paper introduces the M3SVD large-scale infrared-visible video dataset (220 videos / 150K frames) and proposes the VideoFusion framework, which achieves spatio-temporal collaborative multi-modal video fusion via a Cross-modal Differential Reinforcement Module (CmDRM), Complete Modal Guided Fusion (CMGF), Bidirectional Co-Attention Module (BiCAM), and a variational consistency loss. The method surpasses existing image fusion and video fusion approaches in both fusion quality and temporal consistency.
ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting: ViKey overlays frame-index visual prompts (VPs) onto video frames and incorporates a lightweight Keyword-Frame Mapping (KFM) module to significantly improve temporal reasoning in VideoLLMs without any training, achieving near-dense-frame performance with as few as 20% of frames.
ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking: ViRC proposes a Reason Chunking mechanism that structures multimodal mathematical CoT into sequential Critical Reasoning Units (CRUs), simulating the process by which human experts repeatedly consult visual information and incrementally verify intermediate propositions. Through the CRUX dataset and a progressive training strategy (Instructional SFT → Practice SFT → Strategic RL), ViRC-7B achieves an average improvement of 18.8% across mathematical benchmarks.
Vision-Language Models Encode Clinical Guidelines for Concept-Based Medical Reasoning: This paper proposes MedCBR, a framework that integrates clinical diagnostic guidelines (e.g., BI-RADS) into the training and inference pipeline of concept bottleneck models. By leveraging LVLMs to generate guideline-consistent reports for enhanced concept supervision, combining multi-task CLIP training with a large reasoning model for structured clinical explanation generation, MedCBR achieves AUROCs of 94.2% and 84.0% on ultrasound and mammography cancer detection, respectively.
VISion On Request: Enhanced VLLM Efficiency with Sparse, Dynamically Selected, Vision-Language Interactions: VISOR proposes a new efficiency paradigm distinct from visual token compression — by sparsifying vision-language interaction layers within the LLM (a small number of cross-attention layers plus dynamically selected self-attention layers), it achieves 8.6–18× FLOPs savings while retaining all high-resolution visual tokens, substantially outperforming token compression methods on challenging tasks that require fine-grained understanding.
VL-RouterBench: A Benchmark for Vision-Language Model Routing: This paper introduces VL-RouterBench, the first systematic routing benchmark for vision-language models, encompassing 14 datasets, 17 candidate models, and 519,180 sample-model pairs. It evaluates 10 routing methods and reveals a significant gap between the current best router and the ideal Oracle.
VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh Recovery: This paper proposes a VLM-guided dual-memory self-reflective Critique Agent that generates group-level preference signals for diffusion-based human mesh recovery (HMR), followed by Group Preference Alignment fine-tuning of the diffusion model. The approach substantially improves in-the-wild HMR accuracy without requiring any 3D annotations.
VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models: This paper proposes VLM-Loc, a framework that converts 3D point cloud maps into BEV images and scene graphs for structured spatial reasoning with VLMs, and introduces a Partial Node Assignment (PNA) mechanism for fine-grained text-to-point-cloud localization. On the newly constructed CityLoc benchmark, VLM-Loc achieves a 14.20% improvement in Recall@5m over the previous state of the art.
VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm: This paper proposes VLM-Pruner, a training-free centrifugal token pruning method that balances redundancy elimination and local detail preservation through a Buffering for Spatial Sparsity (BSS) criterion. At an 88.9% pruning rate, it consistently outperforms existing methods across 5 VLMs while achieving end-to-end inference acceleration.
Do Vision-Language Models Leak What They Learn? Adaptive Token-Weighted Model Inversion Attacks: This paper presents the first systematic study of model inversion (MI) attacks against VLMs, proposing a suite of inversion strategies tailored to token generation (TMI/TMI-C/SMI) and an adaptive attention-weighted method SMI-AW that dynamically weights token gradient contributions based on visual attention intensity. Evaluated across 4 VLMs and 3 datasets, SMI-AW achieves up to 61.21% human-evaluated attack accuracy, revealing severe training data privacy leakage risks in VLMs.
VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments: This paper introduces VS-Bench, a multimodal benchmark comprising ten visual game environments, which systematically evaluates VLMs' strategic abilities in multi-agent settings across three dimensions—perception, strategic reasoning, and decision-making. Results reveal that even the strongest current models exhibit significant gaps from optimal performance in reasoning and decision-making.
Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training: Wan-Weaver proposes a decoupled architecture consisting of a planner (VLM) and a visualizer (DiT). By training the planner on large-scale textual-proxy data instead of real interleaved data, it achieves an Overall score of 8.67 on OpenING—approaching Nano Banana's 8.85—while maintaining strong comprehension capability (MMMU 74.9) and state-of-the-art interleaved text-image generation.
WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs: This paper diagnoses the Time-Agnosticism problem in current Video-LLMs and proposes the WeaveTime framework. During training, a temporal reconstruction auxiliary task (SOPE) endows the model with temporal awareness; during inference, an uncertainty-gated coarse-to-fine memory cache (PCDF-Cache) enables efficient adaptive memory retrieval, achieving significant gains on streaming video QA.
WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs: This paper diagnoses the Time-Agnosticism problem in current Video-LLMs and proposes WeaveTime, a framework that endows models with temporal awareness via a Shuffled-Order Prediction Enhancement (SOPE) auxiliary task during training, and achieves efficient adaptive memory retrieval at inference via an uncertainty-gated coarse-to-fine memory cache (PCDF-Cache), yielding significant gains on streaming video QA benchmarks.
What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models: This paper proposes EmbedLens, a probing tool for systematically analyzing the internal structure of visual tokens in MLLMs. It reveals that visual tokens fall into three categories—sink, dead, and alive (approximately 40% are uninformative)—that alive tokens already encode rich semantics before entering the LLM (a "pre-linguistic" property), and that intra-LLM visual computation is redundant for most tasks, such that direct mid-layer injection suffices.
When to Think and When to Look: Uncertainty-Guided Lookback: This paper presents the first systematic analysis of the effect of test-time thinking on visual reasoning in LVLMs. It reveals that "looking more is better than thinking more"—long reasoning chains frequently neglect the image, producing "long-wrong" trajectories. Based on this finding, the authors propose an uncertainty-guided lookback decoding strategy that injects visual re-inspection prompts when reasoning chains drift, achieving 2–6 point improvements on MMMU and five other benchmarks without modifying the model.
When Token Pruning is Worse than Random: Understanding Visual Token Information in VLLMs: This paper identifies a phenomenon in which existing token pruning methods underperform random pruning in deep layers of VLLMs, proposes a method for quantifying visual token information based on changes in output probability, and reveals the "Information Horizon"—a critical layer at which visual token information uniformly dissipates to zero. The position of this horizon is dynamically influenced by task visual complexity and model capability. The paper further demonstrates that simply integrating random pruning can effectively improve existing methods.
Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation: This paper proposes Eagle, a lightweight black-box attribution framework that performs spatial attribution for autoregressive token generation in MLLMs via a unified objective combining insight score (sufficiency) and necessity score (indispensability), and quantifies whether each generated token relies on language priors or perceptual evidence. Eagle comprehensively outperforms existing methods in faithfulness, localization, and hallucination diagnosis while substantially reducing GPU memory requirements.
Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models: This paper proposes CORE (COncept-aware REfuser), a framework for continual unlearning in large vision-language models (LVLMs). It decomposes vision-language deletion targets into fine-grained visual attribute concepts and textual intent concepts, employs a concept modulator to identify concept combinations requiring refusal, and generates concept-aligned refusal responses via a mixture of refusal experts (refusers). CORE achieves state-of-the-art unlearning-retention trade-offs of 90.67% CRR and 88.02% AR across 16 sequential unlearning tasks.
Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs: This paper is the first to formally define the Widget-to-Code task, constructing the first image-only widget dataset and a multi-dimensional evaluation framework. It proposes a modular baseline built upon a Perceptual Agent and the WidgetFactory infrastructure, achieving high-fidelity widget reconstruction through component decomposition, icon retrieval, reusable visualization templates, and adaptive rendering.
Zina: Multimodal Fine-grained Hallucination Detection and Editing: Zina formalizes the task of multimodal fine-grained hallucination detection and editing, proposes a two-stage system (detector MLLM + reviewer MLLM) that delegates token copying to a deterministic function to reduce model burden, constructs the VisionHall dataset (6.9K human-annotated + 20K graph-based synthetic samples), and surpasses GPT-4o by 15.8 points in detection F1.