Skip to content

🧩 Multimodal VLM

📷 CVPR2026 · 388 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (211) · 💬 ACL2026 (83) · 🧪 ICML2026 (89) · 🤖 AAAI2026 (75) · 🧠 NeurIPS2025 (105) · 📹 ICCV2025 (106)

🔥 Top topics: Multimodal/VLM ×191 · Alignment/RLHF ×32 · LLM ×27 · Few-/Zero-Shot Learning ×17 · Adversarial Robustness ×17

4DP-QA: Scalable QA for 4D Perception in Vision Language Models

This paper designs a scalable spatiotemporal QA automatic generation pipeline, producing 400,000 training samples (4DP-QA) and a 2.2K benchmark (4DP-QA-Bench) from various real/synthetic 4D data sources. It introduces "true-motion point tracking" as a new perception task to decouple object motion from camera motion. By fine-tuning standard VLMs with this data, 4D perception accuracy increases from ~42% to ~84%, with generalization to the external benchmark VLM4D.

4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models

4DWorldBench proposes a unified, multimodal, physics-aware 4D world generation evaluation framework. By mapping text/image/video conditions into a unified textual space, it evaluates models across four dimensions: Perceptual Quality, Condition-4D Alignment, Physical Realism, and 4D Consistency. It employs an adaptive hybrid scoring strategy involving LLM-as-judge, MLLM-as-judge, and traditional metrics, validated by human subjective experiments to be more aligned with human judgment than existing benchmarks.

A3: Towards Advertising Aesthetic Assessment

The authors propose the A3 framework, which includes a theory-driven three-stage advertising aesthetic assessment paradigm A3-Law (Perceptive Attention → Formal Interest → Desire Impact), a dataset of 120,000 annotated samples (A3-Dataset), a model aligned via SFT and GRPO (A3-Align), and an evaluation benchmark (A3-Bench). It outperforms existing MLLMs in automated advertising aesthetic assessment.

A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks

A closed-form solution for VLM debiasing is proposed, achieving Pareto optimal fairness and bounded utility loss through orthogonal decomposition of the attribute subspace in the cross-modal embedding space and Chebyshev scalarization. It is training-free and label-free, uniformly covering zero-shot classification, text-to-image retrieval, and text-to-image generation tasks.

A More Word-like Image Tokenization for MLLMs

DiVT replaces the MLP projector in LLaVA with a clustering-based visual projector, grouping ViT patch features into "visual words" based on semantics. Each cluster generates a single token, with the token count adaptively varying based on image complexity. Trained solely on language modeling objectives, it matches or exceeds full-resolution baselines across 8 multimodal benchmarks using 1/4 or even 1/40 of the visual tokens.

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

To address the deficiencies of VLMs in 3D spatial reasoning, this paper proposes the training-free SandboxVLM: it utilizes a video diffusion prior to generate multi-view sequences from a single 2D image, lifts key objects into sparse "abstract 3D bounding boxes," and renders them back to the VLM. This enables zero-shot understanding of 3D structures, achieving a 17.4% improvement over the baseline on SAT-Real.

Activation Matters: Test-time Activated Negative Labels for OOD Detection with Vision-Language Models

This paper introduces TANL (Test-time Activated Negative Labels), which dynamically evaluates the "activation level" of negative labels on OOD samples during test-time to mine the most effective labels. Combined with an activation-aware scoring function, it significantly reduces FPR95 from 17.5% to 9.8% on ImageNet benchmarks while remaining training-free and computationally efficient at inference.

Active Perceptual Inference: A Corticothalamic-Inspired Dynamic Nested Recurrent Network for Multimodal Sentiment Analysis with Incomplete Data

To address the "random frame-level missingness" problem in Multimodal Sentiment Analysis (MSA), this paper incorporates the human brain's "active perceptual inference" mechanism into the network. It proposes a dual-layer nested recurrent network, DNRNet: a local loop simulates intra-cortical pattern completion for intra-modality self-correction, while a global loop simulates the corticothalamic circuit to perform cross-modal weighted completion based on modality confidence. Two corrective signals are iteratively fed back into the input, upgrading "one-pass feedforward passive completion" to "multi-round active inference completion," achieving an average improvement of 1.5%–2.0% across various missing rates on MOSI/MOSEI/SIMS.

Adapting In-context Generation for Enhanced Composed Image Retrieval

This paper proposes DAIG: using 32 target domain samples to perform in-context fine-tuning (CIR-LoRA) on a pre-trained T2I model (Flux). This allows the model to synthesize "unbiased, domain-aligned" Composed Image Retrieval (CIR) triplets in batches. A two-stage training framework (feature-perturbed pre-training DRSP + angular margin fine-tuning FRA) is then used to feed these synthetic data into any off-the-shelf CIR model, achieving significant performance gains on CIRR/FashionIQ in a plug-and-play manner with zero additional inference cost.

Addressing Exacerbated Attention Sink for Source-Free Cross-Domain Few-Shot Learning

The authors observe that in source-free cross-domain few-shot learning (CDFSL) scenarios, standard few-shot fine-tuning on the target domain significantly exacerbates the attention sink of CLIP. The model concentrates attention on "simple tokens" that are inherently associated with all classes, leading to a loss of inter-class discriminability. To address this, TIR (Token Importance Recalibration) is proposed. It linearly reweights tokens between deep layers of the CLIP vision encoder based on their "cross-class activation" (Sum score). This suppresses sink tokens and amplifies discriminative tokens, achieving new SOTA results across four CDFSL benchmarks.

ADSeeker: A Knowledge-Grounded Reasoning Framework for Industry Anomaly Detection and Reasoning

ADSeeker is a large-scale pre-training-free, plug-and-play industrial anomaly detection (IAD) assistant. It injects domain-specific knowledge into a general MLLM using the first visual document knowledge base SEEK-M&V and a multimodal retrieval framework Q2K RAG. Combined with an AD Expert that fuses defect localization/discrimination information into visual tokens and a Hierarchical Sparse Prompt (HSP) to extract type-level defect features, it achieves SOTA in zero-shot anomaly detection and MMAD anomaly reasoning across 12 industrial and medical datasets.

AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models

AG-VAS introduces three learnable "semantic anchor" tokens into the Large Multimodal Model (LMM) vocabulary: an absolute anchor [SEG] that translates abstract "anomalies" into concrete visual entities (e.g., holes, scratches), and relative anchors [NOR]/[ANO] that model contrastive contexts between normal and anomalous regions. Combined with a Semantic-Pixel Alignment Module (SPAM) and an Anchor-Guided Mask Decoder (AGMD), the model directly outputs binary anomaly masks for unseen categories, achieving new zero-shot SOTA performance across six industrial and medical benchmarks.

AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models

AGFT proposes an alignment-guided fine-tuning framework that enhances zero-shot adversarial robustness of VLMs through text-guided adversarial training and distribution consistency calibration. By preserving the pre-trained cross-modal semantic structure, it achieves an average robust accuracy of 46.57% across 15 zero-shot benchmarks, surpassing SOTA by 3.1 percentage points.

Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow

This paper identifies that the excessive attention of text tokens toward irrelevant vision tokens is the root cause of the "seeing but perceiving incorrectly" phenomenon in VLMs. It proposes Adaptive Information Flow (AIF), a training-free method based on token dynamic entropy that regulates information flow by modifying the causal mask at inference time to block irrelevant vision-to-text connections, enhancing the perceptual capabilities of various VLMs.

Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval

Addressing the failure of the traditional "small-loss" assumption in Composed Image Retrieval (CIR) due to "partial matching" noise, this paper uses an MLLM to offline label a high-precision anchor set and distills it into a lightweight Bayesian proxy for online confidence estimation. By diverting training data into "Clean Alignment" and "Feedback Correction" streams, the method decouples the arbiter from the learner to avoid representation pollution, significantly outperforming existing SOTA in high-noise CIR settings.

Anchor-Guided Gradient Alignment for Incomplete Multimodal Learning

Addressing the learning imbalance problem where "reconstructed samples dominate optimization and suppress the representation of complete samples" under high missing rates, ANGA constructs optimization anchors using complete samples and aligns the gradients of reconstructed samples toward these anchors (three-stage modulation within a conical region). Coupled with a Semantic Enhancement Adapter that generates dynamic prompts from retrieved instances, it consistently outperforms SOTAs like RAGPT on three datasets.

Anti-Degradation Lifelong Multi-View Clustering

Addressing the "streaming views arriving over time" scenario, ALMC projects the prototypes of each new view onto the null space (orthogonal direction) of the old knowledge subspace before fusion. This mathematically ensures that new knowledge does not overwrite old knowledge, achieving SOTA results on six benchmarks (e.g., ALOI-10 ACC improved from 87.4% to 90.9%).

ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

ARGUS discovers that the behaviors of "following user instructions vs. following injected instructions" are linearly separable in the activation space of MLLMs and reside within a "safe subspace." By applying activation steering toward a "defensive yet performance-preserving" direction during inference, combined with a three-stage pipeline (injection detection + adaptive intensity + post-filtering), it reduces attack success rates to near zero across image, video, and audio modalities while maintaining model utility.

ArtiMuse: Fine-Grained Image Aesthetics Assessment with Joint Scoring and Expert-Level Understanding

ArtiMuse utilizes an InternVL-3-8B based Multimodal Large Language Model (MLLM) to simultaneously output 8-dimensional fine-grained expert aesthetic textual analysis and a continuous aesthetic score. By introducing "Token As Score," the model integrates continuous scoring into discrete LLM token generation. It also introduces ArtiMuse-10K, the first dataset with 10,000 expert-annotated samples per dimension, achieving SOTA performance on multiple aesthetic scoring benchmarks.

AToken: A Unified Tokenizer for Vision

AToken unifies the encoding of images, videos, and 3D assets into a shared sparse 4D latent space. Utilizing a pure Transformer with non-adversarial Gram loss, it achieves high-fidelity reconstruction and semantic understanding simultaneously. A single model achieves performance competitive with specialized methods across three modalities (Image 0.21 rFID / 82.2% ImageNet, Video 3.01 rFVD, 3D 28.3 PSNR / 90.9% accuracy).

Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs

AoD-IP adds three lightweight projectors to a frozen CLIP and uses a "credential token" to lock authorized domains so they can only be activated with a specific key. This allows on-demand hot-swapping of new authorized domains after deployment without retraining the backbone, while outputting a "legality signal" during each inference to detect unauthorized access. It achieves near-zero loss on authorized domains and significant accuracy collapse on unauthorized domains across multiple cross-domain benchmarks.

AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models

AutoTraces extends a multimodal LLM (LLaVA-Video) by introducing a <point> token with a corresponding Point Encoder/Head representation. This maps 2D waypoints into the LLM latent space, allowing the model to predict future robotic trajectories point-by-point through native autoregressive mechanisms. Combined with automatically generated Chain-of-Thought (CoT) reasoning and two-stage training, it outperforms SOTA models on the SCAND dataset in long-horizon, cross-scenario, and variable-length forecasting.

β-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment

β-CLIP decomposes a long description into a three-layer hierarchy of text queries ("full caption → sentences → phrases"). It utilizes cross-attention to dynamically aggregate these queries into query-specific visual features. A contrastive loss with \(\beta\) adjustment (\(\beta\)-CAL) is introduced to handle the inherent semantic overlap between these hierarchical features. Without using any hard negatives, it improves fine-grained retrieval (FG-OVD Hard) to 30.9% and Urban1K retrieval to 91.8/92.3%, establishing a new SOTA under the "no hard negative" setting.

BALM: A Model-Agnostic Framework for Balanced Multimodal Learning under Imbalanced Missing Rates

BALM proposes a model-agnostic, plug-and-play framework to address multimodal learning under Imbalanced Missing Rates (IMR). By employing a Feature Calibration Module (FCM) to align representations across different missing patterns and a Gradient Rebalancing Module (GRM) to balance the optimization dynamics of each modality from both distributional and spatial dimensions, BALM consistently improves the robustness of various backbone networks across multiple multimodal emotion recognition benchmarks.

Benchmarking Single-Factor Physical Video-to-Audio Generation

This paper introduces FlatSounds—a benchmark that audits the physical reasoning capabilities of video-to-audio (V2A) models using "single-factor counterfactual intervention + single-video pattern testing." It reveals that current SOTA models actually "copy" physics and semantics from text captions rather than learning them from pixels, and stronger captions lead to poorer temporal alignment.

Beyond Global Similarity: Multi-Conditional Retrieval for Fine-Grained Cross-Modal Understanding

This paper introduces the MCMR benchmark—a fine-grained cross-modal product retrieval dataset that requires "simultaneous satisfaction of multiple complementary conditions across both image and text" for a match. It systematically evaluates mainstream MLLM retrievers and MLLM-as-Rerankers, finding that while current retrievers excel at coarse-grained recall, they struggle with multi-conditional reranking. Explicit pair-wise verification via pointwise reranking significantly improves top-tier ranking quality.

Beyond Graph Model: Reliable VLM Fine-Tuning via Random Graph Adapter

This work replaces the "one category = one deterministic vector" approach in VLM text adapters with "one category = one Gaussian distribution." It initializes distributions using diverse descriptions generated by LLMs, performs probabilistic message passing on the category graph, and incorporates a dynamic multi-backbone fusion scheme based on prediction certainty (kurtosis). The method consistently outperforms SOTA methods like GraphAdapter and AMU-Tuning in few-shot classification and OOD generalization across 11 datasets.

Beyond Heuristic Prompting: A Concept-Guided Bayesian Framework for Zero-Shot Image Recognition

Reformulates VLM zero-shot image recognition as a Bayesian framework. It constructs a concept proposal distribution through an LLM-driven multi-stage concept synthesis pipeline and utilizes an adaptive soft-trim likelihood function to suppress the influence of outlier concepts, outperforming SOTA methods across 11 classification benchmarks.

Beyond Missing Modalities: Hypergraph Guided Diffusion for Uncertainty-Aware Multimodal Emotion Recognition

To address the random loss of audio/text/visual modalities in Multimodal Emotion Recognition in Conversation (MERC), HyperEF utilizes a Masked Hypergraph Attention Network (MHGAT) to capture high-order multivariate dependencies within dialogues. This network serves as a condition to guide a diffusion model in completing missing modality features within the latent space. Finally, Dual-Channel Evidential Fusion (DCEF) quantifies uncertainty from both "feature source" and "discriminative" perspectives to adaptively fuse modalities, achieving new SOTA performance across all missing rates on IEMOCAP and MELD.

Beyond Sequential Tools: A Unified VLM Agent System for Photographic Post-Processing via Dynamic Multi-Expert Fusion

A VLM acts as the "brain" to diagnose multiple coupled degradations in an image and assign weights to corresponding expert LoRAs. These LoRAs are fused into a diffusion backbone once, enabling collaborative restoration (e.g., "deraining + dehazing + deblurring") in a single forward pass. This avoids the generalization issues of all-in-one models and the error accumulation inherent in sequential tool-calling agentic methods.

Beyond Single Images: A Comprehensive Benchmark for Album-Level Vision-Language Understanding

This paper introduces AlbumBench, the first comprehensive benchmark for "album organization." It decomposes album operations into four tasks: intent selection, intent rating, group labeling, and group clustering. Evaluating 20 mainstream VLM configurations on 27,051 images across 641 albums reveals a significant gap between open-source and closed-source models. While "thinking" modes significantly improve grouping tasks at a high cost, VLMs perform marginally better than baselines that use only a single-sentence album description.

Beyond Weak Supervision: MLLMs-Guided Graded Knowledge Distillation for Unsupervised Camouflaged Object Detection

Addressing the two major pain points of Unsupervised Camouflaged Object Detection (UCOD)—"weak supervision signals" and "poor utilization of pseudo-labels"—this paper employs a frozen teacher model composed of MLLM and SAM to generate high-quality pseudo-labels. Through a trio of designs—Camouflage-Aware Chain-of-Thought (CA-CoT), Graded Mask Evaluator (GME), and Graded Knowledge Distillation (GKD)—it ensures pseudo-label quality and distills knowledge based on quality differences to a student network. This approach significantly outperforms existing UCOD methods and demonstrates strong performance in zero-shot settings.

Beyond What's Shared: Recovering Lost Unique Information from Intermediate Layers to Boost Multimodal Geo-Foundation Models

The authors observe that the intermediate layers of multimodal contrastive models (e.g., SatCLIP) retain modality-specific (unique) information that is discarded by the final alignment layers. Consequently, they propose BWS (Beyond What's Shared), which performs deep weighted concatenation of intermediate and final layer representations. Without any additional training objectives or external models, this single step leads to consistent performance gains across seven geospatial downstream tasks.

Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models

Authors find that demographic bias in VLM embeddings is not concentrated on a few coordinate dimensions but rather distributed across several linear subspaces. They propose SPD: iteratively learning the entire "bias subspace" that can linearly predict sensitive attributes using INLP, projecting embeddings onto its orthogonal complement (null space) to eliminate decodable attribute signals, and then reinjecting a neutral mean to preserve semantics. Across zero-shot classification, text-to-image retrieval, and image generation, four fairness metrics improved by an average of 18.5% with negligible accuracy loss.

BiomedCCPL: Causal Conditional Prompt Learning for Biomedical Vision-Language Models

Addressing the poor generalization of biomedical VLMs on "unseen classes within the same dataset," BiomedCCPL employs a VGAP module to dynamically generate image-conditional prompts from multi-scale adaptive visual prototypes and an SCD module to decouple prompts into causal and non-causal pathways via front-door adjustment for deconfounding. On 11 datasets across 9 modalities, the average HM for Base-to-Novel tasks is improved from 73.53% to 79.98% (+6.45%).

Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing

PaddleOCR-VL utilizes a lightweight "coarse-to-fine" two-stage framework that "localizes valid regions first, then identifies chunk-by-chunk," filtering out redundant backgrounds in high-resolution documents from the VLM. With only 0.9B parameters and approximately 2.5k visual tokens, it achieves a SOTA overall score of 92.62 on OmniDocBench v1.5, while delivering 50% higher throughput than the strongest baseline.

Boosting Vision-Language Models Towards Cross-Domain Incremental Object Detection

To address the more realistic "Cross-Domain Incremental Object Detection" scenario, this paper establishes a CDIOD benchmark (involving sequential sub-tasks across natural, underwater, and remote sensing domains) and proposes the DGS framework. DGS dynamically groups tasks based on distribution similarity, shares subspaces via expandable LoRA adapters within groups, and performs inference with group-level routing. It achieves SOTA on CDIOD with +11.4 AP using only +1.2% additional parameters.

Boosting Visual Reprogramming for CLIP with Dual Granularity Alignment

Addressing the flaw of "single-level alignment" in CLIP visual reprogramming (which only trains visual prompts at the input while freezing the black-box CLIP), this paper proposes DGA. DGA extracts two overlooked types of structural information—semantic granularity (label hierarchy) and visual granularity (multi-scale). It uses PLH+HKP for hierarchical semantic alignment and multi-scale cropping + UPF for uncertainty-weighted visual alignment. These two paths collaborate to achieve an average improvement of 4.5% over the previous SOTA (DVP) across 12 recognition datasets.

Breaking Multimodal LLM Safety via Video-Driven Prompting

This paper reveals that the video modality is more susceptible to jailbreaking than the image modality. It proposes SPTV: weaving harmful typographic images into a video that is "proximal to safe data in representation space and sufficiently diverse across frames" via bipartite graph matching. SPTV achieves SOTA jailbreak success rates (average 36.4%) across 16 safety policies and 5 open/closed-source MLLMs, while providing an effective Video-aware System Prompt (VSP) defense.

Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding

Addressing the "over-reliance on language priors and neglect of visual evidence" in Vision-Language Models (VLMs) that leads to object hallucinations, this paper proposes a training-free Positive-and-Negative Decoding (PND). By using an external BLIP cross-modal attention to locate visual evidence, PND constructs a "positive path" to amplify evidence and a "negative path" to erase evidence and expose priors. During each decoding step, logits from three paths are contrastively fused to pull generation toward visual facts, achieving up to a 6.5% accuracy improvement on POPE.

Bridging the Modality Gap in Compositional Zero-Shot Learning via Sparse Alignment and Unimodal Memory Bank

Addressing the inherent "modality gap" of CLIP in Compositional Zero-Shot Learning (CZSL), this paper proposes the SAM three-stage framework: Sparse Alignment selects the image patches most relevant to the text to reduce redundant visual information; Visual Adaptive Condensation compresses key cues into a single representation; and a Dynamic Memory Bank bypasses the modality gap through pure visual classification. This approach comprehensively outpaces CLIP-based methods across three benchmarks in both closed-world and open-world settings.

BriMA: Bridged Modality Adaptation for Multi-Modal Continual Action Quality Assessment

BriMA is proposed to address non-stationary modality imbalance in multi-modal continual action quality assessment through memory-guided bridged completion and modality-aware replay mechanisms, achieving an average Gain of \(6-8\%\) in correlation coefficients and a reduction of \(12-15\%\) in error across three benchmarks.

CAD-Refiner: A Unified Framework for CAD Generation and Iterative Editing

CAD-Refiner utilizes a VLM agent to parse text, images, or editing instructions into a "topological graph" as a unified condition. Combined with a "Sequence Injection Strategy," it aligns generation, completion, and editing tasks into a single decoder. It corrects geometric errors using adaptive loss weighting based on OCCT geometric validation, completing a full CAD modeling workflow from initial generation to multi-round iterative editing within a unified framework.

CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models

CADFS reconstructs 450,000 real-world CAD models created by engineers on the Onshape platform into clean, executable FeatureScript code. Supplemented with automatically generated text and multi-view annotations, it enables VLMs to generate complex design histories beyond simple "sketch+extrude" for the first time—including 15 types of operations such as fillet, loft, and revolve—setting new SOTA results in both text-to-CAD and multi-view reconstruction tasks.

Camouflage-aware Image-Text Retrieval via Expert Collaboration

This paper introduces "Image-Text Retrieval" (ITR) to camouflaged scenes for the first time, constructing the CamoIT dataset with 10.5k samples. It proposes CECNet, featuring a dual-branch architecture and Confidence-conditioned Graph Attention (C2GA). By utilizing a Camouflaged Object Detection (COD) expert to "extract" and independently encode camouflaged targets before selectively fusing them back into global representations, CECNet improves retrieval accuracy by approximately 29%, outperforming seven mainstream models in Camouflage-aware ITR (CA-ITR).

Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching

FlowSG reformulates Scene Graph Generation (SGG) from "one-shot classification" into "progressive generation." By using hybrid discrete-continuous flow matching, a noise-polluted graph gradually evolves into object boxes (via continuous CFM) and predicate labels (via discrete DFM) over time. It outperforms the SOTA (USG-Par) by an average of 3 points across closed-set and open-vocabulary settings on VG and PSG datasets.

CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation

CapNav proposes a "capability-conditioned navigation" benchmark: given an indoor tour video, a navigation graph, an agent profile with physical/operational capabilities, and a "go from A to B" task, VLMs must determine if and how the agent can navigate the space. Experiments on 13 mainstream VLMs show that navigation performance drops significantly once mobility constraints (e.g., inability to climb stairs, narrow corridors) are introduced.

CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment

The CAPT (Confusion-Aware Prompt Tuning) framework is proposed to explicitly model systematic misalignment patterns in VLMs through a Semantic Confusion Miner (SEM) and a Sample Confusion Miner (SAM). By integrating different levels of confusion information via Multi-Granularity Difference Experts (MGDE), it achieves a state-of-the-art HM of 83.90% across 11 benchmarks.

CaptionQA: Is Your Caption as Useful as the Image Itself?

CaptionQA redefines "caption quality" as "whether the caption can substitute for the image in downstream tasks." By using a text-only LLM to answer 33,027 dense multiple-choice questions based solely on captions, it measures exactly how much usable information is lost relative to the original image. Results show even the strongest closed-source models suffer a 9–16% utility drop, while open-source models drop over 40% in Embodied AI scenarios.

CASPA: Graph-Structured Concept Anchors for Modality-Agnostic Adaptation in Vision-Language Models

CASPA reformulates CLIP downstream adaptation from "learning one prompt set per class" to "sharing a set of semantic anchors across all classes, with each class learning a soft distribution over these anchors." By using cross-modal consistency regularization to align text and visual anchors while freezing the backbone, CASPA adds only 1.1M parameters (0.73% of CLIP). It achieves or exceeds SOTA across 11 datasets in four settings: Base-to-Novel, cross-dataset transfer, and few-shot learning.

CF-IPT: Cross-Modal Fusion Interactive Prompt Tuning of Vision-Language Pre-Trained Model for Multisource Remote Sensing Data Classification

CF-IPT proposes a prompt tuning framework that fuses Hyperspectral (HSI) and LiDAR/SAR data into a single image to generate spectral-spatial prompt matrices. These matrices guide the bidirectional interactive alignment between CLIP's vision and text prompts. By tuning only 0.76% of CLIP's parameters, it transfers the natural-image pre-trained model to multisource remote sensing classification, achieving OA gains of 1.38%/2.27%/1.38% over SOTA on Houston, MUUFL, and Augsburg datasets respectively.

ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding

This paper proposes ChartNet, a million-scale chart understanding dataset containing 1.5 million high-quality multimodal aligned samples. Through a code-guided synthesis pipeline, it generates quintuplet data (code, image, data table, text description, and reasoning-based QA) covering 24 chart types and 6 plotting libraries. A 2B model fine-tuned on ChartNet outperforms GPT-4o and 72B open-source models.

CICA: Coupling Confidence-Aware Pretraining with Confidence-Informed Attention for Robust Multimodal Sentiment Analysis

CICA enables each unimodal encoder to "self-evaluate" signal reliability (outputting confidence \(s_m\) and uncertainty \(u_m\)) during pretraining. These signals are then used to modulate the output of a Confidence-Informed Attention mechanism. This allows the model to adaptively amplify reliable modalities and suppress noisy ones when text, visual, or audio signals conflict or are missing, setting a new SOTA across MOSI, MOSEI, CH-SIMS, and CH-SIMSv2.

CLEP: Contrastive Language-Pose Pretraining

CLEP adapts CLIP-style contrastive learning to "3D Human Pose ↔ Natural Language". By combining the hierarchical pose encoder HierFormer (joint/limb/full-body levels + Cross-Scale Attention Fusion, CSAF) with the self-constructed CLEP-2M dataset (2 million pairs) for contrastive pretraining, it boosts mRecall on PoseScript-H zero-shot retrieval from 5.9 to 34.8 (nearly 6x) and outperforms baselines in downstream tasks like pose generation and editing.

CLIP-like Model as a Foundational Density Ratio Estimator

This paper reinterprets contrastively trained vision-language models like CLIP/SigLIP as "off-the-shelf density ratio estimators." The similarity scores implicitly optimized by contrastive objectives are shown to be proportional to log-density ratios. This enables two training-free capabilities: single-prompt importance-weighted pre-training (F1 gain up to +7 points) and image-text KL divergence estimation (measuring semantic diversity for data filtering, achieving results comparable to LAION2B filtering).

Cluster-aware Anchor Learning for Multi-View Clustering

To address the drawbacks of "globally fixed anchor counts and treating every cluster equally" in anchor-based multi-view clustering, CAL partitions the consensus anchor matrix into \(k\) groups by cluster. It applies column-sparsity penalties to each group to automatically determine the number of anchors per cluster and pulls anchors of different clusters apart via inter-cluster orthogonal regularization. It outperforms 10 SOTAs in ACC/NMI across 8 benchmarks.

Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models

CPT constrains the "Neural Collapse / ETF equiangular separation" from all global classes to the internal semantic clusters inherent in pre-trained VLMs. By incorporating a rotation stability loss that anchors learnable text prototypes to frozen ones, it enhances tail-class separability in long-tailed prompt tuning without destroying CLIP's global semantic hierarchy—outperforming SOTAs like DPC/DeKg/NPT across 11 datasets.

CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval

Addressing the gap where "code retrieval only considers text and ignores visual rendering," this paper introduces MMCoIR, the first multimodal and multilingual code retrieval benchmark (covering 5 visual domains, 8 languages, and 11 libraries). Based on Qwen2VL, the authors develop CodeMMR using instruction-conditioned contrastive learning to project text, code, and images into a unified semantic space. CodeMMR outperforms strong baselines like VLM2Vec-v2 and GME by approximately 10 points in average nDCG@10. Integrating it into RAG workflows further improves the execution rate and visual fidelity of image-to-code generation.

CodePercept: Code-Grounded Visual STEM Perception for MLLMs

Through systematic scaling analysis, it is discovered that perception, rather than reasoning, is the true bottleneck for MLLMs in the STEM field. The CodePercept paradigm is proposed using executable Python code as an anchoring medium. It constructs a million-scale ICC-1M dataset and the STEM2Code-Eval benchmark. Following two-stage training (SFT+RL), the STEM visual perception and downstream reasoning capabilities of MLLMs are significantly improved.

Concept-Aware Batch Sampling Improves Language-Image Pretraining

This paper transforms "data curation" from offline, sample-level, concept-agnostic filtering into online, batch-level, concept-aware sampling. The authors first annotate 128 million image-text pairs with fine-grained concepts (DATACONCEPT), then utilize a pluggable scoring function, CABS, to select sub-batches from a super-batch during training that match target concept distributions—using "Diversity Maximization" for classification and "Frequency Maximization" for retrieval. This achieves a 7% gain in classification and a 9.1% gain in retrieval across 28 benchmarks.

Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach

The paper proposes a training-free CLIP explanation method called CCI (clustering image patches into semantic clusters, masking attention by cluster, and quantifying contributions via similarity drops). This method reveals that "most CLIP errors are fine-grained confusion rather than background dependency." Additionally, the authors build the COVAR benchmark to systematically evaluate the spurious correlation tendencies of 18 CLIP variants across controlled transformations.

Condensed Test-Time Adaptation of VLMs for Action Recognition

Addressing the non-transitivity of the mapping chain in training-free cache-based Test-Time Adaptation (TDA) — where "vision-vision alignment is dominated by appearance while vision-text alignment is dominated by semantics" — CONDA uses text semantics to guide the construction of visual caches. It condenses only patches positively correlated with action semantics (PSPS) into spatio-temporal tubes (ATC), consistently outperforming TDA by 1~3.5% across 7 action recognition benchmarks and enabling plug-and-play integration with any VLM.

ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval

Aiming at the "hard noise" in Noisy Triplet Correspondence (NTC) for Composed Image Retrieval (CIR)—where the reference and target images are highly similar but the modification text is incorrect—this paper proposes ConeSep. It first quantifies the matching fidelity of each sample using geometric boundaries in a cone space for noise separation, then learns a "diagonal negative composition" for each query as an explicit semantic negative anchor. Finally, noise correction is modeled as an optimal transport problem for directional unlearning. ConeSep outperforms SOTAs like TME, HABIT, and INTENT across various noise rates on FashionIQ and CIRR.

Conflict-Aware Adaptive Cross-Reconstruction for Multimodal Sentiment Analysis

Addressing the overlooked pain point where "linguistic, visual, and acoustic sentiment polarities contradict each other within the same sample," CACR first quantifies sentiment conflict scores in a shared subspace, then employs a conflict-weighted cross-reconstruction module to implicitly align shared semantics and suppress conflicting modalities. By supplementing textual semantics with fine-grained sentiment refinement, it outperforms existing SOTA on three standard datasets.

Controllable Federated Prompt Learning at Test Time

Addressing the performance collapse of Federated Prompt Learning models when encountering new domain distribution shifts after deployment, this paper introduces the Test-Time Federated Prompt Learning (TTFPL) setting. It proposes the COTE framework, which utilizes a custom unsupervised Model-Data Alignment (MoDA) score to dynamically select the optimal prompt among "Global, Local, and Original CLIP" prompts, improving average accuracy by over 6% across cross-domain settings on five benchmarks.

CoRiM: Conflict-driven Risk Minimization for Dynamic Multimodal Fusion

This paper redefines dynamic multimodal fusion as a "per-sample optimization problem that directly minimizes conflict risk." It designs a differentiable modality conflict risk function \(R(w)\) (comprising fusion uncertainty, modality confidence priors, and JS consistency) and employs the projection-free Frank-Wolfe algorithm to find optimal modality weights on the probability simplex. This approach significantly outperforms state-of-the-art methods like QMF and PDF in high-conflict and noisy scenarios.

CountGD++: Generalized Prompting for Open-World Counting

CountGD++ generalizes the "prompting" mechanism for open-world object counting: it allows users to specify "what to count" and "what NOT to count" using both text and visual exemplars. It enables the model to self-generate visual exemplars (pseudo-exemplars), borrow exemplars from external or synthetic images, and operate as an LLM-invoked counting expert agent. It achieves significant improvements in counting and detection accuracy across 8 datasets without fine-tuning (e.g., blood cell MAE dropped from ~11.5 to 1.52).

CoV-Align: Efficient Fine-grained Cross-Modal Alignment with Cohesive Visual Semantics Priority

CoV-Align proposes a fine-grained image-text retrieval framework that aggregates image patches into semantic regions without text involvement before performing region-word alignment. It uses deformable attention and consistent assign attention to generate regions, refined by spatial concentration and visual contrastive losses. On Flickr30K and MS-COCO, it achieves new SOTA results while being 3–5 times faster than text-guided methods.

CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models

This paper identifies the "Visual Preference Conflict" issue during visual encoder fine-tuning in MLLMs and proposes the CoVFT framework. By implementing Context Vector Extraction (CVE) and Context Mixture-of-Experts (CoMoE), it achieves context-aware visual fine-tuning, reaching SOTA performance on 12 multimodal benchmarks with significantly higher stability than existing methods.

CrossHOI-Bench: A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods

Ours proposes CrossHOI-Bench, the first HOI benchmark for unified evaluation of VLMs and HOI-specific models via multiple-choice questions. By avoiding erroneous penalties from incomplete annotations through curated positive and negative examples, it reveals that large VLMs outperform SOTA HOI methods by \(+5.18\%\) in Instance-F1 zero-shot, while identifying systematic weaknesses in multi-action recognition and cross-human attribution.

Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

Ours proposes CubiD, the first model to perform discrete diffusion generation on high-dimensional representation tokens (768 dimensions). It achieves high-quality image generation through fine-grained mask prediction on an \(h \times w \times d\) three-dimensional tensor while preserving understanding capabilities.

DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles

This paper proposes DeAR, which decomposes deep attention heads in ViT into three functional roles—attribute, generalization, and mixed heads—using a Concept Entropy metric. By designing a role-based attention mask mechanism to precisely control information flow, it achieves an optimal balance between task adaptation and zero-shot generalization across 15 datasets.

Decoupled and Reusable Adaptation for Efficient Cross-Modal Transfer

The process of "transferring RGB foundation models to non-RGB modalities such as infrared, depth, and events" is decoupled into two stages: "one-time modality knowledge learning (self-supervised training of reusable modality LoRA)" + "lightweight task knowledge learning (task prompts + Mixture of Modality Experts)". This eliminates the need for retraining from scratch when switching tasks, achieving triple efficiency gains in data, computation, and storage across six cross-modal scenarios.

Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation

DASP is proposed to diagnose biased modalities via redundancy scores and resolve negative transfer and catastrophic forgetting in multi-modal TTA through an asymmetric adaptation strategy that decouples stability and plasticity.

DeepAlign: Mitigating Modality Conflict through Modality-Specific Alignment

Addressing the "modality conflict" in MLLMs—where visual integration degrades linguistic performance and fails to capture fine-grained details—DeepAlign introduces a plug-and-play post-training framework. It uses classifier gradients to identify and push "modality-specific components" of visual representations toward the LLM's text embedding space, while distilling patch structural relationships from DINOv2 into the MLLM's visual hidden states. By training only an inserted adapter (200M parameters), DeepAlign achieves consistent gains across three major MLLMs on over ten benchmarks and activates emergent capabilities like multimodal in-context learning.

Demo2Tutorial: From Human Experience to Multimodal Software Tutorials

Demo2Tutorial is an agentic framework that automatically distills raw screen recordings and low-level operation logs of human software usage into structured, interleaved image-text tutorials. The generated tutorials outperform official human-authored tutorials in quality (86.2 vs. 79.1) on a self-built benchmark. They significantly improve the planning success rate of GUI Agents on OSWorld (GPT-5 on Chrome: 52.9% \(\rightarrow\) 70.6%), speed up human software learning by 10.5%, and are preferred by 80% of users.

Describe Anything Anywhere At Any Moment

DAAAM decouples "real-time geometric-semantic mapping" from "fine-grained local descriptions generated by large models": it uses an optimization problem to select the minimum number of keyframes, which are then fed in batches to the Describe Anything Model (DAM) to generate open-vocabulary descriptions. This allows for constructing hierarchical 4D scene graphs with detailed textual annotations at 10 Hz real-time, serving as spatio-temporal memory for embodied agents and achieving SOTA in large-scale spatio-temporal QA and sequence task localization.

Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection

The CARE framework is proposed, which first pinpoint neurons and layers causally related to unsafe behavior in VLMs using causal mediation analysis (diagnosis), and then constructs a dual-modal safety subspace via generalized eigenvalue decomposition to project activation values during inference (repair). This reduces the attack success rate to below 10% with minimal loss in general capabilities.

DialogueVPR: Towards Conversational Visual Place Recognition

This work transforms language-guided place recognition from a static "one-time query" retrieval into a multi-round dialogue reasoning (DlgPR) framework: "retriever coarse-screening → multi-modal LLM active questioning → user feedback → refined retrieval." It introduces the first conversational place recognition benchmark, DQ-Cities, and a questioning agent, DQ-Pilot, trained via "SFT + GRPO curriculum learning." After 5 rounds of dialogue, the R@1 improves by 13.4% over a 7B base model, even outperforming a 72B model.

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

Ours proposes the DACO framework, which constructs a dictionary of 15,000 multimodal concepts from WordNet and CC-3M. Combined with Sparse Autoencoders (SAE), it achieves fine-grained concept control over the activation space of frozen MLLMs, significantly enhancing safety across multiple benchmarks while maintaining general capabilities.

Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

AuditDM fine-tunes an MLLM as an "auditor" to actively generate image-text pairs that induce failures in the target model while maintaining consensus among a set of reference models. This systematically uncovers the target model's capability blind spots and converts them into unlabeled training data for feedback—resulting in PaliGemma2-3B outperforming its official 28B version across multiple benchmarks.

Diffusion Guided Chain-of-Vision for Large Autoregressive Vision Models

This paper transfers the Chain-of-Thought concept from language models to pure visual Large Autoregressive Models (LVM). By using a pre-trained diffusion model to generate a sequence of visually coherent intermediate frames in the image space as a "task-agnostic reasoning process" inserted into the input sequence, it transforms LVM downstream tasks (segmentation, depth, pose, etc.) from "single-step direct output" to "multi-step progressive generation," achieving stable performance gains across seven visual tasks and three model scales.

DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Models

This paper proposes Differential Grounding (DiG) as a proxy task—giving the model two highly similar images and requiring it to localize all differences using bounding boxes without knowing the total count. Combined with Blender-automated data generation, GRPO reinforcement learning, and curriculum learning, the fine-grained visual perception of Qwen3-VL is significantly enhanced and successfully transfers to downstream grounding tasks like RefCOCO and general multimodal benchmarks.

DiGraphHal-Bench: Evaluating Multimodal Large Language Models on Complex Directed Graphs

DiGraphHal-Bench is the first large-scale VQA benchmark specifically for "complex directed graphs." It systematically evaluates MLLM hallucinations and compositional reasoning across 2,796 real flowcharts via four capability dimensions and 12 fine-grained tasks. By employing a two-stage pipeline of "LLM generation + algorithmic deterministic verification," it ensures both scale and reliability without manual annotation. Results indicate that even frontier models like GPT-5 and Gemini 2.5 frequently hallucinate during graph structural reasoning; while SFT provides some relief, the core challenge remains largely unresolved.

Direction-aware 3D Large Multimodal Models

Addressing the pain point where existing 3D point cloud benchmarks ask "left/right/front/back" questions without providing the ego pose—making directional problems inherently ill-posed—this paper introduces PoseRecover to automatically retrieve camera poses from RGB-D video extrinsics for each question. It then uses PoseAlign to transform and align the point cloud directly into that pose coordinate system for off-the-shelf 3D LMMs. Through instruction tuning alone, it achieves a relative improvement of 30% in ScanRefer mIoU and an 11.7% increase in LLM-as-judge accuracy for Scan2Cap.

Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement

The authors propose HRNet, which learns clean shared representations through Cross-scale Feature Disentanglement and Adaptive Projection (CDAP) and non-iteratively predicts joint rigid and non-rigid transformations in a unified coarse-to-fine pipeline, achieving SOTA performance on four multimodal datasets.

Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

MeasureBench constructs a "reading" benchmark using 2,442 real and synthetic images of measuring instruments. It reveals that even the most powerful frontier VLMs achieve an overall accuracy of only around 30%. While they can identify units and instrument types (>90%), they fail to accurately read the values corresponding to pointers or scales, exposing a fundamental bottleneck in fine-grained spatial localization for VLMs.

Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions

Addressing the phenomenon where "VLMs answer correctly on classic visual illusions but provide the same answer even after the inducing factors are reversed," this paper introduces VI-Probe, a controllable illusion probe framework. By applying graded perturbations and matched controls to images, and polarity flipping and instruction variants to questions, and then using metrics like PFC, TFI, and the Hallucination Multiplier \(R\) to decouple "true perception" from "memorizing templates," the study finds that "answer rigidity" in different model families stems from heterogeneous mechanisms—such as memory override, perception-memory competition, and visual processing bottlenecks—rather than a single "language prior" as previously assumed.

DPL: Decoupled Prototype Learning for Enhancing Robustness of Vision-Language Transformers to Missing Modalities

To address the performance drop of vision-language models when a modality is missing, this paper proposes DPL: replacing the fixed fully connected classification head with a decoupled prototype prediction head that selects prototypes based on missing patterns and splits them by modality. Combined with a missing-aware ArcFace loss and prototype relationship contrastive loss, it can be integrated into any prompt-based method as a plug-and-play component, consistently outperforming SOTA across multiple missing scenarios on three datasets.

DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

DRS-GUI inserts a training-free "search-then-predict" phase before MLLM coordinate prediction: it uses a UI Perceptor to parse screenshots into UI elements with semantic relevance, then employs MCTS to schedule three human-like perceptual actions (Focus/Shift/Scatter) guided by region quality rewards to iteratively search for the most relevant compact region. This approach improves the grounding accuracy of Qwen2.5-VL-7B and UGround-V1-7B by approximately 14% on the high-resolution dense interface benchmark ScreenSpot-Pro.

DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing

DSCA enables knowledge editing by decomposing the representation space of VLMs into a set of orthogonal semantic subspaces and performing gated residual interventions within each subspace. This approach maintains an editing success rate of \(>95\%\) with near-zero forgetting even after 1000 sequential edits.

DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance

DuetSVG transforms SVG generation from "pure text generation" to "joint autoregressive generation of image tokens and SVG tokens." This allows the image tokens generated first by the model to serve as internal visual guidance during SVG decoding. Combined with an image-guided test-time resampling strategy, it outperforms existing VLM methods in both text-to-SVG and image-to-SVG tasks.

DuoGen: Towards Autonomous Interleaved Multimodal Generation

DuoGen combines a pre-trained MLLM with a video-pre-trained DiT. Using a special <BOV> token, the MLLM autonomously decides when to generate images, while all preceding images in the sequence serve as conditioning frames for the DiT to continue generation. Combined with a two-stage decoupled training strategy and a high-quality dataset of 298k interleaved instructions synthesized from cleaned web data, it outperforms open-source unified models across interleaved generation, text-to-image (T2I), and image editing tasks.

Dynamic Logits Adjustment and Exploration for Test-Time Adaptation in Vision Language Models

Addressing the issue where Test-Time Adaptation (TTA) for VLMs tends to select only high-confidence samples, leading to "inherited model category bias + insufficient exploration," this paper proposes DLAE. It utilizes Dynamic Logits Adjustment (DLA) to de-bias logits by multiplying them with a balancing factor based on online prediction statistics. Furthermore, it introduces Consistency-Guided Exploration Caching (CGEC) to specifically incorporate decision-boundary samples—those whose predictions "flip" after calibration—into the cache under dual semantic and temporal consistency constraints. This allows for stable exploration of low-confidence regions, consistently outperforming SOTA methods like DPE on both Cross-Domain and OOD benchmarks.

DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs

The DynamicGTR framework is proposed to improve VLM performance in zero-shot graph algorithm QA by dynamically routing each query to the optimal GTR (8 visual/textual representations) at inference time. This approach also transfers effectively to real-world scenarios such as link prediction and node classification.

Dynamics-Aware Preference Optimization for Vision-Language Models

This paper diagnoses the root cause of instability in VLM preference fine-tuning from the perspective of "learning dynamics"—the "squeezing effect" (where easy negatives produce near-zero loss but still exert large, misdirected gradients). It proposes the two-stage CW-DPO: first, a constrained smooth SFT "flattens" the distribution, followed by a "cooling weight" that adaptively scales negative sample gradients based on model confidence to suppress uninformative updates. It achieves SOTA across COCO/Flickr30k/NoCaps/MMMU/MMBench (COCO CIDEr 142.6, +3.4 over PPO; MMMU +2.4% absolute accuracy) while improving calibration and halving convergence steps.

EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval

EagleNet constructs a text-frame relationship graph and employs a Relational Graph Attention Network to learn fine-grained interaction between text-frame and frame-frame units. It generates enhanced text embeddings that integrate video contextual information and introduces an energy-aware matching mechanism to capture the distribution of real text-video pairs, achieving SOTA performance across four benchmark datasets.

Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs

Ours proposes the AGDI framework for black-box MLLM copyright tracking by generating trigger images via adversarial optimization. A dual injection mechanism simultaneously injects copyright information at the response level (driven by CE loss to ensure the auxiliary model outputs the target answer) and the semantic level (minimizing the CLIP cosine distance between the trigger image and target text). Furthermore, model adversarial training is introduced to simulate downstream fine-tuning resistance, achieving results that comprehensively outperform PLA and RNA baselines on Qwen2-VL and LLaVA-1.5.

Ego: Embedding-Guided Personalization of Vision-Language Models

Ego identifies a small set of visual tokens that best represent a personalized concept (e.g., "my cup," "my dog") directly from the LVLM's internal cross-modal attention. These are stored as "concept memory" and injected as soft prompts into the context during inference. This approach is completely training-free, independent of external vision modules, and achieves SOTA performance across single-concept, multi-concept, and video personalization scenarios.

EgoAVU: Egocentric Audio-Visual Understanding

Addressing the issue where existing MLLMs "see but don't listen" and mismatch audio with incorrect visual sources in egocentric videos, this paper proposes EgoAVU, a fully automated data engine. It uses modular open-source models to generate modality-specific narrations and an explicit Multimodal Context Graph (MCG) to model audio-source relationships. The engine produces 3 million training samples (EgoAVU-Instruct) and 3,000 human-verified evaluation samples (EgoAVU-Bench). After fine-tuning, the model achieves up to a 113% relative improvement on its own benchmark and successfully generalizes to other egocentric benchmarks.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

EgoSound is the first benchmark to systematically evaluate the "egocentric sound understanding" capabilities of Multimodal Large Language Models (MLLMs). By merging Ego4D and EgoBlind data sources and defining a 7-task taxonomy covering intrinsic sound perception and cross-modal reasoning, it utilizes a three-stage automated pipeline—"interaction annotation → audio-centric captioning → visually-verified OpenQA"—to produce 7,315 open-ended QAs across 900 video segments. Experiments on 9 SOTA omni models show a maximum accuracy of only 56.7% (compared to 83.9% for humans), exposing significant weaknesses in fine-grained spatial and causal sound reasoning.

EMMA: Extracting Multiple physical parameters from Multimodal Data

EMMA aligns video, audio, and chart modalities into a Liquid Time-Constant (LTC) network, combined with differentiable physics simulation and physics-constrained losses. It performs unsupervised one-shot identification of all identifiable parameters in a dynamical system—including unobservable forced inputs in video, implicit dynamics terms not measurable by any modality, and calibration invariants such as coordinate origins and initial conditions. It significantly outperforms baselines using only video or equation discovery on 75 Delfys videos and real rover/drone platforms.

ENC-Bench: A Benchmark for Evaluating MLLMs in Electronic Navigational Chart Understanding

Ours proposes ENC-Bench, the first professional-grade benchmark for Electronic Navigational Chart (ENC) understanding. It contains 20,490 samples and a three-level hierarchical evaluation system (Perception \(\rightarrow\) Spatial Reasoning \(\rightarrow\) Maritime Decision-making). Systematically evaluating 10 MLLMs reveals that the best model achieves only \(47.88\%\) accuracy, uncovering a significant capability gap of general-purpose models in safety-critical professional domains.

Enhance-then-Balance Modality Collaboration for Robust Multimodal Sentiment Analysis

EBMC introduces a "first enhance, then balance" two-stage framework for multimodal sentiment analysis: it first enriches suppressed audio/visual weak modalities through semantic decoupling and cross-modal complementarity, then employs an energy-based model to equalize the optimization dynamics of each modality and performs instance-level weighted fusion based on credibility. It achieves SOTA on MOSI/MOSEI/IEMOCAP and shows significantly smaller performance degradation than baselines in missing modality scenarios.

Enhancing Continual Learning of Vision-Language Models via Dynamic Prefix Weighting

Addressing the issue where current PEFT methods for VLM continual learning only weight prefixes/adapters at the "sample level" (treating all tokens within a sample equally), DPW utilizes a gating module (RePA + CondAct) to calculate fine-grained prefix weights for each token. It allows adapters to supplement task knowledge in a "residual" manner only when prefix weights are insufficient. DPW achieves SOTA results on MTIL and ODCL-CIL domain-class incremental benchmarks.

Enhancing Descriptive Captions with Visual Attributes for Multimodal Perception

This paper proposes Cap-Workflow, which utilizes a suite of off-the-shelf visual expert models (detection, depth, emotion, OCR, fine-grained recognition, HOI) to extract fine-grained attributes and object relationships "unseen by general LMMs" from images. These attributes are integrated into accurate and detailed image descriptions using a two-stage LLM approach. This process re-labels 1.1M images into superior LMM pre-training corpora, enhancing the perception and reasoning capabilities of LLaVA-v1.5/NeXT across 14 benchmarks.

Enhancing Part-Level Point Grounding for Any Open-Source MLLMs

Without fine-tuning any MLLM parameters, by "synthesizing a grounding-aware query" in the intermediate layers to reshape text-to-image attention and using a lightweight decoder to upsample it into a point heatmap, the part-level point grounding accuracy of open-source MLLMs is significantly improved. This method can be plug-and-played into any model with an attention mechanism.

Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval

Evo-Retriever couples the "model" and "training curriculum" into a synergistic evolutionary pair—stabilizing representations through multi-viewpoint alignment and bidirectional contrastive learning, while an external LLM meta-controller dynamically adjusts the difficulty of hard negatives based on real-time training states. It achieves new SOTA results on ViDoRe V2 and MMEB(VisDoc) with nDCG@5 scores of 65.2% and 77.1%, respectively.

EvoGraph-R1: Self-Evolving Multimodal Knowledge Hypergraphs for Agentic Retrieval

EvoGraph-R1 redefines the knowledge hypergraph in multimodal GraphRAG from a static data structure—"built offline, queried once"—into an MDP environment that co-evolves with the reasoning process. Agents continuously perform four actions: "Query Graph / Web Search / Edit Graph / Answer" to insert, refine, and prune the hypergraph. Optimized end-to-end with GRPO, the system achieves SOTA performance on both multimodal VQA and text-only QA benchmarks.

Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration

This paper introduces the LMEE benchmark and the MemoryExplorer framework, which unify the evaluation of embodied exploration processes and outcomes by combining multi-target navigation with memory-based question answering. By fine-tuning MLLMs with reinforcement learning to actively invoke memory retrieval tools, the method achieves a 23.53% SR on LMEE-Bench (surpassing 3D-Mem's 16.91%) and a 46.40% SR on GOAT-Bench.

Face-Guided Sentiment Boundary Enhancement for Weakly-Supervised Temporal Sentiment Localization

FSENet utilizes facial features as sentiment cues to guide audio-visual interaction. Under a weakly-supervised setting with only "point-level" timestamp annotations, it employs contrastive learning to align sentiment semantics and expands sparse point annotations into pseudo-labels with smooth boundaries. This pushes the average mAP of temporal sentiment localization on TSL300 to 21.45%, outperforming the previous SOTA by approximately 5%.

Factorize, Reconstruct, Enhance: A Unified Framework for Multimodal Sentiment Analysis

FUSE-Net explicitly factorizes each modality into "shared / private / noise" subspaces, employs variational reconstruction based on the Information Bottleneck to preserve sentiment semantics, and utilizes a multi-perspective sample-adaptive dynamic fusion for weighted aggregation and gated noise suppression. It achieves state-of-the-art performance in regression and ordered classification metrics across MOSI, MOSEI, and SIMSv2 benchmarks.

FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Models

Proposes FairLLaVA, a parameter-efficient fairness fine-tuning method that eliminates demographic shortcuts in Multimodal Large Language Models (MLLMs) by minimizing mutual information between hidden states and demographic attributes. It significantly narrows performance gaps across groups in chest X-ray report generation and skin lesion question answering.

FAVE: A Structured Benchmark for Fine-Grained Audio-Visual Temporal Evaluation in Multimodal LLMs

FAVE is a three-layer benchmark specifically designed to evaluate whether Audio-Visual Large Language Models (AVLLMs) can align audio and video streams within the same time window and perform fine-grained temporal reasoning. Using a scalable pipeline involving shot segmentation, dual-modal captioning, GPT synthesis, and human verification, it constructs nearly 10,000 timestamped QA pairs based on QVHighlights. Evaluations of 13 SoTA models show that even the strongest model, Gemini 1.5, performs significantly below human levels, while open-source models suffer near-total failure, indicating that joint cross-modal temporal understanding remains an open problem.

FedMPT: Federated Multi-Label Prompt Tuning of Vision-Language Models

FedMPT models Federated Multi-Label Recognition (MLR) as a causal front-door adjustment problem. It uses an LLM to generate a set of universal "conditions" (e.g., spatial layout, object poses) as mediating variables to constrain label co-occurrence. Through a three-step pipeline—conditional prompts, optimal transport, and gated aggregation—these conditions are aligned to image regions and adaptively weighted. This significantly suppresses spurious correlation overfitting, such as "falsely reporting a chair whenever a cat is seen," especially when client data is heterogeneous.

Finding Distributed Object-Centric Properties in Self-Supervised Transformers

The paper provides a systematic analysis of "where object information is hidden" within self-supervised ViTs like DINO. It finds that such information is distributed across all layers and encoded simultaneously in Query, Key, and Value patch similarities (rather than only in the last layer's [CLS] or key features). Based on this, the authors propose a training-free method, Object-DINO, which identifies "object heads" via cross-layer clustering. This method improves unsupervised object discovery (CorLoc) by +3.6 to +12.4 and provides visual evidence to mitigate object hallucinations in MLLMs.

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

Using "IKEA furniture assembly videos" as a sandbox, a video QA benchmark named Flat-Pack Bench (602 multiple-choice questions, 4 task categories) was constructed to specifically test the fine-grained spatio-temporal understanding of Large Vision-Language Models (LVLMs). It was found that the strongest models, such as GPT-5, achieve only ~38% accuracy, significantly lower than the human performance of 94.18%. The study identifies "tracking, contact judgment, and region grounding" as the primary bottlenecks.

FlowComposer: Composable Flows for Compositional Zero-Shot Learning

FlowComposer introduces Flow Matching to Compositional Zero-Shot Learning (CZSL) for the first time. It learns two primitive flows (attribute flow and object flow) to transport visual features into the corresponding text embedding space. It explicitly composes velocity fields through a learnable Composer and utilizes a leakage-guided augmentation strategy to transform imperfect feature decoupling into auxiliary supervision signals. As a plug-and-play module, it consistently improves CZSL performance across three benchmarks.

FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy

FluoCLIP is a two-stage vision-language framework: it first enables CLIP to learn the semantics of fluorescence stains through stain-grounding, and then achieves stain-aware focus quality assessment (FQA) via stain-guided ranking. It also introduces FluoMix, the first multi-stain tissue-level fluorescence microscopy dataset.

PinPoint: Focus, Don't Prune — Identifying Instruction-Relevant Regions for Information-Rich Image Understanding

PinPoint is proposed as a two-stage framework: it first locates instruction-relevant image regions through Instruction-Region Alignment, then refines the encoding of selected regions, achieving higher VQA accuracy with fewer visual tokens.

Foundation Encoders Are All You Need for Preference-Aware Personalization

FAN does not add any additional structures or fine-tuning to text-to-image models. Instead, it "repurposes" the self-attention within pre-trained text encoders into personalized attention and pairs it with a target-query-oriented profiling strategy. This enables personalized synthesis across various base models like SD V1/XL/V3 and FLUX that aligns with user preferences without sacrificing target semantics.

From 3D Pose to Prose: Biomechanics-Grounded Vision-Language Coaching

BioCoach transforms 3D skeletal kinematics and body measurements from streaming fitness videos into explicit, readable intermediate representations that are fed into a frozen vision/language backbone. Through a three-stage pipeline—"Selecting joints → Computing cycles and constraints → Vision-biomechanical conditioned generation"—it produces precise corrective feedback with joint angles, range of motion (ROM), and phase alignment. On the newly constructed QEVD-bio-fit-coach dataset, it achieves a 262.8% improvement in METEOR compared to Stream-VLM.

From Attraction to Equilibrium: Physics-Inspired Semantic Gravitons for Zero-Shot Anomaly Detection

SGNet remodels CLIP's vision-text cross-modal alignment as a physical process of "energy potential field reaching equilibrium." It introduces a set of learnable "semantic gravitons" as dynamic intermediaries between vision and text, pulling the two modalities to stable localized semantic equilibrium points through attraction and equilibrium forces, achieving SOTA in zero-shot anomaly detection across 10 industrial/medical benchmarks.

From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding

Addressing the pain point in GRPO fine-tuning where "hard samples result in zero rewards with no learning signal," this paper proposes the group-revision paradigm. It first generates an initial response, then directs the model to produce a group of "revised" responses. By calculating the relative improvement (shaping signal) via Hungarian matching, it weights rewards and scales advantages, consistently outperforming existing GRPO methods in segmentation, REC, and counting tasks.

From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings

Proposes the LAPS (Latent Action-based Primitive Segmentation) pipeline, which uses a defined "Latent Action Energy" metric within a latent action space to discover and segment semantic action primitives from unlabeled industrial video streams without supervision, providing structured data for VLA model pre-training.

From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition

This paper proposes SITH (Semantic Inspection of Transformer Heads), a completely data-free and training-free CLIP interpretability framework. By performing SVD on the Value-Output weight matrices of attention heads and utilizing the self-developed COMP algorithm to map singular vectors to sparse combinations of semantically coherent concepts, SITH achieves significantly finer intra-head interpretability compared to existing methods and supports precise weight editing to enhance downstream performance.

From Where Things Are to What They Are For: Benchmarking Spatial–Functional Intelligence in Multimodal LLMs

SFI-Bench is proposed—a video benchmark based on 134 egocentric indoor videos and 1,555 expert-annotated four-choice questions. It shifts the evaluation of Multimodal Large Language Models (MLLMs) from "where objects are" (geometric perception) to "what objects are for" (functional cognition). Covering six task categories across spatial cognition and functional reasoning, it reveals that the integration of "spatial memory + functional reasoning + external knowledge" remains a significant bottleneck for current MLLMs.

G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval

Ours proposes G-MIXER, which achieves training-free state-of-the-art (SOTA) performance in zero-shot composed image retrieval. It utilizes Geodesic Mixup for implicit semantic expansion (expanding the retrieval range along a hypersphere with varying mixup ratios) and Explicit Semantic Re-ranking (filtering noisy candidates using MLLM-generated attributes).

GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting

This work uses a set of anisotropic 2D Gaussians (position + covariance + color) as a compact surrogate representation for images fed into vision-language models. By "reusing a frozen RGB ViT backbone + a lightweight splat input head + two-stage transfer training," it achieves 3–23.5× compression of visual inputs and up to 31× faster loading on the 12.8M DataComp dataset. It maintains 90–98% of the zero-shot accuracy of RGB baselines across 38 datasets and even outperforms RGB on 6 VQA benchmarks when integrated into LLaVA.

GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics

GeoAgent approaches image geolocation as a human-like task of "hierarchical reasoning to a precise address." It first performs a cold start on a VLLM using GeoSeek, a Chain-of-Thought (CoT) dataset annotated by geographic experts and professional players. It then undergoes Group Relative Policy Optimization (GRPO) fine-tuning using two rewards tailored for geographic tasks: a Geographic Similarity Reward (measuring correctness) and a Consistency Reward (measuring reasoning validity). GeoAgent outperforms existing methods and general-purpose VLLMs across multiple granularities.

Geometrically-Constrained Agent for Spatial Reasoning

Addressing the "strong semantics, weak geometry" gap in VLMs for spatial reasoning, this paper proposes GCA, a training-free agent. It first utilizes the VLM as a "semantic analyst" to translate ambiguous queries into formal task constraints (reference frames + objectives), then as a "task solver" to invoke geometric tools within the deterministic boundaries of these constraints. GCA outperforms previous SOTA by approximately 27% on multiple spatial reasoning benchmarks.

Global-Graph Guided and Local-Graph Weighted Contrastive Learning for Unified Clustering on Incomplete and Noise Multi-View Data

GLGC addresses incomplete and noisy multi-view data without relying on data imputation. It utilizes a global affinity graph to generate new positive/negative pairs for incomplete views (addressing "rare-paired" issues) and a local affinity graph to assign adaptive weights to cross-view pairs (addressing "mis-paired" issues). Integrated into a unified contrastive learning framework, GLGC significantly outperforms SOTA methods.

Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM

Granulon enhances pixel-level visual encoders (represented by DINOv3)—which excel at details but lack coarse-grained semantic abstraction—with a "text-conditioned granularity controller + adaptive token aggregation" module. This allows a single encoder to dynamically perform "pixel \(\rightarrow\) fine \(\rightarrow\) coarse" multi-granularity reasoning based on the question's semantics in a single forward pass. Under identical settings, it achieves approximately a 30% increase in inference accuracy and a 20% reduction in hallucination rates.

GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning

This paper proposes the GraphVLM benchmark to systematically evaluate three roles of VLMs in multimodal graph learning: VLM-as-Encoder (enhancing GNN features), VLM-as-Aligner (bridging modalities for LLM reasoning), and VLM-as-Predictor (acting directly as a graph learning backbone). Experiments across six datasets demonstrate that VLM-as-Predictor consistently achieves the best performance, revealing the significant potential of VLMs as a new foundation for multimodal graph learning.

Gravitation-Driven Semantic Alignment for Text Video Retrieval

GraviAlign analogies cross-modal semantic alignment to universal gravitation. It decomposes the alignment score between Gaussian embeddings of text/video into two orthogonal, closed-form factors: "Semantic Gravitation (Attraction)" and "Geometric Overlap." Each factor possesses independent veto power, consistently outperforming the CLIP-ViP baseline by 1.6%~2.6% R@1 across three text-video retrieval benchmarks.

Grounded 3D-Aware Spatial Vision-Language Modeling

GR3D unifies three grounding capabilities—explicit 2D grounding, implicit 2D grounding, and monocular 3D grounding—into a single spatial VLM. The model generates a spatial Chain-of-Thought (CoT) while simultaneously grounding mentioned objects as region tokens inserted back into the text stream. These grounded regions then serve as queries to predict 3D boxes in the camera view, achieving significant performance gains on Omni3D detection and multiple spatial reasoning benchmarks.

Grounding Everything in Tokens for Multimodal Large Language Models

GETok augments the MLLM vocabulary with a set of "grid tokens + offset tokens," discretizing the image plane into a 2D anchor grid and using small-step offset iterations for error correction. Without altering the autoregressive architecture, the model represents various grounding forms (points, boxes, masks, polylines) as unified token sequences, achieving SOTA performance in both SFT and RL paradigms.

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Addressing the issue where existing visual grounding benchmarks are saturated (90%+) by MLLMs despite failing to measure real capabilities, the authors construct GroundingME—a hard benchmark with 1005 samples covering four dimensions: "Fine-grained Discriminative / Complex Spatial / Limited Visibility / Rejection". The study finds that the strongest model achieves only 45.1% accuracy, most models score 0% on rejection tasks, and proposes two improvement paths: test-time scaling (+4.5%) and negative sample mixture training (rejection 0% → 27.9%).

GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training

This paper proposes GTR-Turbo, which generates a "free teacher model" by merging historical checkpoints via TIES during the RL training process to guide subsequent training (via either SFT or KL distillation). It matches or even exceeds the GTR method, which relies on external teachers like GPT-4o, across multiple vision agent tasks while reducing training time by 50% and computational costs by 60%.

GUI-SAGE: Enhancing GUI Automation with Self-Explanatory Learning

To address the "zero-advantage trap" in GUI Reinforcement Learning—where all rollouts fail and advantages become zero when tasks are too difficult—GUI-SAGE prompts the model to explain "why this action is correct" given ground-truth actions. This generates in-distribution positive samples. An Entropy-Modulated Credit Assignment (EMCA) mechanism then amplifies or suppresses gradients based on prediction confidence, enabling a 3B model to achieve an 81.1% average success rate on AndroidControl / GUI-Odyssey, surpassing larger 7B baselines.

GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks

This paper introduces the GUIDE benchmark, containing 67.5 hours of screen recordings and think-aloud annotations from 120 novice users across 10 software applications. It defines three hierarchical tasks: behavioral state detection, intent prediction, and assistance prediction. Evaluations reveal that current state-of-the-art multimodal models show limited performance in understanding user behavior and judging assistance needs (behavior detection accuracy at only 44.6%), but providing structured user context significantly improves performance (up to a 50.2pp gain in assistance prediction).

Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation

To enable CLIP vision encoders to simultaneously achieve "category discriminability" and "fine-grained perception," this paper proposes DCR. Instead of having the diffusion model reconstruct the original image (which only adds detail but harms discrimination), it injects contrastive signals into the predicted noise of the diffusion model to form a unified loss. This single objective optimizes both capabilities simultaneously, bypassing gradient conflicts inherent in naively combining two losses, and achieves consistent gains across 6 CLIP backbones and downstream MLLMs.

HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding

HAMMER is proposed to achieve interaction-image-based 3D affordance grounding by extracting contact-aware intention embeddings from MLLMs, enhancing point cloud features via hierarchical cross-modal fusion, and injecting 3D spatial information into intention embeddings through a multi-granularity geometry lifting module, significantly outperforming existing methods on the PIAD benchmark.

HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics

HanDyVQA is a fine-grained video QA benchmark focused on the "Hand-Object Interaction (HOI) dynamic process." It covers the entire "manipulation \(\rightarrow\) effect" chain through six question categories (Action/Process/Objects/Location/State Change/Parts). The dataset contains 11,100 five-way multiple-choice questions and 10,300 segmentation masks. Experimental results show that the strongest model, Gemini-2.5-Pro, achieves only 73% accuracy, significantly lower than the human baseline of 97%.

Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs

To address the dual problem where safety-aligned MLLMs both forget old tasks and lose safety during Continual Visual Instruction Tuning (CVIT), this paper proposes HPA: a training-free parameter-level editing approach applied after each fine-tuning step. It categorizes parameters into "safety-focused" and "task-focused" based on Hessian importance, utilizes an adaptive balance score to select safety parameters for retention, and applies orthogonal projection to task parameter updates to resist forgetting, thereby preserving both safety and capability without altering the original training workflow.

HAVE-Bench: Hierarchical Audio-Visual Evaluation from Perception to Interaction

HAVE-Bench constructs a 2451-item audio-visual evaluation benchmark using a "Perception-Reasoning-Interaction" three-level cognitive hierarchy paired with "Audio-as-Instruction (AaI)/Audio-as-Context (AaC)" dual roles. It is the first to model multi-turn, memory-dependent interaction tasks as task graphs to evaluate Omni-MLLMs. Results indicate a performance cliff for both open-source and closed-source models at the reasoning and interaction levels, and demonstrate that speech-based visual querying performs significantly worse than text-based querying.

HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation

HBridge replaces the two symmetric, layer-wise shared attention MoT experts in "unified understanding + generation" models with a pair of heterogeneous experts (a frozen large VLM + a pre-trained diffusion DiT). By bridging attention only across mid-layers and introducing a set of semantic reconstruction tokens, it outperforms BAGEL on DPG-Bench / GenEval / ImgEdit using only approximately 1/12 of BAGEL's T2I training tokens.

HDR-VLM: HDR-Domain Adaptation of VLMs and Preference-Aligned Quality Assessment for HDR Video Color Grading

HDR-VLM is the first method to adapt pre-trained VLMs—which have only seen SDR—to the HDR domain for evaluating HDR video color grading quality. The first stage utilizes HLG unified encoding + progressive unfreezing to supplement HDR perception, while the second stage employs GRPO with curriculum rewards to align model scoring with noisy human subjective preferences. It achieves PLCC 0.9033 / SROCC 0.8667 on a realistic production HDR dataset and provides interpretable reasoning for score deductions.

HiconAgent: History Context-aware Policy Optimization for GUI Agents

HiconAgent utilizes a History Context-aware Policy Optimization (HCPO) reinforcement fine-tuning framework to train GUI navigation agents. During the sampling phase, it dynamically varies history lengths to teach the model to use history "on demand." In the update phase, history screenshots are discarded while history action tokens are retained as anchors, with an all-history branch used for alignment distillation. The 3B model outperforms GUI-R1-7B on GUI-Odyssey with an +11.32% improvement in step success rate, while reducing FLOPs by 60% and increasing inference speed by 2.47×.

Hierarchical Attacks for Multi-Modal Multi-Agent Reasoning

This paper proposes HAM³, which decomposes adversarial attacks on "Multi-Modal Multi-Agent Systems (MM-MAS)" into three interconnected levels: perception, communication, and reasoning. It systematically characterizes how perturbations cascade from single-point inputs to collective decisions. Experiments conducted on the GQA dataset across ReAct, Plan-and-Solve, and Reflexion paradigms show a maximum Attack Success Rate (ASR) of 78.3%, finding that reasoning layer attacks are the most potent, most stealthy, and hardest to rectify.

HiFICL: High-Fidelity In-Context Learning for Multimodal Tasks

Through rigorous derivation of the attention formula, HiFICL reformulates the ICL approximation problem from "fitting a shift vector" to "directly parameterizing the source of ICL." By injecting learnable low-rank virtual key-value pairs into attention heads and performing end-to-end training, it achieves a dynamic, context-aware parameter-efficient fine-tuning method that outperforms existing ICL approximation methods and LoRA across multiple multimodal benchmarks with minimal parameters.

HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models

HiSpatial proposes decomposing 3D spatial intelligence into four cognitive hierarchies (geometric perception → object attributes → object relations → abstract reasoning). It constructs an automated data pipeline processing ~5 million images, 45 million objects, and 2 billion QA pairs, and designs an RGB-D VLM using metric-scale point cloud maps as auxiliary input. With only 3B parameters, it surpasses GPT-5 and Gemini-2.5-Pro on multiple spatial reasoning benchmarks.

HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models

Ours proposes HOG-Layout, a hierarchical 3D indoor scene generation, optimization, and editing framework based on VLMs and LLMs. By enhancing semantic consistency with RAG and ensuring physical plausibility through force-guided hierarchical optimization, it outperforms LayoutVLM on SceneEval with 4.5x faster speed.

HouseMind: Tokenization Allows MLLMs to Understand, Generate and Edit Architectural Floor Plans

The paper proposes HouseMind, which discretizes architectural floor plan outlines and room instances into spatial tokens using a hierarchical VQ-VAE. These are unified with text tokens in a single vocabulary, enabling a small-scale LLM (0.6B) to achieve three major tasks—understanding, generation, and editing—within a single autoregressive framework. It significantly outperforms methods based on diffusion models and large-scale VLMs.

Hugging Visual Prompt and Segmentation Tokens: Consistency Learning for Fine-Grained Visual Understanding in MLLMs

The authors propose FCLM, identifying that the visual prompt embedding <VP> in region captioning and the segmentation token [SEG] in grounding actually point to the same region but represent opposite input/output directions. By utilizing a self-reconstruction loss and a latent space cosine consistency loss to align the two, combined with a progressive hybrid region extractor and a two-stage training strategy, a single MLLM achieves SOTA performance across seven fine-grained visual tasks.

HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks

Ours proposes HumanVBench, a human-centric video understanding benchmark comprising 16 fine-grained tasks, supported by two automated pipelines (video annotation + distractor-aware QA synthesis). Evaluation of 30 mainstream video MLLMs reveals critical deficiencies in subtle emotion perception and audio-visual alignment.

Hyperbolic Gramian Volumes for Multimodal Alignment

To address the "volume collapse" issue (where \(\det \approx 1\) and variance is near 0) of Euclidean Gramian volumes under L2 normalization, this paper translates Gramian volume alignment to hyperbolic (Lorentz model) space to preserve variance. By using a learnable scalar \(\alpha\) to perform a convex combination of Euclidean and hyperbolic volumes, the proposed HyperGRAM achieves a zero-shot T2V Recall@1 improvement of +1.8% to +2.9% over Euclidean GRAM across four video-text retrieval benchmarks.

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

The authors propose IAG, the first multi-target backdoor attack method targeting VLM-based visual grounding. By dynamically generating input-aware triggers via a text-conditioned U-Net, it embeds semantic information of any specified target object into visual inputs, achieving the highest attack success rate in 11 out of 12 experimental settings.

IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting

This work constructs IF-Bench, the first high-quality benchmark (499 images / 680 VQA / 10 dimensions) for systematically evaluating the infrared image understanding capabilities of Multimodal Large Language Models (MLLMs). After evaluating 40+ models, the authors propose GenViP, a training-free generative visual prompting method. By using an image editing model to translate infrared images into aligned RGB images and feeding them alongside the original infrared images into MLLMs, GenViP alleviates domain shift and achieves up to a 7% relative improvement without any fine-tuning.

Illuminating Visual Identity in Universal Multimodal Embeddings

Addressing the "visual identity discrimination" capability long overlooked by Universal Multimodal Embeddings (UME), this paper formalizes it into 4 meta-tasks, constructs the MVEB benchmark with 522K samples, and introduces a simple framework of "identity-aware sampling + unified contrastive loss." This allows a 7B model to achieve an average score of 78.8 on identity benchmarks (significantly outperforming existing UMEs) while maintaining universal retrieval performance.

Imbalanced View Contribution Evaluation and Refinement for Deep Incomplete Multi-View Clustering

ICER identifies the overlooked issue that "missing views are not merely incomplete data, but also trigger imbalanced view contributions." It quantifies the marginal contribution of each view using Shapley values and characterizes distribution discrepancies via Unbalanced Optimal Transport (UOT) to construct a view contribution imbalance index \(I_\psi\). Subsequently, View-Adaptive Curriculum Learning (VACL) is employed to dynamically strengthen weak views and suppress dominance by strong views, consistently outperforming existing methods across five incomplete multi-view benchmarks.

Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining

This paper discovers that the essence of "adding regularization to TPT to improve calibration" is pushing the prompt towards the flat minima of the loss surface. Consequently, it proposes FPP—a data-free prompt pretraining framework that positions the initial prompt directly in a flat region. By only replacing the initialization without modifying any TPT procedures, it simultaneously achieves SOTA results in both accuracy and calibration (ECE/SCE).

Information-Theoretic Decomposition for Multimodal Interaction Learning

This paper points out from an information-theoretic perspective that "multimodal interaction (Redundant R / Unique U / Synergistic S) varies dynamically per sample." It proves that conventional joint learning and modality ensembles are each only proficient in one type of interaction. The authors propose DMIL, which explicitly decomposes representations into R/U/S components using variational decomposition and specifically reinforces them through three-stage fine-tuning, achieving optimal performance across samples with different interaction compositions.

Interactive Episodic Memory with User Feedback

Addressing the challenge of "localizing the moment that answers a query in long egocentric videos" (EM-NLQ), which currently only provides one-shot results without error correction, this paper proposes the interactive EM-QnF task, a synthetic feedback data generation recipe requiring no human annotation, and the plug-and-play feedback alignment module FALM. FALM assigns "alignment scores" to each video segment and re-weights the original model features. This allows existing EM-NLQ models to shift focus to the correct segments based on user feedback without introducing heavy LLMs, achieving R1/R5 gains of up to +4.9/+5.4 across three benchmarks.

Interpretable Debiasing of Vision-Language Models for Social Fairness

DeBiasLens is proposed to locate "social neurons" encoding social attributes by training Sparse Autoencoders (SAEs) on VLM encoders, then selectively deactivating these neurons during inference to mitigate bias. It reduces Max Skew by 9-16% on CLIP and gender bias ratios by 40-50% on InternVL2 while maintaining general performance.

Intervention-Aware Multiscale Representation Learning from Imaging Phenomics and Perturbation Transcriptomics

Paired perturbation transcriptomics (RNA-seq) is utilized as "privileged information" during training to guide microscopy image encoders. Through a "transcriptome-conditional teacher → image-only student" distillation framework, mechanistic signals of drug actions are injected into image representations. This enables one-shot migration to unseen drugs/genetic perturbations and drug-target discovery at test time using only microscopy images, significantly outperforming self-supervised (MAE/DINO) and alignment-based (CLIP-style) baselines.

Is the Modality Gap a Bug or a Feature? A Robustness Perspective

This paper theoretically proves that the "modality gap" (global separation between image and text) in multimodal contrastive models such as CLIP is caused by the combination of initialization and contrastive loss. This phenomenon is orthogonal to downstream performance but monotonically negatively correlated with robustness. Consequently, a training-free post-processing algorithm can shift one modality along the gap vector toward the other, significantly enhancing robustness against noise without sacrificing clean accuracy.

Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation

This work constructs the first spatial intelligence evaluation benchmark for Unmanned Aerial Vehicle (UAV) perspectives, SpatialSky-Bench (13 fine-grained tasks in 2 categories), accompanied by a 1-million-sample automatically generated training set SpatialSky-Dataset. By employing "SFT + GRPO reinforcement fine-tuning," the authors develop a specialized model, Sky-VLM, which achieves an average score of 53.30, surpassing the strongest baseline GPT-5 (23.07) by 139.6%.

IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment

IsoCLIP theoretically analyzes the structure of CLIP projectors and discovers that the cosine similarity calculation implicitly involves an inter-modal operator \(\Psi = W_i^\top W_t\) responsible for cross-modal alignment and an intra-modal operator \(\Psi_i = W_i^\top W_i\) that only targets normalization without promoting intra-modal alignment. By applying Singular Value Decomposition (SVD) to \(\Psi\), the study identifies an approximately isotropic alignment subspace; removing anisotropic directions significantly improves intra-modal retrieval and classification performance without requiring any training.

Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild

The JALA framework is proposed to construct a unified latent action space by jointly aligning predicted embeddings with latent actions generated via inverse dynamics. This allows Vision-Language-Action (VLA) models to learn from both annotated data and unlabeled in-the-wild human videos. Combined with the UniHand-Mix dataset containing 7.5M samples, it significantly improves the generalization of robot manipulation.

Language-guided Frequency Modulation for Large Vision-Language Models

This paper proposes a plug-and-play LFM (Language-guided Frequency Modulation) that shifts vision refinement—before feeding features into the LLM—from the spatial domain to the frequency domain. It uses text features to compute "emphasis maps" that selectively enhance critical frequency bands (high frequency for local details, low frequency for global context). Without adding extra trainable parameters (except for a lightweight MLP projector), LFM consistently improves various LVLMs across benchmarks like GQA, MMB, and MathVista.

LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map

LASAR equips an embodied agent with a "dual memory" system—frame-by-frame episodic memory plus a queryable latent cognitive map. A contrastive objective, ST-CRL, is used to "sculpt" the map into a high-level spatial representation capable of encoding topological, distance, and directional relationships, resulting in a 2%–3.5% performance gain in both navigation (VLN-CE) and zero-shot spatial reasoning (VSI-Bench).

Learning Anchor in Dual Orthogonal Space for Fast Multi-view Clustering

This paper proposes DOSFMVC, which extends anchor learning for large-scale multi-view clustering from a "single space" to a "dual orthogonal space." It jointly learns anchors in a space spanned by the anchors themselves and an additional orthogonal space based on "anchored clustering centers." By replacing traditional consensus anchor graphs with cluster indicator matrices for anchors and raw data, it achieves state-of-the-art (SOTA) performance across ACC/NMI/Purity/F1 on 7 datasets (up to ~300k samples) while maintaining linear complexity.

Learning complete and explainable visual representations from itemized text supervision

Addressing supervision scenarios like medical imaging and remote sensing where "one image is paired with multiple non-overlapping independent text descriptions (itemized text)," this paper proposes ItemizedCLIP. It utilizes a masked cross-attention module to generate "text-item-modulated" visual representations, paired with four SigLIP-style objectives to enforce "item independence" and "representation completeness." Zero-shot performance and fine-grained explainability significantly outperform CLIP-family baselines across four real medical/remote sensing domains and one synthetic domain.

Learning to See through Illumination Extremes with Event Streaming in Multimodal Large Language Models

To address the issues of irreversible RGB degradation and subsequent hallucinations in Multimodal Large Language Models (MLLMs) under overexposed/extremely dark conditions, Event-MLLM introduces event streams as a complementary modality. It utilizes an "illumination indicator" learned from a DINOv2 branch to adaptively regulate Event-RGB fusion, combined with an "Illumination Correction Loss" to align fused features with normal illumination semantics. This enables stable reasoning and counting across extreme brightness ranges from \(0.05\times\) to \(20\times\).

Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection

The PROGRESS framework is proposed to dynamically select the most informative training samples by tracking the VLM's learning progress on automatically discovered multimodal concept clusters. Using only 16-20% of labeled data, it achieves 99-100% of full-data performance with a shorter total training time.

LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks

LifeEval constructs the first multimodal assistant evaluation benchmark oriented toward "egocentric, real-time, and task-oriented" scenarios. Using 591 Ego4D video slices and 4,075 QA pairs with reasoning chains, it examines whether 26 mainstream MLLMs can assist humans in daily tasks in real-time like a personal assistant across six capability dimensions (Perception/Reasoning/Retrieval/Planning/Safety/Multi-turn Collaboration). The results reveal significant collective shortcomings in dynamic reasoning and goal planning.

Linguistic Priors for Visual Decoupling: Towards Symmetric Vision-Brain Alignment

Aiming at the "semantic information asymmetry" between brain signals and natural images, this work utilizes object-level text descriptions as linguistic priors to explicitly decouple foreground objects from background regions in images. This transforms asymmetric vision-brain alignment into symmetric semantic alignment, achieving new SOTA in zero-shot brain-to-image retrieval on THINGS-EEG / THINGS-MEG.

Linking Perception, Confidence and Accuracy in MLLMs

The study reveals severe confidence miscalibration in MLLMs (where accuracy plunges during visual input degradation but confidence remains unchanged). It proposes CDRL (Confidence-Driven RL based on original-noise image pairs) for perception sensitivity training and utilizes the calibrated confidence to implement Adaptive Test-Time Scaling (CA-TTS), achieving an average improvement of 8.8% across four benchmarks.

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

Addressing the gap where "current Multimodal Large Language Models (MLLMs) almost exclusively follow the autoregressive paradigm and the diffusion path remains unverified," this paper grafts visual instruction tuning onto the masked diffusion language model LLaDA to create a pure diffusion MLLM—LLaDA-V. By leveraging bidirectional attention to better capture visual-spatial relationships, it refreshes the SOTA for pure diffusion MLLMs across 18 benchmarks and outperforms the autoregressive baseline LLaMA3-V on 11 tasks using the same training data.

LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models

Addressing three major challenges in multimodal multi-turn dialogues—malicious intent concealment, contextual risk accumulation, and cross-modal joint risk—this work constructs the MMDS dataset with 4,484 annotated dialogues and the MMRT red-teaming framework based on MCTS. The proposed LLaVAShield auditing model achieves F1 scores of 95.71% and 92.24% on the user and assistant sides respectively, significantly outperforming baselines such as GPT-5-mini.

LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models

Inspired by human foveal encoding and cortical magnification mechanisms, this paper proposes LLMind, a training-free adaptive sampling framework. It implements non-uniform pixel allocation via Möbius transformations and utilizes closed-loop semantic feedback to optimize sampling parameters at test-time, significantly outperforming uniform sampling under tight budgets of only 1%-5% pixels.

Love Me, Love My Label: Rethinking the Role of Labels in Prompt Retrieval for Visual In-Context Learning

This work reveals that prompt retrieval in Visual In-Context Learning (VICL) often suffers from label inconsistency due to the neglect of label information. The proposed LaPR framework achieves label-aware prompt retrieval through joint image-label representations and a Mixture-of-Experts (MoE) mechanism, consistently outperforming SOTA on foreground segmentation, object detection, and image colorization tasks.

LVLM-Aided Alignment of Task-Specific Vision Models

Using a Large Vision-Language Model (LVLM) as a "translator," this work translates explanation maps of small specific vision models into natural language and turns human category-level descriptions into per-sample error-correction masks. This allows small models to break free from reliance on spurious features (shortcuts) without requiring fine-grained per-image annotation, significantly improving worst-group accuracy on synthetic and real medical data.

M3Grounder: Mask-Based Multi-Span and Multi-Granular Grounding for Document QA

M3Grounder transforms "answer localization" in Document QA from coarse bounding boxes to pixel-level segmentation. While the VLM generates answers, it emits [GROUND] tokens. Each token drives a promptable segmentation module via three MLP heads (phrase, line, and block levels) to produce nested multi-granular evidence masks, achieving SOTA results across four benchmarks.

MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures

MarkushGrapher-2 proposes an end-to-end multimodal chemical structure recognition model. By co-encoding image, text, and layout information through a dedicated chemical OCR module and combining a two-stage training strategy (adapting OCSR features then fusing multimodal encoding), it significantly outperforms existing methods in Markush structure recognition (M2S accuracy 56% vs 38%) while remaining competitive in molecular structure recognition.

MCHDoc: A Comprehensive Benchmark for Reading Multi-Carrier Chinese Historical Documents

MCHDoc organizes 15,724 high-resolution historical document images spanning over 3,000 years and six writing carriers (ancient paper, bamboo/wood slips, calligraphy rice paper, stone inscriptions, silk, and oracle bones) into a unified benchmark. Mirroring the expert workflow of "recognition followed by textual research and correction," it designs four tasks: page-level recognition, character-level recognition, pure LLM post-correction, and knowledge-base augmented post-correction. Systematic evaluation of over 20 open-source and closed-source MLLMs/LLMs reveals that even top-tier models struggle with cross-carrier generalization.

Mechanisms of Object Localization in Vision-Language Models

The authors use a suite of mechanistic interpretability tools (token ablation, attention knockout, causal mediation analysis) to dissect "how" LLaVA-1.5 and InternVL-3.5 internally localize objects. They find that localization relies on a "containerization" mechanism—where a collective set of object-region tokens defines the spatial extent regardless of their internal semantic arrangement. Furthermore, the causal chain is carried by a few sparse attention heads, with nearly non-overlapping sets of specialized heads for classification vs. localization, and localization causally depends on intermediate classification results in a "recognition-then-localization" sequential computation.

Medic-AD: Towards Medical Vision-Language Model's Clinical Intelligence

Medic-AD upgrades general-purpose medical VLMs into clinical intelligence models capable of lesion detection, symptom tracking, and visual explainability through a three-stage progressive training framework involving an anomaly detection token (<Ano>), a temporal difference reasoning token (<Diff>), and visual heatmaps. It achieves SOTA performance across multiple medical tasks.

MERLIN: Building Low-SNR Robust Multimodal LLMs for Electromagnetic Signals

MERLIN translates the "native MLLM" paradigm to the electromagnetic (IQ) signal domain. The authors first construct a dataset of 134,000 signal-text pairs (EM-134K) and the EM-Bench benchmark covering perception and reasoning. They then propose a two-stage distillation framework ("High-SNR Teacher → Low-SNR Student") featuring a Denoising Subspace Module (DSM) that projects noisy features back into the signal subspace. This ensures robustness in noisy environments where the Signal-to-Noise Ratio (SNR) is below 0 dB, significantly outperforming general large models like GPT-5 and Claude-4 on EM-Bench.

MeteorPred: A Meteorological Multimodal Large Model and Dataset for Severe Weather Event Prediction

This paper constructs the first large-scale multimodal dataset for severe weather warning, MP-Bench (420,000 pairs of ERA5 meteorological fields and warning texts), and proposes a Multimodal Large Model (MMLM) capable of directly processing 4D meteorological tensors. Through three plug-and-play fusion modules acting on time, space, and vertical pressure levels, high-dimensional meteorological data is aligned with LLMs to generate natural language warnings.

Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs

This paper points out that existing VLM-based OOD detection methods use intra-modal distances (text-text or image-image) to select negative texts, which is inconsistent with the cross-modal distance optimized by CLIP. It proposes InterNeg to systematically utilize cross-modal distance from both textual and visual perspectives, achieving a 3.47% reduction in FPR95 on ImageNet.

MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction

Ours proposes MM-ReCoder, the first Multi-modal LLM (MLLM) for chart-to-code generation with self-correction capabilities. Through a two-stage multi-turn GRPO reinforcement learning framework (Shared-First-Turn optimization of correction followed by Full-Trajectory optimization of coding), it achieves an 86.5% low-level score on ChartMimic with only 7B parameters, comparable to Qwen3-VL-235B.

MMLandmarks: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding

MMLandmarks constructs the first large-scale instance-level geospatial benchmark with one-to-one correspondence for every landmark across four modalities: ground images, aerial images, text, and GPS (18,557 landmarks in the US, with 329k ground and 197k aerial images). It demonstrates that neither existing specialized models nor general foundation models solve it effectively, and provides a simple CLIP-style four-modal contrastive learning baseline (MMCLIP) to show that "training on this data allows a single model to sweep multiple tasks."

MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection

The authors observe that existing multimodal sarcasm detection datasets/methods are restricted to "single-image" settings, failing to capture sarcasm triggered by cross-image comparisons. Consequently, they construct MMSD3.0, the first real-world benchmark consisting entirely of multi-image samples (2–4 images each), and propose a companion Cross-Image Reasoning Model (CIRM) featuring dual-stage bridging and relevance-guided fusion, achieving SOTA performance on MMSD, MMSD2.0, and MMSD3.0.

Modeling Cross-vision Synergy for Unified Large Vision Model

PolyV integrates image, video, and 3D modalities into a unified large vision model using "Dynamic Routing Sparse MoE + Synergy-aware Training." This enables the model to perform "synesthetic" inference, where temporal priors from video and geometric priors from 3D are transferred to complement static image reasoning. It achieves an average improvement of over 10% across 10 benchmarks relative to the Qwen2.5-VL-7B backbone.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Molmo2 is a fully open family of Video-Language Models (weights, data, code, and training recipes are all open, with no data distilled from closed-source VLMs). By building 9 new datasets and utilizing a three-stage training strategy, it fills the missing capability of "video grounding using points and trajectories" even found in closed-source models. The 8B model significantly outperforms comparable open-source models in video counting, pointing, and tracking, even surpassing Gemini 3 Pro in certain tasks.

MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding

To address the pain points in e-commerce multimodal representation learning—"fixed-ratio mixed training leading to modality imbalance, neglect of intra-product image-text alignment, and high noise in raw data"—MOON2.0 utilizes a modality-driven MoE for end-to-end multimodal joint learning. It employs dual-level alignment to simultaneously align inter-product and intra-product relations, coupled with image-text co-augmentation and dynamic sample filtering to purify data. The authors released the MBE2.0 benchmark with 6.4 million samples, achieving zero-shot SOTA on various e-commerce retrieval, classification, and attribute prediction tasks.

More than the Sum: Panorama-Language Models for Adverse Omni-Scenes

This paper proposes the Panorama-Language Modeling (PLM) paradigm and the PanoVQA large-scale panoramic VQA dataset (653K QA pairs). It designs a plug-and-play Panorama Sparse Attention (PSA) module that allows existing VLMs to handle equirectangular projection (ERP) panoramas without retraining, achieving global reasoning superior to multi-view stitching schemes in adverse scenarios such as occlusions and accidents.

Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models

MoT probe experiments reveal asymmetric pruning sensitivity between text and visual pathways in LVLMs—text pathways are highly sensitive and must be calibrated with text tokens, while visual pathways are highly redundant and can withstand 60% sparsity. Based on this, ATV-Pruning is proposed using all text tokens + a small number of layer-wise adaptively selected visual tokens to construct the calibration pool.

MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding

The MSJoE framework is proposed to jointly evolve an MLLM and a lightweight keyframe sampler via reinforcement learning. The MLLM generates visual queries to guide frame retrieval, and a 1D U-Net sampler learns to select frames from a CLIP similarity matrix. End-to-end joint optimization achieves an +8% accuracy improvement in long-form video QA.

Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following

The authors construct Multi-Crit, the first benchmark for evaluating the multi-criteria following capabilities of multimodal Judge models. It includes criterion-level human annotations and preference conflict samples. Using three new metrics—PAcc, TOS, and CMR—they evaluate 25 LMMs, revealing that even the strongest closed-source model achieves only 32.78% multi-criteria consistency on open-ended generation tasks.

Multi-Hierarchical Contrastive Spectral Fusion for Multi-View Clustering

MCSF integrates differentiable deep spectral embedding into multi-view encoders and fuses multiple views into a "structure-aware" consensus representation using a tri-level contrastive loss (Intra-view Structure Preservation / View-Consensus Alignment / Consensus Structure Refinement), achieving leading clustering accuracy across 8 benchmarks.

Multi-Metric Representation Learning Strategy Based on Clustering for Fine-Grained Multimodal Sentiment Analysis

To address the problem of "overlapping sentiment centers and ambiguous fine-grained boundaries" after fusing different modalities into the same representation space, MMRest first performs k-means sentiment clustering on tri-modal representations. It then employs a multi-metric learning strategy involving a global metric and cluster-specific local metrics to pull similar sentiments closer and push dissimilar ones further apart. Finally, a Projection and Decision-Level Fusion (PDLF) mechanism adds the geometric projection bias derived from the metrics to unimodal prediction scores. MMRest outperforms SOTA models on CMU-MOSI/MOSEI with approximately 30% of the parameters compared to Transformer-based methods.

Multi-Modal Image Fusion via Intervention-Stable Feature Learning

A multi-modal image fusion framework inspired by causal inference is proposed. By probing true inter-modal dependencies through three structured intervention strategies (complementary masking, random masking, and modality dropout), and designing a Causal Feature Integrator (CFI) to learn intervention-stable features, the method achieves a PSNR of 66.02 and AG of 4.129 on MSRS, with an object detection mAP of 0.821.

Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery

Ours proposes the SSR²-GCD framework, which learns structured representations with balanced intra-modal compression via a Semi-Supervised Rate Reduction loss. Combined with a Retrieval-based Text Aggregation strategy to enhance cross-modal knowledge transfer, it outperforms existing multi-modal GCD methods across 8 datasets.

Multi-modal Test-time Adaptation via Adaptive Probabilistic Gaussian Calibration

To address the failure of class-conditional distribution modeling caused by "modality distribution asymmetry" in Multi-modal Test-time Adaptation (TTA), AdaPGC explicitly models the feature distribution of each class using a probabilistic Gaussian model with class-specific covariances. It further suppresses the bias of corrupted modalities through contrastive correction based on symmetric KL divergence, achieving SOTA results across most corruption settings on Kinetics50-C and VGGSound-C.

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Addressing the issue where MLLMs are limited to single-image spatial reasoning and struggle with basic orientations, this paper utilizes annotated 3D/4D scene datasets to automatically generate 27 million multi-frame spatial question-answering pairs (MultiSPA). By injecting foundational capabilities of depth, visual correspondence, and dynamic perception into InternVL2, the trained Multi-SpatialMLLM improves by an average of 36% over the base model on a self-constructed benchmark, matching the performance of closed-source models and specialized 3D models.

Multi-speaker Attention Alignment for Multimodal Social Interaction

This paper discovers that Multimodal Large Language Models (MLLMs) suffer from severe cross-modal attention misalignment between "speaker text tokens and their corresponding visual regions" in multi-speaker dialogue scenarios. It proposes a parameter-free and architecture-agnostic attention alignment method: first, dynamically selecting attention heads responsible for visual grounding, then injecting an adaptive bias calculated from speaker positions into these heads to "weld" the visual features and dialogue of the same speaker together. This achieves an average improvement of 2~3% across three MLLMs and three datasets, setting new SOTA records.

Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance

This work redefines catastrophic forgetting in Multimodal Continual Instruction Tuning (MCIT) as the "absence of gradients from old tasks during new task training." DGG approximates the old task gradients using a "direction vector from current parameters to the optimal parameters of previous tasks," adds this to the real gradients from a limited replay buffer, and dynamically regulates the update frequency using Bernoulli sampling. Without expanding the model, DGG achieves SOTA on VQAv2 and UCIT.

Multimodal Distribution Matching for Vision-Language Dataset Distillation

This paper proposes MDM (Multimodal Distribution Matching), a geometry-aware distribution matching framework for image-text dataset distillation. By intervening simultaneously at the data, model, and loss levels (joint space clustering initialization + angle-guided weight interpolation + geodesic kernel energy matching on the unit hypersphere), it directly aligns the joint distribution of real and synthetic data via single-level optimization. This reduces distillation costs by up to 98% compared to the trajectory-matching SOTA (LoRS) while outperforming baselines in cross-architecture generalization.

Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration

This paper proposes CPSC (Conformal Predictive Self-Calibration), which attributes the seemingly independent "low-quality data" issues of modality imbalance and noise pollution to a single root cause—predictive uncertainty regarding the reliability of modalities or samples. It utilizes Conformal Prediction (CP) to generate real-time reliability scores during training, performing self-calibration at both the feature level (recomposing reliable feature components) and the gradient level (reweighting gradients by sample reliability), achieving new SOTA results across 6 datasets under imbalance and noise settings.

Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

MMRB2 is the first benchmark for evaluating reward models of "omni models" (capable of reading/writing interleaved text and images in a single sequence). Spanning four tasks—text-to-image, image editing, interleaved generation, and multimodal reasoning—with 1,000 expert-annotated preference pairs per task, it reveals a significant gap between the strongest current judge (Gemini 3 Pro at 76.3% average consistency) and human experts (>90%).

µVLM: A Vision Language Model for µNPUs

µVLM is the first vision-language model designed specifically for "µNPUs" (MCU-level, mW power consumption, tens of MBs memory). By replacing hardware-unsupported self-attention with NPU-friendly OverMod encoders and AttSSM decoders, it achieves 117.8 CIDEr on COCO Karpathy while realizing millisecond-level VLM inference (TBT 21 ms, power <300 mW) on µNPUs for the first time.

MVP: Multiple View Prediction Improves GUI Grounding

To address the instability where "minor screenshot perturbations cause drastic coordinate prediction jumps" in GUI grounding models, this paper proposes the training-free MVP framework. It crops multiple sub-views using instruction-vision attention for independent prediction, then performs spatial clustering on these coordinates, selecting the center of the largest cluster as the final output. It improves Qwen3VL-32B from 55.3 to 74.0 on ScreenSpot-Pro.

Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy

Nano-EmoX proposes a cognitively inspired three-level emotional task hierarchy (Perception → Understanding → Interaction). It is the first multimodal language model to unify six core emotional tasks with compact 2.2B parameters, gradually cultivating high-level empathy from basic perception through the P2E progressive training framework.

Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning

The Narrative Weaver framework is proposed, combining the narrative planning of MLLMs with the fine-grained generation of diffusion models. It achieves long-range visual consistency under multi-modal conditions through learnable queries and a dynamic Memory Bank. Additionally, it introduces EAVSD, the first e-commerce advertising storyboard dataset containing over 330K images.

No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

C2LIP proposes a contrastive fine-tuning scheme that does not rely on hard negatives. By decomposing text into noun phrase concepts and introducing cross-modal attention pooling, it achieves SOTA on SugarCrepe/SugarCrepe++ compositionality benchmarks while maintaining or improving zero-shot and retrieval performance.

Noise-Aware Few-Shot Learning through Bi-directional Multi-View Prompt Alignment

The NA-MVP framework is proposed, which achieves fine-grained patch-to-prompt alignment through a bi-directional (clean + noise-aware) multi-view prompt design combined with Unbalanced Optimal Transport (UOT). It utilizes classic OT to perform selective label refinement on identified noisy samples, consistently surpassing SOTA in noisy few-shot learning scenarios.

Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models

Addressing the challenge of preventing catastrophic forgetting without storing historical data in Multimodal Large Language Models (MLLMs), Octopus demonstrates that gradient orthogonalization is more critical than parameter orthogonalization. It proposes History-Free Gradient Orthogonalization (HiFGO), which utilizes only historical weights (no historical data), combined with a two-stage fine-tuning strategy (free adaptation followed by constrained refinement). On the UCIT benchmark, it outperforms the previous SOTA by 2.14% in Avg and 6.82% in Last accuracy.

OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models

This paper proposes OddGridBench to evaluate the fine-grained visual discrepancy perception of MLLMs (identifying elements in a grid that differ in color, size, rotation, or position). It finds that all MLLMs perform significantly below human levels. Consequently, it introduces OddGrid-GRPO (curriculum learning + distance-aware reward) to markedly enhance the visual discrimination of models.

OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion

Ours constructed OmniFood8K, a multimodal nutrition dataset for Chinese food with 8,036 samples, and NutritionSynth-115K, a synthetic dataset with 115K samples. An end-to-end framework is proposed to predict nutrition information from a single RGB image via a Scale-Shift depth adapter, frequency-aligned fusion, and a mask prediction head.

On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models

This paper reveals the "token's dilemma" in dynamic MoE continual learning—where ambiguity in new task data and weak contributions from old tokens toward new knowledge lead to routing drift and catastrophic forgetting. It proposes LLaVA-DyMoE, which mitigates routing drift through Token Assignment Guidance and Routing Score Regularization, achieving an MFN improvement of over 7% and a 12% reduction in forgetting on the CoIN benchmark.

One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework

This work redefines zero-shot image captioning from "image-centric" to "patch-centric." It utilizes a frozen dense visual backbone (DINOv2 family) to extract patch features, applies non-parametric aggregation for specific regions, and feeds the result into a text-only trained decoder. This unified framework addresses multi-granularity tasks—including single patches, boxes, mouse traces, and whole images—without requiring any region-level annotations.

OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation

OneCAT integrates "understanding + generation + editing" into the same decoder-only Transformer. By utilizing a Modality-MoE with hard routing (text, understanding, and generation specialists), it achieves encoder-free inference. It also introduces multi-scale auto-regressive generation into LLMs via a Scale-Aware Adapter, attaining SOTA performance in a unified model while delivering approximately 10× faster generation speeds than diffusion models.

ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models

ORIC formalizes contextual incongruity—where objects appear in unexpected scenes or are missing from expected ones—as a source of uncertainty. Using LLM-guided and CLIP-guided sampling strategies, the authors construct ORIC-Bench from MSCOCO to test this specific scenario. Results reveal that the Macro F1 of 18 mainstream LVLMs drops from near-perfect to approximately 60–80. Performance is recovered and aligned more closely with human judgment using Visual-RFT fine-tuning on 600 ORIC-style samples.

ORION: ORthonormal Text Encoding for Universal VLM Adaptation

ORION performs LoRA fine-tuning on the CLIP text encoder using only category names (without accessing any images). By adding a Frobenius penalty to the loss to push various text prototypes toward pairwise orthonormality while constraining them from deviating from the original zero-shot prototypes, it creates a set of "universal text classifiers" with more dispersed angles and stronger discriminative power. This serves as a plug-and-play replacement, yielding consistent performance gains across zero-shot, few-shot, and test-time adaptation settings on 11 datasets and 3 backbones.

P-Flow: Prompting Visual Effects Generation

Addressing the challenge that "dynamic visual effects such as explosions, squashing, and collapsing are difficult to describe precisely with a single text prompt," P-Flow proposes a training-free framework. It treats the text prompt as an optimization variable, using a Vision-Language Model (VLM) to contrast differences between reference and generated videos and iteratively rewrite prompts. Combined with noise prior enhancement and historical trajectory maintenance, it enables a frozen video generation model to replicate target effects with zero fine-tuning, outperforming baselines in FID-VID, FVD, Dynamic Degree, and human evaluations across T2V/I2V tasks.

PACT: Phase-Like Transition Constraints in Adapter-Based Continual Learning of Vision-Language Models

Addressing the limitation where orthogonal constraints isolate task adapters and suppress cross-task knowledge sharing, the authors derive "Phase-Like Transition Constraints (PACT)" from PAC-Bayes theory for the post-convergence phase. This allows adapters to smoothly transition rather than hard-threshold between "frozen" (preserving history) and "melting" (adapting to new tasks) states, similar to the phase transitions of water. Implemented via a dual-branch ViT, Stable Adapter Initialization (SAI), and Prior Anchoring (PA), the method outperforms SOTA across multiple continual learning settings while using 36.96% fewer trainable parameters than standard adapter baselines.

Parameter-Efficient Adaptation for MLLMs via Implicit Modality Decomposition

To address the imbalance issue where the "text modality excessively dominates parameter updates" when fine-tuning Multimodal Large Language Models (MLLMs) with LoRA, this paper proposes IMoD. It implicitly partitions a single LoRA matrix into text-exclusive, non-text-exclusive, and shared blocks, and guides them via two gradient-level constraints directly injected into backpropagation. This achieves an average improvement of approximately 3.3% across audio-visual-text tasks without adding any trainable parameters or compromising weight mergeability.

PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs

PAS diagnoses the instability of temporal encoding in Video LLMs as "sampling an inverse Fourier temporal kernel with high-frequency ripples." It proposes training-free multi-head inverse phase smoothing—applying small, opposite temporal phase offsets to different attention head queries before standard aggregation. This effectively performs a controlled moving average to smooth out ripples, consistently improving performance across nine video benchmarks with near-zero additional overhead.

Personalized Image Descriptions from Attention Sequences

DEPER is the first to treat "how an individual views an image" (attention scanpath trajectories) as a personalization signal. It distills a cross-image stable subject embedding and injects it into a frozen Qwen2-VL via a lightweight adapter. This allows the model to generate personalized descriptions without requiring gaze data at test time or per-person fine-tuning, achieving an average improvement of approximately 24% across four datasets.

PersonaVLM: Long-Term Personalized Multimodal LLMs

This paper proposes PersonaVLM, a multimodal agent framework for long-term personalization. By utilizing active memory management (four memory databases), multi-step reasoning retrieval, and a momentum-based personality evolution mechanism, it transforms general MLLMs into personalized assistants capable of adapting to evolving user preferences, outperforming GPT-4o by 5.2% in 128K context scenarios.

Phrase-Grounding-Aware Supervised Fine-Tuning for Chart Recognition via Side-Masked Attention

During the VLM fine-tuning phase, a zero-parameter "Side-Masked Attention Module" (SMAM) is inserted to align each phrase in the answer to text regions on the chart. By supervising the logit contribution of these regions, the model learns to "ground" its generation to correct visual areas during chart QA, consistently outperforming standard SFT on benchmarks like ChartQA and C2T.

PhyCritic: Multimodal Critic Models for Physical AI

PhyCritic utilizes a two-stage RLVR pipeline comprising "physical skill warmup + self-referential critic fine-tuning" to train a 7B multimodal model into a critic specialized for physical AI tasks (perception/causality/planning). The core mechanism involves the critic "solving the problem first, then using its own solution as a reference to judge which of two responses is superior." It achieves state-of-the-art performance among open-source 7B/8B models on the newly established PhyCritic-Bench and enhances physical reasoning capabilities when used as a policy model.

Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision

The DeepfakeJudge framework is proposed to scale human-annotated reasoning supervision into large-scale structured scoring data via a bootstrapped generator-evaluator process. This trains 3B/7B vision-language models as automatic judges for the quality of deepfake detection reasoning, achieving high alignment with human judgment in both pointwise and pairwise evaluations.

PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models

PointAlign is proposed to apply feature-level alignment regularization on point cloud tokens in the intermediate layers of a 3D VLM's LLM (aligned with Q-Former outputs). By training only lightweight alignment projectors and LoRA adapters, it effectively prevents the degradation of geometric information during language modeling, achieving a 7.50pp improvement in open-vocabulary classification.

Pointing at Parts: Training-Free Few-Shot Grounding in Multimodal LLMs

POP is a training-free, plug-and-play method that performs element-wise fusion of language-guided attention maps from MLLMs (providing semantic and referential capabilities but remaining coarse) and bidirectional visual correspondences of self-supervised DINOv3 features (precise but ambiguous with multiple objects). This allows MLLMs to achieve precise part-level (e.g., "laptop keyboard") rather than just instance-level pointing in few-shot settings. It improves average scores by up to 8.9 points in 1-shot and 16.4 points in 3-shot across three datasets; even MLLMs without native pointing capabilities see gains of up to 30.9 points.

PosterIQ: A Design Perspective Benchmark for Poster Understanding and Generation

This paper proposes PosterIQ, a comprehensive benchmark for poster design containing 7,765 understanding annotations and 822 generation prompts. Spanning 24 task categories such as OCR, font awareness, layout reasoning, design intent understanding, and composition-aware generation, it systematically evaluates the gap in design cognition between MLLMs and diffusion models.

PowerCLIP: Powerset Alignment for Contrastive Pre-Training

PowerCLIP performs exhaustive local-to-global alignment between the "powerset of image region subsets" and "textual syntax tree phrases." It utilizes a linear-complexity Nonlinear Aggregator (NLA) to reduce the exponential overhead of powerset alignment to \(O(M)\). On 28 zero-shot benchmarks, it outperforms existing CLIP-like methods in 22 cases, showing significant gains in compositionality and robustness.

PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks

PP-OCRv5 shuns parameter scaling in favor of a "data-centric" methodology—systematically filtering and expanding training data across the dimensions of difficulty, accuracy, and diversity. This approach scales a 5M-parameter two-stage OCR system to compete with 10B- and 100B-parameter VLMs on standard OCR benchmarks, while maintaining superior localization precision, hallucination suppression, and computational efficiency.

Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models

This paper systematically diagnoses the degradation of visual representations in the intermediate layers of MLLMs at both the global functional level and the patch-level semantic structure level. It reveals that the essence of this phenomenon is "visual sacrifice" under the pure text generation objective and proposes Predictive Regularization (PRe). By requiring degraded intermediate features to predict initial visual features, PRe mitigates degradation and achieves consistent improvements across multiple VL benchmarks.

Prime Once, then Reprogram Locally: An Efficient Alternative to Black-Box Service Model Adaptation

This paper proposes AReS, a method that replaces the continuous API calls of traditional Zero-Order Optimization (ZOO) with a single-query API priming of a local encoder. It achieves a +27.8% gain on GPT-4o (where ZOO methods are nearly ineffective) while reducing API calls by over 99.99%, enabling cost-free inference.

Probabilistic Prompt Adaptation for Unified Image Aesthetics and Quality Assessment

PPA treats the choice of text prompt for scoring as a latent variable. By performing probabilistic weighted marginalization over a pool of antithetical prompts pre-sampled by an LLM, it simultaneously learns a high-precision task scorer and a general aesthetics/quality evaluator controllable by arbitrary text prompts. This is achieved using only (task, image, score) triplets without requiring any prompt or attribute annotations.

ProSoftArena: Benchmarking Hierarchical Capabilities of Multi-modal Agents in Professional Software Environments

ProSoftArena is the first multimodal agent benchmark targeting professional software (13 tools including CAD, ChemDraw, ArcGIS, Photoshop, etc.). It categorizes agent capabilities into five levels (L1–L5), utilizes automated scoring within real Windows virtual machines based on execution results, and introduces a "Human-in-the-Loop" evaluation. Results reveal that the strongest agents achieve only a 20.6% success rate in software-level tasks (L2) and fail almost entirely in cross-software workflows (L3).

Protect to Adapt: Orthogonal Subspace Control with Ranked Negative-Prompt Curriculum for Few-Shot Action Recognition

When adapting CLIP to Few-Shot Action Recognition (FSAR), the authors employ "Orthogonal Subspace Control (OSC)" to constrain LoRA updates to the orthogonal complement of the pre-trained weights' principal subspace, preventing the destruction of general semantics and suppressing catastrophic forgetting. Furthermore, a "Ranked Negative-Prompt Curriculum (RNC)" uses an LLM to generate rank-ordered intra-class hard negative samples, filtered by a verifier, to sharpen decision boundaries. By fine-tuning only 2% of parameters, this method achieves SOTA results across 5 FSAR benchmarks.

Prototype-as-Prompt: Multimodal Sentiment Prototypes Endowing Large Language Models the Capability to Perform Multimodal Sentiment Analysis

This paper proposes Prototype-as-Prompt (PaP), which compresses audio-visual modalities into a set of sentiment prototypes with explicit semantics as soft prompts for frozen LLMs in multimodal sentiment analysis. Through sentiment supervision, cross-modal alignment, and diversity constraints, these prototypes are forced to encode clear emotional meanings. With only 0.09%–0.26% trainable parameters, it outperforms previous SOTA across four datasets and three different LLM architectures.

Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

Proxy3D clusters semantic features and geometric point clouds from video frames into a compact set of 3D "proxy" tokens based on "semantic groups." By utilizing the SpaceSpan dataset for multi-stage alignment training, the VLM achieves performance comparable to or better than SOTA in 3D QA, visual grounding, and spatial reasoning using only 700 visual tokens (less than 1/10 of competitors).

R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment

Addressing the issues of "Computer Graphics (CG) image quality assessment lacking explainable text descriptions" and "VLMs being inaccurate in direct CG quality judgment," R4-CGQA first constructs the first 3.5K CG dataset with six-dimensional quality descriptions. It then proposes a content-quality dual-stream retrieval framework. By feeding quality descriptions of visually similar CG images as examples to VLMs without fine-tuning, it consistently improves the CG quality assessment capabilities of models like LLaVA, Llama 3.2-V, and Qwen2.5-VL.

Re-evaluating Continual VQA: Toward Fair and Robust Evaluation for Multimodal Continual Learning

This paper identifies two structural flaws in existing Continual VQA benchmarks: "cross-task shared answer vocabularies" and "identical in-task train/test answer distributions," which lead to overestimating anti-forgetting capabilities. Consequently, the authors reconstruct the UCo-VQA benchmark, which enforces token-wise mutually exclusive answer spaces and introduces in-task distribution shifts. Simultaneously, they propose MaDQ—a parameter-efficient method that replays only historical questions combined with dual-layer distillation and image-text matching regularization—achieving SOTA results in these debiased and more challenging settings.

RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding

Addressing the limitations where conventional RGB degrades in low-light, high-dynamic, or fast-motion scenarios, and pure event streams lack color/texture, this paper proposes RE-VLM, the first dual-stream RGB-Event Vision-Language Model. It utilizes parallel RGB/event encoders and a three-stage progressive alignment to map heterogeneous visual features into language space. Furthermore, a graph-driven, degradation-adaptive data pipeline is introduced to convert synchronized RGB-event streams into verifiable scene graphs for large-scale synthesis of captions and Q&A pairs. RE-VLM outperforms RGB-only and event-only models of comparable or larger sizes in image captioning and VQA, especially under adverse lighting.

RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs

RealBirdID is a fine-grained bird identification benchmark focused on "identifying species if possible, and providing reasons if not." By mining \(3.4\text{k}\) "unanswerable" images from real iNaturalist disputes (categorized into three abstention reasons: need vocalization / angle or occlusion / poor quality) paired with "answerable" samples from the same genus, the study evaluates models using three metrics. Results show that top-tier MLLMs like GPT-5 and Gemini-2.5 Pro achieve less than \(13\%\) species-level accuracy, struggle to distinguish answerable from unanswerable samples, and mostly provide incorrect reasons for abstention.

ReBaPL: Repulsive Bayesian Prompt Learning

ReBaPL transforms CLIP prompt learning from searching for a "single optimal solution" to "sampling a diverse set of high-quality prompts from the posterior using cyclical SGHMC." By introducing a "repulsive force" in the representation space via MMD/Wasserstein metrics to prevent sampling collapse into a single mode, it serves as a plug-and-play Bayesian extension for any MLE prompt learning method (e.g., MaPLe, MMRL), significantly improving base-to-novel, cross-dataset, and domain generalization.

ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval

This work reveals the "Capability Degradation" phenomenon when adapting generative MLLMs into discriminative retrievers. It proposes the ReCALL framework, a three-stage pipeline (Diagnosis of retriever blind spots → Generation of corrective triplets via base MLLM CoT reasoning → Grouped Contrastive Refining), to effectively restore degraded fine-grained compositional reasoning. ReCALL achieves 55.52% R@1 on CIRR and 57.04% R@10 on FashionIQ.

Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP

This paper systematically refutes the popular "intra-modal misalignment in CLIP image embeddings" hypothesis. Theoretically, it proves that image-image similarity is fully determined by image-text similarity without additional degrees of freedom. Empirically, it reproduces the so-called "misalignment metrics" in non-CLIP models like DINO and SigLIP2, demonstrating that these metrics are artifacts of the measurement process rather than defects in CLIP's objectives. Finally, a minimal PCA projection method is shown to outperform complex methods specifically designed to "fix" misalignment in retrieval and few-shot classification tasks.

Relational Visual Similarity

This paper formally defines the problem of relational visual similarity—logical or functional correspondence between two images rather than surface attribute similarity. It constructs a dataset of 114K anonymous descriptions and trains the relsim model, revealing fundamental flaws in existing similarity metrics (e.g., CLIP, DINO) in capturing relational structures.

Reliable Clustering Number Estimation for Contrastive Multi-View Clustering

RCNMC utilizes a semantic-aware contrastive module with JSD adaptive weighting to mitigate representation degeneration—where low-quality views degrade high-quality ones. By modeling the "estimation of cluster number \(K\)" as a Markov Decision Process (MDP) and using Reinforcement Learning (RL) to automatically infer \(K\) within a single training session, the method achieves or exceeds the performance of contrastive methods using ground-truth \(K\) across 9 multi-view datasets without pre-setting \(K\) or relying on labels.

ReMatch: Boosting Representation through Matching for Multimodal Retrieval

ReMatch fine-tunes Multi-modal Large Language Models (MLLMs) as embedding models by appending a "chat-style Yes/No matching" task and a "multi-learnable token" representation. This allows generative capabilities to provide instance-level discriminative signals for retrieval embeddings, achieving a new SOTA on MMEB with almost zero additional inference cost.

ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

The authors propose ReMoRa, which directly operates on compressed video representations (I-frames + motion vectors). Through the Refined Motion Representation (RMR) module, coarse block-level motion vectors are refined into fine-grained motion representations similar to optical flow. A Hierarchical Motion State Space (HMSS) module is then utilized for linear-time long-range temporal modeling, surpassing baselines on benchmarks such as LongVideoBench, NExT-QA, and MLVU.

RetFormer: Multimodal Retrieval for Enhancing Image Recognition

RetFormer shifts world knowledge from "compressed model weights" to an "external image-text knowledge base." It performs k-NN retrieval for query images, calculates the contribution of each neighbor using an image-text cross-fusion attention module, and merges this with the backbone branch. This approach improves the overall accuracy on ImageNet-LT from 78.3% to 81.9% in long-tail recognition and noisy label learning.

Rethinking BCE Loss for Multi-Label Image Recognition with Fine-Tuning

The authors find that fine-tuning CLIP with BCE for multi-label recognition systematically disrupts the semantic geometry of text embeddings, leading to a breakdown in calibration (under-confidence in base classes and over-confidence in new classes). They propose Class-wise Covariance Regularization (CCR)—which uses predicted covariance estimated from "jointly inactive class pairs" within a batch to align with the text semantic correlation matrix. As a lightweight structural regularizer applied over BCE, it fixes calibration while enhancing generalization.

Rethinking Cross-Modal Anchor Alignment for Mitigating Error Accumulation

Addressing a long-ignored error source in "noisy correspondence learning" for image-text retrieval—where clean anchor pairs themselves exhibit cross-modal inconsistency (anchor correlation discrepancy)—this paper uses Fourier Transform to align anchor representations in the frequency domain. Based on this, it performs geometry-aware soft label correction combined with a Semantic-Constrained Triplet loss to suppress error accumulation, consistently achieving SOTA retrieval accuracy across three datasets.

Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token

This paper proposes SELF1E, the first MLLM segmentation method that operates without a dedicated mask decoder and uses only a single [SEG] token. By employing Residual Features Refilling (RFR) and Residual Features Amplifier (RFA), the approach restores resolution loss caused by pixel-shuffle compression, achieving performance competitive with decoder-based methods across multiple segmentation tasks.

Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance

Addressing the perennial challenge of "which vision encoder to choose for VLM," this paper systematically validates that traditional intuitions—selecting the largest model or the one with the highest zero-shot accuracy—are nearly uncorrelated with final VLM performance. Instead, it proposes using Gromov-Wasserstein (GW) distance to measure the "structural similarity" between visual representations and LLM text representations as a training-free, inference-only proxy metric. Theoretically, the paper proves that GW distance bounds the Lipschitz constant (learnability) of the cross-modal projector. Experimentally, across 60+ full VLM training runs, this metric correlates more strongly with final performance than all baseline indicators, enabling the prediction of the optimal encoder within 1 minute before full training.

Revisiting Model Stitching in the Foundation Model Era

This paper systematically investigates the feasibility of stitching Visual Foundation Models (VFMs). It discovers that traditional stitching methods fail for VFMs and proposes a two-stage training strategy—"Final Feature Matching + Task Loss"—to enable reliable stitching of heterogeneous VFMs. The resulting stitched models can even outperform individual VFMs. Furthermore, the VFM Stitch Tree (VST) architecture is introduced to provide a controllable accuracy-efficiency tradeoff for multi-VFM systems.

Revisiting Visual Corruptions in LVLMs: A Shape-Texture Perspective on Model Failures

Starting from "corruption type heterogeneity," this paper finds that image corruptions disrupt LVLM perception along two complementary dimensions—shape and texture—inducing two opposite failure modes. Accordingly, a training-free dual-path contrastive decoding framework, ST-CD, is proposed. It utilizes edge maps and jigsaw puzzles as probes to amplify respective biases and adaptively fuses correction signals via entropy, consistently improving robustness against heterogeneous corruptions across multiple LVLMs.

RNED: Rotary Number Encoding and Decoding for Medical VLMs

To address the inherent weakness of medical VLMs in "numerical prediction," this paper proposes RNED: the encoding side follows the RoPE paradigm by using a "value-dependent rotary matrix" to rotate a scalar into a dedicated [NUM] token (norm-preserving, order-preserving, wide range), while the decoding side employs score-matching to retrieve continuous values from hidden states. It consistently outperforms existing VLM baselines on radiology measurement estimation and medical visual grounding tasks.

Role-SynthCLIP: A Role-Play Driven Diverse Synthetic Data Approach

Using "multi-expert role-play prompting" to drive MLLMs to generate multiple complementary captions from cognitive perspectives such as composition, narrative, and emotion. After denoising with a distilled role-aware filter, the approach achieves 64.1% Recall@1 on MS-COCO for CLIP-B/16 using only 1M images, surpassing the strongest synthetic data baseline trained on 5M pairs.

ROSE: Rotate Your Large Language Model to See

Instead of concatenating visual features as tokens into the LLM input (which causes long sequences, quadratic complexity, and dilutes language priors), this work encodes visual semantics into orthogonal rotation matrices that are directly left-multiplied onto the pre-trained weights of the LLM. This avoids context expansion and maintains the angular structure between parameters (i.e., language priors) through orthogonality. The resulting 7B ROSE model matches Qwen2.5-VL-7B across 12 multimodal benchmarks while reducing FLOPs by 80.7% and inference latency by 56.4%.

Rosetta Stone for Unified MLLMs: A Unified Tokenizer to Decipher Understanding and Generation

To address the long-standing conflict between reconstruction and semantic tasks in unified visual tokenizers, the authors employ hierarchical decoupling in a single encoder (shallow layers for pixel reconstruction, deep layers for semantic alignment) + supervision from multiple foundation models (CLIP/DINOv2/SAM) + dual codebooks with attention-prioritized mapping + coarse-to-fine reconstruction guided by converged semantics. This achieves an rFID of 0.33 and zero-shot accuracy of 80.9% on ImageNet, while the resulting 7B unified MLLM outperforms TokenFlow-13B by 3.1% in understanding.

RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning

RxnCaption reformulates "Reaction Diagram Parsing (RxnDP)" from predicting molecular bounding box coordinates to an "image captioning" task. It utilizes a specialized molecular detector, MolYOLO, to pre-annotate molecular boxes and indices on the diagram, allowing the LVLM to describe reactions by simply referencing these indices in natural language. Combined with the newly created U-RxnDiagram-15k real-world dataset, it achieves SOTA performance across multiple metrics.

SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning

The authors propose SALMUBench—the first benchmark for association-level machine unlearning in CLIP-like models. It consists of a \(60\text{K}\) synthetic dataset of person-sensitive attribute pairs, a pair of Compromised/Clean models trained from scratch, and a structured holdout evaluation protocol. The study systematically reveals three failure modes in existing unlearning methods: catastrophic collapse, over-generalized unlearning, and ineffective unlearning.

Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs

The authors propose two benchmarks, REST and REST+, which present the same problem to MLLMs in three forms: "pure text," "pure image (rendered text-as-image)," and "mixed text-image." Under strict control of OCR accuracy, they measure the phenomenon of "same content, different answers" (cross-modal inconsistency). Evaluation of 15 frontier MLLMs reveals that none achieve stable consistency across all three modalities (inconsistency rates of at least ~10%, exceeding 80% at worst). Models generally prefer the text modality, and this inconsistency is significantly correlated with the cosine similarity of internal text-image representations (modality gap).

Same or Not? Enhancing Visual Perception in Vision-Language Models

The authors redefine "fine-grained visual perception" as a simple binary task—determining whether two similar images depict the same object instance. Based on this, they constructed the TWIN dataset with 561K pairs and applied GRPO reinforcement learning for VLM post-training. This approach improved Qwen2.5-VL's performance on a self-built FGVQA benchmark by up to 19.3% without degrading general VQA capabilities.

SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics

SaPaVe proposes an end-to-end active manipulation framework. By employing a bottom-to-top training strategy with decoupled camera and manipulation actions, it first learns active perception priors using 200,000 semantic camera control pairs, followed by joint optimization for active manipulation. In real-world scenarios, it surpasses π₀ and GR00T N1 by a 31.25% improvement in success rate.

Scaling Spatial Intelligence with Multimodal Foundation Models

SenseNova-SI cultivates spatial intelligence capabilities in multimodal foundation models (such as Qwen3-VL, InternVL3, and Bagel) by systematically constructing a diverse spatial dataset of 8 million samples (SenseNova-SI-8M). It achieves unprecedented performance on multiple spatial benchmarks like VSI-Bench and MMSI while maintaining general multimodal understanding capabilities.

Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

The authors propose FlexMem, a training-free visual memory mechanism that constructs a visual memory bank through iterative dual-path KV cache compression. Combined with encoding-based and fast-indexing memory retrieval strategies, it enables MLLMs to process long videos exceeding 1000 frames on a single NVIDIA RTX 3090 GPU, significantly outperforming existing efficient video understanding methods.

Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models

Proposes Scene-VLM—the first video scene segmentation framework based on fine-tuned VLMs. By utilizing structured multimodal shot representations (visual frames + dialogue + metadata), causal sequence prediction, a context-focus window mechanism, and token logit confidence extraction, it achieves significant gains of +6 AP and +13.7 F1 on MovieNet and demonstrates natural language explanation capabilities.

SEA-Vision: A Multilingual Benchmark for Document and Scene Text Understanding in Southeast Asia

The authors introduce the SEA-Vision benchmark, which provides a unified evaluation for document parsing (15,234 pages) and text-centric VQA (7,496 QA pairs) across 11 Southeast Asian languages. By employing a re-rendering strategy to eliminate visual-text misalignment in multilingual VQA, the study reveals a 3–7x performance degradation in MLLMs when handling low-resource Southeast Asian languages.

SEA: Evaluating Sketch Abstraction Efficiency via Element-level Commonsense Visual Question Answering

Addressing the lack of suitable metrics for "sketch quality," this paper proposes SEA, a reference-free metric that combines three signals—recognition probability \(P\), total number of commonsense elements \(E\), and the number of elements actually drawn \(V\)—into a reward-penalty score. It specifically measures the abstraction efficiency of "preserving recognizability with minimal strokes." The authors also release CommonSketch, the first sketch dataset with element-level annotations (300 classes, 23,100 human sketches). Experiments demonstrate that SEA achieves high alignment with human judgment (approx. 88%).

SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker

Ours proposes SEATrack, a multi-modal tracker that achieves dynamic alignment of cross-modal attention maps via AMG-LoRA and efficient cross-modal fusion for global relation modeling via HMoE. It achieves a SOTA performance-efficiency balance in RGB-T/D/E tracking with minimal parameters.

SeD-UD: An Influence-Driven and Hierarchically-Decoupled Information Bottleneck for Multimodal Intent Recognition

Addressing the coexistence of redundancy and noise in text/audio/visual features for multimodal intent recognition, SeD-UD proposes an Influence-Driven Adaptive Bottleneck (IDAB) that dynamically adjusts the bottleneck dimension per sample. It hierarchically decouples the process into two steps: parallel unimodal de-redundancy followed by unified denoising after fusion, outperforming existing SOTA on MIntRec, MELD-DA, and CH-SIMS.

See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis

Addressing the dual dilemma where structural artifacts generated by modern diffusion models are difficult to annotate manually and VLMs fail to comprehend them, this paper proposes ArtiAgent—a fully automated pipeline comprising perception, synthesis, and verification agents. By manipulating Positional Embeddings (PE) and Value Embeddings within DiT self-attention, it injects plausible artifacts into real images. This enables the synthesis of 100,000 artifact data samples with bounding boxes and explanations under zero human labor. Open-source VLMs fine-tuned on this data outperform GPT-5 across detection, localization, and explanation tasks.

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

SWIM is a training strategy that supervises the "object noun token → visual token" cross-attention in MLLMs using object masks during the training phase only. This enables the model to precisely locate user-specified objects from pure text prompts. At inference, it requires no visual prompts such as points, boxes, or masks, outperforming expert models that rely on visual prompts on video fine-grained understanding benchmarks.

Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions

This paper proposes the tactile localization task—identifying regions in an image that share the same material properties as a given tactile input. By learning dense cross-modal features through local visual-tactile alignment and a material diversity pairing strategy, the authors construct two new tactile-material segmentation datasets.

Self-guided Semantic Inspection for Zero-Shot Composed Image Retrieval

Addressing the training-inference mismatch in Zero-Shot Composed Image Retrieval (ZS-CIR)—where models are trained on "aligned image-text pairs" but must handle "unaligned reference images + modified text" during inference—this paper proposes DiffComp. It introduces a "Differentiate-then-Compose" self-supervised paradigm that actively masks visual regions aligned with text phrases during training to artificially introduce cross-modal differences, followed by difference-aware adaptive fusion. DiffComp achieves SOTA performance across four ZS-CIR benchmarks.

Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning

TG-DP decouples "masked reconstruction" and "contrastive alignment" in audio-visual pre-training into two independent forward passes (each with its own mask ratio). It uses a full-view teacher network to select visible tokens for the contrastive branch and distill global representations, eliminating semantic noise from previous single-pass coupling and achieving SOTA on zero-shot retrieval and linear probing for AudioSet / VGGSound.

SenseSearch: Empowering Vision-Language Models with High-Resolution Agentic Search-Reasoning via Reinforcement Learning

SenseSearch enables a 7B VLM to autonomously coordinate three tools—"text search + image search + image crop"—during multi-turn reasoning. Through two-stage training (cold-start SFT + self-developed BN-GSPO reinforcement learning), the model learns to address both "knowledge-intensive" and "high-resolution fine-grained perception" challenges, outperforming same-scale baselines by 19.18 points on the new HR-MMSearch benchmark.

Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning

Ours proposes the Similarity-as-Evidence (SaE) framework, which reinterprets VLM text-image similarity as Dirichlet evidence. By calibrating overconfident softmax outputs through the Similarity Evidence Head (SEH) and implementing an interpretable and efficient medical active learning process based on a dual-factor acquisition strategy of vacuity (knowledge gap) and dissonance (evidence conflict), Ours achieves a SOTA macro average accuracy of 82.57% across 10 datasets with a 20% labeling budget.

SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models

SIMPACT proposes a test-time simulation-augmented action planning framework that automatically constructs physical simulation environments from a single RGB-D image. This enables VLMs to propose actions, observe simulation results, and iteratively refine reasoning, achieving SOTA performance on rigid and deformable object manipulation tasks without additional training.

SketchVL: Policy Optimization via Fine-Grained Credit Assignment for Chart Understanding and More

SketchVL enables MLLMs to "draw" each step of chart reasoning as visual annotation actions (boxes, lines, points, circles) on the image. It introduces the FinePO algorithm to redistribute the coarse-grained advantage of an entire trajectory to each step based on scores from a Process Reward Model (FinePRM). This achieves step-level fine-grained credit assignment, yielding an average improvement of 7.23% over base models across chart, natural image, and math benchmarks.

Small Object, Great Challenge: A Benchmark for Small Object Visual Grounding

Addressing the bias in existing Visual Grounding (VG) benchmarks toward large objects, this paper constructs the RefCOCOs benchmark (320k referring expressions) using an MLLM-based automated pipeline on COCO, where the average target area is only 1.60% of the image. A strong baseline, SoVG-Net, featuring a Hierarchical Text Injection (HTI) module, is proposed, achieving leading performance in [email protected] and mIoU for small object localization and segmentation.

SMAP: Semantic Route Planning with Map-Grounded Multimodal Alignment

SMAP feeds user queries, structured POI metadata, and a "north-up map tile marking only candidate POIs" into a multimodal large model for semantic route planning. It utilizes a "generator drafts, validator corrects via map" process to automatically create preference pairs, followed by training with Hallucination-Penalized DPO (HDPO). This boosts a 32B open-source model to match or exceed GPT-5 in route efficiency, temporal rationality, and overall quality.

SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs

Addressing the overlooked issue of "whether and how experts should specialize by modality" in MoE-VLMs, this paper proposes SMoES. It uses layer-wise dynamic soft modality scores to characterize the actual vision/text fusion degree of tokens, bins experts into groups aligned with deployment devices, and drives specialization through inter-bin mutual information regularization. Across 4 MoE-VLMs and 16 benchmarks, it achieves average gains of 0.9%/4.2% on multimodal/language tasks while reducing expert parallelism (EP) communication overhead by 56.1% and increasing throughput by 12.3%.

SO-Bench: A Structural Output Evaluation of Multimodal LLM

This is the first benchmark proposed by Apple to systematically evaluate the capability of Multimodal Large Language Models (MLLMs) to convert visual inputs into structured outputs conforming to predefined JSON Schemas. Using a three-stage automated annotation pipeline, SO-Bench constructs 1.8K "Image–Schema–Instruction" triplets from 112K images across four domains and 6.5K JSON Schemas. Accompanied by a three-level evaluation metric, it reveals a significant performance gap where even the strongest model, Gemini-2.5-Pro, achieves an exact match accuracy of only 18.9%.

SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning

To address the issue in CLIP test-time prompt tuning (TPT) where imposing strict orthogonal constraints to enhance class separability leads to overconfidence and poor calibration, this paper replaces hard orthogonal constraints with a Huber-style smooth orthogonal calibration (SoC). By applying a capped, gentle repulsion to semantically similar class prototypes, SoC significantly reduces the Expected Calibration Error (ECE) while maintaining high classification accuracy.

Socratic-Geo: Synthetic Data Generation and Cross-Modal Geometric Reasoning via Multi-Agent Interaction

Socratic-Geo utilizes a "Teacher-Solver-Generator" three-agent closed-loop framework. Starting from only 108 seed problems, the Teacher diagnoses Solver failures and procedurally modifies geometric diagrams using Python code with self-verification. This creates a strictly aligned curriculum of geometric problems. The Solver achieves 49.11% across six benchmarks using only 1/4 of the training data (2.43 points higher than the strongest baseline), while the byproduct Generator reaches 42.4 on GenExam-Math, setting a new open-source SOTA.

SOTA: Self-adaptive Optimal Transport for Zero-Shot Classification with Multiple Foundation Models

SOTA converts classification outputs from various foundation models (VLMs like CLIP, VFMs like DINO) into cost matrices. It utilizes a self-adaptive Optimal Transport (OT) with a "squared inner product" objective to solve for a soft assignment transport plan. This training-free and prior-free approach automatically balances model contributions, achieving significant performance gains over the strongest single models across 26 benchmarks in natural, remote sensing, and medical domains.

SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

The SPARROW framework is proposed to integrate temporal consistency supervision via Target-Specific Tracking Features (TSF) and stabilize first-frame initialization using dual-prompt ([BOX]+[SEG]) coarse-to-fine decoding. Designed as a plug-and-play module for existing video MLLMs, it achieves consistent improvements across six benchmarks and three tasks.

Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models

Addressing the two major efficiency bottlenecks of Masked Discrete Diffusion Models (MDM)—where thousands of redundant mask tokens are fed into the network every step and KV caching is incompatible—Sparse-LaViDa proposes an equivalent transformation using "sparse parameterization + register tokens + step-causal attention masks." Without breaking the bidirectional context of MDM, it allows the model to process only the "small batch of tokens to be decoded" per step, achieving up to ~2.8× acceleration in text-to-image (T2I), image editing, and visual math reasoning with minimal loss in quality and accuracy.

Sparse Spectral LoRA: Routed Experts for Medical VLMs

Ours proposes MedQwen, which partitions SVD spectral segments of pre-trained weights into non-overlapping experts and utilizes a top-k router to select spectral priors based on inputs. Accompanied by theoretically grounded residual compensation and scaling rules, it aligns the training dynamics of low-rank MoE with full-rank full-parameter fine-tuning. MedQwen approaches the performance of full fine-tuning across 23 medical datasets (with 339× fewer parameters) and suppresses catastrophic forgetting in sequential training from \(>20\text{--}50\%\) to approximately \(5\%\).

SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

This paper proposes SpatialScore, the most comprehensive multimodal spatial intelligence benchmark to date (5K samples / 30 tasks), and enhances the spatial understanding capabilities of MLLMs through two complementary paths: a data-driven SpatialCorpus (331K QA) fine-tuning scheme and a training-free SpatialAgent (12 tools).

SpatialTree: How Spatial Intelligence Branches Out in MLLMs

Inspired by cognitive science, this work deconstructs the spatial intelligence of Multi-modal Large Language Models (MLLMs) into 27 atomic capabilities across four layers ("Perception → Mapping → Simulation → Execution"). It introduces SpatialTree-Bench, the first "capability-centric" hierarchical benchmark. Through SFT/RL intervention experiments, the study reveals that low-layer capabilities are mutually independent but exhibit strong transfer toward high-layer ones, and excessive "thinking" can impair intuitive perception. Consequently, an "auto-think" strategy is proposed to achieve stable RL improvements across all hierarchical levels.

Spot The Ball: A Benchmark for Visual Social Inference

This paper introduces the SPOT THE BALL benchmark: humans and VLMs are tasked with inferring the location of a ball from sports images where it has been erased. The study finds that while humans rely on social cues like player gaze and pose—achieving 2–3x the accuracy of models—four leading VLMs only utilize superficial spatial heuristics like "guessing the center" or "near players," exposing systematic deficiencies in current VLMs regarding visual social inference.

STAR: Test-Time Adaptation Can Enhance Universal Prompt Learning for Vision-Language Models

STAR enables CLIP models that have undergone few-shot prompt tuning to continue self-adapting during the inference stage using unlabeled test streams (mixing ID and OOD samples). It first uses Fisher scores for adaptive soft gating to separate ID/OOD, then generates reliable pseudo-labels via conjugate optimization for unsupervised fine-tuning, and finally utilizes a dynamic prototype library for class-calibrated OOD detection—significantly reducing FPR95 compared to LoCoOp/SCT on ImageNet-1K.

STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval

STiTch is a training-free zero-shot composed image retrieval (ZS-CIR) framework. It first leverages an MLLM to sample multiple target descriptions (treated as a discrete distribution), then constructs a "transition vector" in the embedding space using the text modifier to correct these descriptions toward the target image and filter out noise from the reference image. Finally, it models the "description set vs. target image augmentation set" as a set-to-set Bi-directional Conditional Transport (CT) distance for retrieval scoring. It achieves overall state-of-the-art performance among training-free methods on four benchmarks: CIRCO, CIRR, FashionIQ, and GeneCIS.

Streaming Video Instruction Tuning (Streamo)

Streamo integrates the decision of "when to speak" directly into the next-token prediction of video large models. By using three state tokens (Silence/Standby/Response), the model judges response timing frame-by-frame. Combined with a 465,000-sample multi-task streaming instruction dataset for end-to-end training, it transforms offline video models into online assistants capable of real-time narration, localization, and QA, outperforming the previous Prev. SOTA Dispider by 13.83% on OVO-Bench.

Structural Graph Probing of Vision-Language Models

This paper constructs a "correlation graph" based on the pairwise neuron correlations within each layer of a Vision-Language Model (VLM). Using GCN graph probes, the authors demonstrate that this population-level topological structure can predict model behavior, characterize the evolution of cross-modal fusion with depth, and locate "hub neurons" that significantly alter the output upon perturbation. This introduces a novel intermediate scale for interpretability, positioned between "local attribution" and "full circuit recovery."

StructXLIP: Enhancing Vision-Language Models with Multimodal Structural Cues

StructXLIP utilizes edge maps as proxy representations of visual structure, introducing three structure-centric losses (edge-structural text alignment + local region-text block matching + edge-color image connection) during CLIP fine-tuning. By maximizing the mutual information of multimodal structural representations, the model is guided toward a more robust, semantically stable optimal solution, outperforming existing competitors in cross-modal retrieval tasks.

SynCLIP: Synonym-Coherent Language-Image Pretraining for Robust Open-Vocabulary Dense Perception

SynCLIP identifies "synonym-induced grounding inconsistency" in existing CLIP-based open-vocabulary dense perception methods—where the spatial attention shifts when the same object is described using different synonyms. It introduces a Synonym-to-label Spatial Attention alignment (SSA) loss and a Semantic-induced Attention Refinement (SAR) module that leverages DINOv2 for semantic token selection and context aggregation. On OV-COCO and OV-LVIS, SynCLIP achieves SOTA results among CLIP-based methods and reduces the performance drop caused by synonym replacement from ~9 AP to 4.4 AP.

Synthesizing Visual Concepts as Vision-Language Programs

Treat the VLM as a "perception function" rather than a "reasoner" — let it extract structured symbolic descriptions from images, and then use program synthesis over a Domain-Specific Language (DSL) to search for an executable logical program that expresses visual rules. This approach consistently outperforms direct VLM prompting on inductive visual reasoning tasks, while producing programs that are naturally interpretable and manually correctable.

Tackling Alignment Ambiguity in Person Retrieval through Conversational Attribute Mining

To address the persistent "alignment ambiguity" in Text-to-Image Person Retrieval, this paper utilizes Multimodal Large Language Models (MLLM) to extract fine-grained attributes through "multi-turn QA" and summarizes them into a compact description. A Bi-directional Cross-attention Mixer refines these summaries with image tokens, while a Confidence-Aware Weighted Loss suppresses noise in MLLM-generated dialogues, achieving new SOTA Rank-1 results across three benchmarks.

Tackling Model Bias via Game-theoretic Multi-agent Collaboration Framework for Hateful Meme Classification

GECO organizes three Large Multimodal Models (LMMs), one learnable agent, and one primary decision agent into a regularized game. Driven by a "hybrid reward" system to achieve consensus on correct labels, it suppresses both individual and inter-model cognitive biases, achieving new SOTA results on five hateful meme benchmarks.

TANGO: Text-Anchored Guided Optimization for Robust Fine-tuning Vision-Language Models under Label Noise

TANGO utilizes a set of "clean and immutable semantic anchors" generated by the CLIP text encoder as ground truth references independent of training labels. It replaces the noise-vulnerable parametric linear classification head with a non-parametric retrieval-based voting mechanism and employs anchors to validate and correct noisy samples. It achieves new SOTA results across six noisy benchmarks (e.g., 83.83% on CIFAR-100N, a 4.79% improvement over the strong baseline DeFT).

Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models

The TARA framework is proposed to inject taxonomic hierarchical knowledge into Large Multimodal Models (LMMs) by aligning their intermediate representations with the taxonomy-aware features of Biological Foundation Models (BFMs), significantly improving hierarchical visual recognition performance for both known and novel categories.

TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

TempR1 unifies five video temporal tasks (Temporal Grounding TG, Dense Temporal Grounding DTG, Temporal Action Localization TAL, Video Highlight Detection VHD, and Grounded Video QA GVQA) into a multi-task reinforcement learning framework based on GRPO. The key lies in designing localization rewards based on three types of "predicted interval ↔ ground-truth instance" mappings (one-to-one, many-to-one, and many-to-many). It achieves new SOTA results across five benchmarks, demonstrating positive synergy where multi-task joint training benefits individual tasks.

Test-Time Attention Purification for Backdoored Large Vision Language Models

It is discovered that the essence of backdoor behavior in LVLMs is cross-modal attention hijacking (where trigger visual tokens seize attention from text tokens). This study proposes CleanSight—the first training-free test-time backdoor defense framework—which eliminates backdoor effects by detecting and pruning visual tokens with abnormally high attention.

Text-Printed Image: Bridging the Image-Text Modality Gap by "Printing" Text into Images

To fine-tune Large Vision Language Models (LVLMs) when real images are unavailable and only text descriptions exist, this paper proposes Text-Printed Image (TPI)—rendering text descriptions directly onto a plain white canvas as image input. By forcing text through the vision encoder, TPI bridges the modality gap while preserving 100% of the text semantics. It consistently outperforms "text-only" and "diffusion-generated image (T2I)" baselines across 4 models and 7 benchmarks.

The Geometry of Robustness: Optimizing Loss Landscape Curvature and Feature Manifold Alignment for Robust Finetuning of Vision-Language Models

This paper attributes the root cause of the "ID accuracy / OOD generalization / Adversarial robustness" trilemma in VLM robust finetuning to sharp anisotropic minima in parameter space and deformed feature manifolds under perturbation. It proposes the GRACE framework: utilizing layer-wise adaptive low-rank adversarial weight perturbation to flatten the loss curvature, combined with Gram volume alignment loss to stabilize the feature manifold. When finetuning CLIP on ImageNet, it simultaneously improves all three axes (ID 74.2%, OOD 57.0%, Adversarial 22.4%).

The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition

This paper reveals that open-source LLMs lack hierarchical taxonomic knowledge of the visual world (even failing at basic biological taxonomic systems), which makes the LLM a bottleneck for hierarchical visual recognition in Vision LLMs.

The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

The authors propose the Contrastive Fusion (ConFu) framework, which extends CLIP-style pairwise contrastive learning to tri-modal higher-order alignment. By learning both paired and fused representations within a unified objective, it supports both 1→1 and 2→1 retrieval.

TIGeR: A Unified Framework for Time, Images and Geo-location Retrieval

The TIGeR framework is proposed to learn a unified geo-temporal embedding space for images, locations, and time using a multimodal Transformer. It unifies three tasks—geolocation, time-of-capture prediction, and geo-temporal aware image retrieval—and introduces a high-quality benchmark dataset of 4.5M images.

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

This work systematically investigates the key factors for constructing Video Temporal Grounding (VTG) capabilities in MLLMs. From the dimensions of data quality and algorithmic design, the authors release the high-quality TimeLens-Bench and the TimeLens-100K training set. By adopting an interleaved text-time encoding and a thinking-free RLVR training paradigm, they develop the TimeLens model series, achieving SOTA among open-source models and surpassing GPT-5 and Gemini-2.5-Flash.

TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

This paper proposes TIPSv2, discovering that distillation significantly enhances patch-text alignment. This insight is transformed into a new pretraining objective, iBOT++ (where visible tokens also participate in loss computation). Combined with Head-only EMA and multi-granularity text augmentation, the model achieves SOTA across 20 datasets in 9 tasks.

Token Warping Helps MLLMs Look from Nearby Viewpoints

This paper proposes performing spatial warping on ViT image tokens of MLLMs (rather than traditional pixel-level warping) to simulate viewpoint changes. Backward token warping is found to maintain semantic consistency while remaining robust to depth estimation noise, significantly outperforming pixel-level warping, specialized spatial reasoning MLLMs, and generative warping methods on the self-constructed ViewBench.

Towards Calibrating Prompt Tuning of Vision-Language Models

Addressing the "dual miscalibration" issue (base class under-confidence + novel class over-confidence) in prompt-tuned CLIP, this work proposes mean-variance margin regularization and text moment matching loss. These complementary regularization terms serve as plug-and-play modules that significantly reduce ECE across 7 prompt tuning methods and 11 datasets.

Towards Dynamic Modality Alignment in Multimodal Continual Learning

This paper argues that "modality alignment is not a static one-time constraint, but a dynamic process evolving with tasks and network layers." It constructs a "Dynamic Alignment Graph" for each task (nodes are cross-modal cluster centroids, intra-layer edges capture token interactions, and inter-layer edges capture representation propagation). By using three-level graph regularization to lock the evolution of old class subgraphs while keeping new ones flexible, it prevents shallow misalignments from snowballing into deeper layers. On the MTIL 11 dataset, it pushes Avg./Last accuracy to 79.4%/87.1% with only 1.8M trainable parameters, exceeding the previous strongest baseline DIKI by approximately +3.1%/+2.0%.

Towards Multimodal Domain Generalization with Few Labels

This paper defines and investigates the new Semi-Supervised Multi-modal Domain Generalization (SSMDG) problem, proposing a unified framework driven by consensus-based pseudo-labeling, disagreement-aware regularization, and cross-modal prototype alignment to achieve cross-domain generalization under sparse labeling.

Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset

This paper constructs the first million-scale industrial defect "image-text pair" dataset, IMDD-1M (1.24 million images, 63 manufacturing domains, 421 defect types), and trains a text-conditioned diffusion foundation model from scratch. It unifies segmentation, detection, classification, and generation into a single framework. Downstream tasks achieve performance close to specialized models using only about 200 samples per class (less than 5% of the annotation volume required by expert models).

Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

Ours proposes DocHumming, a data-training co-design framework. It constructs the large-scale synthetic dataset DocMix-3M via Realistic Scene Synthesis and implements a Document-Aware Training Recipe combining progressive learning with structural token weighting. DocHumming achieves an Overall score of 93.75 on OmniDocBench using only a 1B MLLM (surpassing Qwen3-VL-235B's 89.15), with a performance degradation of only 6.72 points in realistic capture scenarios (compared to 18-20 points for modular methods).

Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models

Addressing "reasoning-capable" Multimodal Large Language Models (MLLMs), this paper proposes RMLLMU-Bench to specifically measure information leakage within reasoning chains and the preservation of reasoning capabilities. It introduces R-MUSE, a training-free, inference-time intervention framework that employs subspace guidance and adaptive steering to erase target answers and intermediate reasoning traces while minimizing disruption to general reasoning.

Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters

TOGA attaches an "image-patch-text" heterogeneous graph teacher during the training phase for fine-grained cross-modal reasoning. These relational insights are distilled into the key-value cache of a Tip-Adapter student. During inference, the entire graph teacher is discarded, keeping the inference path identical to Tip-Adapter (zero extra latency or VRAM). It achieves new SOTA results on 11 benchmarks across 1–16 shot settings.

Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation

The authors propose CES (Coordinator-Executor-State Tracker), a multi-agent framework and phased execution-feedback reinforcement learning algorithm. By decoupling high-level task planning from low-level execution through specialized training of the Coordinator and State Tracker, the framework significantly enhances the planning and state management capabilities of GUI agents in long-horizon tasks.

TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration

TreeTeaming proposes an automated red-teaming framework based on a hierarchical strategy tree. Driven by an LLM-based Orchestrator, it dynamically explores and evolves attack strategies, achieving SOTA Attack Success Rates (ASR) across 12 mainstream VLMs (87.60% on GPT-4o) and identifying diverse new attack methods beyond known strategy sets.

TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

The TRivia self-supervised fine-tuning framework is proposed, which enables VLMs to learn table recognition directly from unlabeled table images through table Question Answering (QA)-driven GRPO reinforcement learning. With 3B parameters, TRivia-3B outperforms proprietary models such as Gemini 2.5 Pro and GPT-5 across multiple benchmarks.

TTL: Test-time Textual Learning for OOD Detection with Pretrained Vision-Language Models

Addressing the limitation where existing CLIP-based OOD detection relies on "fixed external OOD labels" that fail to cover the open world, TTL updates only a set of learnable OOD textual prompts on the test stream. It employs pseudo-labels to amplify OOD similarity, a purification loss to eliminate noise from ID boundary samples, and a textual knowledge base for cross-batch score calibration. TTL reduces the average FPR95 by 12.67% and improves AUROC by 3.94% across nine OOD datasets in two major benchmarks.

TTRV: Test-Time Reinforcement Learning for Vision Language Models

TTRV enables off-the-shelf decoder-based VLMs to perform reinforcement learning directly on unlabeled test data during the inference stage. Driven by two self-supervised rewards—"frequency of the model's own output" and "entropy of the output distribution"—through GRPO, it achieves an average 24.6% improvement in object recognition and 10.0% in VQA across 16 datasets. It even pushes the ImageNet recognition of InternVL3-8B beyond GPT-4o.

TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

TUNA cascades a VAE encoder and a semantic representation encoder to obtain a set of continuous unified visual representations compatible with both "understanding" and "generation." Combined with an autoregressive text head and a flow-matching generation head, a single native model at 1.5B/7B scale achieves SOTA results in image/video understanding, image/video generation, and image editing (MMStar 61.2, GenEval 0.90).

Twin-T & TwintVQA: A Reliable Structure-Detail Separating VLM and a Comprehensive Benchmark for Chart and Table Tasks

Twin-T explicitly separates and recombines chart structural cues (axes, grids, layout) and detail cues (values, legends, text) using a "dual-head image encoder + Schur-style fusion." It further enhances numerical and keyword fidelity via MINT preference learning. Accompanying this is the TwintVQA benchmark, covering 17 chart types, 11 tasks, and 3 formats. The 7B model outperforms GLM-4.5V-106B on mainstream chart-table leaderboards, approaching the performance of GPT-4o and Gemini-2.5-Pro.

UARE: A Unified Vision-Language Model for Image Quality Assessment, Restoration, and Enhancement

UARE integrates Image Quality Assessment (IQA), restoration, and enhancement into a single Mixture-of-Transformers (MoT) based vision-language model. By employing a two-stage training strategy with interleaved "reason-then-restore" data, the authors systematically validate the hypothesis that "explicit quality analysis improves restoration results" for the first time. The model achieves competitive performance across SR, multi-degradation restoration, and IQA tasks.

UI-Lens: Assessing General MLLMs' Potential to Automate UI Display Quality Assurance

UI-Lens constructs a multilingual UI display defect detection benchmark for real-world commercial Apps (4,759 Chinese interfaces + 3,392 English interfaces, with 6 defect categories and expert naming). Systematic evaluation of 9 mainstream MLLMs/VLMs reveals that they perform nearly identically to random guessing on fine-grained boundary defects (Text Overflow F1 only 22.19%) and cross-interface semantic consistency (F1 only 11.44%), exposing a fundamental shortcoming: current models "recognize what the object is but ignore how it is presented."

Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models

Proposes Beta-KD, an uncertainty-aware knowledge distillation framework from a Bayesian perspective. By modeling teacher supervision as a Gibbs prior and deriving a closed-form solution via Laplace approximation, it automatically adjusts the balance between data and teacher signals, consistently improving distillation performance on multimodal VQA benchmarks.

Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models

The UNCHA framework is proposed to model the semantic representativeness of part images relative to the whole scene using hyperbolic uncertainty in hyperbolic VLMs. By utilizing uncertainty-guided contrastive and entailment losses, it enhances compositional scene understanding and outperforms existing hyperbolic VLMs across multiple downstream tasks.

Understanding Counting Mechanisms in Large Language and Vision-Language Models

Authors utilize a set of controlled "repetitive object counting" experiments and a self-developed causal probing tool, CountScope, to dissect LLMs and LVLMs layer-by-layer and token-by-token. They find that counting is not a one-time summation but a hierarchical process emerging across layers, driven by "internal counters" that update incrementally and rely heavily on structural shortcuts like delimiters.

Understanding Task Transfer in Vision-Language Models

This paper presents the first systematic study of the impact of fine-tuning Vision-Language Models (VLMs) on a specific visual perception task on the zero-shot performance of other perception tasks. It proposes the Perfection Gap Factor (PGF), a normalized metric to quantify cross-task transfer. Using three scales of Qwen-2.5-VL, the study reveals structural patterns in task transfer (positive/negative transfer cliques, task persona classification, scale dependencies, etc.) and demonstrates that PGF can guide data selection to enhance fine-tuning efficiency.

UNI-OOD: Unified Object- and Image-level Out-of-Distribution Detection via Cross-Context Attentive Vision-Language Modeling

UNI-OOD employs two identical pairs of CLIP image-text encoders to model the "target object" and "background" respectively. By leveraging four types of cross-context attention (intra-image, inter-image, inter-text, and image-text alignment), it decouples fine-grained object evidence from spurious background associations. This approach marks the first single model to achieve SOTA performance simultaneously in both object-level and image-level OOD detection without requiring pre-knowledge of the task type during inference.

UNICBench: UNIfied Counting Benchmark for MLLM

Introducing UNICBench, the first unified cross-modal (Image/Text/Audio) multi-level counting benchmark, containing 14,301 QA pairs (5,508+5,888+2,905) categorized by three capability levels (Pattern/Semantic/Reasoning) × three difficulty levels (Easy/Medium/Hard). Systematic evaluation of 45 SOTA MLLMs reveals that basic counting tasks are approaching human level, while significant gaps remain in reasoning-level and difficult tasks.

Unified Personalized Understanding, Generating and Editing

OmniPersona achieves "personalized understanding, generation, and editing" within a single unified Large Multimodal Model (LMM). By using structurally decoupled concept tokens, the model routes the same concept to different expert subspaces according to the task to reduce mutual interference. It further employs an inference-time "explicit knowledge recaptioning" mechanism to extract concept attributes through QA before feeding them into generation. This framework integrates personalized image editing into a unified model for the first time and introduces the OmniPBench evaluation benchmark.

Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models

Addressing the "modal imbalance" issue where MLLMs become more text-biased and visual representations homogenize in deeper layers, this paper proposes LaVer. It performs masked reconstruction of visual tokens within the LLM's latent semantic space (latent MIM) and utilizes Clipped Gram-Anchoring to prevent feature collapse. This provides direct supervision for visual representations, yielding significant improvements in dense visual tasks like OCR and vision-centric benchmarks (e.g., OCRBench +19.22%).

UVU: Improving Multimodal Understanding via Vision-Language Unified Autoregressive Paradigm

UVU shifts visual supervision from a "post-training auxiliary constraint" to the "main driver of pre-training." It abandons Vector Quantization (VQ), uses continuous visual encoding for lossless input images, and constructs a 200,000-entry pixel-level visual codebook through large-scale iterative hierarchical clustering. This allows the LLM to generate pixel-level image tokens similarly to text tokens during autoregressive next-token prediction. Consequently, fine-grained visual perception is embedded into the model's perception backbone without relying on external decoders. The 3B model significantly outperforms same-class models like Qwen2.5-VL across 12 understanding benchmarks.

VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging

VCU-Bridge proposes a three-layer progressive visual connotation understanding framework ("Foundational Perception → Semantic Bridging → Abstract Connotation") along with HVCU-Bench for layer-wise diagnosis. The study finds that MLLM performance consistently declines as the reasoning hierarchy ascends. By utilizing MCTS-guided instruction tuning data to strengthen low-level perception, the approach achieves improvements on this benchmark and an average gain of +2.53% on general benchmarks (+7.26% on MMStar).

Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping

This paper defines the new task of Aesthetic Guidance (AG) and constructs the AesGuide benchmark (10,748 images with aesthetic scores, analysis, and guidance annotations). It proposes Venus, a two-stage framework that first empowers MLLMs with aesthetic guidance capabilities through progressive aesthetic Q&A, and then activates aesthetic cropping capabilities via CoT reasoning, achieving SOTA performance on both tasks.

Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models

VisionToM is a lightweight vision-based intervention framework that enhances Theory of Mind (ToM) reasoning in MLLMs by probing and intervening in attention heads sensitive to visual input and ToM logic. Without fine-tuning, the method significantly improves performance on the EgoToM benchmark by guiding the model to focus on visual evidence.

VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion

Ours proposes VideoFusion, the first large-scale infrared-visible video fusion framework. By integrating Cross-modal Differential Representation Enhancement (CmDRM), Complete Modality Guided Fusion (CMGF), and Bidirectional Temporal Collaborative Attention Mechanisms (BiCAM), it jointly models cross-modal complementarity and temporal dynamics to generate spatio-temporally consistent high-quality fused videos. Additionally, the M3SVD dataset consisting of 220 videos/154,000 frames is constructed.

ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting

ViKey significantly enhances the temporal reasoning capabilities of VideoLLMs under training-free conditions by overlaying sequential frame index visual prompts (VP) on video frames combined with a lightweight Keyword-Frame Mapping (KFM) module. It approaches dense frame performance using only 20% of the frames.

VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA

VinQA proposes a "visual element interleaved long-form answer generation" task and dataset for real-world documents. Answers are no longer pure text but insert cited figures, tables, and charts directly before the corresponding supporting text. The work introduces two encoding methods for raw page images (Page and Modality Encoding) and a multimodal scoring framework, M-GroSE. Fine-tuning the open-source Qwen2.5-VL-7B on the VinQA training set improves the M-GroSE Avg from ~2.0 to ~3.34, significantly closing the gap with closed-source frontier models like GPT-4.1 and Claude 3.5.

Vision-Language Model Guided Source-Free Domain Adaptation via Optimal Transport

VSFOT liberates Source-Free Domain Adaptation (SFDA) from the self-training "dead loop" of generating pseudo-labels for itself. Instead, it utilizes a frozen CLIP as an external semantic prior to soft-align target features with source classifier prototypes via Optimal Transport (OT). Simultaneously, the task model fine-tunes CLIP through reverse distillation, forming a complementary bidirectional distillation framework that consistently outperforms existing SFDA methods across four benchmarks.

Vision-Speech Models: Teaching Speech Models to Converse about Images

This paper proposes MoshiVis, which uses a set of lightweight gated cross-attention adaptation modules to transform Moshi, a real-time full-duplex speech dialogue large model, into a Vision-Speech Model (VSM) capable of "seeing images and chatting via speech." By utilizing single-stage mixed fine-tuning with "speechless image-text data + a small amount of image-speech data," the training cost is compressed to one day on 8×H100, with an added inference latency of only about 7ms per step.

VISion On Request: Enhanced VLLM Efficiency with Sparse, Dynamically Selected, Vision-Language Interactions

VISOR proposes a new efficiency paradigm distinct from vision token compression—by sparsifying vision-language interaction layers within the LLM (utilizing minimal cross-attention and dynamically selected self-attention layers). It achieves 8.6-18\(\times\) FLOPs savings while preserving full high-resolution vision tokens, significantly outperforming token compression methods particularly on difficult tasks requiring fine-grained understanding.

VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models

VisMem equips Vision-Language Models (VLMs) with a "latent vision memory" system. Based on cognitive psychology, memory is bifurcated into "short-term/vision-led" and "long-term/semantic-led" types. These are dynamically triggered by special tokens during autoregressive generation to instantly generate latent memory vectors for context insertion. Trained via two-stage reinforcement learning, it achieves an average improvement of 11.0% across 12 benchmarks compared to the original model.

VisPlay: Self-Evolving Vision-Language Models

VisPlay enables a single base VLM to simultaneously act as both a "Questioner" and a "Responder." Using only unlabeled images, the system automatically scores questions based on the responder's answer uncertainty and generates pseudo-labels via majority voting. The two roles evolve alternately through self-play using GRPO. VisPlay achieves consistent performance gains across 8 visual reasoning benchmarks, nearly matching the performance of models trained with manual annotations using GRPO.

Visual Grounding for Object Questions

This paper proposes a new task, Visual Grounding for Object Questions (VGOQ), which shifts the focus from "where the direct answer is" to "locating visual evidence/context that supports answering open-ended abstract questions." The authors developed two automated data pipelines to create the VizWiz-VGOQ and ABO-VGOQ benchmarks and trained a lightweight CLIPSeg-style model with only 1.77M parameters. This model outperforms large-scale models like GLaMM, UnifiedIO, and OFA on the VGOQ task and remains competitive with the contemporaneous Qwen3-VL.

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

This paper constructs VisualOverload using 150 public domain paintings with ultra 4K resolution and highly dense human activities. It is a VQA benchmark featuring 2,720 human-annotated QA pairs with private ground truth, specifically designed to test foundational perception (Activity/Attribute/Counting/OCR/Reasoning/Scene Classification) of VLMs in "visual overload" scenarios. Experimental results across 37 models show that even the strongest model, o3, achieves only 19.6% accuracy on the hardest subset, suggesting that the notion that "foundational visual understanding has been solved" is an illusion.

VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment

VITAL automatically annotates 4.58 million vision-language pairs using six scoring models followed by multi-LMM cross-review. By freezing the LLM and training only the vision encoder via generative pre-training, it produces a foundation vision encoder for visual quality assessment that generalizes across image/video scoring and descriptions while being seamlessly transferable to arbitrary LLM decoders.

VKG-QA: Visual Knowledge Graph-based Question Answer for Large Multimodal Models

By visualizing Knowledge Graphs as images and tasking Large Multimodal Models (LMMs) with "looking at the graph" for question answering, the authors constructed the VKG-QA benchmark covering 3 categories, 14 subtasks, and 3,205 questions. Evaluations of 19 LMMs reveal that current models generally struggle to "understand graph structures," with structural perception (degree, direction, connectivity) being the most prominent weakness. Closed-source models significantly outperform open-source counterparts.

VL-Eraser: Vacuum Distillation for Machine Unlearning in Vision-Language Models

VL-Eraser points out that traditional "reverse-training" unlearning in VLMs primarily destroys cross-modal alignment rather than truly removing knowledge. It reformulates unlearning as a two-stage "distillation-then-deletion" process: first, distilling the targeted knowledge into a set of LoRAs under "vacuum space" constraints, and then subtracting these LoRAs from the original model to achieve cleaner deletion while preserving model utility.

VL-RouterBench: A Benchmark for Vision-Language Model Routing

This paper introduces VL-RouterBench, the first systematic routing benchmark for Vision-Language Models (VLMs), covering 14 datasets, 17 candidate models, and 519,180 sample-model pairs. It evaluates 10 routing methods and reveals a significant performance gap between the current optimal routers and the ideal Oracle.

VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression

The authors discovered that off-the-shelf VLMs (Gemini 2.5-Flash) can zero-shot reproduce human pairwise preference judgments. By treating the VLM as a "perceptual judge" and utilizing Diffusion DPO to post-train a FlowMo-based diffusion autoencoder, they developed VLIC—an image compression system highly aligned with human perception that achieves SOTA performance across most perceptual metrics.

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

VLM-3R integrates a metric-scale feed-forward 3D reconstruction model (CUT3R) with a VLM. It extracts implicit scene geometry tokens and camera motion tokens from pure monocular video, which are then fused into visual features via cross-attention for instruction tuning. This allows the model to perform spatial and temporal reasoning without relying on depth sensors or pre-built point cloud maps, achieving the highest performance among open-source models on VSI-Bench and the newly proposed VSTI-Bench.

VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh Recovery

A VLM-based dual-memory self-reflective Critique Agent is proposed to generate group-level preference signals for diffusion-based human mesh recovery. The diffusion model is fine-tuned via Group Preference Alignment, significantly improving HMR accuracy in in-the-wild scenarios without 3D annotations.

VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models

The VLM-Loc framework is proposed, which converts 3D point cloud maps into BEV images and scene graphs for structured spatial reasoning by VLMs. Combined with a Partial Node Assignment (PNA) mechanism for fine-grained text-to-point cloud alignment, it significantly outperforms previous SOTA on the self-built CityLoc benchmark with a 14.20% improvement in Recall@5m.

VLM4RSDet: Collaborative Optimization with Vision-Language Model for Enhancing Remote Sensing Object Detection

VLM4RSDet enables a conventional closed-set detector and a vision-language model (Florence-2) to share a vision backbone and perform joint backpropagation during the training phase, "distilling" VLM prior knowledge into the detector’s features. During inference, the VLM is discarded, leaving only the standard detection branch. This achieves SOTA detection accuracy with zero additional overhead (e.g., mAP\(_{0.5:0.95}\) on VisDrone2019 improved by 7.5% over previous best methods).

Do Vision-Language Models Leak What They Learn? Adaptive Token-Weighted Model Inversion Attacks

This paper presents the first systematic study of Model Inversion (MI) attacks on VLMs. It proposes a set of inversion strategies tailored for token generation characteristics (TMI/TMI-C/SMI) and the SMI-AW method, which dynamically weights token gradient contributions based on visual attention intensity. The approach achieves a human-evaluated attack accuracy of up to 61.21% across 4 VLMs and 3 datasets, revealing significant privacy risks regarding training data in VLMs.

Vocabulary Scaling Law: Tuning Open-vocabulary Predictors for Their Openness

This paper theoretically proves that the ability of CLIP to maintain accuracy on old classes (stability) and recognize new classes (extensibility) as the vocabulary expands is lower-bounded by the "prediction confidence over the complete open vocabulary universe \(U\)." Based on this, it proposes three tuning principles (covering the entire \(U\), tuning only class-name embeddings, and adding orthogonal constraints to trained/open class-name embeddings) and implements SVFT, a fine-tuning method that uses submodular greedy selection to approximate \(U\). SVFT significantly outperforms existing fine-tuning methods in both stability and extensibility.

VQ-VA World: Towards High-Quality Visual Question-Visual Answering

This paper brings the capability of "visual question-visual answering" (VQ-VA)—originally exclusive to closed-source systems like GPT-Image or NanoBanana—to open-source models. It utilizes a five-agent pipeline to extract approximately 1.8 million training samples from web-based interleaved documents that "require world knowledge and reasoning to complete image transformations," accompanied by the manually annotated IntelligentBench. After fine-tuning LightFusion on this data, the IntelligentBench score surged from 7.78 to 53.06, surpassing all open-source models and significantly narrowing the gap with closed-source systems.

VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments

This paper introduces VS-Bench, a multimodal benchmark consisting of ten visualized game environments. It systematically evaluates the strategic capabilities of VLMs in multi-agent settings across three dimensions: perception, strategic reasoning, and decision-making. The study reveals that current state-of-the-art models still exhibit a significant gap from optimal performance in reasoning and decision-making.

Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training

Wan-Weaver proposes a decoupled architecture consisting of a Planner (VLM) and a Visualizer (DiT). By training the planner with large-scale textual-proxy data instead of real interleaved data, it achieves SOTA interleaved text-image generation. It reaches an Overall score of 8.67 on OpenING, surpassing GPT-4o (8.20) and performing competitively with Nano Banana (8.85), while maintaining strong understanding capabilities (MMMU 74.9).

WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation

WEAVE constructs the first interleaved cross-modal comprehension and generation data suite for "multi-turn with historical context." It includes a 100k multi-turn dialogue training set (WEAVE-100k), a 100-question manually annotated benchmark (WEAVEBench), and a hybrid VLM evaluation framework. The study reveals that current unified multimodal models collectively fail at multi-turn image editing/generation requiring "visual memory," whereas fine-tuning with WEAVE-100k enables the emergence of visual memory capabilities.

WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs

This work diagnoses the "Time-Agnosticism" issue in current Video-LLMs and proposes the WeaveTime framework. It endows the model with temporal awareness through a Streaming Temporal Perception Enhancement (SOPE) auxiliary task during training. At inference, it implements efficient adaptive memory retrieval via an uncertainty-gated Past-Current Dynamic Focus Cache (PCDF-Cache), achieving significant improvements in streaming video QA.

WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

WeMMU bridges a frozen VLM (Qwen2.5-VL) and a trainable diffusion model (Sana) using a set of "noisy query tokens" resampled from \(\mathcal{N}(0,I)\) at each step, alongside an external VAE linear branch to recover fine-grained details. This design resolves the "task generalization collapse" observed when fixed learnable queries migrate to new tasks, enabling efficient and sustainable learning for unified multimodal generation and editing.

Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention

This paper first uses "layer-wise visual masking" to dissect where visual information is integrated into the language stream (finding fusion concentrated in shallow-to-middle layers, with "reviewing" in deep layers), then proposes a training-free contrastive attention method. By subtracting "pre-fusion layer" attention from final layer attention, it extracts truly task-relevant image regions for secondary inference, achieving stable performance gains across 7 MLLMs and multiple VQA benchmarks.

Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

Eagle is proposed as a lightweight black-box attribution framework that performs spatial attribution for MLLM autoregressive token generation using a unified objective function of insight score (sufficiency) and necessity score (indispensability). It quantifies whether each token relies on language priors or perceptual evidence, significantly outperforming existing methods in faithfulness, localization, and hallucination diagnosis while substantially reducing GPU memory requirements.

Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models

Ours proposes CORE (COncept-aware REfuser), a framework for continual unlearning in Large Vision-Language Models (LVLMs). By decomposing vision-language pairs to be deleted into fine-grained visual attributes and textual intent concepts, it utilizes a concept modulator to identify required concept combinations for rejection. Subsequently, a Mixture of Refusers generates concept-aligned refusal responses. CORE achieves the best unlearning-retention trade-off with 90.67% CRR and 88.02% AR across 16 sequential tasks.

Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

This paper explains why RL (GRPO) post-trained VLMs generalize better to out-of-distribution (OOD) data than SFT from a "data perspective": the advantage of RL does not stem from the algorithm itself, but from its advantage function naturally concentrating training signals on "medium-difficulty" samples, acting as an implicit data filter. Accordingly, the authors propose DC-SFT—explicitly removing hard samples before standard SFT—obtaining results that surpass RL on OOD while being more stable and 3–5 times faster.

Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs

This paper formally defines the Widget-to-Code task for the first time, constructs the first image-only widget dataset and a multi-dimensional evaluation system, and proposes a modular baseline based on Perception Agents and WidgetFactory infrastructure. It achieves high-fidelity widget reconstruction through component decomposition, icon retrieval, reusable visual templates, and adaptive rendering.

WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

WikiCLIP revives the contrastive learning paradigm for open-domain Visual Entity Recognition (VER), which has recently been overshadowed by generative methods. It utilizes an LLM to encode Wikipedia text as knowledge representations and employs visual features at the patch level to filter out irrelevant text, resulting in "knowledge-aware entity vectors." Combined with synthetic hard negatives, it outperforms the 13B generative SOTA (AutoVER) by 3.4 points on OVEN unseen entities while being nearly 100 times faster during inference.

Will Multimodal Models Be Dazzled by Multi-Image Visual Puzzles?

This paper proposes the MIRACLE benchmark—an evaluation set containing 4,000 problems and 29,400 images, with an average of 7.35 images per problem (up to 14). It forces models to perform cross-image relational reasoning to arrive at correct answers. Results indicate that even the strongest model, Gemini-2.5-Pro, achieves only 55.91%. All models collapse on high visual density tasks like jigsaw puzzles and numerical constraint reasoning, exposing significant weaknesses in current MLLMs regarding structured and collaborative visual reasoning.

World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models

The authors introduce CultureMix, a food VQA benchmark utilizing diffusion models to synthesize 23,000 images featuring "co-occurring multiple cultural elements" (across 4 sub-tasks). The study evaluates 10 Large Vision-Language Models (LVLMs) on their ability to recognize food and its country of origin in mixed-culture scenarios. Findings indicate that models rely heavily on background cues and are frequently misled by "cultural distractors" (accuracy drops by 14% after adding backgrounds). Preliminary evidence suggests that Supervised Fine-Tuning (SFT) can significantly mitigate this vulnerability.