Skip to content

🧩 Multimodal VLM

🧠 NeurIPS2025 · 151 paper notes

A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1

This paper proposes M-Attack, which performs random cropping on source images and aligns them with target images via local-global or local-local matching in the embedding space, combined with a multi-CLIP model ensemble. This causes adversarial perturbations to naturally concentrate on semantically critical regions, forming clear semantic details. M-Attack achieves >90% targeted attack success rate against commercial black-box LVLMs including GPT-4.5/4o/o1.

A Multimodal Benchmark for Framing of Oil & Gas Advertising and Potential Greenwashing Detection

This work introduces the first multimodal framing analysis benchmark for oil and gas (O&G) industry video advertisements, comprising 706 videos, 13 framing categories, 50+ entities, and 20 countries. It systematically evaluates six VLMs on greenwashing-related framing detection, finding that GPT-4.1 achieves 79% F1 zero-shot on environmental labels but only 46% on green innovation, thereby exposing implicit framing analysis and cultural context understanding as core challenges for current VLMs.

ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking

This paper proposes ACT (Annotation with Critical Thinking), a data pipeline in which an MLLM annotates all samples in bulk, a second MLLM acting as a critic estimates the error probability of each annotation, and only high-suspicion samples are routed to human reviewers. Combined with a theoretically derived ACT loss function, the approach achieves 70–90% reduction in human annotation cost across six cross-modal datasets while maintaining a downstream performance gap of less than 2%.

AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining

AdaLRS is proposed as a plug-and-play online learning rate search algorithm that adaptively adjusts the learning rate by monitoring the loss descent velocity, reducing the cost of learning rate hyperparameter search from multiple independent training runs to a single run, achieving approximately 50% savings in training cost.

Adapting Vision-Language Models for Evaluating World Models

This paper proposes UNIVERSE (UNIfied Vision-language Evaluator for Rollouts in Simulated Environments), a unified semantic evaluator for world model rollouts constructed by fine-tuning only the projection head of PaliGemma 2 (0.07% of total parameters). UNIVERSE achieves performance comparable to task-specific models on action recognition and character recognition, while exhibiting strong alignment with human judgments.

ADMN: A Layer-Wise Adaptive Multimodal Network for Dynamic Input Noise and Compute Resources

This paper proposes ADMN (Adaptive Depth Multimodal Network), a two-stage training framework: (1) Multimodal LayerDrop fine-tuning to make the backbone robust to arbitrary layer configurations, and (2) a QoI-aware controller that dynamically allocates layer budgets across modalities. ADMN adaptively assigns layers based on per-modality quality-of-information (QoI) under strict compute constraints, matching full-model accuracy while reducing FLOPs by 75% and latency by 60%.

Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning

This paper proposes CLIC, which concatenates two images to form a composite scene and generates hard negatives via cross-image lexical swapping, while constructing multiple positive captions to enhance semantic invariance. By fine-tuning only the CLIP text encoder, CLIC simultaneously improves compositional reasoning (achieving SOTA on SugarCrepe++) and downstream retrieval performance, resolving the long-standing trade-off between compositionality and retrieval in prior methods.

AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models

This paper introduces a fine-grained 3D embodied reasoning task—jointly predicting the spatial location, motion type, and motion axis of actionable elements—and proposes rendering 3D point clouds into panoramic views with projected affordance candidates, guided by a customized Chain-of-Thought (CoT) reasoning paradigm for MLLMs, achieving state-of-the-art performance with AP25 of 23.3%.

Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment

This paper proposes BACL (Boundary-Aware Curriculum with Local Attention), which combines a learnable boundary-aware negative sampler (via easy-to-hard curriculum learning) with a contrastive local attention loss (for token-level mismatch localization). On LAION-400M, BACL yields a +32% R@1 improvement over CLIP and achieves state-of-the-art results on four large-scale benchmarks.

AntiGrounding: Lifting Robotic Actions into VLM Representation Space for Decision Making

This paper inverts the conventional instruction grounding paradigm — rather than compressing VLM knowledge into intermediate representations (symbolic skills or constraints), it renders candidate robot trajectories into multi-view scene images and evaluates action proposals directly within the VLM's native high-dimensional representation space, enabling zero-shot closed-loop robotic manipulation control.

Approximate Domain Unlearning for Vision-Language Models

This paper introduces Approximate Domain Unlearning (ADU), a novel task that enables pretrained VLMs to selectively forget recognition capabilities for specified domains (e.g., illustrations, sketches) while preserving classification accuracy on other domains (e.g., real photographs). Two modules are proposed — Domain Disentangling Loss (DDL) and Instance-wise Prompt Generator (InstaPG) — achieving substantial improvements over all baselines across four multi-domain datasets.

AQuaMaM: An Autoregressive, Quaternion Manifold Model for Rapidly Estimating Complex SO(3) Distributions

This paper proposes AQuaMaM—a Transformer-based autoregressive quaternion manifold model that represents each projected component of the unit quaternion as a geometrically constrained mixture of uniform distributions, enabling exact likelihood computation and fast sampling on the SO(3) rotation manifold. AQuaMaM achieves 52× faster inference and 14% higher log-likelihood compared to IPDF, with sampled distributions that closely match the ground truth.

Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering

This paper presents DeepTumorVQA, a 3D diagnostic-grade visual question answering benchmark for abdominal CT tumors, comprising 9,262 CT volumes (3.7 million slices) and 395K expert-level questions. It systematically evaluates the clinical diagnostic capability of four state-of-the-art VLMs, finding that current models perform acceptably on measurement tasks but fall far short of clinical requirements in lesion recognition and reasoning.

Attention! Your Vision Language Model Could Be Maliciously Manipulated

This paper proposes the Vision-language Model Manipulation Attack (VMA), an image-based adversarial attack method that combines first- and second-order momentum optimization with a differentiable transformation mechanism, enabling precise control over every output token of a VLM. The approach supports a range of attack scenarios (jailbreaking, hijacking, privacy breach, DoS, sponge examples) and can also be repurposed for copyright-protection watermark injection.

Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization

This paper proposes Balanced Token Pruning (BTP), which jointly considers the impact of pruning on both the current layer (local) and subsequent layers (global). BTP emphasizes diversity preservation in shallow layers to maintain downstream representation quality, and attention-based selection in deep layers to preserve local output consistency. On multiple LVLMs including LLaVA and Qwen2.5-VL, BTP retains 98% of the original model's performance while keeping only 22% of visual tokens.

Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging

This paper proposes BTB3D, a 3D CT tokenizer based on causal convolutional codec, 3D Haar wavelet compression, and a three-stage progressive training strategy. It achieves substantial state-of-the-art improvements on two downstream tasks—radiology report generation and text-conditioned CT synthesis—demonstrating that "better tokens matter more than larger language models."

Beyond Greedy Exits: Improved Early Exit Decisions for Risk Control and Reliability

UAT (Unsupervised Adaptive Thresholding) designs a reliability function for early-exit DNNs to assess the quality of intermediate layer outputs, and employs a multi-armed bandit (MAB) algorithm to dynamically learn optimal exit thresholds at inference time, achieving 1.7–2.1× speedup with less than 2% performance degradation while remaining robust to distribution shift.

BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning

BioCLIP 2 trains a ViT-L on TreeOfLife-200M (214M images across 952K species) using hierarchical contrastive learning, achieving an 18% improvement over BioCLIP in zero-shot species recognition. The work further uncovers emergent properties arising from scale: embeddings automatically encode ecological relationships (e.g., Darwin's finches arranged by beak size), and intra-species variation is orthogonal to inter-species variation.

Breaking the Compression Ceiling: Data-Free Pipeline for Ultra-Efficient Delta Compression

This paper proposes UltraDelta — the first data-free delta weight compression pipeline — which achieves compression ratios up to 224× across LLM/NLP/vision/multimodal models without performance degradation and even surpasses fine-tuned models, via three components: variance-guided mixed sparsity allocation, distribution-aware compression, and trace-norm-guided rescaling.

BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models

This paper proposes BridgeVLA, which projects 3D point clouds into multi-view 2D images and uses 2D heatmaps as an intermediate representation to align the input and output spaces, enabling efficient and effective 3D robot manipulation learning.

Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning

This paper proposes In-Context Representation Learning (ICRL), the first training-free framework that injects representations from non-text-modality foundation models (FMs) into a text-only LLM for few-shot reasoning. Two strategies are introduced: PCA-based text-level injection and optimal transport (OT)-based embedding alignment, enabling cross-modal knowledge utilization without any parameter updates.

Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

This paper introduces the Qualcomm Interactive Cooking benchmark and the LiveMamba model, presenting the first systematic evaluation of multimodal LLMs for providing real-time, step-by-step task guidance in streaming video — encompassing instruction delivery, completion detection, and error feedback.

CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

This paper proposes CAPability, a comprehensive visual captioning benchmark covering 12 dimensions across 6 perspectives. It annotates visual elements (rather than sentences) for nearly 11K images and videos, simultaneously evaluating caption correctness (precision) and thoroughness (hit). A novel "Knows but doesn't Tell" (\(K\bar{T}\)) metric is introduced to reveal the significant capability gap between MLLMs in QA versus captioning tasks.

Causal-LLaVA: Causal Disentanglement for Mitigating Hallucination in Multimodal Large Language Models

This paper identifies the root cause of object hallucination in MLLMs at the representation level—semantic entanglement induced by dataset co-occurrence bias—and proposes a dual-path causal disentanglement framework (Causal-Driven Projector + Causal Intervention Module). By applying backdoor adjustment at both the projector and the final Transformer layer to decouple co-occurring object representations, the method achieves a 22.6% improvement on MME-Perception.

ChartMuseum: Testing Chart Visual Reasoning in Large Vision-Language Models

This paper introduces ChartMuseum, a chart question-answering benchmark comprising 1,162 expert-annotated questions and real-world charts from 184 distinct sources. It is the first benchmark to systematically distinguish visual reasoning from textual reasoning, revealing that the current strongest model, Gemini-2.5-Pro, achieves only 63.0% accuracy compared to 93% for humans, with visual reasoning performance lagging behind textual reasoning by 35%–55%.

CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models

This paper proposes CHOICE, a large-scale multi-level VLM benchmark for the remote sensing domain, comprising 10,507 newly collected questions spanning 2 top-level dimensions, 6 sub-dimensions, and 23 leaf tasks across perception and reasoning, enabling the first systematic and objective evaluation of VLM remote sensing capabilities.

CoIDO: Efficient Data Selection for Visual Instruction Tuning via Coupled Importance-Diversity Optimization

This paper proposes CoIDO, a bi-objective optimization framework for data selection that jointly optimizes data importance and diversity. By training a lightweight scorer on only 20% of randomly sampled data, CoIDO selects a 20% subset from LLaVA-665K that achieves 98.2% of the performance of full-data fine-tuning, while incurring the lowest computational overhead among all compared methods.

Context Informs Pragmatic Interpretation in Vision-Language Models

This work systematically evaluates the pragmatic reasoning capabilities of VLMs using iterated reference games. Models perform substantially worse than humans in the absence of context, but can rapidly leverage relevant dialogue history to achieve approximately 80% accuracy, revealing a strong dependence on contextual information.

Continual Multimodal Contrastive Learning

This paper is the first to formally define the Continual Multimodal Contrastive Learning (CMCL) problem and proposes Dual-side Null-space gradient projection (DNS), which projects gradients from new data into subspaces that do not interfere with previously acquired knowledge. DNS achieves the best stability–plasticity trade-off across 7 datasets.

CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder

This paper proposes CovMatch, which reduces the bi-level optimization of multimodal contrastive learning to a closed-form cross-covariance matrix alignment problem, enabling for the first time joint optimization of both image and text encoders for multimodal dataset distillation. Using only 500 synthetic image-text pairs, CovMatch achieves a mean retrieval recall of 38.4 on Flickr30K (+6.8% over SOTA LoRS), substantially outperforming frozen-text-encoder approaches in extremely data-efficient settings.

CyIN: Cyclic Informative Latent Space for Bridging Complete and Incomplete Multimodal Learning

This paper proposes the CyIN framework, which constructs an informative latent space via token-level and label-level information bottlenecks (IB), and employs cyclic cross-modal translation to reconstruct missing modality information, simultaneously optimizing complete and incomplete multimodal learning within a single unified model.

DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding

This paper introduces DanmakuTPPBench, the first multi-modal Temporal Point Process (TPP) benchmark integrating temporal, textual, and visual modalities. It comprises DanmakuTPP-Events (7,250 video sequences with 10.8 million danmaku events collected from Bilibili) and DanmakuTPP-QA (10 evaluation tasks constructed via a multi-agent pipeline), revealing significant gaps in current LLM/MLLM capabilities for TPP understanding.

DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding

This paper introduces DanmakuTPPBench, the first multimodal temporal point process benchmark. DanmakuTPP-Events provides 7,250 sequences comprising 10.8 million Danmaku events with natural three-modal alignment (time–text–video). DanmakuTPP-QA automatically generates 10 categories of reasoning question–answer pairs via a multi-agent pipeline. The benchmark systematically reveals significant deficiencies of both classical TPP models and MLLMs in understanding multimodal event dynamics.

Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention

This paper proposes HoloV, a plug-and-play visual token pruning framework that adaptively allocates pruning budgets across different spatial crop regions to preserve global visual context rather than retaining only attention-highlighted tokens. On LLaVA-1.5, HoloV retains 95.8% of original performance after pruning 88.9% of visual tokens.

DOTA: DistributiOnal Test-time Adaptation of Vision-Language Models

DOTA proposes shifting test-time adaptation from a "caching sample instances" paradigm to a "continuously estimating test data distributions" paradigm. By combining online Gaussian discriminant analysis with zero-shot prediction probabilities to estimate per-class distributions, DOTA achieves gradient-free, forgetting-resistant, and efficient test-time adaptation, surpassing all baselines in average accuracy across 10 cross-domain benchmarks.

DynamicVL: Benchmarking MLLMs for Dynamic City Understanding

This paper proposes DVL-Suite, a framework comprising the DVL-Bench evaluation benchmark and the DVL-Instruct instruction-tuning dataset, covering 42 U.S. cities and 14,871 high-resolution multi-temporal remote sensing images. It systematically evaluates 18 MLLMs on long-term urban dynamic understanding and introduces DVLChat as a baseline model.

Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

This paper proposes EPIC, a framework that addresses the optimization difficulty caused by feature space perturbation during visual token compression training via progressive consistency distillation along two dimensions (Token and Layer), achieving efficient multimodal LLMs without modifying model architecture.

ElasticMM: Efficient MLLM Serving with Elastic Multimodal Parallelism

This paper proposes the Elastic Multimodal Parallelism (EMP) paradigm and the ElasticMM system, which disaggregates different stages of multimodal inference into independent instances via modality-aware load balancing and elastic partition scheduling, achieving up to 4.2× TTFT reduction and 3.2–4.5× throughput improvement over vLLM.

READ: Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions

This paper proposes READ, a fine-tuning method that enhances the compositional reasoning capability of CLIP's text encoder via two auxiliary objectives: (1) token-level reconstruction, where a frozen decoder reconstructs alternative descriptions from text embeddings, and (2) sentence-level alignment, which enforces consistency among embeddings of paraphrases. READ achieves state-of-the-art performance on 5 compositional reasoning benchmarks, outperforming NegCLIP by 4.5% and FSC-CLIP by 4.1%.

Enhancing Outcome Reward-Based RL Training of MLLMs with Self-Consistency Sampling

To address the problem of "unfaithful reasoning trajectories induced by outcome-reward RL training in multimodal multiple-choice tasks," this paper proposes Self-Consistency Sampling (SCS), which obtains consistency rewards via truncation-resampling and visual perturbation to penalize spurious reasoning. When combined with RLOO, SCS achieves an average improvement of 7.7 percentage points across six benchmarks.

Enhancing Vision-Language Model Reliability with Uncertainty-Guided Dropout Decoding

This paper proposes Dropout Decoding — a training-free inference-time method that projects visual tokens into the text space to quantify their epistemic uncertainty, selectively masks high-uncertainty visual tokens, and aggregates multiple masked decoding results via majority voting to substantially reduce object hallucinations in LVLMs.

Evaluating Multimodal Large Language Models on Core Music Perception Tasks

This paper systematically evaluates multimodal LLMs on three core music perception tasks—syncopation scoring, transposition detection, and chord quality identification—under both audio and MIDI input modalities, revealing that models approach ceiling performance on symbolic reasoning while exhibiting significant deficits in audio perception.

First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training

This paper proposes MM-UPT, a framework that introduces a third-stage "unsupervised post-training" phase following SFT and RL. By combining majority voting as a pseudo-reward signal with GRPO, MM-UPT enables self-improvement of MLLMs, boosting Qwen2.5-VL-7B from 66.3% to 72.9% on MathVista.

FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models

FlexAC identifies that associative reasoning in MLLMs is primarily encoded in intermediate layers. By extracting steering vectors from hallucinated responses and injecting them into intermediate-layer representations at inference time, it enables flexible control over faithfulness and creativity—reducing hallucination rate by 29% (CHAIR) and improving creativity by 5.8× (Creation-MMBench), all without any training.

FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models

FlowCut reexamines the emergence of visual token redundancy in VLMs through the lens of Information Flow, and proposes a pruning framework featuring layer-adaptive pruning ratios, multi-criteria fusion scoring, and cumulative importance tracking. The approach aligns pruning decisions with the model's intrinsic information propagation behavior. On LLaVA-1.5-7B, FlowCut surpasses the previous SOTA by 1.6% at an 88.9% token reduction rate; on LLaVA-NeXT-7B, it surpasses the previous SOTA by 4.3% at a 94.4% reduction rate.

FlySearch: Exploring how vision-language models explore

FlySearch introduces a photorealistic 3D outdoor environment built on Unreal Engine 5 to evaluate the exploration capabilities of VLMs. Results reveal that state-of-the-art VLMs fail to reliably complete even simple search tasks, and the performance gap relative to humans widens dramatically as task difficulty increases.

FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering

This paper proposes FOCUS, a training-free visual cropping method that constructs object relevance maps via cosine similarity of value features in the MLLM's internal KV-cache, enabling efficient localization of question-relevant image regions. FOCUS achieves accuracy comparable to state-of-the-art methods on fine-grained VQA benchmarks while improving computational efficiency by 3–6.5×.

ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation

This paper proposes ForceVLA, which introduces 6-axis force/torque sensing as a first-class modality within the VLA framework. A Force-aware Vision-Language Mixture-of-Experts (FVLMoE) module dynamically fuses visual-language embeddings with real-time force feedback at the action decoding stage, achieving an average success rate improvement of 23.2% across five contact-rich manipulation tasks, with individual tasks reaching up to 80%.

GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images

GEM is proposed as the first multimodal large language model that unifies ECG time series, 12-lead ECG images, and text. Through a dual-encoder framework, cross-modal alignment, and knowledge-guided instruction data generation, GEM achieves grounded ECG diagnosis based on quantifiable physiological features, improving diagnostic accuracy by 7.4%, interpretability by 22.7%, and grounding capability by 25.3%.

Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling

This paper proposes REVERSE, the first framework to unify generation adjustment and post-hoc verification within a single VLM. Through hallucination-aware training on 1.3M semi-synthetic samples combined with inference-time retrospective resampling, REVERSE enables a VLM to automatically detect and correct hallucinations during generation, achieving a 12% reduction on CHAIR-MSCOCO and a 34% improvement on HaloQuest.

GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization

This paper proposes GeoRanker, a distance-aware ranking framework that leverages large vision-language models (LVLMs) to model spatial relationships between queries and candidates, achieving state-of-the-art worldwide image geolocalization via a multi-order distance loss.

GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity

GLSim is a training-free object hallucination detection method for LVLMs that combines a global scene similarity score (cosine similarity between the object token and the last instruction token) and a local visual grounding similarity score (cosine similarity between the object token and the Top-K image patch embeddings localized via Visual Logit Lens). It achieves 83.7% AUROC on MSCOCO, surpassing SVAR by 9% and Internal Confidence by 10.8%.

GoalLadder: Incremental Goal Discovery with Vision-Language Models

This paper proposes GoalLadder, a framework that leverages VLMs to incrementally discover and rank candidate goal states, employs an ELO rating system to handle noisy feedback, and defines distance-based rewards in a learned embedding space. Using only a single language instruction, the method trains RL agents to achieve approximately 95% success rate.

Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment

This paper proposes MAPLE, a framework that leverages the inherent modality alignment capabilities of off-the-shelf MLLMs to automatically construct preference data, and introduces a Relative Preference Alignment (RPA) loss to guide cross-modal representation learning, achieving significant improvements on fine-grained retrieval tasks.

HAWAII: Hierarchical Visual Knowledge Transfer for Efficient VLM

This paper proposes the Hawaii framework, which distills knowledge from multiple visual experts into a single visual encoder via Mixture of LoRA Adapters (MoLA) and Hierarchical Knowledge Distillation (HKD), significantly improving the visual understanding capability of VLMs without incurring any additional inference cost.

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

This work is the first to identify the systematic phenomenon that understanding capability consistently surpasses generation capability in unified multimodal large language models. It proposes the HermesFlow framework, which constructs paired understanding-generation preference data from homologous inputs, and employs Pair-DPO with iterative self-play optimization to simultaneously improve both capabilities and narrow the gap between them—without relying on any external high-quality data.

Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems

This paper derives a Hierarchical Self-Attention (HSA) mechanism from the first principle of entropy minimization, providing a theoretically optimal attention computation method for nested signals (multimodal and multi-scale data). It further proves that HSA is the KL-divergence-optimal solution closest to standard Softmax attention under hierarchical block constraints.

HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models

This work presents the first theoretical analysis of frequency allocation strategies in multimodal RoPE for long-context VLMs. It proposes HoPE, which sets the lowest frequency to zero for temporal modeling to guarantee the semantic preference property, coupled with a dynamic temporal scaling mechanism, achieving gains of 8.35% on long video understanding and 22.23% on retrieval tasks.

iFinder: Structured Zero-Shot VLM Grounding for Dash-Cam Video Reasoning

This paper proposes iFinder, a modular training-free framework that decouples dash-cam video understanding into perception (structured scene representation) and reasoning (LLM). Through a hierarchical data structure and a three-block prompting strategy, iFinder endows LLMs with interpretable spatiotemporal reasoning capabilities, achieving zero-shot superiority over end-to-end V-VLMs across four driving video benchmarks, with accident reasoning accuracy gains of up to 39%.

In-Context Compositional Learning via Sparse Coding Transformer

Inspired by sparse coding, this work reinterprets the Transformer attention mechanism as projection onto encoding and decoding dictionaries, explicitly represents compositional rules via sparse coefficients, and transfers compositional rules from in-context tasks to target tasks using a lifting scheme.

in the eye of mllm benchmarking egocentric video intent understanding with gaze-

This paper proposes the EgoGazeVQA benchmark and three gaze-guided prompting strategies (textual / visual / salience map), providing the first systematic validation of eye-gaze signals for improving egocentric video intent understanding in MLLMs. The best configuration, Qwen2.5-VL-72B + GazeS, achieves a 5.8 percentage-point gain in average accuracy.

Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats

This paper proposes AllPath, a multi-path hallucination intervention framework grounded in the Transformer causal architecture. It is the first to demonstrate that hallucinations in LVLMs do not stem from a single causal path but from the interaction of three paths — image-to-input-text, image-to-output-text, and text-to-text — and that models adaptively rely on different paths depending on the question-answer alignment format. By designing lightweight key-head identification methods for each path and performing adaptive intervention, AllPath consistently reduces hallucinations across four benchmarks covering different alignment formats: POPE, MCQ-POPE, CHAIR, and MME.

JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models

Inspired by the Eliciting Latent Knowledge (ELK) framework, this paper is the first to reveal that VLMs possess approximable safety decision boundaries in the latent space of fusion layers. It proposes JailBound, a two-stage attack framework comprising Safety Boundary Probing and Safety Boundary Crossing, which jointly optimizes image and text adversarial perturbations to cross this boundary. JailBound achieves average attack success rates of 94.32% and 67.28% in white-box and black-box settings, respectively, significantly surpassing the state of the art.

Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

VG-LLM proposes integrating a 3D visual geometry encoder (VGGT) into multimodal large language models, enabling the extraction and fusion of 3D geometric priors from video input alone—without any explicit 3D data. This approach significantly improves MLLM performance on 3D scene understanding and spatial reasoning tasks, with the 4B model surpassing Gemini-1.5-Pro on VSI-Bench.

Learning Shared Representations from Unpaired Data

This paper proposes SUE (Spectral Universal Embedding), which is the first to demonstrate that cross-modal shared representations can be learned with almost entirely unpaired data. Independent spectral embeddings extract modality-invariant "universal" structure from random walks within each modality; a minimal number of paired samples (~100 pairs) then enables CCA-based linear alignment followed by MMD-based nonlinear fine-tuning. SUE outperforms contrastive learning using the same number of pairs by more than 250% on retrieval benchmarks.

Learning Skill-Attributes for Transferable Assessment in Video

This paper proposes CrossTrainer, a method that discovers sport-agnostic skill attributes (e.g., balance, control, hand positioning) as intermediate representations to train a multimodal language model for generating actionable feedback and proficiency assessments from video. CrossTrainer achieves up to 60% relative improvement over the state of the art in zero-shot cross-sport transfer.

Learning to Instruct for Visual Instruction Tuning

This paper proposes L2T (Learning to Instruct), which improves visual instruction tuning solely by extending the training loss to cover the instruction sequence (rather than computing loss on responses only). Without additional data and with virtually zero computational overhead, L2T achieves up to 9% relative improvement across 16 multimodal benchmarks, an 18% gain on captioning tasks, and notable hallucination reduction.

Learning to Steer: Input-dependent Steering for Multimodal LLMs

Addressing the limitation of existing steering methods that rely on fixed direction vectors incapable of adapting to diverse inputs, this paper proposes L2S (Learn-to-Steer): it first generates ideal input-specific steering vectors via contrastive prompting (P2S), then trains a lightweight 2-layer MLP to predict these vectors from the input context. This achieves input-dependent behavioral steering at negligible overhead, significantly outperforming static steering baselines on both safety enforcement and hallucination mitigation.

MDReID: Modality-Decoupled Learning for Any-to-Any Multi-Modal Object Re-Identification

This paper proposes MDReID, a framework that decouples modality features into modality-shared and modality-specific components, enabling object re-identification under arbitrary modality combinations (any-to-any ReID) and substantially outperforming existing methods in both modality-matched and modality-mismatched scenarios.

Metacognitive Sensitivity for Test-Time Dynamic Model Selection

Inspired by the concept of metacognitive sensitivity (meta-d') from cognitive science, this paper proposes a test-time dynamic model selection framework that quantifies a model's ability to "know what it doesn't know" via meta-d', combines it with instantaneous confidence scores to form a context vector, and employs a contextual bandit to online-select the optimal model, outperforming individual models across multiple datasets.

MIDAS: Misalignment-based Data Augmentation Strategy for Imbalanced Multimodal Learning

This work is the first to propose using cross-modal misaligned samples as supervised training signals—rather than treating them as noise or interference—to alleviate modality imbalance in multimodal learning. The proposed MIDAS data augmentation framework combines three complementary mechanisms: confidence-based labeling of misaligned samples, weak-modality weighting, and hard-sample weighting. MIDAS substantially outperforms existing methods across four multimodal classification benchmarks.

Mint: A Simple Test-Time Adaptation of Vision-Language Models against Common Corruptions

This work identifies embedding variance collapse—the simultaneous shrinkage of intra- and inter-class variance that erodes discriminability in the embedding space—as the root cause of CLIP's performance degradation under image corruptions. It proposes Mint, which restores embedding geometry online by maximizing pseudo-label inter-class variance (PL-inter) using only two lightweight components: a mean accumulator and a gradient accumulator. Mint consistently improves CLIP's classification accuracy across multiple corruption benchmarks even at BS=1, while running 45× faster than the strongest baseline.

MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agriculture

MIRAGE is the first multimodal benchmark constructed from real agricultural expert consultation dialogues (35,000+), evaluating vision-language models on domain-level entity identification, causal reasoning, and clarify-or-respond decision-making. It reveals a severe challenge in which even GPT-4.1 achieves only 43.9% identification accuracy.

MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models

This paper proposes MM-OPERA, an open-ended association reasoning benchmark comprising 11,497 instances. It evaluates the association reasoning capabilities of LVLMs through two tasks — Remote-Item Association (RIA) and In-Context Association (ICA) — and introduces an LLM-as-a-Judge scoring strategy alongside a process reward evaluation method. The benchmark reveals that even the strongest current LVLMs remain significantly behind humans.

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

This paper introduces MME-VideoOCR, a comprehensive video OCR evaluation benchmark comprising 25 tasks, 44 scenarios, 1,464 videos, and 2,000 manually annotated QA pairs, spanning three levels of text recognition, understanding, and reasoning. Evaluation of 18 state-of-the-art MLLMs reveals that the strongest model (Gemini-2.5 Pro) achieves only 73.7% overall, with cross-frame understanding tasks falling below 25%.

MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

This paper introduces MMLongBench, the first comprehensive benchmark for evaluating long-context vision-language models (LCVLMs), comprising 13,331 samples spanning 5 downstream task categories, mixed image types, and 5 standardized input length levels (8K–128K tokens). Evaluation of 46 models reveals that single-task performance is a weak proxy for overall capability, and that stronger reasoning ability positively correlates with long-context performance.

MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

The first benchmark to systematically evaluate the perspective understanding capabilities of multimodal large language models (MLLMs), comprising 10 tasks across 3 dimensions, 2,711 images, and 5,083 question–answer pairs. It reveals significant deficiencies in perspective reasoning and robustness across 43 state-of-the-art models.

MoniTor: Exploiting Large Language Models with Instruction for Online Video Anomaly Detection

This paper proposes MoniTor, a memory-based online scoring queue framework that leverages LLMs for training-free online video anomaly detection (VAD). It guides LLMs toward real-time anomaly recognition through a dual-layer memory mechanism, behavior prediction, and a standard scoring queue.

Multi-Modal Masked Autoencoders for Learning Image-Spectrum Associations for Galaxy Evolution and Cosmology

This work constructs GalaxiesML-Spectra, a large-scale multi-modal dataset of 134,533 galaxies with images, spectra, and redshifts, and adapts a Multi-Modal Masked Autoencoder (MMAE) for joint image–spectrum reconstruction and redshift regression. It demonstrates that at test time, even with spectra entirely absent, using only 25% masked images achieves a redshift prediction scatter of \(\sigma_{NMAD} = 0.016\), surpassing AstroCLIP.

Multimodal Bandits: Regret Lower Bounds and Optimal Algorithms

For the multimodal multi-armed bandit problem where the reward function has at most \(m\) modes, this paper proposes the first computationally feasible algorithm for solving the Graves-Lai optimization problem, achieves an asymptotically optimal regret bound, and proves that local search strategies are suboptimal.

Multimodal Negative Learning

This paper proposes the Multimodal Negative Learning (MNL) paradigm, in which dominant modalities guide weaker modalities to suppress non-target classes—rather than enforcing alignment on target classes—thereby stabilizing the decision space, preserving modality-specific information, and theoretically tightening the robustness lower bound of multimodal fusion.

Nautilus: A Large Multimodal Model for Underwater Scene Understanding

This paper presents Nautilus, the first large multimodal model supporting eight underwater scene understanding tasks. It introduces a physics-prior-driven Visual Feature Enhancement (VFE) module that explicitly rectifies underwater image degradation in feature space, improving the robustness of LMMs in underwater environments.

NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

This paper systematically investigates the design space and scaling properties of native multimodal large language models (Native MLLMs) under data constraints. It identifies a positive log-linear optimal scaling relationship between the visual encoder and the LLM, and based on this finding proposes NaViL, which achieves competitive performance with state-of-the-art MLLMs using only approximately 600 million pre-training image-text pairs.

NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables

This paper proposes NeedleInATable (NIAT), a benchmark that treats each table cell as a "needle" to evaluate the fine-grained perception capability of LLMs over long structured tables. It reveals that strong performance of existing models on complex downstream tasks may stem from dataset shortcuts rather than genuine table understanding.

NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception

This paper proposes the NegoCollab framework, which introduces a Negotiator module to negotiate a common representation from the local representations of heterogeneous multimodal agents during training, effectively eliminating domain gaps between heterogeneous collaborative agents and enabling low-cost collaborative connected perception.

Omni-Mol: Multitask Molecular Model for Any-to-Any Modalities

This paper proposes Omni-Mol, a unified molecular understanding and generation framework built upon a multimodal LLM. Through a 1.42M-sample instruction tuning dataset, Gradient Adaptive LoRA (GAL), and a Mixture-of-GAL-Experts (MoGE) architecture, Omni-Mol is the first single model to jointly learn 16 molecular tasks (Mol2Mol / Mol2Text / Mol2Num / Text2Mol), achieving SOTA on 13 tasks with only 2.2B parameters.

On the Value of Cross-Modal Misalignment in Multimodal Representation Learning

This paper proposes a latent variable model that formalizes cross-modal misalignment into two mechanisms—selection bias and perturbation bias—and theoretically proves that MMCL-learned representations precisely capture the invariant semantic subset unaffected by both biases, thereby unifying the opposing views of misalignment as harmful vs. beneficial.

OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Models

This paper proposes OpenHOI, a framework that leverages the commonsense reasoning capabilities of multimodal large language models (MLLMs) to infer contact regions and grasp types for unseen objects, enabling open-world hand-object interaction synthesis without requiring per-object training data collection.

PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments

This paper proposes the Active Visual Reasoning (AVR) task paradigm, constructs the CLEVR-AVR simulation benchmark and the AVR-152k dataset (with rich CoT annotations), and trains the PhysVLM-AVR model to iteratively acquire information through a perception–reasoning–action closed loop in partially observable interactive environments, significantly outperforming existing MLLMs.

Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning

This paper discovers that the decision-making reasoning capability of VLMs can be decoupled from visual perception—replacing image inputs with textual descriptions yields equal or higher decision accuracy. Building on this insight, Praxis-VLM trains decision-making reasoning on purely textual scenarios via multi-stage GRPO with adaptive rewards, then transfers zero-shot to visual inputs at inference time, achieving comprehensive improvements over SFT baselines on three decision-making benchmarks, with especially notable gains in OOD generalization.

PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation

PrefixKV identifies that the importance distributions of KV caches vary substantially across layers, and formalizes the per-layer cache sizing problem as a global prefix configuration search. A binary search is employed to find the optimal cumulative priority threshold that maximizes contextual information retention in each layer. At a 20% retention ratio, PrefixKV incurs only a 0.49 PPL degradation while delivering a 1.8× inference speedup.

Reading Recognition in the Wild

This paper introduces a novel reading recognition task and the first large-scale multimodal "reading-in-the-wild" dataset (100 hours). A lightweight Transformer model fusing three complementary modalities—RGB, gaze, and IMU—enables real-time reading detection on smart glasses.

Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models

This paper proposes GLOBE — an LVLM-based image geo-localization system trained via GRPO reinforcement learning. By constructing MP16-Reason, a reasoning-oriented dataset with localizability assessment, visual-clue reasoning chains, and geographic accuracy annotations, GLOBE surpasses SOTA methods trained on millions of samples as well as large-scale open-source VLMs using only 33K training examples across multiple benchmarks.

Rethinking Multimodal Learning from the Perspective of Mitigating Classification Ability Disproportion

This paper proposes a classification ability disproportion perspective to understand modality imbalance in multimodal learning, and designs a Sustained Boosting algorithm (shared encoder + multiple configurable classifiers, jointly optimizing classification and residual errors) coupled with Adaptive Classifier Assignment (ACA). The paper theoretically proves that the cross-modal gap loss converges at \(\mathcal{O}(1/T)\), and achieves substantial improvements over SOTA on six datasets including CREMAD.

Retrv-R1: A Reasoning-Driven MLLM Framework for Universal and Efficient Multimodal Retrieval

This paper proposes Retrv-R1, the first R1-style reasoning-based multimodal retrieval framework. It reduces token consumption via an Information Compression Module (ICM), preserves complete information for hard candidates through a Details Inspection Mechanism (DIM), and employs a curriculum-based RL reward to balance effectiveness and efficiency, achieving state-of-the-art performance on universal multimodal retrieval benchmarks.

Revisiting Logit Distributions for Reliable Out-of-Distribution Detection

This paper proposes LogitGap, a novel post-hoc OOD detection scoring function that explicitly exploits the "gap" between the maximum logit and the remaining logits to distinguish in-distribution (ID) from out-of-distribution (OOD) samples. A top-N selection strategy is introduced to filter noisy logits. Theoretical analysis and experiments demonstrate that LogitGap outperforms MCM and MaxLogit across multiple scenarios.

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

This paper proposes RoboRefer, a 3D-aware reasoning VLM trained via a two-stage SFT + RFT strategy with a metric-sensitive process reward function. It achieves precise single-step spatial understanding and multi-step spatial reasoning on spatial referring tasks, surpassing Gemini-2.5-Pro by 17.4% on RefSpatial-Bench.

RobustMerge: Parameter-Efficient Model Merging for MLLMs with Direction Robustness

From the perspective of low-rank decomposition, this paper identifies "direction robustness" as the key factor in parameter-efficient module merging (as opposed to sign conflicts in full-parameter merging), and proposes RobustMerge, which maintains singular value direction stability via complementary parameter adaptive scaling and cross-task normalization, achieving average improvements of 3.4% (seen tasks) and 4.5% (unseen tasks) on multimodal generation benchmarks.

rtv-bench benchmarking mllm continuous perception understanding and reasoning th

This paper proposes RTV-Bench, a benchmark comprising 552 videos and 4,608 QA pairs, designed to systematically evaluate MLLMs' continuous analysis capabilities in real-time video streams through three core designs: multi-timestamp QA (the same question yields different correct answers at different timestamps), hierarchical question structure, and multidimensional evaluation. Key findings include that online models outperform offline models, and that simply scaling model size or increasing frame count yields limited gains.

RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

This paper proposes RTV-Bench, a fine-grained evaluation benchmark for assessing the continuous real-time video analysis capabilities of MLLMs. Comprising 552 videos and 4,608 QA pairs, it comprehensively evaluates model perception, understanding, and reasoning in dynamic video streams through a multi-timestamp QA mechanism, hierarchical question structure, and multi-dimensional assessment.

Scene-Aware Urban Design: A Human-AI Recommendation Framework Using Co-Occurrence Embeddings and Vision-Language Models

This paper proposes a human-AI collaborative computer vision framework that employs Grounding DINO for urban object detection, constructs co-occurrence embeddings from the ADE20K dataset to capture real-world spatial configurations, leverages a VLM for scene-aware third-object recommendation, and generates 3D models for AR preview — all aimed at enabling residents to participate in micro-scale urban design.

SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodal LLMs

This paper proposes SCOPE, a visual token pruning strategy that jointly models saliency and coverage. By iteratively selecting tokens with the highest SCOPE scores, it preserves semantic completeness and retains 96% of LLaVA-1.5's performance under a 9× token reduction.

SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models

This paper proposes MSMU, a large-scale quantitative spatial reasoning dataset (700K QA pairs, 2.5M numerical annotations), and Depth Positional Encoding (DPE), enabling VLMs to achieve strong quantitative spatial measurement and understanding without relying on 3D point clouds. SD-VLM outperforms GPT-4o by 26.91% on MSMU-Bench.

Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models

This paper addresses OCR hallucinations in MLLMs under degraded document conditions. It introduces KIE-HVQA, the first benchmark for evaluating hallucinations in degraded document scenarios, and proposes a multi-objective reward reinforcement learning framework based on GRPO. The resulting 7B-parameter model achieves approximately 28% higher hallucination-suppression accuracy than GPT-4o.

See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model

This paper proposes See&Trek, a training-free and GPU-free spatial prompting framework that enhances spatial understanding in MLLMs through maximum semantic richness sampling and motion reconstruction, achieving up to 3.5% improvement on VSI-Bench.

Sherlock: Self-Correcting Reasoning in Vision-Language Models

The first systematic study of self-correction capabilities in reasoning VLMs: existing reasoning VLMs are found to be nearly incapable of self-correction (<10% exhibit an aha moment). The paper proposes Sherlock, a three-stage training framework (SFT cold-start → offline trajectory-level preference learning → online self-iterative improvement) that surpasses LLaVA-CoT/Mulberry/LlamaV-o1 (which use 100K–260K annotations) using only 20K labeled samples.

SITCOM: Scaling Inference-Time COMpute for VLAs

SITCOM proposes an inference-time compute scaling framework inspired by Model Predictive Control (MPC). It performs multi-step rollout simulation of a pretrained VLA using a learned dynamics model and selects optimal trajectories via a reward model, transforming a single-step VLA into a robust long-horizon planner. On the SIMPLER benchmark, it improves task success rate from 48% to 72%.

Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Models

This work introduces the Situat3DChange dataset (174K data instances) that unifies dynamic scene change perception and situated awareness understanding under a perception–action paradigm, and proposes SCReasoner—an efficient 3D MLLM for point cloud comparative reasoning.

Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

This paper extends sparse autoencoders (SAEs) to vision-language models (e.g., CLIP), proposes the MonoSemanticity score (MS) to quantitatively evaluate the monosemanticity of neurons, and demonstrates that manipulating SAE neurons can directly steer multimodal large language models (e.g., LLaVA) to insert or suppress specific concepts.

SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards

This paper proposes SpatialThinker, which trains MLLMs to construct scene graphs and perform structured spatial reasoning via online RL with multi-objective dense spatial rewards (lexicographic gating over format → count → accuracy → spatial localization). Using only 7K samples, it surpasses GPT-4o on 3DSRBench by 12.1%.

SpatialTraceGen: High-Fidelity Traces for Efficient VLM Spatial Reasoning Distillation

This paper proposes SpatialTraceGen, a framework that distills high-quality multi-step tool-use reasoning traces from large teacher models via automated verification, enabling efficient fine-tuning of small VLMs for spatial reasoning.

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

This paper proposes SRPO (Self-Reflection enhanced reasoning with Group Relative Policy Optimization), a two-stage reflection-aware RL framework. Stage 1 constructs reflection data via large model distillation for SFT cold-start; Stage 2 designs a reflection-aware reward function within GRPO to reinforce concise and effective self-reflection. SRPO achieves state-of-the-art results at the 7B/32B scale on multimodal reasoning benchmarks including MathVista, MathVision, and MMMU-Pro.

SSR: Enhancing Depth Perception in VLMs via Rationale-Guided Spatial Reasoning

This paper proposes the SSR framework, which converts raw depth information into structured textual reasoning rationales and compresses them into compact latent embeddings via knowledge distillation, enhancing the spatial reasoning capabilities of existing VLMs in a plug-and-play manner.

Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs

This paper proposes Struct2D, a perception-guided prompting framework that converts 3D perception outputs into structured 2D representations (BEV images + object labels + metadata), enabling MLLMs to perform complex spatial reasoning without explicit 3D input. The authors also construct Struct2D-Set, a large-scale instruction tuning dataset containing 200K QA pairs.

Structure-Aware Fusion with Progressive Injection for Multimodal Molecular Representation Learning

This paper proposes MuMo, a framework that fuses 2D topological and 3D geometric information into stable structural priors via a Structured Fusion Pipeline (SFP), and asymmetrically integrates these priors into the sequence stream through a Progressive Injection (PI) mechanism, achieving an average improvement of 2.7% over competitive baselines across 29 molecular property prediction benchmarks and ranking first on 22 of them.

Systematic Reward Gap Optimization for Mitigating VLM Hallucinations

This paper proposes Topic-level Preference Rewriting (TPR), which systematically optimizes the reward gap configuration in preference data through fine-grained semantic control at the topic level, combined with a curriculum learning strategy that progressively increases the difficulty of negative samples, achieving approximately 93% hallucination reduction across multiple hallucination benchmarks.

T-Rex: Task-Adaptive Spatial Representation Extraction for Robotic Manipulation with VLMs

This paper proposes the T-Rex framework, which dynamically selects the optimal spatial representation extraction scheme (point / vector / 6D pose) according to task complexity, and introduces Chain of Grounding (CoG) to guide VLMs through step-by-step reasoning, enabling training-free open-vocabulary robotic manipulation.

Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models

This paper proposes STS (Spectrum-Aware Test-Time Steering), a lightweight test-time adaptation method that extracts a low-dimensional semantic subspace via SVD decomposition of text embeddings, and learns a small set of coefficients to steer text prototypes within this subspace to handle distribution shift. STS requires no backpropagation through large encoders, runs 8× faster than TPT with 12× less memory, and substantially outperforms existing TTA methods on OOD datasets.

Text to Robotic Assembly of Multi Component Objects using 3D Generative AI and Vision Language Models

This paper proposes an end-to-end pipeline that converts natural language input into 3D mesh models via 3D generative AI, then leverages zero-shot multimodal reasoning of VLMs to automatically decompose the mesh into multi-component 3D models (structural components + panel components), which are subsequently assembled into physical objects by a robotic arm. The system also supports interactive user feedback through dialogue to adjust component assignments.

The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models

This paper introduces TTA-VLM, a unified benchmark evaluating 8 episodic and 7 online test-time adaptation (TTA) methods across 15 datasets under controlled experimental conditions. Three surprising findings emerge: (1) existing TTA methods offer only marginal improvements over the early TPT baseline; (2) TTA methods collaborate poorly with training-time fine-tuning approaches; (3) accuracy gains come at the cost of calibration, OOD detection, and robustness.

To Think or Not To Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning

This paper systematically investigates whether explicit thinking is necessary in rule-based reinforcement fine-tuning (RFT). It finds that on visual perception tasks, No-Thinking-RFT consistently outperforms the conventional think-then-answer paradigm, and proposes an Adaptive-Thinking approach that allows models to autonomously determine whether to reason based on their own capability and task complexity.

To See or To Read: User Behavior Reasoning in Multimodal LLMs

This paper proposes BehaviorLens, a benchmarking framework that systematically compares three representations of user behavior history — text sequences, scatter plots, and flowcharts — for next-purchase prediction with MLLMs. Visual representations are shown to improve prediction accuracy by up to 87.5% over equivalent text representations without incurring additional computational overhead.

TOMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning

This paper proposes TOMCAT, which dynamically updates compositional prototypes by accumulating dual-modality (textual and visual) knowledge from unlabeled test data at test time, addressing label distribution shift and achieving state-of-the-art performance on four CZSL benchmarks.

Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs

This paper proposes E3VQA, the first multi-view VQA benchmark, and M3CoT, a prompting technique that fuses three complementary scene graphs, to enhance multi-view scene understanding in Large Vision-Language Models (LVLMs), achieving gains of 4.84% on GPT-4o and 5.94% on Gemini 2.0 Flash.

Towards Evaluating Proactive Risk Awareness of Multimodal Language Models

This paper introduces PaSBench, a benchmark for evaluating the proactive risk awareness of multimodal language models — requiring models to autonomously observe environments and issue safety warnings without any user query. An evaluation of 36 models reveals that the strongest model (Gemini-2.5-pro) achieves only 71% accuracy, with 45% of risks failing to be detected consistently. The core bottleneck is identified as unstable proactive reasoning rather than a lack of safety knowledge.

Training-free Online Video Step Grounding

This paper proposes BaGLM, a training-free online video step grounding method that integrates LLM-estimated step dependencies and LMM-estimated step progress into zero-shot LMM predictions via Bayesian filtering, outperforming existing trained offline methods on three datasets.

TRoVe: Discovering Error-Inducing Static Feature Biases in Temporal Vision-Language Models

TRoVe proposes an automated method for discovering static feature biases that induce systematic prediction errors in temporal VLMs. Through a dual-scoring mechanism combining an Error Contribution Score (ECS) and a Static Bias Score (SBS), TRoVe outperforms baselines by 28.6% on 101 synthetic models and successfully identifies novel biases in 7 real-world VLMs.

Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition

This paper proposes Uni-MuMER, which performs unified multi-task fine-tuning of an open-source VLM via three data-driven tasks (Tree-CoT, Error-Driven Learning, and Symbol Counting), achieving substantial improvements over both specialized lightweight models and zero-shot commercial VLMs on the CROHME and HME100K benchmarks.

Unified Reinforcement and Imitation Learning for Vision-Language Models

This paper proposes RIL (Unified Reinforcement and Imitation Learning), a training framework that combines GRPO-based reinforcement learning with GAIL-style adversarial imitation learning to substantially improve the performance of small VLMs (7B) by learning the text generation style of large VLMs (72B), without incurring additional inference latency or requiring an explicit "thinking" process.

Unifying Vision-Language Latents for Zero-Label Image Caption Enhancement

This paper proposes the ViZer framework, which improves the image captioning capability of VLMs through a unified vision-language latent space alignment training paradigm—requiring no text annotations whatsoever. Using only raw image data, the model learns to generate more grounded and descriptive captions.

UniTok: A Unified Tokenizer for Visual Generation and Understanding

This paper proposes UniTok, a unified tokenizer for visual generation and understanding that overcomes the representation capacity bottleneck of discrete tokens via Multi-Codebook Quantization (MCQ). UniTok achieves simultaneous state-of-the-art records of 0.38 rFID and 78.6% zero-shot accuracy on ImageNet, and can be seamlessly integrated into MLLMs to enable both generation and understanding.

Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards

This paper proposes the Chain-of-Step (CoS) reasoning framework, which decomposes VLM reasoning chains into structured steps consisting of Name, Thought, and Reflection components. A step-level Process Reward Model (PRM) is trained to provide fine-grained reward signals. Combined with iterative DPO and step-level beam search, the framework systematically improves VLM reasoning—achieving an average of 73.4% (+4.0%) across 6 benchmarks on InternVL-2.5-MPO-8B and 64.2% (+12.1%) on LLaVA-NeXT-8B—while revealing the counterintuitive finding that quality matters far more than length in VLM reasoning, contrary to trends observed in LLM research.

VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents

VAGEN is a framework that structures the reasoning process of VLM agents into StateEstimation and TransitionModeling to build an internal world model, and combines WorldModeling Reward with Bi-Level GAE for efficient multi-turn RL training. A 3B model trained under this framework (0.82) surpasses GPT-5 (0.75) and Gemini 2.5 Pro (0.67).

VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models

This paper proposes VaMP, a variational multi-modal prompt learning framework that models text-side prompts as latent variables and performs instance-level uncertainty modeling via variational inference. Combined with a class-aware prior for regularizing the latent space, VaMP significantly improves CLIP's downstream adaptation under few-shot and domain generalization settings.

Video-R1: Reinforcing Video Reasoning in MLLMs

Inspired by DeepSeek-R1, this paper presents the first systematic exploration of applying the R1 paradigm (rule-based RL) to video reasoning. It proposes the T-GRPO algorithm to explicitly encourage temporal reasoning, constructs a mixed image-video training dataset, and achieves 37.1% accuracy on VSI-Bench, surpassing GPT-4o.

Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs

This paper presents Video-SafetyBench, the first comprehensive benchmark for safety evaluation of video LVLMs. It comprises 2,264 video-text pairs spanning 48 fine-grained unsafe categories, constructed via a controllable video generation pipeline. A confidence-based evaluation metric, RJScore, is proposed to assess model outputs. Large-scale evaluation across 24 LVLMs reveals an average attack success rate of 67.2% under benign queries.

VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning

This paper proposes VideoRFT, which extends the reinforced fine-tuning (RFT) paradigm to video reasoning via a cognition-inspired multi-expert CoT data construction pipeline and a novel semantic consistency reward. Two datasets are constructed: VideoRFT-CoT-102K (for SFT) and VideoRFT-RL-310K (for RL), achieving state-of-the-art performance on 6 video reasoning benchmarks.

VIPAMIN: Visual Prompt Initialization via Embedding Selection and Subspace Expansion

This paper proposes VIPAMIN—a zero-extra-parameter visual prompt initialization strategy comprising two modules: attention-guided semantic Matching and orthogonal subspace injection (Orthogonalizing). It addresses two failure modes of self-supervised VPT—prompt attention uniformization and subspace collapse—requiring only a single forward pass, and achieves state-of-the-art performance across 24 visual tasks.

Vision Function Layer in Multimodal LLMs

This paper identifies that vision-related functional decoding in MLLMs is concentrated in specific narrow layer blocks (Vision Function Layers), exhibiting a consistent hierarchical order across model families (recognition → counting → grounding → OCR). Building on this finding, the authors propose VFL-LoRA (matching full-LoRA performance with only 1/3 of the parameters) and VFL-select (achieving 98% of full-data performance with 20% of the data).

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

To address the difficulty of draft models in handling redundant visual tokens during VLM speculative decoding, this paper proposes ViSpec, a framework that achieves significant acceleration (up to 3.22×) in VLM speculative decoding for the first time, via a visual adapter for image token compression, global visual feature injection, and synthetic training data generation.

Visual Instruction Bottleneck Tuning

This paper is the first to apply the Information Bottleneck (IB) principle to end-to-end instruction tuning of multimodal large language models. It proposes Visual Instruction Bottleneck Tuning (Vittle), which inserts a lightweight bottleneck layer inside the LLM to learn minimally sufficient representations. Vittle consistently improves robustness across 30 distribution shift scenarios without sacrificing performance on standard benchmarks.

Visual Structures Help Visual Reasoning: Addressing the Binding Problem in LVLMs

This paper proposes VISER (Visual Input Structure for Enhanced Reasoning), which constructs spatial partitions by superimposing equidistant horizontal lines with numeric labels onto input images, combined with a "row-by-row scan" textual instruction. This approach converts the parallel visual processing of LVLMs into sequential region-by-region parsing. Without modifying the model, without training, and within a single query, VISER substantially mitigates the binding problem and improves performance on visual reasoning tasks including counting, visual search, scene description, and spatial relationship understanding.

VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching

This paper proposes VLA-Cache, a training-free inference acceleration method for VLA models that identifies and caches KV representations of static visual tokens across frames, filters out task-relevant tokens, and adaptively adjusts the reuse ratio per layer, achieving 1.7× speedup with negligible loss in task success rate.

VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning

This paper proposes VT-FSL, a framework that leverages Cross-modal Iterative Prompting (CIP) to jointly exploit class names and support images for driving LLMs to generate accurate, visually grounded textual descriptions and zero-shot synthesize semantically consistent images. Combined with Kernelized Volume Contrastive Learning (CGA) for global nonlinear cross-modal alignment, VT-FSL achieves an average classification accuracy improvement of 4.2% across 10 few-shot learning benchmarks.

Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM

This paper proposes TriSense — a tri-modal (visual + audio + speech) large language model that adaptively modulates per-modality weights via a Query-Based Connector for robust video temporal understanding, supported by the TriSense-2M dataset containing 2 million annotated samples.

WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

This paper introduces WearVQA, the first VQA benchmark specifically designed for wearable device (smart glasses) scenarios. It comprises 2,520 egocentric image–question–answer triplets, systematically covering 7 visual domains, 10 cognitive task types, and 6 categories of wearable-specific image quality degradation. An accompanying LLM-as-a-judge evaluation framework achieves 96% accuracy, and the benchmark reveals that current SOTA multimodal models attain only 24–52% accuracy in this setting.

What Can RL Bring to VLA Generalization? An Empirical Study

This paper systematically investigates the effect of RL fine-tuning on the generalization capabilities of Vision-Language-Action (VLA) models. The study finds that PPO is the most effective RL algorithm, significantly outperforming DPO and GRPO; RL yields substantially greater OOD generalization than SFT in semantic understanding and execution robustness, while achieving comparable visual robustness.

When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning

This paper introduces the concept of modality sabotage as a diagnostic failure mode, proposes a lightweight and model-agnostic evaluation layer that treats each modality as an independent agent, and exposes "contributors" versus "saboteurs" through simple fusion. Applied to multimodal sentiment recognition benchmarks, the framework reveals systematic differences in per-modality reliability.

When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations

This paper identifies a "semantic hallucination" problem in Large Multimodal Models (LMMs) for scene text recognition—where non-semantic text is misread as semantically plausible words. Analysis reveals that Transformer layers whose attention is more focused on text regions are less prone to hallucination. Based on this finding, the authors propose a training-free framework, ZoomText + Grounded Layer Correction, achieving approximately 4–5% improvement on TextHalu-Bench and approximately 4% on ST-VQA.

STRUCTURE: With Limited Data for Multimodal Alignment, Let the Structure Guide You

This paper proposes STRUCTURE regularization and a representation-similarity-based layer selection strategy that achieves high-quality cross-modal alignment between frozen unimodal foundation models using only tens of thousands of paired samples (less than 1% of conventional data requirements), yielding average improvements of 51.6% and 91.8% across 24 zero-shot classification and retrieval benchmarks.

Zero-Shot Robustness of Vision Language Models Via Confidence-Aware Weighting

This paper proposes CAW (Confidence-Aware Weighting), an adversarial fine-tuning loss function for CLIP that focuses on hard adversarial examples via confidence-aware weighting, combined with feature alignment regularization to preserve pre-trained semantic knowledge. CAW achieves state-of-the-art zero-shot robustness under AutoAttack with lower memory overhead.