🧩 Multimodal VLM¶
🧪 ICML2026 · 30 paper notes
📌 Same area in other venues: 💬 ACL2026 (46) · 📷 CVPR2026 (230) · 🔬 ICLR2026 (88) · 🤖 AAAI2026 (88) · 🧠 NeurIPS2025 (146) · 📹 ICCV2025 (132)
🔥 Top topics: Multimodal/VLM ×18 · Reasoning ×8 · LLM ×5 · Adversarial Robustness ×3 · Compression ×2
- Any3D-VLA: Enhancing VLA Robustness via Diverse Point Clouds
-
Through a pilot study, the authors find that "explicitly lifting vision to point clouds and then fusing with 2D patches" is the most effective way to inject 3D information into VLA. To address the scarcity of 3D data and domain gaps among different point cloud sources (simulation/sensor/monocular estimation), they propose Any3D-VLA: using hybrid point cloud training to learn source-agnostic geometric representations, achieving a 29.2% improvement (62.5% vs 33.3%) over the strongest baseline in real-world zero-shot grasping tasks.
- Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
-
This work enforces VLM outputs to be split into
<recognition>perception blocks and<think>reasoning blocks, then uses a "blindfolded" text-only reasoning agent (which cannot access the image, only the perception text written by the VLM) to determine if the question can be answered correctly, serving as the perception reward \(R_P\), paired with structured verbal verification (SVV) as the outcome reward \(R_O\). MoCA uses \(R_P\) as a gate for modality-level credit assignment, enabling a 7B model to achieve simultaneous improvements across 9 perception/reasoning/rich-modality benchmarks, surpassing GPT-4o on multiple metrics. - Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners
-
Addressing the "understanding–generation gap" in unified multimodal models on anything-to-image (X2I) tasks (can understand but cannot generate), this paper proposes the Self-Adaptive Interleaved Reasoner: a hierarchical data synthesis pipeline routes 50,000 samples among direct generation, self-reflection, and multi-step planning modes; SFT + GRPO training is applied with step-wise reasoning rewards and intra-group complexity penalties, enabling Emu3.5 to surpass GPT-4o, Gemini 2.5 Flash, and other closed-source models on KRIS-Bench and OmniContext.
- Calibrated Multimodal Representation Learning with Missing Modalities
-
Addressing the practical scenario of "training unified multimodal alignment with partial modality data such as V-T, A-T," this work theoretically establishes upper and lower bounds for "anchor shift caused by missing modalities" via singular value perturbation, and proposes CalMRL: a probabilistic PCA-style generative model performs closed-form EM imputation for missing modalities at the representation level, then feeds both observed and imputed representations into the SVD alignment objective of GRAM/PMRL. On VAST, cross-modal average Recall@1 is improved from 44.8 to 54.2 (+9.4).
- DCER: Robust Multimodal Fusion via Dual-Stage Compression and Energy-Based Reconstruction
-
DCER unifies "intra-modal frequency domain compression + cross-modal bottleneck token" as a robust fusion pipeline, employs a learned energy function for gradient-based reconstruction of missing modalities, and uses the final energy value as intrinsic uncertainty, achieving new SOTA on MOSI/MOSEI/SIMS.
- FreeRet: MLLMs as Training-Free Retrievers
-
FreeRet proposes a fully training-free two-stage multimodal retrieval framework: Stage 1 bypasses the MLLM's final MLP and uses controlled generation prompts to extract semantically faithful embeddings for candidate retrieval; Stage 2 reformulates reranking as a multiple-choice question to avoid LLM framing bias. On MMEB, it outperforms retrieval models trained on tens of millions of paired data.
- FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision–Language Models
-
FRISM refines "VLM × LRM merging" from the layer level to the SVD subspace level: it uses the SVD subspaces of LRM task vectors as reasoning priors, then employs an unlabeled self-distillation (with learnable gating only, KL for vision preservation + spectral norm maximization for reasoning absorption) to find the optimal injection strength, thereby significantly improving VL reasoning performance without notable vision degradation.
- Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs
-
This paper unifies quantization-aware training (QAT) and knowledge distillation (KD) from the Information Bottleneck (IB) perspective, proposing the GRACE framework (confidence-gated decoupled distillation + relational centered kernel alignment + adaptive IB controller). This enables INT4-quantized LLaVA / Qwen-VL models not only to avoid performance drops but to surpass BF16 baselines on multiple benchmarks, achieving 3× throughput and 54% memory savings in real-world deployment.
- Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
-
This work reframes the "regression to the mean" issue of MLLM continuous value regression under long-tail distributions as a distribution-aware RL problem. Within the GRPO framework, the Concordance Correlation Coefficient (CCC) is used as a batch-level reward—simultaneously considering correlation, variance, and mean—thus explicitly penalizing prediction distribution collapse. On four long-tail regression tasks and Qwen2.5-VL-3B/7B, it consistently outperforms SFT, SoftLabel, and various point-wise RL methods, with especially significant MAE reductions in medium/few-shot regions.
- Instruction Lens Score: Your Instruction Contributes a Powerful Object Hallucination Detector for Multimodal Large Language Models
-
This work finds that the intermediate-layer embeddings of instruction tokens in MLLMs naturally filter misleading information introduced from the visual side. Based on this, a training-free InsLen score (Calibrated Local Score + Context Consistency Score) is proposed, which improves object hallucination detection AUROC by up to 13.81% across 5 MLLMs × 4 benchmarks.
- Large Vision-Language Models Get Lost in Attention
-
This paper quantitatively diagnoses the residual stream of LVLMs using a geometric information-theoretic framework based on "information complexity (eRank) + subspace support." It finds that attention almost exclusively performs reconfiguration within the subspace, while the FFN injects new semantic dimensions. Even more strikingly, replacing learned attention weights with Gaussian noise leads to equal or improved performance on most vision tasks, revealing severe misalignment and redundancy in visual attention of contemporary LVLMs.
- Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training
-
VISTA transforms self-improvement training for multimodal large models into a two-stage pipeline: "augmenting hard samples via prefix resampling, filtering pseudo-positive samples via Visual Attention Score (VAS)." On Qwen2.5-VL-3B, it achieves an average improvement of +13.66% on mathematical/medical multimodal reasoning.
- Less Precise Can Be More Reliable: A Systematic Evaluation of Quantization's Impact on VLMs Beyond Accuracy
-
Through 700,000 experimental runs covering 16 quantization methods × 10 VLMs × multiple reliability metrics, this work finds that quantization is not merely a destructive process—it suppresses high-rank, low-variance spectral components, thereby improving calibration, OOD detection, and noise robustness, but also amplifies reliance on covariate shift and spurious correlations.
- LIMSSR: LLM-Driven Sequence-to-Score Reasoning under Training-Time Incomplete Multimodal Observations
-
The authors reformulate multimodal action quality assessment with "missing modalities during training" as a "LLM-based conditional sequence-to-score reasoning" problem. By using prompts and special tokens, the LLM is guided to complete missing semantics without full data supervision. Combined with mask-aware dual-path fusion to suppress hallucination, the method outperforms SOTA models that rely on complete training data across three AQA datasets.
- Model-Dowser: Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models
-
Model-Dowser scores each parameter in an MLLM using the product of "weight magnitude × input activation × output Jacobian." High-scoring parameters are frozen, and only low-scoring ones are updated. This enables deep fine-tuning on LLaVA/NVILA to learn downstream tasks while retaining pretraining knowledge. Compared to SPIDER and ModelTailor, it consistently leads in H-score.
- Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives
-
To address the visual forgetting problem of MLLMs in cross-scenario VQA, this work constructs the MSVQA benchmark (four scenarios: high-altitude, underwater, low-altitude, indoor) and proposes the Unifier framework—adding CSR multi-branch modules and a projector (VRE) for parameter isolation within vision blocks, then aligning different branch representations with a KL-based soft constraint (VCC). With a single inference, Unifier improves VQA by 2.70–10.62% and F1 by 3.40–7.69% over 20-step continual learning.
- MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
-
MUSE attributes the "understanding-generation" zero-sum dilemma of unified visual tokenizers to manifold misalignment, proposing the gradient orthogonality hypothesis—injecting semantics into \(W_V\) while structural gradients flow through \(W_{Q,K}\). Through Synergistic Block + DINOv3 topological alignment + NCE semantic anchoring, the two are fully decoupled. As a result, gFID 3.08 and linear probing 85.2% (even surpassing the InternViT-300M teacher at 82.5%) coexist, achieving genuine "mutual reinforcement" rather than trade-off for the first time.
- OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models
-
This paper identifies that existing Omni-LLM token compression methods treat audio and video "symmetrically," which is suboptimal. It proposes OmniSIFT—a two-stage, modality-asymmetric compression framework: first, spatio-temporal saliency prunes video redundancy to obtain "visual anchors," then these anchors guide audio token selection. With only 4.85M extra parameters, OmniSIFT consistently outperforms existing compression baselines and even the original model on Qwen2.5-Omni-7B when retaining 25% of tokens.
- Perceptual Flow Network for Visually Grounded Reasoning
-
Abandoning the traditional RLVR approach of "hard supervision with precise expert bounding boxes," PFlowNet models the perceptual behavior itself as a structured Perceptual Flow latent variable. It approximates the ideal reasoning-oriented posterior with a variational distribution \(p_\theta(Z|X)\), and is trained using Sub-TB variational RL, multi-dimensional rewards, and Vicinal Geometric Shaping. As a result, the 8B Qwen3-VL achieves a new SOTA of 90.6% on V* Bench and 67.0% on MME-RealWorld-lite.
- Referring Multiple Regions with Large Multimodal Models via Contextual Latent Steering
-
CSteer proposes a training-free latent steering method that constructs "contextual vectors" from the difference in hidden activations between incorrect/correct referring answers. These vectors are hierarchically injected into early query layers and mid-to-late decode layers during inference, enabling general LMMs (Qwen3-VL, InternVL-3.5) to outperform specialized region LMMs fine-tuned for multi-region visual referring tasks.
- Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models
-
This work redefines LVLM hallucination as "visual information loss suppressed by language priors." By orthogonally projecting out the language prior from the original visual direction to obtain a 'pure visual vector,' and using risk gating to sparsely intervene at only the optimal single deep layer, the method reduces CHAIRS hallucination rate by ~19% without training, while preserving MM-Vet general capability.
- ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning
-
This work systematically reveals that the widely used VSI-Bench suffers from structural failures due to 3D annotation drift and inconsistent frame sampling. The authors re-annotate 381 scenes and 5,365 objects, design frame-budget adaptive QA, and introduce a dummy video stress test by removing all frames containing the queried object, resulting in a high-fidelity spatial intelligence benchmark named ReVSI. Evaluation shows that open-source VLMs experience up to a 40% drop on ReVSI, and still exhibit high hallucination rates on dummy videos, exposing that current spatial reasoning abilities have been systematically overestimated by VSI-Bench.
- ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision
-
Addressing the prevalent use of "sparse grounding" annotation and loss of full-screen structure in GUI agents, this work introduces a fully automated Webshot pipeline to construct the dense screen parsing dataset ScreenParse, comprising 771K screenshots, 21M elements, and 55 classes. The authors train ScreenVLM, a model with only 316M parameters, to parse entire screens into ScreenTag structural sequences, outperforming 8B-scale foundation VLMs on both dense parsing and sparse grounding benchmarks while reducing latency to approximately \(1/4\).
- Seeing to Generalize: How Visual Data Corrects Binding Shortcuts
-
This paper reproduces the puzzling phenomenon that "VLMs outperform their base LLMs on pure text tasks" using a controlled synthetic "color-shape-item" retrieval task, and mechanistically explains it: visual training shifts the model's variable binding strategy from "positional shortcuts" to "semantic-symbolic matching." This shift is retained when switching back to pure text, boosting OOD retrieval accuracy from 37.2% to 69.5%. Consistent increases in the "symbolic/positional ratio" are also observed in real Qwen2/2.5/3 model families.
- Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models
-
This work leverages Pointwise Partial Information Decomposition to quantify vision-text modality interactions and proposes a Multimodal Interaction Gate: it automatically selects samples dominated by "image-unique information" and lets the VLM self-generate captions to inject into the text side, thereby converting unique visual signals into redundant shared signals. As a result, the VLM's visual hallucination under blurry or corrupted inputs drops by 38.3%, and consistency improves by 16.8%.
- SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
-
SLQ appends a small set of "shared latent queries" \(\mathbf{Q}\) to the end of image/text token sequences, leveraging the MLLM's own causal attention to aggregate global context. By training only a few thousand query parameters, a frozen MLLM is turned into a retriever, outperforming full fine-tuning and LoRA on COCO/Flickr30K, and introduces KARR-Bench to evaluate "implicit knowledge reasoning" capabilities.
- The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design
-
This paper formalizes the issue of "VLMs failing to perceive details" as a Sequential Bayesian Optimal Experimental Design (S-BOED) problem, and proposes a training-free FOVEA module based on a computable proxy objective of "coverage × resolution." FOVEA consistently outperforms Direct and ReAct-style baselines on high-resolution and remote sensing benchmarks.
- Toward Structural Multimodal Representations: Specialization, Selection, and Sparsification via Mixture-of-Experts
-
This paper proposes the S3 framework, which uses MoE to decompose multimodal representations into concept-level experts (Specialization), activates relevant experts per task via routing (Selection), and prunes low-contribution paths at inference based on routing scores (Sparsification). On four MultiBench benchmarks, it reveals an "inverted-U" curve where performance peaks at intermediate sparsity, presenting a third paradigm for multimodal representation beyond contrastive learning/InfoMax.
- Vision-aligned Latent Reasoning for Multi-modal Large Language Model
-
This paper proposes VaLR: inserting several "latent tokens" before each CoT reasoning step in MLLMs, and aligning these tokens with patch features from visual encoders such as DINOv3/SigLIP/π³ (REPA), thereby continuously "feeding back" visual information to the model during long-chain reasoning. This approach boosts Qwen2.5-VL's accuracy on VSI-Bench from 33.0% to 52.9%, and for the first time enables MLLMs to exhibit "the longer the reasoning, the higher the accuracy" test-time scaling behavior.
- What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity (GLANCE)
-
GLANCE introduces a "think-see alignment" self-supervised head into VLM agent reinforcement learning: the LLM’s "next state prediction" in CoT is projected via a lightweight projector to the real next-frame representation encoded by an EMA target vision encoder. The prediction gap serves simultaneously as intrinsic curiosity reward, training signal for the vision encoder, and an alignment loss to ground the internalized world model. Combined with a curriculum exploration mechanism that periodically resets the projector to counteract curiosity drain, GLANCE consistently outperforms existing exploitation-only VLM-RL methods across five agentic tasks.