ICML2026 Multimodal VLM AI paper notes paper summaries Multimodal/VLM Adversarial Robustness Alignment/RLHF LLM Compression Few-/Zero-Shot Learning

🧩 Multimodal VLM¶

🧪 ICML2026 · 89 paper notes

📌 Same area in other venues: 📷 CVPR2026 (420) · 🔬 ICLR2026 (211) · 💬 ACL2026 (83) · 🤖 AAAI2026 (75) · 🧠 NeurIPS2025 (107) · 📹 ICCV2025 (119)

🔥 Top topics: Multimodal/VLM ×43 · Adversarial Robustness ×11 · Alignment/RLHF ×5 · LLM ×4 · Compression ×3

ACTIVE-o3: Empowering MLLMs with Active Perception via Pure Reinforcement Learning: ACTIVE-o3 delegates the decision of "where and how to look" to an MLLM for autonomous learning. Using pure reinforcement learning (GRPO), the model is trained to parallelly select up to 3 sub-regions most worthy of magnification. A dual-form reward mechanism (task reward + heuristic reward) is employed to solve the sparsity of pure task rewards. The method consistently outperforms baselines in small/dense object detection, remote sensing, autonomous driving, and interactive segmentation, while simultaneously enhancing general understanding capabilities such as RealWorldQA and MME.
AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions: This paper proposes AgentHijack, a benchmark that evaluates the robustness of computer-use agents using 9 categories of configurable everyday environment corruptions. Furthermore, it introduces DA-GRPO to strengthen grounding and an onlooker for behavioral summarization and environment checking, improving the average success rate of UI-TARS-1.5-7B from 18.74% to 22.89%.
Alterbute: Editing Intrinsic Attributes of Objects in Images: Alterbute utilizes VLMs to automatically mine Visual Named Entity (VNE) identity clusters and jointly conditions a diffusion model on identity references, attribute text, background, and masks. This approach provides a unified framework for editing object color, texture, material, and shape while preserving object identity and scene context.
Any3D-VLA: Enhancing VLA Robustness via Diverse Point Clouds: Through a pilot study, the authors discovered that "explicitly lifting vision to point clouds and fusing them with 2D patches" is the most effective way to inject 3D information into VLA models. To address 3D data scarcity and domain gaps across different point cloud sources (simulation, sensor, or monocular estimation), Any3D-VLA is proposed. By employing hybrid point cloud training to learn source-agnostic geometric representations, it achieves a 29.2% zero-shot improvement over the strongest baseline (62.5% vs 33.3%) in real-world grasping tasks.
AOEPT: Breaking the Implicit Modality-Reduction Bottleneck in Modality-Missing Prompt Tuning: AOEPT points out that existing missing-modality prompt tuning compresses the inference scope of Multimodal Transformers into visible modality subspaces. It utilizes Modal-Contextualized Prompts (MCPs) distilled from the training set as a retrievable implicit information source for missing modalities, consistently outperforming existing methods across multiple datasets, missing rates, and backbones.
Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination: This paper introduces VisualSwap and VS-Bench to test real visual re-examination capabilities by replacing the image after a VLM claims to "take another look." The study finds that current reasoning-heavy VLMs often follow the inertia of previous text, with only explicit multi-turn user instructions or enhanced visual attention significantly restoring grounding.
AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs: AVI-Bench is an audio-visual benchmark inspired by human cognition. It organizes the evaluation of Omni-MLLMs into three stages: "Perception → Understanding → Reasoning," supplemented by a "Primitive Sensation" (PriSe) extension. Using 14 tasks, 5,864 samples, and 9 metrics, it systematically diagnoses the Audio-Visual Intelligence (AVI) of 28 open-source/closed-source Omni-MLLMs and proposes a four-level AVI taxonomy.
Benchmarking and Enhancing VLM for Compressed Image Understanding: This paper constructs the first large-scale benchmark (11 codecs, 9 VLMs, 1M+ compressed images) to evaluate VLM understanding of compressed images. Performance degradation is decomposed into an irrecoverable "information gap" and a remediable "generalization gap." A lightweight conditional vision encoder adapter is proposed, which utilizes codec type and compression level as conditional embeddings + distillation training to improve VLM performance by 10%–30% across various encoders and bitrates.
Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated: This paper proposes that VLM urban perception benchmarks should possess two key attributes: "reliability-aware" and "negotiated." By utilizing a benchmark comprising 100 Montreal street-view images, 12 community annotators, and 30 measurement dimensions, it reveals that model alignment is positively correlated with annotator consistency and that models exhibit systematic distributional biases compared to humans in subjective evaluation dimensions.
Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling: DiNa-LRM is proposed to establish preference learning directly on the noisy latent space of diffusion models. Through noise-calibrated Thurstone likelihood and inference-time multi-noise ensembles, it achieves preference prediction accuracy close to SOTA with significantly lower computational overhead than VLM-based reward models.
Calibrated Multimodal Representation Learning with Missing Modalities: Addressing the practical scenario of "training unified multimodal alignment using partial modality data such as V-T and A-T," this paper derives theoretical upper and lower bounds for "anchor shift" caused by missing modalities using singular value perturbation theory. It proposes CalMRL: a Probabilistic PCA-style generative model performs closed-form EM imputation for missing modalities in the representation space. The observed and imputed representations are jointly fed into the SVD alignment objective of GRAM/PMRL. On the VAST benchmark, the cross-modal average Recall@1 is improved from 44.8 to 54.2 (+9.4).
Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing: This paper extends Randomized Smoothing (RS) from scenarios supporting only single continuous or discrete inputs to hybrid perturbation environments involving "discrete tokens + continuous images." By employing a hybrid Neyman–Pearson analysis, the authors derive a one-dimensional, continuous, and invertible likelihood ratio CDF. This transforms the combinatorial knapsack problem into a solvable root-finding problem. It provides the first model-agnostic certificate against "jointly unsafe vision-language" inputs on LLaVA-Guard multimodal safety filtering.
CG-MLLM: Captioning and Generating 3D Content via Multi-modal Large Language Models: CG-MLLM proposes a Mixture-of-Transformer-based multi-modal large language model that combines a pre-trained VLM backbone with a 3D VAE latent space via a dual Transformer architecture consisting of TokenAR (token-level autoregressive) and BlockAR (block-level parallel). It achieves end-to-end high-resolution 3D content generation and 3D captioning within a single MLLM framework for the first time, reaching SOTA among MLLM-based 3D generation methods.
Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Vision-Language Models: The authors propose Circle-RoPE, which maps the 2D coordinates of image tokens onto a torus orthogonal to the text position axis. This forms a conical geometry where the RoPE distance from each text token to all image tokens is equal (PTD=0), eliminating cross-modal pseudo-positional biases while preserving internal image spatial structure through Alternating Geometric Encoding (AGE).
Conditional Diffusion Sampling: This paper proposes Conditional Diffusion Sampling (CDS): by deriving a class of conditional interpolants, an exact closed-form SDE for unnormalized target distributions is obtained (without needing neural network approximation). Parallel Tempering (PT) is then used to efficiently sample the initial distribution of this SDE—combining PT's global exploration capability with the diffusion process's local refinement ability. Across 8 target distributions and 4 task categories, it outperforms traditional MCMC, training-free MCMC, and neural samplers with fewer density evaluations.
Contextualized Visual Personalization in Vision-Language Models: CoViP converges the open-ended task of "visual personalization based on user history" into a shared underlying process of "personalized image captioning." By employing RL post-training with verifiable rewards and inference-time Caption-Augmented Generation (CAG), it enables VLMs to "generate human-like grounded descriptions" within interleaved vision-language contexts, complemented by an MCQA diagnostic benchmark designed to exclude textual shortcuts.
Contrastive Spectral Rectification: Test-Time Defense towards Zero-shot Adversarial Robustness of CLIP: The authors discover that adversarial features collapse sharply when middle and high frequencies are gradually removed (unlike clean samples). Based on this, they propose CSR, a test-time defense: it utilizes "spectral consistency" as a gate to detect adversarial samples and then optimizes a rectification perturbation on the input using a contrastive objective—pulling features toward low-pass anchors and pushing them away from original adversarial features—to return the image to the natural manifold. CSR achieves an average improvement of 18.1% against the strong APGD attack across 16 classification benchmarks with minimal inference overhead.
CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception: CVSearch proposes a training-free "Assess-then-Search" cognitive framework: a rapid localization is first performed using a visual expert (SAM 3); if the expert fails, a semantic-guided adaptive patching and bottom-up search are triggered as a fallback. It achieves SOTA in both accuracy and efficiency on high-resolution benchmarks such as V*Bench and HR-Bench.
CyberJurors: A Multi-Agent Simulation Task for E-Commerce Disputes Verdict: The authors formalize the real-world crowdsourced jury mechanism of e-commerce platforms into the EDV (E-commerce Dispute Verdicts) task. They construct VerdictBench, the first multimodal benchmark containing 6,000 cases with ground-truth voting distributions from 17 jurors (text/images/video/multi-round). They propose CyberJurors, which uses a four-phase Individual Verdict Chain-of-Thought (IV-CoT) for fine-grained evidence localization and a Jury Consensus Verdict (JCV) mechanism that incorporates historical precedents via "Stare Decisis" for collective consensus. On VerdictBench, CyberJurors improves accuracy by \(+9.48\%\), \(+9.38\%\), and \(+6.19\%\) compared to the strongest LLM, MLLM, and court simulators, respectively.
DCER: Robust Multimodal Fusion via Dual-Stage Compression and Energy-Based Reconstruction: DCER establishes "intra-modal frequency domain compression + cross-modal bottleneck tokens" as a unified robust fusion pipeline. It utilizes a learned energy function for gradient-descent reconstruction of missing modalities, while treating the final energy value as an intrinsic uncertainty measure, achieving new SOTA results on MOSI/MOSEI/SIMS.
Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models: The authors constructed the first multimodal benchmark for MLLM deceptive behavior, MM-DeceptionBench (6 categories, 1013 real cases), and proposed the "Debate with Images" framework. In this framework, two MLLM agents are forced to use visual operations for evidence retrieval from the original image during multi-round debates, followed by a judge's determination. This improves Cohen's kappa consistency with humans by up to 1.5\(\times\) and accuracy by up to 1.25\(\times\) compared to MLLM-as-a-judge.
Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging: The authors develop a local quadratic theory for weight merging starting from "merge-ready flat basins": merging gain equals curvature-weighted checkpoint variance. Splitting along the principal directions of gradient conflict via PCA maximizes this gain. Based on this, the MERIT pipeline is proposed: PCA-based splitting by dataset conflict, independent fine-tuning with zero communication, and final one-shot token-weighted averaging. It improves the 8-benchmark average of Qwen2.5-VL-3B from 54.3 to 57.0 on 136 Vision-FLAN tasks.
Deep Pre-Alignment for VLMs: The authors replace the standard "ViT + lightweight projector" visual module in VLMs with a small VLM (perceiver). This allows the depth-intensive task of modality alignment to be completed upstream within the small VLM, ensuring the downstream large LLM does not waste its initial layer depth on alignment. Results show a +1.9 point improvement for a 4B model across 8 multimodal benchmarks and +3.0 points for a 32B model, while reducing language capability forgetting by 32.9% with only a 2–6% drop in inference throughput.
DenseMLLM: Standard Multimodal LLMs for Dense Prediction: The authors integrate dense prediction tasks—such as semantic segmentation, depth estimation, and referring expression segmentation—directly into a standard 4B MLLM (ViT + Projector + LLM) without any task-specific decoders. By introducing "Multi-label Next-Token Prediction" (NTP-M) supervision for vision tokens, the model achieves 54.2 mIoU on ADE20K, 87.6 \(\delta_1\) on DDAD, and 80.7 cIoU on RefCOCO val, while maintaining general VL performance comparable to Qwen3-VL-4B.
Density-Aware Translation of Spurious Correlations in Zero-Shot VLMs: The authors observe that CLIP embeddings exhibit an anisotropic ellipsoidal distribution on the hypersphere, where spurious samples cluster near the mean. They propose DAT: estimating a local density \(D_{y,a}(z)\) using a reference set for each (class, spurious attribute) group, and rescaling the original cosine similarity via \(\tilde s_{y,a}(x)=s_{y,a}(x)/(D_{y,a}(z)+\varepsilon)^{\lambda}\) based on whether a sample resides in the core of the group. This significantly improves worst-group accuracy without fine-tuning, text-side modifications, or test-time spurious attribute labels.
Detached Skip-Links and \(R\)-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR: Addressing OCR scenarios in MLLMs, the authors apply stop-gradient (Detached Skip-Links) to shallow skip branches within a multi-layer ViT→LLM fusion architecture. Simultaneously, they propose \(R\)-Probe, a reconstruction probe initialized with the "first 1/4 layers of the LLM," to diagnose whether visual tokens effectively deliver fine-grained information to the language model.
Dimension-Free Multimodal Sampling via Preconditioned Annealed Langevin Dynamics: This work provides the first dimension-free non-asymptotic convergence analysis for Preconditioned Annealed Langevin Dynamics (PALD)—reducing the sampling complexity for multimodal distributions from \(\tilde{O}(d/\epsilon^2)\) to \(\tilde{O}(1/\epsilon^2)\), liberating diffusion-based sampling algorithms from the "curse of dimensionality" in high-dimensional settings.
DIVA: Harnessing the Representation Divergence in Unified Multimodal Models for Mutual Reinforcement: DIVA discovers that Unified Multimodal Models (UMM) spontaneously decouple "understanding" and "generation" information flows within intermediate layers. By explicitly factorizing representations into shared and unique components and applying contrastive/CLUB mutual information constraints for "shared alignment + unique decoupling," it simultaneously improves understanding by +7.82% and generation by +8.46% on Show-o/Liquid/Nexus-Gen without architectural modifications.
Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review: As conferences such as AAAI, ICML, and NeurIPS officially incorporate AI-generated reviews into the preliminary review process, this paper presents PaperGuard—the first benchmark to systematically evaluate the vulnerability of multimodal AI reviewers under adversarial manipulation. It unifies black-box prompt injection with white-box gradient attacks (Text GCG; Image PGD/APGD/CW), demonstrating that text-only protection is insufficient (image attacks can inflate scores by \(+14\) points), and proposes a lightweight chunk-level embedding retrieval defense (95% accuracy, zero false positives).
ECA: Efficient Continual Alignment for Open-Ended Image-to-Text Generation: ECA proposes exemplar-free incremental learning on the "alignment module" (the Q-Former in BLIP-2) of pretrained VLMs. By utilizing Mixture-of-Query to compositionally aggregate task-specific queries per image, expanding parallel adapters on-demand based on a Fisher Information Matrix criterion, and preserving prior knowledge via sparse dictionary replay, the method successfully learns new subjects without catastrophic forgetting in open-ended image-to-text generation tasks where visual topics drift over time.
ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation: ECG-R1 is the first "reasoning-type" medical multimodal large model dedicated to ECG interpretation. Through a suite of protocol-guided instruction data synthesis + decoupled signal/image encoding + interleaved modality dropout training + evidence-driven process reward RL, it improves ECG diagnostic accuracy from the previous SOTA (GEM) of 74.7 to 80.3, while maintaining cross-modal consistency even when a modality is missing.
Explaining Is Harder than Predicting Alone: Evaluating Concept-Based Explanations of MLLMs as ICL Visual Classifiers: The authors utilize a five-level formalization ladder of explanation conditions (classification only → natural language explanation → feature list → IF-THEN knowledge base → DL axioms) and an LLM-as-a-judge pipeline evaluating 9 XAI metrics to conduct 2,080 ICL classification experiments on four SOTA MLLMs. They find that "forcing models to generate more formal concept explanations leads to a monotonic decline in classification accuracy (93.8% → 90.1%)," yet "Local Discriminativeness" is the only explanation quality dimension significantly correlated with accuracy.
FlowNar: Scalable Streaming Narration for Long-Form Videos: FlowNar employs a combination of "segment-end visual KV cache clearing + compressing historical visual information into fixed-length memory tokens via gated linear attention." This allows the streaming video narration model to maintain constant memory and computational overhead, enabling the processing of \(10\times\) longer videos with \(3\times\) throughput. Simultaneously, the introduction of a self-conditioned evaluation protocol reveals that baseline methods are significantly overestimated under real-world deployment scenarios.
Focusing Where Vision Matters: Selective Training for Large Vision Language Models via Visual Information Gain: This paper proposes Visual Information Gain (VIG)—a visual dependency metric based on the log-ratio of perplexity between "with image vs. without image (proxied by a blurred image)." VIG quantifies "to what extent a data sample or a specific token depends on the image." Based on this, selective instruction tuning is performed: loss is computed only on high-VIG samples and tokens. This allows LLaVA-1.5-13B to outperform vanilla training across the board using only 21% of effective tokens, while significantly mitigating language bias and hallucinations.
FreeRet: MLLMs as Training-Free Retrievers: FreeRet proposes a fully training-free two-stage multimodal retrieval framework: the first stage bypasses the last MLP layer of the MLLM and utilizes controlled generation prompts to extract semantically faithful embeddings for candidate retrieval; the second stage transforms reranking into a multiple-choice question (MCQ) format to circumvent the LLM framing bias. It outperforms retrieval models trained on tens of millions of paired data points on MMEB.
Furina: Fragmented Uncertainty-Driven Refusal Instability Attack: This paper first utilizes multi-metric diagnostics to prove that "LLM safety decisions are not binary thresholds, but exist within a refusal instability band," discovering that this band is characterized by "rising external uncertainty while internal safety signals decrease." Based on this, it proposes Furina—a model-agnostic jailbreak attack that forces inputs into the instability band by breaking malicious intent into fragmented situational narratives, outperforming multiple strong baselines on HarmBench.
FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs: This paper introduces FutureOmni, the first benchmark evaluating MLLMs' ability to forecast future events from audio-visual context (919 videos / 1,034 MCQs). Results show even the strongest model, Gemini 3 Flash, achieves only 64.8% accuracy. The authors propose OFF, a rationale-infused instruction tuning method that significantly enhances both forecasting and generalization for open-source models.
CHARM: Using Multimodal JEPA + Channel Descriptions for Time Series Foundation Embedding: CHARM injects channel text descriptions (e.g., "temperature sensor °C") as an inductive bias into a time series Transformer and trains it using a JEPA objective (latent prediction rather than raw signal reconstruction). The resulting embeddings match specialized models like PatchTST, MOMENT, and Moirai across anomaly detection, classification, and forecasting using simple linear probes, while maintaining strict channel-permutation equivariance.
Hierarchical Synthetic Tabular Data Generation: A Hybrid Top-Down and Bottom-Up Framework: This paper proposes the H-TDBU framework: utilizing LLMs or human-written rules in a top-down path to generate a "logical skeleton" \(\mathcal{S}\), while using lightweight bottom-up generators such as RandomForest, XGBoost, or CTGAN to learn "statistical texture" \(z\). These components are integrated via a conditional generator \(G(z\in\mathcal{Z}\mid\mathcal{S})\) and iteratively refined through a TSTR + XModal feedback loop. On weak multimodal financial benchmarks, it achieves TSTR AUROC superior to pure neural network baselines while maintaining cross-modal consistency.
Hyper-ICL: Attention Calibration with Hyperbolic Anchor Distillation for Multimodal ICL: Hyper-ICL provides a structural prior for multimodal LVLM in-context learning by lifting CLIP embeddings into hyperbolic space to form structured "hyperspherical anchors," combined with hierarchy-aware distillation attention—consistently surpassing traditional demo selection strategies on tasks like VQA, Captioning, and Caption Editing.
Immuno-VLM: Immunizing Large Vision-Language Models via Generative Semantic Antibodies for Open-World Trustworthiness: This paper ports the "negative selection" principle from biological immune systems to VLMs such as CLIP. It employs an LLM to actively hallucinate a set of "look-alike but non-target" text descriptions as semantic antibodies. A lightweight adapter then pushes visual features away from these antibodies, significantly reducing "high-confidence misclassification" in open-world scenarios without retraining the backbone.
ATHA: Improving CLIP Adaptation on Source-Free Cross-Domain Few-Shot Learning by Breaking Tail Alignment: ATHA proposes an asymmetric alignment paradigm of "aligning head tokens and pushing away tail tokens" for CLIP cross-domain few-shot fine-tuning. By actively pushing semantically sparse patches away from text embeddings, it mitigates overfitting and improves 1-shot average accuracy from 55.92% to 58.35%.
RESTORE: Improving Visual Token Reduction via Distortion Correction for Enhanced MLLM Inference Efficiency: RESTORE highlights two overlooked issues in existing Visual Token Reduction (VTR): "position distortion" and "attention decay." By introducing a distance-aware reverse compensation term for RoPE decay and improving token merging with an anchor selection strategy balancing representativeness and discriminativeness, it enables LLaVA-1.5-7B to approach full-token performance even at 64 tokens (~11% retention).
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression: This work transforms the "regression to the mean" problem of MLLM continuous value regression under long-tailed distributions into a distribution-aware RL problem. Within the GRPO framework, the Concordance Correlation Coefficient (CCC) is utilized as a batch-level reward—evaluating correlation, variance, and mean simultaneously—to explicitly penalize predictive distribution collapse. Across four long-tailed regression tasks using Qwen2.5-VL-3B/7B, the method consistently outperforms SFT, SoftLabel, and various point-wise RL approaches, achieving a significant reduction in MAE, particularly in medium/few-shot regions.
VEENA: Interpreting and Enhancing Emotional Circuits in Large Vision-Language Models via Cross-Modal Information Flow: VEENA utilizes a steering-vector causal attribution framework to locate emotional circuits in LVLMs. It uncovers a three-stage mechanism: "Adapt (shallow modal alignment) → Aggregate (middle-layer emotion-specific heads) → Execute (deep-layer emotion-general heads + neurons)." Based on this, it introduces training-free inference-time interventions—"Visual Emotion Enhancement" and "Emotional Neuron Augmentation"—to significantly mitigate emotional hallucinations.
Jailbreaking Vision-Language Models Through the Visual Modality: The authors propose four types of jailbreak attacks that bypass frontier VLM security solely through visual inputs (Visual Cipher / Object Replacement / Text Replacement / Visual Analogy Riddle). They systematically demonstrate across six frontier VLMs that "safety alignment on the text side does not automatically transfer to the vision side" and reveal the underlying hierarchical mechanisms through mechanistic analysis.
LBR/LBP: Language Bias in LVLMs — From In-Depth Analysis to Simple and Effective Mitigation: This paper systematically quantifies language bias in LVLM training—discovering that both VIT and DPO stages cause the text-only likelihood \(\pi(y|x)\) to increase nearly as much as the multimodal likelihood \(\pi(y|x,v)\), proving that LVLMs systematically undervalue visual input. The authors propose Language Bias Regularization (using \(|\mathcal{B}|\) to anchor the language path to a reference level during VIT) and Language Bias Penalty (using a sigmoid penalty to actively suppress existing bias during DPO). Without any additional data or auxiliary models, these methods significantly improve performance on 10+ benchmarks and reduce hallucinations.
Large Vision-Language Models Get Lost in Attention: This paper quantitatively diagnoses the residual streams of LVLMs using a geometric information theory framework of "Information Complexity (eRank) + Subspace Support." It finds that Attention primarily performs intra-subspace reconfiguration while FFN injects new semantic dimensions. More surprisingly, replacing learned attention weights with Gaussian noise maintains or even improves performance on most vision tasks, revealing severe mismatch and redundancy in contemporary LVLM visual attention.
Layer-Specific Fine-Tuning for Improved Negation Handling in Medical Vision-Language Models: NAST utilizes causal tracing to calculate the Causal Trace Effect (CTE) of each layer in the CLIP text encoder for negation understanding. These CTEs are then used as layer-wise gradient scaling factors for LoRA fine-tuning, significantly enhancing the semantic sensitivity of medical VLMs in distinguishing between the presence and absence of symptoms, and narrowing the affirmative-negative accuracy gap from 21.6% to 4.2%.
Left-Right Symmetry Breaking in CLIP-style Vision-Language Models Trained on Synthetic Spatial-Relation Data: The authors train a CLIP-style Transformer on a 1D synthetic image-text testbed and find that these models learn "left/right" relations and generalize to unseen object pairs. The mechanism is identified as the cross-term of token and positional embeddings \(EW_{QK}P^T\) inducing a horizontal gradient in the vision encoder's attention logits, breaking symmetry; ablating this term drops spatial discrimination accuracy to chance levels.
Manga109-v2026: Revisiting Manga109 Annotations for Modern Manga Understanding: The authors revisit Manga109, the foundational dataset for manga AI research, identifying five categories of dialogue text annotation issues. By combining commercial OCR + dual LLM voting (GPT-5/Gemini 3 Flash) + human verification, they revised approximately 29,000 annotations (19.6% of the total 147,887 text annotations) to release Manga109-v2026, improving end-to-end OCR evaluation H-mean from 48.5 to 62.9 (+14.4 pp).
Measurement Plasticity: Sensor-Level Adaptation for Vision–Language Models: This paper shifts Test-Time Adaptation (TTA) for Vision-Language Models (VLM) from "tuning the model/tokens" to "tuning the camera/photons." By treating the camera's exposure triangle (ISO, shutter speed, aperture) as controllable "physical prompts," it selects multiple physical views based on source domain affinity during the capture stage, followed by entropy filtering and hard voting. Without any gradients or model modifications, this method significantly outperforms digital-only TTA methods under sensor-level distribution shifts.
Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training: Med-Scout defines the systematic failure of medical MLLMs to adhere to image geometric constraints during lesion localization as "geometric blindness." It utilizes three geometric proxy tasks (Hierarchical Scale Localization / Topological Jigsaw / Anomaly Consistency) that do not require expert annotation, combined with Dense Geometric Rewards (DGR) under the GRPO framework for post-training. The authors also release Med-Scout-Bench for quantifying geometric blindness, demonstrating consistent improvements across four backbones and eight medical benchmarks, with open-source models even surpassing GPT-5 / Gemini-3-Flash.
MedSIGHT: Towards Grounded Visual Comprehension in Medical Large Vision-Language Models: MedSIGHT incorporates a "region perceiver" and a set of "modality-grouped regional codebooks" into a medical LVLM. This allows a single generative model to perform diagnostic reasoning while directly generating discrete region codes that are decoded into segmentation masks. Using only 72K instruction samples, it achieves SOTA performance simultaneously on both understanding and segmentation tasks.
Mitigating Manifold Departure: Uncertainty-Aware Subspace Rectification for Trustworthy MLLM Decoding: To address the issue where training-free decoding methods "indiscriminately suppress language priors," pushing hidden states away from the normal decoding manifold (manifold departure) and harming generation, MGAP uses SVD to estimate a low-rank "language prior subspace" from blind text hidden states. During decoding, it adaptively attenuates only the projection components of the hidden states on this subspace based on "visual conflict degree + prediction uncertainty," achieving stronger hallucination suppression and more stable generation fidelity on POPE and CHAIR.
Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling: This paper reveals and formalizes the "Perceptual Judgment Bias" in MLLM-as-a-Judge, where judgment models tend to reward linguistically fluent responses even when visual evidence conflicts with the textual narrative. By constructing the Perceptually Perturbed Judgment Dataset (PPJD) and employing GRPO-based batch ranking reward training, the authors enable a 7B judge to significantly outperform baselines of the same size across consistency, single-score prediction, and batch ranking protocols using only 3k samples.
MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs: Addressing the issue that MLLMs struggle with fine-grained visual grounding because "multiple visual semantics are entangled" within ViT patch representations, this paper proposes MoDA—a lightweight module. On top of aligned visual features, it uses language instructions to generate a \([0,1]\) channel-level soft mask via cross-attention, using Hadamard multiplication to amplify instruction-relevant feature dimensions and suppress irrelevant ones. It is plug-and-play, requires no changes to MLLM architecture, needs no extra supervision, and achieves consistent gains across 12 benchmarks and 3 MLLM architectures (e.g., +12.0 on MMVP for LLaVA-1.5) with less than 1% additional FLOPs.
Model-Dowser: Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models: Model-Dowser scores each parameter of an MLLM using a three-factor multiplication of "weight magnitude \(\times\) input activation \(\times\) output Jacobian." By freezing high-score parameters and updating only low-score ones, it enables deep fine-tuning on LLaVA/NVILA that masters downstream tasks while preserving pre-trained knowledge, consistently outperforming SPIDER and ModelTailor on H-score.
Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives: To address visual forgetting in MLLMs across scenarios, this paper constructs the MSVQA benchmark (covering high-altitude, underwater, low-altitude, and indoor scenarios) and proposes the Unifier framework. By integrating a CSR multi-branch structure with a shared projector (VRE) in vision blocks for parameter isolation, and applying dual-channel KL soft constraints (VCC) to align representations, the method improves VQA scores by 2.70-10.62% and F1 by 3.40-7.69% across 20-step continual learning with single-inference latency.
Neutral-Reference Prompting for Vision-Language Models: This paper re-attributes the Base-New Trade-off (BNT) in VLM efficient transfer to "unremoved asymmetric category preferences from pre-training in novel classes." It proposes NeRP: using a semantically neutral text prompt and the "training image mean" as reference inputs to estimate zero-parameter prior shifts for each category on a trained VLM. A Bayesian-style proxy score is then used to perform local flips between confused category pairs, improving novel class accuracy while preserving base class accuracy without modifying model parameters.
Pair2Scene: Learning Local Object Relations for Procedural Scene Generation: Pair2Scene reformulates 3D indoor scene generation from "directly fitting a global joint distribution" to "learning one-to-one local object relations (support + functional) and recursively assembling them via a scene hierarchy tree." Combined with point cloud geometric encoding, Mixture-of-Logistics probability heads, and collision-aware rejection sampling, it enables complex scene generation—increasing object counts from ~4 to ~14 when trained only on 3D-Front—outperforming baselines like ATISS, DiffuScene, and LayoutVLM in FID and user studies.
Referring Multiple Regions with Large Multimodal Models via Contextual Latent Steering: CSteer proposes a training-free latent steering method that constructs "contextual vectors" based on the hidden activation differences between incorrect and correct referring responses. These vectors are injected into early query layers and mid-to-late decoding layers, enabling general LMMs (e.g., Qwen3-VL, InternVL-3.5) to outperform specialized fine-tuned region LMMs on multi-region visual referring tasks.
Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?: Robust-U1 enables unified Multimodal Large Language Models (MLLMs) to first "self-recover" corrupted images at the pixel level and then perform joint reasoning using both original and recovered images, achieving SOTA in robust understanding under both real-world and adversarial degradations.
SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning: SAME explicitly decomposes "catastrophic forgetting" in multimodal continual instruction tuning (MCIT) for MoE-LoRA into two independent sources: router drift and expert drift. It addresses these using spectral-aware subspace constrained updates for the router, Riemannian preconditioning with historical input covariance for experts, and an adaptive task-level freezing mechanism to eliminate redundant updates. Ours consistently outperforms existing MoE continual learning SOTAs on CoIN, UCIT, and the newly established TriGap long-sequence benchmarks.
ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision: To address the loss of full-screen structure in "sparse grounding" labels commonly used by GUI agents, this paper constructs ScreenParse, a dense screen parsing dataset with 771K screenshots, 21M elements, and 55 classes via an automated Webshot pipeline. The authors further train ScreenVLM (316M parameters) to parse entire screens into ScreenTag structural sequences, outperforming 8B-scale foundation VLMs on dense parsing and sparse grounding benchmarks while reducing latency to \(\sim 1/4\).
Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs: The authors modify the causal attention mask in decoder-only MLLMs by "digging a hole" that allows preceding image tokens to retrospectively attend to subsequent text question tokens. This single-line mask modification requires no extra parameters or training data changes, achieving an average improvement of 6.2 points across 3 LLM backbones and 12 multimodal benchmarks.
Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models: This paper quantifies vision-text modality interactions using Pointwise Partial Information Decomposition and proposes a Multimodal Interaction Gate. By automatically selecting samples where "visual-unique information dominates" and letting the VLM generate self-captions to feed back into the text side, the method converts unique visual signals into redundant shared signals. This reduces visual hallucinations by 38.3% and improves consistency by 16.8% under blurred or contaminated inputs.
Self-Prophetic Decoding to Unlock Visual Search in LVLMs: SeProD pairs a post-trained LVLM optimized for visual search with its non-finetuned pre-trained version. The pre-trained model acts as a "prophet," generating single-step draft prefixes at each turn, while the post-trained model selectively accepts these prefixes based on a probability threshold. This approach restores single-step foundational capabilities and maintains multi-step reasoning coherence without additional training or extra computational overhead.
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs: SLQ appends a small set of "Shared Latent Queries" \(\mathbf{Q}\) to the end of image/text token sequences, leveraging the causal attention of the MLLM to aggregate global context. By training only a few thousand query parameters while keeping the MLLM frozen, it transforms the model into a retriever. It outperforms full fine-tuning and LoRA on COCO/Flickr30K. The authors also release KARR-Bench to evaluate "implicit knowledge reasoning" capabilities.
Smoothing Slot Attention Iterations and Recurrences: Addressing the long-neglected pain points in Slot Attention for image and video object-centric learning—specifically the "insufficient information in cold-start queries" and the "forced unification of aggregation transformations for first vs. non-first frames"—the authors propose SmoothSA. It utilizes a small self-distilled warm-up module to inject sample information into queries while allowing the first frame to run three iterations and non-first frames to run only one. This approach sets new SOTA results on both image and video OCL benchmarks.
SOLAR: Self-supervised Joint Learning for Symmetric Multimodal Retrieval: SOLAR proposes the first two-stage self-supervised learning framework for "symmetric MM2MM retrieval" (where both query and document are image+text pairs and roles are interchangeable). The first stage learns an "intersection mask" via global-local alignment and QDA adaptive thresholds to decouple shared and unique information between images and text. The second stage utilizes this mask to construct positive and hard negative samples by masking different regions for contrastive learning. The authors also release a manually verified sym-MM2MM benchmark with 214 samples; the final model, with 0.2B parameters and 768-dimensional embeddings, outperforms the strongest 7.75B VLM baseline by 7.08 percentage points.
Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning: To address the KV cache explosion in multi-modal many-shot in-context learning, TASM proposes a training-free framework: using "task vectors" instead of sample-specific attention for scoring (de-biasing), bipartite graph matching for semantic-aware token merging rather than hard pruning (topology preservation), and hierarchical dynamic retrieval triggered by JS divergence (allowing compressed details to be recalled when needed). It reduces VRAM usage by up to 80% while maintaining performance close to full-context.
Text-Conditional JEPA for Learning Semantically Rich Visual Representations: This paper proposes TC-JEPA, which additionally conditions the I-JEPA masked feature predictor on image captions. Through multi-layer sparse cross-attention, patch representations become predictable under text "prompts," thereby learning semantically richer visual representations that are particularly friendly to dense prediction tasks without using contrastive loss.
TGV-KV: Text-Grounded KV Eviction for Vision-Language Models: TGV-KV introduces a triplet of mechanisms—layer-wise budgeting based on text-vision attention, re-ranking visual importance using dominant text tokens, and prioritizing text KV during eviction—to successfully migrate KV eviction strategies from LLMs to VLMs. Under a 5% retention rate, it maintains performance near full KV levels on LLaVA-NeXT and Qwen3-VL, while achieving a 52.6% throughput increase.
The Truth Stays in the Family: Enhancing Contextual Truthfulness via Inherited Heads in Model Lineages: The authors discovered that "attention heads encoding contextual faithfulness" are inherited across LLMs/MLLMs derived from the same base model. They propose TruthProbe—a plug-and-play mechanism using head-level Truth Scores for soft gating. Scores probed from a base LLM can be directly transferred to its fine-tuned LLM and multimodal descendants, simultaneously reducing hallucinations in HaluEval, POPE, and CHAIR.
TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings: The authors construct the TimeSpot benchmark, covering 1,455 real-world ground-level images from 80 countries. It mandates VLMs to provide structured nine-field predictions covering both "When" (season, month, minute-level local time, day phase) and "Where" (continent, country, climate zone, environment type, coordinates). Results indicate that even the strongest model, Gemini-2.5-Flash-Thinking, achieves only 77.59% country accuracy and a median geographic distance error of 892.54 km, with minute-level time accuracy below 34%, revealing a significant lack of joint geo-temporal reasoning based on physical cues.
Toward Structural Multimodal Representations: Specialization, Selection, and Sparsification via Mixture-of-Experts: Ours proposes the S3 framework, which decomposes multimodal representations into concept-level experts (Specialization) using MoE, activates relevant experts via task-based routing (Selection), and prunes low-contribution paths by routing scores during inference (Sparsification). Across four MultiBench benchmarks, it reveals an inverted U-curve where "performance peaks at intermediate sparsity," presenting a third paradigm for multimodal representation beyond contrastive learning and InfoMax.
TRAP: Hijacking CoT Reasoning of VLA for Targeted Behavior Attacks via Adversarial Patches: TRAP is the first targeted behavior hijacking attack against reasoning VLAs. It hijacks the VLA's CoT reasoning (bounding boxes/trajectories/subtasks) through a tablecloth-sized physical adversarial patch. This forces the robot to perform an attacker-defined action (e.g., "give a knife to a person") while the user instruction remains "pick up the apple." Across MolmoAct, GraspVLA, and InstructVLA CoT paradigms, it achieves an average ASR of 52.54%. Real-world printed patches on GraspVLA achieve an 86.7% interference success rate and a 33.3% full control rate in occlusion-free deployments.
TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization: TUR-DPO overlays a "semantic + topological structure" shaping reward difference and a dynamic "per-sample uncertainty" down-weighting instance weight on the preference logits of DPO. This allows the model to explicitly reward reasoning process structural rationality and weaken the influence of fragile preference pairs while maintaining the simplicity of RL-free training. It systematically outperforms DPO and IPO on reasoning tasks like GSM8K / MATH / BBH / QA and matches PPO on most tasks.
Universal Skeleton Understanding: Differentiable Rendering and MLLMs: By rendering skeleton sequences into images, MLLMs are enabled to understand various skeleton data formats—achieving universal skeleton understanding and resolving cross-modal and format heterogeneity issues.
Unveiling Visual Counting Bottlenecks in Vision-Language Models: By decomposing visual counting into three cognitive stages, this study discovers that the root cause of VLM counting failure lies not in visual perception or numerical understanding, but in the symbolic mapping stage where visual representations fail to project to the correct text tokens, reflecting the lack of a unified cross-modal numerical representation space.
V-LynX: Token Interface Alignment for VideoX LLMs: V-LynX discovers the continuous token interface (manifold) within Video LLMs—a geometric prior carved by the visual encoder and projection layer that is compatible with the LLM's internal operation space. By utilizing lightweight LoRA (68.7M parameters) and unpaired unimodal data, V-LynX efficiently integrates new modalities (audio, 3D, high-frame-rate video) into pre-trained Video LLMs, achieving a CIDEr of 145.7 vs. PAVE's 134.5 on AVSD with 46% fewer parameters.
Very Efficient Listwise Multimodal Reranking for Long Documents: ZipRerank simultaneously eliminates the two major bottlenecks of listwise reranking in VLMs—"excessive visual token sequence length" and "sequential token-by-token ranking output in autoregressive decoding." By utilizing query-aware token pruning and single-logit ranking, it reduces LLM inference latency on MMDocIR by an order of magnitude while matching or exceeding the current SOTA, MM-R5.
Vision Language Models Cannot Reason About Physical Transformations: By introducing the ConservationBench benchmark, this paper reveals that while 112 VLMs claim powerful perception and reasoning capabilities, they systematically fail to judge conservation in physical transformations (e.g., constant liquid volume after pouring), relying on textual priors rather than genuine visual understanding.
VisionPulse: Dynamic Visual Sparsification in Multimodal Reasoning: VisionPulse proposes a training-free, step-level dynamic visual token pruning framework. By adaptively adjusting the number of retained tokens based on changing visual dependencies at each decoding step, it maintains inference accuracy while retaining only 5% of visual tokens, reducing inference length by 11.2%.
Visual Persuasion: What Influences the Decisions of Vision-Language Models?: This paper systematically uses image editing models to modify visual attributes (maintaining semantic invariance) and discovers significant visual preferences in VLMs. It proposes three visual prompt optimization methods to reveal these preferences, develops an automatic interpretability pipeline to understand the visual themes driving decisions, and mitigates risks through visual normalization.
VLA-Arena: An Open-Source Framework for Evaluating Vision-Language-Action Models: VLA-Arena proposes a structured VLA benchmark—systematically quantifying difficulty through three orthogonal dimensions: task structure, language command, and visual observation. With 170 tasks, it reveals key deficiencies in generalization, visual perception, and safety of existing VLA models.
VLANeXt: A Recipe for Building Robust VLA Models: This paper systematically explores the design space of VLA models, distilling 12 key design principles from over 500 controlled experiments to construct the efficient and powerful VLANeXt model. It surpasses SOTA on the LIBERO benchmark and validates these design principles through real-world robot tasks.
WeatherSyn: An Instruction Tuning MLLM For Weather Forecasting Report Generation: WeatherSyn decomposes the weather forecaster's report writing process into a multimodal instruction task of "observe images → list key points → produce draft." The authors first constructed the WSInstruct dataset, covering 31 US cities and 8 types of weather elements. Subsequently, a three-stage fine-tuning process (SFT→RFT→DPO) was applied to Qwen3-VL-8B. The results demonstrate that an 8B open-source model consistently outperforms closed-source models such as GPT-5-Nano and Claude-3.7-Sonnet across multiple evaluation metrics, showing zero-shot generalization to unseen cities.