🧩 Multimodal VLM¶
🔬 ICLR2026 · 211 paper notes
📌 Same area in other venues: 📷 CVPR2026 (388) · 💬 ACL2026 (83) · 🧪 ICML2026 (89) · 🤖 AAAI2026 (75) · 🧠 NeurIPS2025 (105) · 📹 ICCV2025 (106)
🔥 Top topics: Multimodal/VLM ×113 · LLM ×15 · Alignment/RLHF ×10 · Few-/Zero-Shot Learning ×8 · Layout & Composition ×6
- SR-3D: 3D-Aware Region Prompted Vision Language Model
-
SR-3D directly injects 3D positional encodings derived from depth estimation into the visual tokens of a 2D foundation VLM and employs a dynamic slice region extractor. This allows a single model to handle both single-view images and multi-view videos, supporting precise cross-frame 3D spatial reasoning by simply drawing a box or mask on any single frame, achieving SOTA performance across multiple 2D and 3D benchmarks.
- A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models
-
The A-TPT framework is proposed to promote angular diversity by maximizing the minimum pairwise angular distance of normalized text features on the unit hypersphere. This addresses the miscalibration issue caused by overconfident VLM predictions in Test-time Prompt Tuning (TPT), outperforming existing TPT calibration methods on both natural distribution shifts and medical datasets.
- A High Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation
-
Addressing the pain points of scarce training data and unreliable evaluation for interleaved image-text generation in unified Large Multimodal Models (LMMs), this paper introduces InterSyn, a large-scale dataset with 1.8 million samples and 3,500 topics featuring automated quality control (SEIR iterative refinement). It also presents SynJudge, an evaluation model providing four-dimensional interpretable scores highly aligned with human judgment (95.4% A@1). Experiments show that fine-tuning with InterSyn significantly improves interleaved generation capabilities with only 25K–50K samples.
- ASCIIEval: Benchmarking Models' Visual Perception in Text Strings via ASCII Art
-
Using human-artist-drawn ASCII art as a carrier, this paper constructs ASCIIEval, a recognition benchmark where content is strictly equivalent in both text and image modalities. It systematically reveals multiple diagnostic findings: LLMs can "see" visual semantics from pure strings, open-source MLLMs face a trade-off between OCR and global visual perception, and current models fail to benefit from "text + image" dual-modality inputs.
- Asynchronous Matching with Dynamic Sampling for Multimodal Dataset Distillation
-
Addressing the "asynchronous optimization rhythms of image and text networks" in image-text dataset distillation, this paper proposes the AMD framework. It decouples the sampling origins of image and text expert trajectories for asynchronous trajectory matching, utilizes MMD to measure convergence speed differences to dynamically determine the sampling range for each modality, and replaces random initialization with semantic prototype mining. On Flickr30k and COCO, it significantly refreshes distilled retrieval performance with almost zero extra overhead (e.g., IR@1/@5/@10 Gained by 4.5%/9.6%/10.9% under the Flickr30k 200-pair setting).
- AttTok: Marrying Attribute Tokens with Generative Pre-trained Vision-Language Models towards Medical Image Understanding
-
Addressing the challenge where generative medical multimodal large models lose discriminative power by encoding clinical attributes like "mild/severe DR" into nearly identical text tokens, this paper proposes Attribute Tokens (AttTok). By assigning a dedicated special token to each clinical concept and implementing a multimodal embedding book, Attribute-centric Cross-attention (ACC) adapter, and Attribute-centric Matching (ACM) loss, the authors explicitly inject discriminative medical knowledge into the generative paradigm. This approach achieves consistent performance gains across 5 classification benchmarks and 3 VQA benchmarks.
- BaseReward: A Strong Baseline for Multimodal Reward Model
-
Instead of inventing new architectures, this paper deconstructs the process of building a SOTA Multimodal Reward Model (MRM) into six dimensions: paradigm, reward head, regularization, data, backbone/scale, and ensemble. Through systematic ablation, it derives a clear "recipe" and builds BaseReward—a simple yet strong baseline based on Qwen2.5-VL-7B with a two-layer SiLU MLP reward head and selected mixed preference data. It sets new SOTAs on benchmarks like MM-RLHF-Reward Bench and VL-Reward Bench, while offering significantly faster inference than generative reward models.
- Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs
-
Addressing the pain points of "poor SFT data quality and lack of complex reasoning data" in fully open-source multimodal large models, this paper utilizes an automated data curation pipeline (HoneyPipe) to clean and enrich approximately 24 million raw image-text pairs into Honey-Data-15M, a high-quality dataset of 15 million samples with dual-layer CoT. By training on this dataset, the Bee-8B model achieves new SOTA among fully open-source MLLMs, matching or even surpassing the semi-open InternVL3.5-8B on several reasoning benchmarks.
- Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation
-
This paper constructs the first large-scale evaluation benchmark for fine-grained image tasks, FG-BMK (1.01 million questions, 280,000 images). It systematically interrogates 12 mainstream LVLMs/VLMs from two perspectives: "human-oriented dialogue" and "machine-oriented features." The study reveals how contrastive training paradigms, modality alignment, perturbation robustness, and hierarchical category reasoning influence fine-grained performance, discovering that LVLMs still significantly lag behind specialized models in fine-grained tasks.
- Bilateral Information-aware Test-time Adaptation for Vision-Language Models
-
To address the issue of Vision-Language Models (VLMs) like CLIP overfitting to atypical features during Test-time Adaptation (TTA) when using only a "fixed ratio of low-entropy samples," this paper proposes BITTA: it simultaneously "learns" core representations from a dynamic ratio of low-entropy samples and "unlearns" atypical features from high-entropy samples. This approach consistently improves the average accuracy of various TTA methods by approximately 1–2 percentage points on corrupted datasets such as CIFAR-10/100-C and ImageNet-C.
- BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models
-
BioCAP is proposed to train biological multimodal foundation models by generating Wikipedia-guided synthetic descriptive captions using MLLMs instead of relying solely on species labels. It achieves an average improvement of 8.8% across 10 species classification benchmarks and a 21.3% improvement in text-image retrieval tasks compared to BioCLIP.
- Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems
-
The authors construct Bongard-RWR+, a benchmark containing 5400 Bongard problems, using a VLM pipeline (Pixtral-12B + Flux.1-dev) to automatically generate photorealistic images representing abstract concepts. Systematic evaluation reveals that state-of-the-art (SOTA) VLMs struggle to discern fine-grained visual concepts such as contours, rotation, and angles, with accuracy reaching as low as 19%.
- Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation
-
To address the performance plateau of SFT in chart-to-code generation, Multimodal Structured Reinforcement Learning (MSRL) is proposed. By utilizing a dual-layer text+visual reward function and a two-stage RL strategy, it achieves a 6.2% and 9.9% improvement in high-level metrics on ChartMimic and ReachQA respectively, reaching open-source SOTA and matching GPT-4o.
- CAD-Tokenizer: Towards Text-Based CAD Prototyping via Modality-Specific Tokenization
-
A "primitive-level VQ-VAE tokenizer" is designed for CAD sequences to replace default LLM subword tokenization. It compresses sketch-extrusion pairs into discrete tokens aligned with the LLM vocabulary and employs Finite State Automata (FSA) constrained decoding. This approach unifies Text-to-CAD generation and text-driven CAD editing within a single model for the first time.
- Calibrated Information Bottleneck for Trusted Multi-modal Clustering
-
Addressing the heavy reliance of Information Bottleneck (IB) multi-modal clustering on "accurate mutual information estimation" and "clean pseudo-labels," this paper proposes CLIB. By using a parallel multi-head structure with "one main clustering head + multiple modal calibration heads," the model enables mutual error correction between modalities. Combined with a dynamic pseudo-label filtering mechanism based on information redundancy, it simultaneously improves clustering accuracy (77.8% ACC on Caltech-3V) and suppresses over-confidence (halving ECE on multiple datasets).
- Can Vision-Language Models Answer Face to Face Questions in the Real-World?
-
The authors propose QIVD (Qualcomm Interactive Video Dataset), a face-to-face real-time QA benchmark containing 2,900 videos with audio and timestamp annotations. The study reveals that existing VLMs significantly lag behind humans in real-time situated understanding (Best model 60% vs. Human 87%). The main bottlenecks are identified as referential disambiguation, timing judgment for responses, and situational common sense, though fine-tuning can significantly narrow this gap.
- Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts
-
To address the Straggler Effect during MoE inference caused by uneven token distribution (where the expert with the heaviest load determines overall latency), this paper proposes Capacity-Aware Token Drop (discarding lower-scoring tokens from overloaded experts) and Expanded Drop (rerouting overflow tokens to local low-load experts), achieving a 1.85× speedup and a 0.2% performance improvement on Mixtral-8×7B.
- CapRL: Stimulating Dense Image Capabilities via Reinforcement Learning
-
CapRL reformulates subjective image caption quality into a verifiable reward defined as "whether a text-only LLM can correctly answer image-related multiple-choice questions based solely on the caption." Using GRPO, Qwen2.5-VL-3B is trained to generate denser and more accurate captions, further yielding the CapRL-5M dataset. This approach significantly outperforms SFT-based caption data in both multimodal pre-training and Prism caption evaluations.
- CARPRT: Class-Aware Zero-Shot Prompt Reweighting for Vision-Language Model
-
CARPRT points out that existing VLM prompt ensembling methods assign weights to each prompt template that are "shared across all classes," which contradicts the fact that "different prompts have different affinities for different classes." It employs a training-free, purely black-box (score-only) two-stage pipeline to estimate a set of prompt weights for each class individually, consistently outperforming MPE / WPE and even human-selected prompts across 11 zero-shot classification benchmarks.
- Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception
-
This paper proposes SD-RPN (Self-Distilled Region Proposal Network), which transforms noisy and blurred intermediate-layer attention maps in MLLMs into high-quality pseudo-labels via denoising and selective labeling. These labels are used to train a small RPN attached to a frozen backbone, allowing the model to predict Regions of Interest (RoI) in a single partial forward pass. Trained on only 10K QA pairs, it achieves over 10% absolute accuracy gains on unseen benchmarks such as TextVQA, DocVQA, and V-Star.
- ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation
-
ChartGalaxy constructs a million-scale infographic dataset by inducing chart types, visual variants, and layout templates from real-world designs, then programmatically synthesizing high-quality infographics with table supervision. It significantly enhances LVLM capabilities in infographic Q&A, code generation, and example-driven chart generation.
- CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing
-
The authors construct CityLens, the largest urban socioeconomic sensing benchmark to date (covering 17 cities, 6 domains, and 11 prediction tasks). It evaluates 17 LVLMs across three paradigms—direct prediction, normalized estimation, and feature-based regression—to infer socioeconomic indicators from satellite and street-view images. The results reveal that general LVLMs still lag behind domain-specific contrastive learning methods in most tasks.
- CitySeeker: How Do VLMs Explore Embodied Urban Navigation with Implicit Human Needs?
-
CitySeeker constructs the first embodied urban navigation benchmark for "implicit human needs" (8 cities, 6,440 real street-view trajectories, 7 categories of needs). Utilizing a unified ReAct-style navigation framework to evaluate 27 VLMs, it is found that even the strongest model achieves a task completion rate of only 21.1%, significantly lagging behind humans. Three types of human-cognition-inspired strategies—Backtracking, Spatial Cognition Enrichment, and Memory-Based Retrieval (BCR)—are proposed, pushing performance to 26.9%.
- CLIP-FMoE: Scalable CLIP via Fused Mixture-of-Experts with Enforced Specialization
-
CLIP-FMoE scales CLIP using a "Fused MoE" pipeline where specialized experts are pre-trained via two-level semantic clustering and subsequently frozen while training the router. By employing a Fusion Gate to fuse the pre-trained MLP with domain-specific experts per-token, the model enhances image-text retrieval and long-text understanding while preserving the original CLIP's zero-shot classification capabilities.
- Closing the Modality Gap Aligns Group-Wise Semantics
-
Demonstrates that the modality gap in CLIP is irrelevant for instance-level tasks (retrieval) but severely harms group-level tasks (clustering). Proposes a new objective function consisting of Align True Pairs loss + Centroid Uniformity loss, reducing the gap nearly to zero in bi-modal and tri-modal settings, significantly improving clustering V-Measure (+10-17 points) while maintaining retrieval performance.
- CogMoE: Signal-Quality–Guided Multimodal MoE for Cognitive Load Prediction
-
CogMoE reframes cognitive load prediction from multimodal physiological signals (EEG/ECG/EDA/Gaze) from "modality-based fusion" to "quality-based fusion." It first cleans noise, missing segments, and misalignments using wavelet synchronization and cross-modal recovery. Then, it utilizes three experts specialized in clean, noisy, and recovered signals, respectively, with adaptive routing via a quality-aware gate. Combined with the CORTEX multi-objective loss, it improves performance on CL-Drive and ADABase by up to 13 percentage points over strong baselines.
- CoMem: Compositional Concept-Graph Memory for Vision-Language Adaptation
-
CoMem treats "compositional structures" (graphs of concepts + relations) as the unit for memory and rehearsal in continual learning. Rather than storing raw images, it synthesizes replay samples in the feature space conditioned on subgraphs. Combined with compositional consistency constraints and teacher entropy-gated distillation to suppress drift, it achieves higher retention and lower forgetting across cross-domain retrieval, structured concept learning, and continual VQA.
- Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping
-
AttWarp is proposed as a plug-and-play test-time image warping method that utilizes the cross-modal attention maps of an MLLM to perform rectangular grid resampling. By expanding high-attention regions and compressing low-attention ones, it consistently improves accuracy, enhances compositional reasoning, and reduces hallucinations across 5 benchmarks and 4 MLLMs.
- Context Tokens are Anchors: Understanding the Repeat Curse in dMLLMs from an Information Flow Perspective
-
This paper discovers that diffusion-based Multimodal Large Language Models (dMLLMs) suffer from severe text repetition (Repeat Curse) when using cache acceleration. From an information flow perspective, the root cause is identified as the disruption of "context anchor token" information flow and the failure of deep-layer information entropy to converge. Based on this, the training-free CoTA (Contextual Attention Enhancement + Entropy-guided Voting) is proposed to eliminate repetition.
- ContextNav: Towards Agentic Multimodal In-Context Learning
-
ContextNav transforms the task of "selecting and cleaning examples for multimodal in-context learning" into an MLLM-driven closed-loop agentic workflow. It first performs resource-aware embedding and candidate retrieval, followed by agent-based reasoning to eliminate semantic and structural noise. Finally, an Operational Grammar Graph (OGG) constrains the tool-calling sequence, while downstream ICL feedback continuously optimizes the strategy. Across 8 datasets, it improves the average ICL gain from the Prev. SOTA of 7.6% to 16.8%.
- Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach
-
The authors propose the Emotion Statement Judgment (ESJ) task and the INSETS automatic labeling pipeline, reframing visual emotion evaluation from "open-ended classification" to "statement truth judgment." They constructed the MVEI benchmark (3,086 samples, 424 emotion labels, across four cognitive dimensions). Systematic evaluation of 19 MLLMs reveals that even GPT-4o exhibits a 13.3% accuracy gap compared to humans (91.6%).
- DAVE: A VLM Vision Encoder for Document Understanding and Web Agents
-
DAVE trains a specialized VLM vision encoder for document and web images. It employs modified pixel-level MAE for self-supervised learning on 20 million unlabeled document/web images, followed by autoregressive supervised pre-training on a small amount of high-quality data. By utilizing "multi-decoder weight merging + ensemble with SigLIP2," the encoder captures both structural-spatial information and general semantics, outperforming SigLIP2 by approximately 10.5% / 5% on document recognition, web localization, and Mind2Web Agent tasks.
- DaVinci: Reinforcing Visual-Structural Syntax in MLLMs for Generalized Scientific Diagram Parsing
-
DaVinci trains a 7B MLLM using a two-stage framework consisting of "SFT for learning visual primitives + GRPO for learning structural relationships." By translating scientific diagrams into compilable TikZ code using the self-constructed TikZ30K dataset (standardizing drawing order + injecting comments) and a hybrid reward system that extracts error-free signals from vectorized representations, DaVinci surpasses closed-source models like GPT-5 and Claude-Sonnet-4 in both compilation rate and visual fidelity.
- DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning
-
DecAlign decouples multimodal features into two streams: "modality-specific heterogeneous features" and "cross-modal shared homogeneous features." It aligns the heterogeneous part using prototype-guided multi-marginal optimal transport and the homogeneous part through latent space distribution matching with MMD. It consistently outperforms 13 SOTA methods across four sentiment analysis benchmarks.
- Decoding Open-Ended Information Seeking Goals from Eye Movements in Reading
-
This paper proposes a new task of decoding open-ended information retrieval goals from eye-tracking trajectories during reading. Using the OneStop eye-tracking dataset (360 subjects, 486 questions, 162 paragraphs), the authors develop discriminative and generative multimodal models. RoBERTEye-Fixations achieves 49.3% accuracy in a 3-choice target selection task (random baseline 33%) and 70.9% across different critical spans. DalEye-Llama/GPT also significantly outperforms non-eye-movement baselines in goal reconstruction.
- Decoupling Primitive with Experts: Dynamic Feature Alignment for Compositional Zero-Shot Learning
-
To address the "same primitive has different semantics in different compositions" pain point in Compositional Zero-Shot Learning (CZSL), this paper proposes EVA—using Mixture-of-Experts (MoE) adapters to decouple primitives into multiple semantic variants for learning, and then employing semantic variant alignment to select the variant that best matches the image for fine-grained cross-modal matching. SOTA results are achieved on three benchmarks in both closed-world and open-world settings.
- DeepEyesV2: Toward Agentic Multimodal Model
-
DeepEyesV2 aims to truly weave "external tool calling" into the inference process of multimodal models. It allows models to autonomously decide when to write Python code or initiate a web search within a single inference trajectory, backfilling tool outputs for further reasoning. The authors found that pure RL fails to learn stable tool calling; thus, they proposed a "cold-start SFT + reinforcement learning" two-stage training approach. This resulted in consistent improvements across perception, mathematical reasoning, and search benchmarks (e.g., MathVerse +7.1; MMSearch 63.7%, significantly surpassing the 53.8% of specialized search models).
- Delving into Spectral Clustering with Vision-Language Representations
-
This paper advances spectral clustering from an image-only unimodal paradigm to a multimodal one: it utilizes "positive nouns" from the CLIP text end to anchor a Neural Tangent Kernel (NTK), making the affinity between two images a product of "visual proximity \(\times\) semantic overlap." This naturally strengthens the block-diagonal structure. Furthermore, a regularized affinity diffusion mechanism is used to adaptively integrate affinity matrices from multiple prompts, significantly outperforming previous SOTA on 16 benchmarks (e.g., 98.3% ACC on STL-10, 84.9% ACC on ImageNet-Dogs).
- DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs
-
This paper systematically reveals that the common intuition "intuitively similar training data is more helpful" is unreliable for multimodal LLMs. It proposes DataProphet, a training-free metric that uses the product of multimodal perplexity, cross-domain similarity, and problem diversity to predict the impact ranking of a supervised dataset on a target benchmark with high precision before training (reaching a Kendall's τ of 86%). This guides data selection, outperforming training-based SoTA methods and even approaching oracle performance.
- Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification
-
The authors propose EUQ (Evidential Uncertainty Quantification), which decomposes the epistemic uncertainty of LVLMs into Conflict (CF) (internal contradiction) and Ignorance (IG) (lack of information) based on Dempster-Shafer evidence theory. Without training and using only a single forward pass, EUQ detects four types of misbehaviors: hallucination, jailbreaking, adversarial attacks, and OOD failures, achieving an average relative AUROC improvement of 10.4%/7.5% over the best baselines.
- Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence
-
To address the residual "modality gap" when CLIP uses InfoNCE for vision-language alignment, this paper proposes CS-Aligner. Beyond maximizing mutual information, it introduces Cauchy-Schwarz (CS) divergence to bridge the feature distributions of images and text. This approach compensates for InfoNCE’s limitation of only aligning paired samples while neglecting the overall distribution, and naturally resolves the internal conflict between alignment and uniformity in InfoNCE. It significantly outperforms alignment methods such as Eclipse, Long-CLIP, and LLM2CLIP in text-to-image (FID) and image-text retrieval tasks.
- DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
-
DualToken decouples the naturally conflicting goals of "semantics for understanding" and "pixels for generation" along the shallow and deep structures of ViT. By learning reconstruction in shallow layers for a pixel codebook and semantics in deep layers for a semantic codebook, a single tokenizer achieves 0.25 rFID and 82.0% zero-shot accuracy simultaneously, enabling a pure autoregressive MLLM to excel at both image understanding and synthesis.
- Efficient Discriminative Joint Encoders for Large Scale Vision-Language Re-ranking
-
The paper proposes EDJE (Efficient Discriminative Joint Encoder), which achieves high-throughput inference at 50k image-text pairs/second by offlining visual feature extraction and using lightweight attention adapters to compress visual tokens. It matches the performance of existing joint encoders on Flickr (zero-shot) and COCO (fine-tuned) while requiring only 49kB of storage per image.
- EgoHandICL: Egocentric 3D Hand Reconstruction with In-Context Learning
-
The first work to introduce the In-Context Learning (ICL) paradigm to 3D hand reconstruction. Through VLM-guided template retrieval, a multi-modal ICL tokenizer, and an MAE-driven reconstruction pipeline, it significantly outperforms SOTA methods on ARCTIC and EgoExo4D benchmarks.
- Endowing GPT-4 with a Humanoid Body: Building the Bridge Between Off-the-Shelf VLMs and the Physical World
-
The BiBo framework utilizes a two-level structure consisting of an "Instruction Compiler + Motion Diffusion Executor," allowing off-the-shelf VLMs like GPT-4 to control humanoid agents for complex physical scene interactions without any fine-tuning, achieving a single-task success rate of 90.2%.
- Enhanced Continual Learning of Vision-Language Models with Model Fusion
-
This paper proposes the Continual Decoupling-Unifying (ConDU) framework, introducing model fusion into VLM continual learning for the first time. By maintaining a unified model combined with task triggers for iterative decoupling-unifying operations, it outperforms SOTA by an average of 2% on MTIL benchmarks while simultaneously enhancing zero-shot capabilities.
- Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning
-
This work introduces the GEOPERCEIVE benchmark (geometric perception evaluation based on unambiguous DSL) and the GEODPO framework (translator-guided reinforcement learning). It enables VLMs to maintain natural language output while utilizing an NL→DSL translator to calculate fine-grained reward signals, significantly enhancing geometric primitive perception and downstream reasoning capabilities.
- Enhancing Multi-Image Understanding through Delimiter Token Scaling
-
By scaling the hidden states of image delimiter tokens in vision-language models, the ability to isolate information between images is enhanced. This achieves performance gains across multi-image (Mantis/MuirBench/MIRB/QBench2) and multi-document/multi-table (TQABench/MultiNews/WCEP-10) benchmarks without any additional training or inference costs.
- ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models
-
ERGO introduces a suite of RL rewards designed for efficiency (region-verification reward + box adjustment reward), enabling LVLMs to perform "reasoning-driven perception" on low-resolution coarse images. Even when target objects are downsampled to a point of indiscernibility, the model utilizes contextual cues to locate and re-encode the correct region. On the V* benchmark, ERGO outperforms Qwen2.5-VL-7B by 4.7 points while using only 23% of the visual tokens and achieving a 3x speedup in inference.
- Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models
-
A training-free two-stage VLM framework is proposed that uses an Error Notebook to record corrected reasoning trajectories combined with RAG for test-time adaptation. On specification-driven part retrieval tasks for 3D CAD assemblies, GPT-4o accuracy improves from 41.7% to 65.1% (+23.4%), with a further 4.5% gain via a Grammar Constraint verifier.
- EventFlash: Towards Efficient MLLMs for Event-Based Vision
-
EventFlash leverages the inherent spatiotemporal sparsity of event streams by designing two token sparsification modules: Adaptive Time Window Aggregation and Sparsity-Driven Guided Attention. These modules increase inference throughput by 12.4× and extend the processable event bin capacity from 5 (in EventGPT) to 1000.
- Exploring Cross-Modal Flows for Few-Shot Learning
-
This work reformulates image-to-text alignment from the "one-step adjustment" characteristic of existing PEFT methods into a "multi-step iterative correction" via Flow Matching. By employing a plug-and-play velocity field, it gradually aligns entangled cross-modal distributions on difficult datasets, significantly improving few-shot classification.
- Fed-Duet: Dual Expert-Orchestrated Framework for Continual Federated Vision-Language Learning
-
Fed-Duet decouples VLM adaptation in federated continual learning into two complementary pathways: "semantic experts (prompts) + parameter experts (adapters)". It utilizes a server-side knowledge orchestrator for adaptive distribution of shared semantic experts and client-side cross-attention gating to fuse local/shared experts. Combined with routing consistency and expert stability losses, the framework mitigates forgetting while preserving cross-modal alignment in non-IID and streaming task scenarios.
- Figma2Code: Automating Multimodal Design to Code in the Wild
-
This paper introduces Figma2Code, a novel task and dataset that advances design-to-code from unimodal image-based "screenshot-to-code" to a realistic multimodal scenario incorporating Figma metadata, design assets, and screenshots. It provides an evaluation framework measuring visual fidelity, layout responsiveness, and code maintainability, revealing a core contradiction where current MLLMs struggle to balance visual fidelity with code quality.
- FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
-
FLARE permeates "deep vision-language fusion" throughout the entire VLM workflow—guiding vision with text during encoding, dynamically aggregating vision based on text context during decoding, bridging modality spaces with dual reconstruction losses, and feeding training with "text-first" data synthesis. This enables a 3B model to outperform Cambrian-1 8B and Florence-VL 8B using only 630 visual tokens.
- Flatness-Guided Test-Time Adaptation for Vision-Language Models
-
This paper proposes the Flatness-Guided Adaptation (FGA) framework: the training phase employs sharpness-aware prompt tuning to locate flat minima, while the testing phase avoids updating any parameters. Instead, it uses a "perturb-score-filter" approach to enhance samples, aligning the flat minima of selected test loss landscapes with the training flat minima. This significantly improves CLIP's out-of-distribution generalization with zero backpropagation and low memory overhead.
- FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows
-
FlowBind replaces fixed Gaussian priors with a learnable shared latent anchor + per-modality invertible flows, factorizing the multimodal joint distribution into a set of independent modality-to-anchor flows. Trained end-to-end with only a single flow matching loss, it enables arbitrary translation between text, images, and audio with 6x fewer parameters and 10x faster training.
- From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
-
This paper introduces NEO, a family of native (monolithic) VLMs built from first principles. It integrates visual encoding, cross-modal alignment, and reasoning into a single decoder-only backbone using unified "native primitives." By leveraging Native-RoPE (which decouples T/H/W), mixed image-text attention, and a reusable pre-Buffer, NEO significantly narrows the gap between native VLMs and top-tier modular VLMs of the same scale using only 390M image-text samples.
- GeoBench: Rethinking Multimodal Geometric Problem-Solving via Hierarchical Evaluation
-
GeoBench utilizes the formal engine TrustGeoGen to generate 1021 verifiable synthetic geometry problems. Based on the van Hiele cognitive model, geometric reasoning is decomposed into four levels and six tasks ("Visual Perception → Goal Planning → Theorem Application → Self-reflection & Backtracking"), shifting VLM evaluation from "final answer only" to "diagnosing specific bottlenecks."
- GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
-
Without modifying the VLM architecture or introducing point cloud modalities, this work augments 2D vision-language models with 3D indoor scene understanding capabilities using only a visual prompting paradigm consisting of "videos + reconstructed Bird’s Eye Views (BEV) + cross-frame consistent object ID markers." It achieves SOTA performance in both zero-shot and fine-tuning settings.
- GranViT: A Fine-Grained Vision Model For Autoregressive Multimodal Large Language Models
-
By constructing the Gran-29M dataset with 29.51 million images and 183 million region-level annotations, and pre-training a vision encoder using bidirectional "Bbox→Caption / Caption→Bbox" autoregressive tasks along with local self-distillation, GranViT enables a ViT to possess fine-grained local perception capabilities aligned with the LLM semantic space for the first time, achieving new SOTAs in visual grounding and OCR understanding.
- Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
-
GAR (Grasp Any Region) is proposed to extract high-fidelity local features while maintaining global context via RoI-aligned feature replay. It achieves precise single-region description, multi-region interaction modeling, and complex reasoning, with the 1B model outperforming InternVL3-78B.
- Grounding-IQA: Grounding Multimodal Language Models for Image Quality Assessment
-
By combining spatial localization (referring + grounding) with image quality assessment, the GIQA-160K dataset was constructed to train multimodal LLMs to generate quality descriptions with bounding boxes and spatial VQA, significantly outperforming general MLLMs in fine-grained quality perception.
- Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization
-
The paper proposes Guided Query Refinement (GQR): an approach that uses scores from a lightweight text retriever as guidance signals at test-time to iteratively refine the query embeddings of a visual retriever (ColPali series) via gradient descent. This allows ColPali models to approach or even exceed the retrieval quality of significantly larger models while maintaining a small representation footprint, achieving up to 14× speedup and 54× memory savings.
- GuirlVG: Incentivize GUI Visual Grounding via Empirical Exploration on Reinforcement Learning
-
This paper decomposes Rule-based Reinforcement Fine-Tuning (RFT/GRPO) into "reward function, prediction format, KL penalty, and training configuration" for systematic controlled ablation. By introducing an Adversarial KL Factor that adaptively suppresses reward over-optimization, it surpasses methods utilizing tens of millions of SFT samples on the ScreenSpot benchmark with only 5.2K samples.
- How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images
-
This study systematically diagnoses the root cause of poor medical MLLM performance in zero-shot medical VQA as insufficient visual grounding—where model attention systematically deviates from clinically relevant regions. To address this, the authors propose VGRefine, a training-free inference-time attention correction method that achieves SOTA results across 8 imaging modalities and 110K+ samples in 6 benchmarks.
- HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models
-
Ours proposes HSSBench—the first large-scale multimodal benchmark focused on Humanities and Social Sciences (HSS), covering 6 main categories and 45 subcategories with 13,152 multiple-choice questions in the six official UN languages. Constructed through an "Expert + Multi-agent" collaborative pipeline, HSSBench demonstrates that HSS tasks remain a significant challenge for current mainstream MLLMs (with accuracies generally below 60%).
- Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models
-
Human-MME is the first comprehensive MLLM evaluation benchmark specifically for "human-centric scenarios." Utilizing a five-step auto-annotation pipeline followed by expert manual verification, the authors constructed a dataset covering 43 sub-scenarios and 8 progressive dimensions—from "fine-grained perception" to "high-dimensional causal reasoning." Comprising nearly 20,000 real image-text QA pairs, the benchmark systematically exposes shortcomings in fine-grained human grounding and high-order reasoning across 20 SOTA MLLMs.
- Human Uncertainty-Aware Data Selection and Automatic Labeling in Visual Question Answering
-
This paper systematically reveals the impact of Human Uncertainty (HU) on Supervised Fine-Tuning (SFT) in VQA—demonstrating that high HU samples are ineffective or even harmful. It proposes the HaDola framework, a four-stage pipeline of "Discriminate-Self Annotate-Error Trigger-Training," which matches or exceeds strong baselines fine-tuned on 100% data using only 5% seed annotations in both accuracy and calibration.
- HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes
-
HumanPCR constructs a hierarchical evaluation suite for human-centric visual scenes. It diagnoses model weaknesses across three levels—Perception, Comprehension, and Reasoning—covering human details, social behaviors, temporal processes, and multi-evidence video reasoning. The study finds that the most significant bottleneck for current models is not "seeing more frames," but rather the inability to proactively seek critical visual evidence not explicitly stated in the prompt.
- ICYM2I: The Illusion of Multimodal Informativeness under Missingness
-
Ours reveals an overlooked problem in multimodal learning: distribution shifts caused by modality missingness lead to severe biases in modality valuation. The ICYM2I framework is proposed to correct biases in both training and evaluation via dual Inverse Probability Weighting (IPW), achieving unbiased estimation of modality predictive utility and information-theoretic value under the MAR assumption.
- Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization
-
MISP-DPO extends multimodal DPO from "one positive, one negative" to "one positive, multiple negatives": it utilizes Sparse Autoencoders (SAE) in CLIP space to extract interpretable visual bias factors for selecting semantically diverse negative images, then employs a Plackett-Luce objective with importance sampling for efficient training, significantly reducing hallucinations in VLMs.
- IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs
-
IndicVisionBench is the first large-scale culture-multilingual VLM evaluation benchmark focusing on the Indian subcontinent. Covering English + 10 Indic languages across three multimodal tasks (VQA / OCR / MMT) with 5K images and over 37K QA pairs, it systematically reveals the significant performance gaps of current VLMs in culturally diverse contexts.
- InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search
-
InSight-o3 introduces O3-BENCH to evaluate the capability of models to find details while reasoning within high-information-density images. It utilizes a two-agent framework comprising a vReasoner and a vSearcher to train generalized visual search as a plug-and-play component, significantly enhancing multimodal foundation models such as GPT-5-mini and Gemini-2.5-Flash.
- InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models
-
InternSVG introduces a "triad" of resources—dataset SAgoge, benchmark SArena, and model InternSVG—to unify SVG understanding, editing, and generation tasks within a single MLLM. By utilizing specialized SVG tokens, subword-mean initialization, and two-stage curriculum training, it outperforms both open-source and closed-source models across self-built and existing benchmarks.
- Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders
-
By systematically masking individual vision encoders in multi-encoder MLLMs, this paper reveals that the "more encoders the better" assumption is a fallacy. It proposes two metrics, CUR and IG, to quantify the marginal contribution and redundancy of each encoder, proving that most tasks can maintain over 90% performance with only 1–2 encoders while significantly reducing training and inference costs.
- IWR-Bench: Can LVLMs Reconstruct Interactive Webpage from a User Interaction Video?
-
This paper introduces IWR-Bench, the first benchmark for Large Vision-Language Models (LVLMs) to reconstruct interactive webpages from "user interaction videos + complete static assets." Using an agent-as-a-judge protocol to evaluate both functional correctness and visual fidelity, experiments on 28 models reveal that even the strongest model scores only 36.35, with functionality scores (IFS 24.39%) significantly lagging behind visual scores (VFS 64.25%).
- K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge
-
The K-Sort Eval framework is proposed, which, through posterior correction and dynamic matching strategies, enables VLMs to reliably and efficiently replace humans for preference evaluation of visual generation models, typically requiring fewer than 90 model runs to achieve results consistent with the human Arena.
- Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation
-
KALEIDOSCOPE, built through global open-science collaboration, manually collects 20,911 real-world multiple-choice exam questions across 18 languages and 14 subjects (55% requiring visual context). It establishes the largest "in-language" multilingual multimodal VLM benchmark to date, revealing systematic deficiencies in current VLMs regarding low-resource languages, multimodal reasoning, and STEM subjects.
- KeepLoRA: Continual Learning with Residual Gradient Adaptation
-
By analyzing the SVD decomposition of pre-trained model weights, it is discovered that general knowledge is encoded in the principal subspace while domain-specific knowledge is encoded in the residual subspace. The proposed KeepLoRA method constrains LoRA updates for new tasks to the residual subspace while initializing with gradient information to maintain plasticity, achieving an optimal balance among forward stability, backward stability, and plasticity in continual learning.
- Knowledge Exchange with Confidence: Cost-Effective LLM Integration for Reliable and Efficient Visual Question Answering
-
A well-calibrated small VQA model outputs reliable confidence scores to route questions into three tiers: high (VQA answers directly), medium (LLM acts as a "consultant" using candidate answers), or low (LLM acts as a "teacher" via full delegation). This significantly cuts expensive LLM calls while maintaining or even improving accuracy.
- Label-Free Mitigation of Spurious Correlations in VLMs using Sparse Autoencoders
-
DIAL utilizes a pre-trained Sparse Autoencoder (SAE) to decompose CLIP image embeddings into interpretable monosemantic feature directions. It identifies subspaces encoding spurious attributes in a zero-shot manner and removes them from affected samples via orthogonal projection, requiring no training, additional data, class labels, or spurious feature labels.
- Language-Instructed Vision Embeddings for Controllable and Generalizable Perception
-
LIVE directly injects natural language instructions into a vision encoder, enabling the same image to generate different task-centric visual embeddings based on different questions. By training on image-question-answer triplets generated by LLMs, this lightweight vision encoder significantly outperforms static visual representations on MMVP, GQA, and cross-dataset instruction retrieval.
- Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation
-
Lavida-O utilizes a single Masked Diffusion Model (MDM) to simultaneously bridge image understanding, object grounding, image editing, and 1024px high-definition text-to-image generation. By employing an "Elastic Mixture-of-Transformers" (Elastic-MoT) architecture, it efficiently integrates an 8B understanding branch with a 2.4B lightweight generation branch. Furthermore, it introduces planning and self-reflection mechanisms that allow understanding capabilities to enhance generation quality, outperforming Qwen2.5-VL and FluxKontext across RefCOCO, GenEval, and ImgEdit.
- LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR
-
Legato feeds full-page (or even multi-page) printed sheet music images directly into a frozen Llama vision encoder combined with a scratch-trained ABC decoder. It performs end-to-end transcription into concise ABC notation. Leveraging 214,000 synthetic data samples, it is the first large-scale pre-trained OMR model capable of recognizing full-page/multi-page typeset music and outputting ABC. On highly realistic datasets, it reduces TEDn and OMR-NED by absolute margins of 68% and 47.6%, respectively.
- LiveWeb-IE: A Benchmark For Online Web Information Extraction
-
This paper proposes LiveWeb-IE, the first benchmark for online Web Information Extraction (WIE), which covers various data types including text, images, and hyperlinks. It also introduces the Visual Grounding Scraper (VGS) framework, which achieves robust information extraction on dynamic web pages by simulating human cognitive processes—visually scanning to locate regions, precisely pinpointing elements, and generating XPaths.
- LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding
-
LLaVA-4D encodes "3D position + 1D time" into dynamic-aware 4D coordinates as spatiotemporal prompts. It decouples visual features into spatial and temporal components before fusing them with these prompts via cross-attention, enabling multimodal models to simultaneously understand static backgrounds and dynamic objects for the first time.
- LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models
-
LLaVA-FA is proposed as an efficient multimodal large model compression method that performs joint low-rank and quantized weight approximation in the frequency domain. By utilizing the decorrelation and conjugate symmetry properties of the Fourier transform, it achieves a more compact and accurate weight representation. Combined with PolarQuant (polar coordinate quantization) and ODC (Optional Diagonal Calibration), the method outperforms existing efficient multimodal models across multiple benchmarks with minimal activation parameters and computational costs.
- Long-tailed Test-Time Adaptation for Vision-Language Models
-
This paper presents the first systematic study of Test-Time Adaptation (TTA) for Vision-Language Models (VLMs) in long-tailed test streams. It proposes L-TTA, which utilizes Synergistic Prototypes, learnable Rebalancing Shortcuts, and Balanced Entropy Minimization (BEM) to simultaneously address insufficient tail-class semantics, amplified cross-modal bias, and the head-class bias of entropy minimization. It improves both Accuracy and Macro-F1 across OOD, cross-domain, and noisy long-tailed benchmarks.
- Manzano: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
-
Manzano utilizes a hybrid tokenizer consisting of a shared vision encoder and two lightweight adapters (continuous tokens for understanding, discrete tokens for generation). This allows a unified autoregressive LLM to learn both understanding and generation within the same semantic space, while delegating pixel rendering to an external diffusion decoder. This approach nearly eliminates task conflict between understanding and generation and validates scalability from 300M to 30B parameters.
- MaskInversion: Localized Embeddings via Optimization of Explainability Maps
-
Without fine-tuning any weights, by treating the alignment of a frozen CLIP's explainability map with a query mask as an optimization objective at test-time, iteratively optimizing a single token can learn a localized embedding for any image region that can directly replace the [CLS] token.
- Massively Multimodal Foundation Models: A Framework for Capturing Interactions with Specialized Mixture-of-Experts
-
This paper proposes the MERGE framework, which decomposes multimodal interactions into "Redundant/Unique/Synergistic (RUS)" signals across time lags using directed information. These signals then guide Mixture-of-Experts (MoE) routing—directing similar modalities to the same experts, unique modalities to different experts, and synergistic modalities to specialized cross-modal experts—significantly enhancing performance and producing interpretable expert specialization in "massively multimodal" scenarios involving dozens of heterogeneous inputs like sensors, imaging, and text.
- MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks
-
MCIF is the first human-annotated crosslingual multimodal instruction-following benchmark that covers three modalities (speech/video/text), four languages (EN/DE/IT/ZH), and both long/short contexts, with full parallel alignment across all dimensions. Derived from ACL scientific talk videos, evaluations of 23 mainstream models reveal significant gaps in current MLLMs regarding long-context summarization, joint speech-video understanding, and fine-grained QA.
- Memory-Free Continual Learning with Null Space Adaptation for Zero-Shot Vision-Language Models
-
NuSA-CL extracts the "low-energy null space" of CLIP's current weights via SVD and strictly constrains the low-rank updates of each new task within this null space. After training, updates are merged back into the backbone, achieving continual learning of new tasks with nearly zero loss in original zero-shot capabilities, while maintaining zero storage overhead, zero parameter growth, and zero extra model components.
- MergeTune: Continued Fine-Tuning of Vision-Language Models
-
MERGETUNE defines the recovery of pre-trained knowledge in an already fine-tuned CLIP/VLM as a "continued fine-tuning" problem. By using Linear Mode Connectivity (LMC) constraints to further optimize previously trained parameters, the final model is positioned closer to both the zero-shot CLIP and the downstream fine-tuned model, improving base-novel, cross-dataset, domain generalization, and ID-OOD robustness without adding inference parameters.
- Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering
-
Ours proposes MAPD (Meta-Adaptive Prompt Distillation), a prompt distillation method based on MAML meta-learning. It distills soft prompts from task-related image features through an attention mapper, enabling LMMs to adapt to new VQA tasks with only a few gradient steps at test time, surpassing ICL performance by 21.2%.
- MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites
-
MetaCaptioner proposes CapFlow, a multi-agent pipeline using open-source models to generate high-quality long captions across image and video domains. Through rigorous rejection sampling, a 4.1M training dataset was constructed to fine-tune an 8B multimodal model into a generalist visual captioner that approaches the description quality of commercial models while maintaining strong downstream capabilities.
- MIAM: Modality Imbalance-Aware Masking for Multimodal Ecological Applications
-
The authors formalize the masking strategy as a probability distribution on a unit hypercube and propose MIAM—a hybrid product-beta distribution featuring full support and corner prioritization. It dynamically increases the masking probability for dominant modalities based on relative performance and learning speed, providing a unified mechanism to resolve robustness to missing data, modality imbalance, and fine-grained contribution analysis in multimodal ecological data.
- MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning
-
MMDuet2 reframes the decision of "when to speak" in streaming video as a pure text multi-turn dialogue. In each user turn, 1–2 frames are provided, and the assistant autonomously decides whether to output a response or "NO REPLY." By utilizing a multi-turn GRPO strategy with a PAUC-centered reward that eliminates the need for precise response timestamp annotations, the model enables a 3B Video MLLM to provide fast and accurate proactive responses on ProactiveVideoQA.
- MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models
-
MME-Emotion constructs the largest emotional intelligence benchmark for multimodal large language models to date—comprising 6,500 video segments, 8 emotion tasks, and 27 scenarios—and provides a label-free multi-agent evaluation suite (unifying recognition, reasoning, and CoT scores). After evaluating 20 frontier MLLMs, it was discovered that current emotional intelligence is far from satisfactory, with the strongest model, Gemini-2.5-Pro, achieving a recognition score of only 39.3%.
- MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
-
MME-Unify proposes a comprehensive evaluation benchmark for Unified Multimodal Large Language Models (U-MLLMs), placing understanding, generation, and hybrid tasks that require "understanding-then-generation" within a single reproducible scoring framework. The findings reveal that even the top-performing U-MLLMs achieve an overall score of only approximately 50, with significant weaknesses remaining in complex instruction following and multi-step visual state maintenance.
- MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
-
Six 3D vision researchers spent 300+ hours manually crafting 1,000 multi-image spatial reasoning multiple-choice questions from 120,000 real images to form MMSI-Bench. Among 37 mainstream MLLMs, the strongest open-source model scores only 30%, GPT-5 reaches 41.9%, while humans achieve 97%. It also provides an automated error diagnosis pipeline leveraging manual reasoning annotations.
- MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
-
Ours proposes MMTok—a multimodal vision token selection framework based on the Maximum Coverage Problem. It leverages both Text-to-Visual and Visual-to-Visual coverage information to select the most informative subset of vision tokens. In a training-free setting, it significantly outperforms single-modal baselines and even surpasses methods requiring fine-tuning.
- Modal Aphasia: Can Unified Multimodal Models Describe Images From Memory?
-
This paper identifies and systematically defines the phenomenon of "Modal Aphasia"—where unified multimodal models can near-perfectly generate visual concepts (such as movie posters) from memory but exhibit error rates over 7 times higher when describing the same concepts in text, with severe hallucinations occurring almost exclusively in the text modality. Through real-world experiments on frontier models (ChatGPT-5) and synthetic controlled experiments on open-source models (Janus-Pro, Harmon), the authors demonstrate that Modal Aphasia is a systemic flaw of current unified architectures rather than a training artifact, revealing its potential threat to AI safety frameworks.
- Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds
-
Addressing the asymmetric alignment problem where "text possesses hierarchical features while images have only one," this paper constructs hierarchical feature trees for both modalities. These trees are embedded into hyperbolic manifolds with different curvatures, and alignment is achieved via an intermediate manifold derived from KL divergence. This approach significantly outperforms strong baselines in taxonomic open-set recognition.
- MoRA: Missing Modality Low-Rank Adaptation for Visual Recognition
-
MoRA utilizes a set of "modality-shared + modality-specific" low-rank parameters to enable vision and text encoders to maintain cross-modal alignment while independently adapting to downstream tasks during fine-tuning. This approach significantly outperforms prompt-based methods in missing modality scenarios with zero additional inference overhead.
- Mordal: Automated Pretrained Model Selection for Vision Language Models
-
Mordal automates the process of selecting the optimal vision encoder and LLM for a VLM. It employs representation similarity clustering to reduce candidate volume and combines early stopping with scaling law prediction to minimize individual evaluation costs. It identifies optimal combinations with 8.9–11.6× less GPU time than grid search.
- MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs
-
MotionSight proposes a training-free video visual prompting method that uses "Visual Spotlights" to amplify object motion and "Synthetic Motion Blur" to amplify camera motion. By decoupling these two types of signals and feeding them into off-the-shelf MLLMs, it significantly improves fine-grained motion understanding. Furthermore, it distills the first large-scale fine-grained motion dataset, MotionVid-QA (40K videos / 87K QA), to train MotionChat.
- Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional
-
This work quantifies intra-modality and inter-modality dependencies across 23 VQA benchmarks through a large-scale empirical study. It reveals that many benchmarks designed to eliminate text bias have inadvertently introduced image bias and proposes a multi-dimensional characterization framework for multi-modal datasets.
- Multimodal Aligned Semantic Knowledge for Unpaired Image-text Matching
-
MASK uses pre-trained word vectors as a bridge to align each word to a "prototypical region representation." It leverages the semantic structure of word vectors to reconstruct visual prototypes for out-of-distribution (OOD) words and employs a prototype consistency contrastive loss to compress intra-class variance. This approach significantly outperforms existing knowledge-based methods in "unpaired image-text matching" without relying on in-domain paired data.
- Multimodal Classification via Total Correlation Maximization
-
This paper analyzes the modality competition problem in multimodal classification from an information-theoretic perspective. It proposes the TCMax loss function, which maximizes the Total Correlation (TC) between multimodal features and labels. By simultaneously addressing joint learning, unimodal learning, and cross-modal alignment, it outperforms SOTA on several audio-visual and image-text classification benchmarks.
- Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis
-
Ours proposes PDS (Prototype-Guided Data Synthesis), the first training-free multimodal dataset distillation framework. It utilizes the CLIP-aligned embedding space for modality-specific clustering, obtains cross-modal image-text prototype pairs through Hungarian matching, and synthesizes distilled images from image prototypes using an unCLIP decoder. At a minimal scale of 100 pairs, PDS outperforms optimization-based methods with zero training cost and achieves SOTA cross-architecture generalization.
- Multimodal Dataset Distillation via Phased Teacher Models
-
Addressing the phenomenon in multimodal dataset distillation where the "teacher is only useful in the first 20–30% of the training phase and the trajectory becomes unstable later," this paper proposes PTM-ST. By using Phased Teacher Models + Shortcut Interpolated Trajectories, the distillation is decomposed into multiple sub-tasks with stabilized gradient directions, significantly outperforming SOTA on Flickr30k/COCO image-text retrieval (Flickr30k average +9.53%, max +13.5%).
- Multimodal Policy Internalization for Conversational Agents
-
The authors propose "Multimodal Policy Internalization (MPI)," a new task for internalizing lengthy and complex multimodal policies (decision rules, tool-use protocols, and even demonstration images) from in-context prompts into model parameters. Using the three-stage training framework TriMPI (Visual Masked Continued Pre-training + CoT-SFT + RL with PolicyRollout), models achieve high compliance without providing the policy at inference time, achieving an absolute gain of up to 70.7% over the CoT-SFT baseline.
- Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs
-
This work extends Automated Prompt Optimization (APO) from a text-only space to a multimodal space for the first time. It proposes the MPO framework, featuring alignment-preserving joint exploration (driven by unified semantic gradients for text and non-text updates with Generation/Edit/Mix operators) and prior-inherited Bayesian UCB candidate selection (using parent performance to warm-start child Beta priors). MPO achieves an average accuracy of 65.1% across 10 datasets (image, video, molecule), outperforming the leading text-based APO baseline, ProTeGi (60.0%).
- Naming to Learn: Class Incremental Learning for Vision-Language Model with Unlabeled Data
-
N2L situates Class Incremental Learning (CIL) in a more realistic setting—where each new task provides only class names and unlabeled images. It utilizes CLIP zero-shot for initial pseudo-labeling, followed by pseudo-label refinement through dimensionality reduction, dual-level sample weighting, and recursively solvable ridge regression. This approach allows unlabeled incremental training to approximate joint-training performance while maintaining robustness against noise and forgetting.
- NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching
-
The authors replace Autoregressive (AR) with Discrete Flow Matching (DFM) as a unified modeling paradigm to develop NExT-OMNI, the first fully DFM-based open-source omnimodal foundation model. A single encoder provides unified representations that support understanding, generation, and cross-modal retrieval across text, image, video, and audio.
- ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments?
-
This work constructs ODI-Bench, the first benchmark for systematically evaluating the Omnidirectional Image (ODI) understanding capabilities of MLLMs (2,000 real-world panoramas, 4,254 QAs, 10 fine-grained tasks, dual closed+open formats). Evaluating 20 mainstream models reveals that current MLLMs perform only slightly better than random guessing in immersive spatial understanding. A training-free Chain-of-Thought framework, Omni-CoT, is proposed, improving the overall scores of models like o3 by 6–8 percentage points.
- Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception
-
Addressing the symbiotic problem of "the more detailed the description, the more hallucinations" in omni-modal language models, this paper proposes an agentic "detective" data pipeline (Omni-Detective) that calls various tools to automatically produce high-detail, low-hallucination audio-visual captions. Through two-stage curriculum training, the authors develop Audio-Captioner and Omni-Captioner, and design a cloze-style benchmark, Omni-Cloze. The models achieve open-source SOTA on multiple benchmarks including VDC, MMAU, and Omni-Cloze, rivaling Gemini 2.5 Pro.
- Omni-Weather: A Unified Multimodal Model for Weather Radar Understanding and Generation
-
Omni-Weather is the first foundation model to unify "meteorological generation" (radar nowcasting, satellite-to-radar inversion) and "meteorological understanding" (diagnostic reports for radar images/sequences) within a single multimodal backbone. By employing a shared self-attention mechanism and modality-specific encoders, it expresses various tasks in a unified sequence-to-sequence format. Accompanied by a Chain-of-Thought (CoT) dataset for meteorological causal reasoning, the model enables generation tasks to "think while drawing," outperforming specialized SOTA models in both task categories and demonstrating that generation and understanding can mutually benefit each other.
- OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
-
OmniVideoBench is a high-quality benchmark specifically designed to evaluate "audio-visual collaborative reasoning." Using 628 real-world videos (up to 30 minutes), the authors constructed 1,000 multiple-choice questions with atomic-level reasoning chain annotations through human question generation, dual-model filtering, and manual refinement. Results indicate that even the strongest Gemini-3.0-Pro achieves only 61.8% accuracy—far below the human level of 82.69%—while open-source models perform near chance levels.
- OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
-
OmniVinci employs three architectural improvements for "visual-audio alignment" (OmniAlignNet semantic alignment, temporal grouping, and Constrained Rotary Time Embedding) alongside a data pipeline capable of synthesizing 24 million dialogues. Trained on only 0.2T tokens, this open-source omni-modal LLM understands video, audio, speech, and text simultaneously, outperforming Qwen2.5-Omni (while using only 1/6 of its training tokens) across multiple cross-modal, audio, and visual benchmarks.
- On Discriminative vs. Generative Classifiers: Rethinking MLLMs for Action Understanding
-
The authors revisit the mainstream practice of treating MLLMs as generative classifiers that autoregressively output action labels. They identify semantic overlap caused by shared subwords in action labels as the root cause of low accuracy. Consequently, they transform MLLMs into discriminative classifiers using a learnable [CLS] token and introduce generative modeling as an auxiliary regularization. The proposed GAD (Generation-Assisted Discriminative) framework achieves higher accuracy across 5 datasets and 4 types of temporal action understanding tasks, with up to 3× inference acceleration.
- On the Generalization Capacities of MLLMs for Spatial Intelligence
-
This paper reveals a fundamental flaw in RGB-only spatial reasoning MLLMs: the focal length-depth ambiguity caused by ignoring camera intrinsics. It proposes the Camera-Aware MLLM (CA-MLLM) framework, which utilizes dense camera ray embeddings, camera-aware data augmentation, and geometric prior distillation to improve F1 scores from 39.1% to 52.1% on spatial localization tasks requiring cross-camera generalization.
- One Patch Doesn't Fit All: Adaptive Patching for Native-Resolution Multimodal Large Language Models
-
The authors discover that MLLMs claiming "any-resolution" support are actually highly sensitive to resolution, rooted in the use of fixed patch sizes in ViTs. They propose AdaPatch, which adaptively selects patch sizes per image based on resolution and information density. By using pseudo-inverse resizing, they convert pre-trained fixed-patch models into any-patch models without training, simultaneously improving accuracy, stability, and inference speed at high resolutions by reducing token counts.
- OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging
-
This paper establishes the first benchmark for Multimodal Large Language Model (MLLM) merging with clearly defined capability and modality dimensions. It proposes OptMerge, which utilizes SVD low-rank denoising and robust task vector optimization to merge multiple expert MLLMs into a unified model without data, achieving an average gain of 2.48% and even surpassing mixture-of-data training.
- ORION: Decoupling and Alignment for Unified Autoregressive Understanding and Generation
-
ORION identifies a semantic-structural representation conflict in "monolithic autoregressive" unified MLLMs when learning understanding and generation simultaneously (understanding requires semantic separability, while generation requires low-level reconstructability, creating a "tug-of-war" in shared representations). By employing a non-linear visual head for decoupling and a representation consistency distillation loss for alignment, combined with a three-stage progressive training strategy, a pure monolithic autoregressive backbone achieves performance comparable to or exceeding more complex unified models without any task-specific parameters.
- P2P: Automated Paper-to-Poster Generation and Fine-Grained Benchmark
-
P2P decomposes the paper-to-poster generation process into three agents—figure understanding, content organization, and HTML layout orchestration—each equipped with self-checking loops. It introduces the P2PINSTRUCT dataset and the P2PEVAL dual-perspective benchmark to evaluate generated posters based on both objective content fidelity and subjective overall quality.
- Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs
-
PaDT treats the patch features of the query image itself as "decodable tokens" (Visual Reference Tokens, VRTs) and inserts them into the autoregressive output of the MLLM. This allows the MLLM to represent detected objects using the image patches themselves rather than textual coordinates. A lightweight decoder then converts these VRTs into boxes, masks, and scores. This approach achieves SOTA across four task categories: detection, referring expression comprehension (REC), referring expression segmentation (RES), and referring image captioning (RIC). Notably, the 3B model outperforms the 78B InternVL3 on RefCOCO REC.
- Pay Less Attention to Function Words for Free Robustness of Vision-Language Models
-
The authors discover that "function words" (high-frequency words with minimal semantic content like the, a, of) are the vulnerability points of vision-language models under cross-modal adversarial attacks. They propose Function-word De-Attention (FDA): calculating a parallel "function-word-to-image" cross-attention as a distraction within the fusion encoder, and differentially subtracting it from the original attention. This reduces the average attack success rate for retrieval tasks by 18/13/53% and visual grounding by approximately 90%, with almost no loss in clean performance (<1%).
- PCLR: Progressively Compressed LoRA for Multimodal Continual Instruction Tuning
-
Decomposes a LoRA adapter into "rank-level atomic experts" to form an extremely fine-grained MoE (LRP). Drawing inspiration from human memory consolidation during sleep, it designs a "Compression–Integration–Learning" (CIL) pipeline: compression prunes redundant rank experts to free up capacity, integration uses distillation to recover knowledge from pruned ranks, and learning uses the freed capacity to accommodate new tasks. Combined with a progressive compression schedule, the model enables persistent learning of new tasks while keeping memory growth near the level of "non-expansion" methods. On the CoIN benchmark, it reduces forgetting from 37.29 in LoRA to 3.39.
- pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models
-
pFedMMA inserts a "down-projection — shared projection — up-projection" multi-modal adapter into the top layers of CLIP's image/text encoders. In federated learning, each client trains all parameters locally but only uploads and aggregates the shared projection used for cross-modal alignment. This achieves the optimal trade-off between strong personalization and strong generalization (to unseen classes/domains) across 11 datasets.
- PHyCLIP: \(\ell_1\)-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning
-
PHyCLIP replaces the image-text embedding space from a "single hyperbolic space" with an "\(k\) hyperbolic factors \(\ell_1\)-product metric space." This allows "is-a" hierarchies within concept families to emerge spontaneously within individual hyperbolic factors, while cross-family compositions (e.g., "dog + car") are captured by the additive geometry of \(\ell_1\) summation, analogous to Boolean algebra. This approach outperforms CLIP / MERU / HyCoCLIP across zero-shot classification, retrieval, hierarchical classification, and compositional understanding tasks.
- PhysLLM: Harnessing Large Language Models for Cross-Modal Remote Physiological Sensing
-
PhysLLM utilizes a frozen CNN (PhysNet) as an rPPG backbone and employs "Dual-domain Stabilization + Multi-scale Visual Aggregation + Text Prototype Guidance" to translate signal and visual features into tokens decodable by an LLM. Combined with physiological cue prompts generated by LLaVA and statistical descriptors, the LLM estimates heart rates from facial videos, achieving SOTA results across four datasets and cross-domain tests.
- PI-CCA: Prompt-Invariant CCA Certificates for Replay-Free Continual Multimodal Learning
-
PI-CCA redefines "forgetting" in Vision-Language Models (VLMs) as the drift of image-text alignment geometry. It employs a compact "CCA certificate" (top-k canonical correlation spectra + subspace sketches) as an invariant to constrain LoRA fine-tuning under replay-free, constant memory settings. By averaging across prompt perturbations, it achieves prompt invariance, reaching SOTA performance among replay-free methods on MTIL, X-TAIL, VLCL, and ConStruct-VL benchmarks.
- Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding
-
Addressing the performance collapse of multimodal models when a modality is missing during inference, the authors discover that "modality preference" can be quantified in the frequency domain. They propose the Frequency Ratio Metric (FRM) and a plug-and-play, nearly zero-parameter Multimodal Weight Allocation Module (MWAM). By weighting the neglected modalities during training to balance the optimization, various CNN/ViT backbones become more robust under various missing modality combinations.
- Post-hoc Probabilistic Vision-Language Models
-
A training-free post-hoc uncertainty estimation method is proposed, applying Laplace approximation to the final layers of VLMs such as CLIP/SigLIP. By analytically deriving the uncertainty of cosine similarity, it achieves performance significantly superior to baselines in uncertainty quantification and active learning.
- Preserve and Sculpt: Manifold-Aligned Fine-tuning of Vision-Language Models for Few-Shot Learning
-
This paper treats the CLIP feature space as a "semantic manifold." During few-shot fine-tuning, it constrains the intrinsic geometry of the manifold using Gram matrix alignment to prevent destruction (Preserve) while simultaneously enhancing separability by pulling intra-class samples closer and pushing inter-class samples apart via multimodal query-support matching (Sculpt). This approach improves the few-shot classification SOTA by approximately 1-2.5 percentage points across 11 datasets.
- PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
-
Ours constructs PRISMM-Bench, the first scientific paper multimodal inconsistency benchmark grounded in real-world reviewer annotations. By mining 384 cross-modal inconsistencies from 18,009 ICLR open reviews, it designs Identifier/Remedy/Matching tasks and proposes a JSON-structured debiased answer representation. Evaluation of 21 top-tier LMMs shows a peak performance of only 53.9%, systematically exposing severe deficiencies in current models regarding cross-modal reasoning in scientific documents.
- Procedural Mistake Detection via Action Effect Modeling
-
This paper proposes a dual-branch multimodal supervised action effect modeling framework. It combines a visual branch (extracting object states and spatial relationship features) with a text branch (utilizing GPT-4o generated scene graphs) to distill external supervision signals into learnable effect tokens, achieving SOTA mistake detection performance in egocentric procedural videos.
- Prompt-Robust Vision-Language Models via Meta-Finetuning
-
Addressing the fragility where CLIP-like vision-language models exhibit significant performance fluctuations with different prompt wordings, this paper proposes Promise. It treats various semantically equivalent prompt templates as "tasks" within a meta-learning framework. By employing an inner-outer bi-level loop, adaptive prompt weighting, and token-level learning rates, the model learns a set of prompt tokens insensitive to wording. It reduces prompt sensitivity while enhancing generalization across 15 benchmarks (achieving a +1.13 gain in the harmonic mean for base-to-new on 11 datasets).
- PSP: Prompt-Guided Self-Training Sampling Policy for Active Prompt Learning
-
PSP models the "sample selection" process as a reinforcement learning problem. It utilizes Soft Actor-Critic (SAC) with a "real-pseudo hybrid reward" derived from prompt learning feedback to optimize the sampling policy end-to-end. Simultaneously, it leverages a teacher CLIP model to assign reliable pseudo-labels to unselected samples, ensuring selected samples effectively serve the optimization of prompt templates. This approach improves average accuracy across seven downstream datasets from 74.36% to 76.87%.
- QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining
-
This paper identifies two major defects in CLIP vision encoders: "mesoscopic bias" and "interpolation bias." It proposes QLIP—a "drop-in" modification that replaces uniform grid patching with content-adaptive quadtree patching and uses a small MLP to re-interpolate position encodings. Without retraining the vision encoder or the LLM, it improves LLaVA-1.5 performance on the fine-grained VQA benchmark V* by up to 13.6%.
- RAG4DMC: Retrieval-Augmented Generation for Data-Level Modality Completion
-
RAG4DMC introduces Retrieval-Augmented Generation (RAG) to "data-level missing modality completion" for the first time. By constructing a dual knowledge base from "internal complete samples + external public datasets," the method applies cross-modal mapping, cluster filtering, and orthogonal alignment for purification. It utilizes two-stage multimodal fusion retrieval to retrieve the most relevant examples, guiding a generative model to produce and select the best candidate to fill missing images or text. This leads to a maximal improvement of +5.0 in training downstream image-text retrieval and image captioning tasks on the completed data.
- RAR: Reversing Visual Attention Re-Sinking for Unlocking Potential in Multimodal Large Language Models
-
This paper discovers that the final layers of MLLMs are often inferior to intermediate layers ("sub-optimal output layers") and traces the root cause to "visual attention re-sinking." Text-only supervision causes visual token attention gradients to become sparse, forcing late-stage attention to retreat to low-semantic background tokens. The proposed parameter-free SADS framework retains all visual heads and minimal sink heads (including one shared head) during inference, outperforming standard SFT on 20 benchmarks with a 10.3% speedup.
- RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding
-
Ravenea is the first benchmark constructed to evaluate multimodal retrieval-augmented cultural understanding. It consists of 1,868 instances and 11,396 human-ranked Wikipedia documents covering 11 categories across 8 countries. Evaluations of 7 multimodal retrievers and 17 VLMs demonstrate that culture-aware RAG improves performance by an average of 6% on cVQA and 11% on cIC.
- Reading Images Like Texts: Sequential Image Understanding in Vision-Language Models
-
Drawing inspiration from the "dual-stream hypothesis" of human vision, this paper dissects VLM visual processing into "what" (object identification) and "where" (localization) pathways. By using logit lens to translate image patches into text tokens, the authors discover that the vision encoder follows a two-stage Gestalt-like process: "attribute recognition followed by object disambiguation." Furthermore, after theoretically deriving the geometric structure of spatial relations in 2D RoPE, they propose an instruction-agnostic token compression algorithm to increase speed and a RoPE scaling technique to enhance spatial reasoning.
- Reconstruction Alignment Improves Unified Multimodal Models
-
RECA treats the visual understanding embeddings of a Unified Multimodal Model (UMM) as dense "visual prompts." By re-aligning the understanding and generation branches through post-training with unlabeled image reconstruction, it significantly improves 1.5B UMMs on GenEval, DPGBench, and image editing benchmarks without requiring extra captions, GPT-4o distillation, or reinforcement learning.
- Rethinking Causal Mask Attention for Vision-Language Inference
-
This paper re-evaluates the rationality of decoder-only VLMs inheriting the causal mask from LLMs. It finds that allowing visual tokens to "see" future visual/textual context during the prefill stage improves performance in multi-image, visual relationship, and text-dense QA tasks. The authors propose a lightweight future-aware attention that compresses future attention into prefix positions, retaining most benefits while maintaining causal decoding and low latency.
- Reversible Primitive–Composition Alignment for Continual Vision–Language Learning
-
Addressing the overlooked phenomenon in VLM sequential adaptation where "primitive recognition remains while compositional ability degrades," this paper proposes COMPO-REALIGN—a lightweight alignment head. It utilizes a Cayley orthogonal reversible composer to synthesize composition embeddings from primitive embeddings, treats text and synthetic compositions as dual positive samples for images via a multi-positive InfoNCE, and clips gradients using a spectral trust region when alignment sensitivity inflates. It improves the strongest baseline by +2.4 R@1 and reduces forgetting by approximately 40% in compositional DIL and multi-domain MTIL retrieval tasks.
- Revisit Visual Prompt Tuning: The Expressiveness of Prompt Experts
-
This work reveals the limitations of VPT from a Mixture-of-Experts (MoE) perspective—prompt experts are input-independent constant functions with limited expressiveness. VAPT is proposed to make prompt experts input-adaptive through token-wise projectors and shared feature projectors, achieving superior performance with fewer parameters and providing theoretical guarantees for optimal sample efficiency.
- Revisiting Confidence Calibration for Misclassification Detection in VLMs
-
This work demonstrates that standard confidence calibration limits the misclassification detection capability of VLMs even under perfect calibration. It introduces MisD-oriented reliability curves, a differentiable surrogate loss, and a lightweight posterior meta network to learn instance-wise temperature coefficients, effectively separating correct predictions from incorrect ones.
- Revisiting Multimodal Positional Encoding in Vision-Language Models
-
This paper systematically deconstructs the two pillars of multimodal RoPE—"position design" and "frequency allocation"—distilling three criteria: position consistency, full-spectrum utilization, and preserving text priors. Based on these, the authors propose the architecture-free spatial-reset position design along with two frequency allocation variants, MHRoPE and MRoPE-I, which consistently outperform existing RoPE schemes across 20+ benchmarks in image, video, and grounding tasks.
- RL Makes MLLMs See Better Than SFT
-
This paper systematically compares the different impacts of SFT and RL (represented by DPO) on Multimodal Large Language Models (MLLMs) and their vision encoders. It finds that DPO not only performs better on vision-intensive VQA tasks but also reshapes the vision encoder to be more fine-grained and capable of localization. Based on this, it proposes PIVOT, an extremely low-cost recipe for vision encoder evolution.
- RLAP-CLIP: Continual Multimodal Learning with Prototype Adaptation and Difficulty-Aware Routing
-
RLAP-CLIP addresses class-incremental multimodal continual learning for CLIP by replacing simple mean category prototypes with reinforcement learning-based weighted optimization. It utilizes vision-text dual-modal prompts and difficulty-aware MoE routing to process samples of varying complexity, consistently outperforming methods like PROOF and C-CLIP across eight classification datasets.
- ScaleCap: Scalable Image Captioning via Dual-Modality Debiasing
-
ScaleCap employs two complementary modules—"Heuristic Question Answering" and "Contrastive Sentence Scoring"—to rectify descriptive biases in open-source LVLMs. The former recovers omitted object details through iterative questioning, while the latter removes hallucinated sentences caused by language priors via offline contrastive decoding. The system scales in precision and detail with increased inference budget. Pre-training on 450,000 images annotated by ScaleCap demonstrates consistent performance gains across 11 benchmarks.
- Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models
-
This paper introduces DECEPTIONDECODED: a large-scale multimodal news benchmark anchored in credible news contexts that explicitly models misleading creator intent. Using 12,000 image-text samples, it diagnoses VLM vulnerabilities to content that is "surface-consistent but intentionally misleading" and demonstrates that fine-tuning on such data improves general multimodal misinformation detection.
- Seeing What's Not There: Negation Understanding Needs More Than Training
-
Addressing the persistent issue where CLIP-like vision-language models fail to understand "negation," this paper proposes a completely training-free zero-shot method. By using rules to extract negated concepts from sentences, it subtracts this semantic portion in the text embedding space via projection and adds back an anchor bias. This improves vanilla CLIP's performance on NegBench MCQ from 25.5% to 67.0%, outperforming models specifically fine-tuned on negation datasets.
- Self-Aug: Query and Entropy Adaptive Decoding for Large Vision-Language Models
-
This paper proposes Self-Aug, a training-free decoding strategy that utilizes Self-Augmented Selection (SAS) Prompting to allow LVLMs to dynamically select visual augmentations aligned with query semantics using their own knowledge. It also introduces the Sparsity-Adaptive Thresholding (SAT) algorithm, which utilizes the full entropy information of the output distribution to dynamically adjust the candidate vocabulary size. Self-Aug consistently outperforms existing contrastive decoding methods across 5 LVLMs and 7 benchmarks.
- Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking
-
Ours proposes the EvoQuality framework, which generates pseudo-ranking labels through pairwise majority voting combined with GRPO self-iterative optimization. This allows VLMs to autonomously improve image quality perception without human annotation, achieving a 31.8% PLCC improvement in zero-shot performance and surpassing supervised SOTA on 5 out of 7 IQA benchmarks.
- Shuffle-R1: Efficient RL Framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle
-
The Shuffle-R1 framework is proposed to address two key efficiency bottlenecks in RL training: Advantage Collapsing and Rollout Silencing. By implementing Pairwise Trajectory Sampling (selecting high-contrast trajectory pairs) and Advantage-based Batch Shuffle (reallocating training batches by advantage values), it achieves a 22% improvement over the baseline on Geo3K and surpasses GPT-4o on MathVerse.
- SigLIP-HD by Fine-to-Coarse Supervision
-
SigLIP-HD utilizes a frozen SigLIP 2 to generate fine-grained teacher features from multi-scale images, supervising an architecturally identical student model to learn clearer visual tokens from \(512^2\) images alone. This enhances the OCR, chart, and detail perception of MLLMs without increasing inference costs.
- Simulation to Rules: A Dual-VLM Framework for Formal Visual Planning
-
VLMFP utilizes a SimVLM, proficient in visual-spatial understanding and action simulation, to supervise a GenVLM adept at PDDL generation. It automatically converts visual planning tasks from images into PDDL problem/domain files solvable by formal planners, significantly outperforming direct VLM planning and feedback-free PDDL generation baselines in grid-world and 3D tasks.
- SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence
-
SpaCE-10 constructs a benchmark for compositional spatial intelligence in MLLMs by decomposing spatial capabilities in real indoor scenes into 10 atomic capabilities, which are then recomposed into 8 categories of QA tasks. Utilizing 811 real-world scenes and 5k+ high-quality QA pairs, it reveals significant weaknesses in current models regarding multi-view integration, counting, inverse reasoning, and situational perspective understanding.
- Sparsity Forcing: Reinforcing Token Sparsity of MLLMs
-
Ours proposes Sparsity Forcing—a RL post-training framework based on GRPO. It adopts an MLLM with top-\(p\) sparse attention as the policy model and the original MLLM as the reference model. Through multi-budget rollout exploration of varying token retention thresholds \(p\), it performs group-relative optimization using a joint reward of efficiency (token reduction rate) and performance (answer correctness). This framework enhances the token reduction rate of Qwen2/2.5-VL from 20% to 75% with minimal accuracy loss, achieving a 3× memory reduction and 3.3× decoding acceleration.
- SpatialViz-Bench: A Cognitive Science-Driven Benchmark for Diagnosing the Spatial Visualization Capabilities of MLLMs
-
To address the gap in existing multimodal benchmarks that only evaluate "visible information" rather than "internal mental rotation/folding/perspective" of objects, this paper decomposes spatial visualization into 4 sub-capabilities × 12 tasks based on cognitive science. It uses Python+FreeCAD to procedurally generate 1,180 infinitely expandable and contamination-free questions. Evaluation across 27 MLLMs reveals that the strongest model, Gemini-2.5-pro, achieves only 44.66% (Human: 82.46%), and the use of CoT in open-source models leads to performance degradation.
- SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start
-
SPECS redesigns the cold start phase of multimodal large models before entering RLVR: it first constructs preference pairs that distinguish only the output paradigm through self-distillation, then uses DPO+SFT loss for format pre-alignment, and finally applies GRPO for deep reasoning. This approach achieves better generalization, training stability, and multimodal reasoning performance than traditional SFT cold starts.
- SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery
-
Ours proposes SpectralGCD, which represents images as semantic mixtures (cross-modal similarity vectors) over a CLIP concept dictionary. It automatically selects task-relevant concepts through spectral filtering and maintains semantic quality via bi-directional knowledge distillation. It achieves multimodal SOTA across six benchmarks with computational costs comparable to unimodal methods.
- StreamingVLM: Real-Time Understanding for Infinite Video Streams
-
StreamingVLM utilizes a unified framework of "overlapping short segments during training and reusing compact KV cache during inference," enabling 7B-class VLMs to maintain low latency, long-range memory, and second-level real-time commentary capabilities over hours of video streams.
- Supporting Multimodal Intermediate Fusion with Informatic Constraint and Distribution Coherence
-
This paper re-analyzes the disparity between multimodal Intermediate Fusion (IF) and Late Fusion (LF) from the perspective of generalization error. It proposes IID, which employs informatic constraints to ensure the linear target mapping of IF meets theoretical conditions, and utilizes Wasserstein distribution coherence with RIP dimensionality reduction to mitigate inter-modal distribution misalignment. The approach achieves consistent performance gains in vision-language classification, scene recognition, and multimodal knowledge graph link prediction.
- TableDART: Dynamic Adaptive Multi-Modal Routing for Table Understanding
-
TableDART is proposed, which dynamically selects the optimal processing path (Text-only / Image-only / Fusion) for each query-table pair via an MLP gating network with only 2.59M parameters. By reusing frozen unimodal expert models and introducing an LLM Agent for cross-modal fusion, it outperforms the strongest MLLM baseline, HIPPO, by an average of 4.02% across 7 table understanding benchmarks while reducing latency by 24.5%.
- TABLET: A Large-Scale Dataset for Robust Visual Table Understanding
-
TABLET reorganizes 14 seed datasets for table understanding into 4 million visual table instruction samples. It prioritizes retrieving original table screenshots from actual web pages or documents, enabling VLMs to learn layouts, colors, merged cells, and image cues from real-world tables beyond synthetic renderings.
- Talking Points: Describing and Localizing Pixels
-
This paper introduces TalkingPoints, which utilizes a Point Descriptor to describe individual pixels or keypoints in an image using coarse-to-fine natural language, and a Point Localizer to regress pixel coordinates from descriptions. It evaluates and trains "whether a point is clearly described" based on localization accuracy.
- Teaching VLMs to Admit Uncertainty in OCR from Lossy Visual Inputs
-
Addressing the hallucination issue where VLMs "fluently fabricate typos without warning" on blurry/degraded documents, this paper teaches models to frame uncertain segments with
<C>...</C>tags during transcription. By employing a "pseudo-label cold start + multi-objective reward GRPO" training strategy, the model achieves a word-level F1 of 0.685 for uncertainty labels on the self-built Blur-OCR benchmark without sacrificing transcription accuracy. - The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss
-
This paper identifies that the widely adopted Pre-Norm architecture in MLLMs causes a severe norm discrepancy between high-norm visual tokens and low-norm text tokens, leading to slow visual token updates ("representation inertia") and cross-modal attention collapse. The authors insert a carefully initialized LayerNorm after the visual projector to force norm alignment, coupled with Global Weight Compensation to resolve subsequent vanishing gradients. On LLaVA-1.5, this improves not only multimodal benchmarks but also text-only MMLU performance.
- Thinking as Society: Multi-Social-Agent Self-Distillation for Multimodal Misinformation Detection
-
This work utilizes a group of "social user" MLLM agents to perform truthfulness judgments on multimodal content from different stances. Their collective feedback is distilled into high-quality "Social Chain-of-Thought" (SCoT) preference data. Using a preference optimization algorithm, SCPO, which employs "social misjudgment degree" as a verifiable weight, the collective reasoning capabilities are internalized into a single 7B Qwen2-VL. This model outperforms larger open-source models and specialized multi-agent frameworks on MFC-Bench / MMFakeBench, even approaching or exceeding GPT-4o and Claude.
- Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
-
This paper proposes Puffin, which treats "camera parameters" as a language within a Large Multimodal Model (LMM). By utilizing a shared "thinking with camera" Chain-of-Thought (CoT), it simultaneously performs camera understanding (estimating roll/pitch/FoV from images) and camera-controllable generation (generating images from specific viewpoints), outperforming specialized models in both domains.
- Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders
-
Nar-KFC compresses long video inputs into "query-relevant and diverse keyframes + non-keyframe narratives inserted in real-time order," significantly enhancing performance in various long-video question-answering and open-ended generation tasks without retraining the MLLM.
- Turning Internal Gap into Self-Improvement: Promoting the Generation-Understanding Unification in MLLMs
-
This paper first validates that the generation branch of unified MLLMs is often weaker than the understanding branch using a "non-unification score" system. It then transforms this internal gap into a self-improvement signal without external reward models: the understanding branch filters generation candidates to construct SFT/DPO data, and curriculum replay is used to further exploit hard samples, simultaneously improving generation quality, understanding discriminative ability, and generation-understanding consistency.
- U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning
-
The authors systematically ablate the design space of MLLM embedding learning, revealing that bidirectional attention + mean pooling outperforms the mainstream last token approach, and learnable temperature is a significantly undervalued key factor. Based on these findings, the U-MARVEL three-stage framework (progressive transition → filtered hard negatives → reranking distillation) is constructed. It achieves a 63.2% Avg on M-BEIR, substantially surpassing existing SOTA as a single model, and leads in zero-shot transfer for CIR and T2V.
- Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models
-
To address the optimization dilemma in unified multimodal models where "enhancing generation degrades understanding and vice versa," this paper proposes the Reason-Reflect-Refine (R3) framework. It reformulates single-step image generation into a multi-step chain process of "Reason → Generate → Reflect → Refine," making generation inherently dependent on the model's understanding capabilities. Combined with Tree-structured Reinforcement Learning, the method significantly improves both generation (GenEval++ 0.371 → 0.689) and understanding (ITA 60.6 → 73.4) on BAGEL.
- Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs
-
The paper proposes Uni-DPO, which introduces a unified adjustment of preference pair weights through quality-aware weighting (prioritizing pairs with high score differences), performance-aware weighting (utilizing focal loss to focus on underfit samples), and a calibrated NLL loss component. Uni-DPO consistently outperforms DPO and SimPO on text understanding and mathematical reasoning benchmarks. Notably, Gemma-2-9B achieves 67.1% on Arena-Hard, surpassing Claude 3 Opus (60.4%).
- UniF2ace: A Unified Fine-grained Face Understanding and Generation Model
-
UniF2ace is the first Unified Multimodal Model (UMM) to unify facial "understanding" (VQA / description) and "generation" (text-to-face) within a single framework. It enhances fine-grained generation fidelity through a D3Diff loss that unifies masked generation with discrete score matching. To combat "attribute forgetting," it employs a grouped token-level + sequence-level MoE architecture to reinject semantic and identity features. Additionally, it introduces the UniF2aceD-1M dataset containing 130K image-text pairs and 1M VQA samples. At the 1.8B scale, its Desc-GPT and VQA-score outperform models of similar magnitude by 7.1% and 6.6%, respectively.
- Unified Vision-Language Modeling via Concept Space Alignment
-
This paper proposes v-Sonar, which aligns vision encoders post-hoc to the Sonar text embedding space. This alignment enables the Large Concept Model (LCM), trained solely on Sonar space, to process visual inputs zero-shot. Further extension via instruction tuning yields v-LCM, which outperforms existing VLMs in 61 out of 62 languages.
- UniHM: Unified Dexterous Hand Manipulation with Vision Language Model
-
UniHM is proposed as the first unified language-conditioned dexterous hand manipulation framework. It maps heterogeneous robotic hands to a shared discrete space via a morphology-agnostic VQ codebook, combines a VLM for instruction-driven sequence generation, and ensures physical feasibility through physics-guided dynamic optimization.
- UniLIP: Revamping CLIP to Unify Multimodal Understanding, Generation, and Editing
-
UniLIP utilizes "two-stage + self-distillation" training to transform CLIP, originally proficient only in understanding, into a unified visual encoder capable of high-fidelity pixel reconstruction while preserving semantic integrity. Coupled with a "multimodal hidden states + query embeddings" dual-condition architecture bridging MLLMs and Diffusion models, the 1B/3B small models outperform larger unified models such as BAGEL (7B) and UniWorld-V1 (12B) on GenEval (0.90), WISE (0.63), and ImgEdit (3.94).
- Unlocking the Power of Co-Occurrence in CLIP: A DualPrompt-Driven Method for Training-Free Zero-Shot Multi-Label Classification
-
This paper discovers that rewriting CLIP's discriminative prompts (containing only the target label) into "relational prompts" with co-occurring labels introduces co-occurrence information to improve multi-label recognition, but also causes CLIP to overfit co-occurrences and produce object hallucinations. Consequently, the authors model co-occurrence as a mediator using causal inference and derive a training-free calibration formula—simply adding the prediction scores of the discriminative and relational prompt paths (DualPrompt)—outperforming existing SOTA on MS-COCO and VG-256.
- UrbanFeel: A Comprehensive Benchmark for Temporal and Perceptual Understanding of City Scenes through Human Perspective
-
UrbanFeel establishes a multimodal large model evaluation benchmark for urban street views. Using 11 tasks and 14.3K visual question-answering samples, it simultaneously examines static scene recognition, long-term temporal change understanding, and subjective perception consistency across dimensions like safety, aesthetics, wealth, and liveliness. The study finds that while current MLLMs approach human levels in single-frame subjective judgment, they significantly lag in cross-temporal sorting and urban evolution reasoning.
- VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery
-
This paper constructs VaseVQA-3D, the first 3D visual question answering dataset for ancient Greek pottery (664 3D models + 4460 QAs). By utilizing a synthetic pipeline involving "2D image filtering → single-image 3D reconstruction → six-dimensional archaeological semantic cleaning," the authors trained a domain-specific model, VaseVLM. Its 7B-RL version achieves a 12.8% improvement in R@1 and a 6.6% increase in lexical similarity compared to the strongest baseline.
- VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
-
This paper proposes HiCo, a hierarchical video token compression method that reduces long video context from the clip level to the video level by approximately \(1/50\) (averaging only 16 tokens per frame). Combined with a multi-stage short-to-long training strategy, the LongVid dataset containing 114K long videos, and a more challenging multi-hop NIAH evaluation, VideoChat-Flash achieves better performance than GPT-4o / Gemini-1.5-Pro on both short and long video benchmarks at the 7B scale, while achieving 99.1% accuracy on the 10,000-frame NIAH test.
- ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Models
-
ViPER reformulates the enhancement of fine-grained visual perception in VLMs as a coarse-to-fine two-stage task. It utilizes a closed-loop framework where the model "generates its own data and learns from it." By utilizing a diffusion model to reconstruct images from textual descriptions as a critic for the VLM, combined with two-stage reinforcement learning, Qwen2.5-VL evolves stronger perception capabilities without relying on external distillation or cold-start data (achieving up to +6.0% in fine-grained perception).
- VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models
-
VisCodex utilizes "task vectors" to arithmetically merge a powerful code LLM into the language backbone of a Vision-Language Model (VLM), while keeping the vision encoder and projection layer frozen. Combined with a self-constructed 598k multimodal coding dataset (MCD) for supervised fine-tuning, the MLLM retains vision understanding while gaining strong coding capabilities. It achieves open-source SOTA on UI-to-code and chart-to-code tasks, approaching GPT-4o performance.
- Vision-Zero: Scalable VLM Self-Evolution via Multi-Agent Self-Play
-
Bringing "Who is the Spy" into the visual world—providing real images to Citizens and blank ones to Spies, allowing VLMs to generate training data automatically through multi-role adversarial play. By alternating Self-Play and RLVR optimization (Iterative-SPO), Qwen2.5-VL-7B simultaneously outperforms SOTA models trained on expensive human-annotated data in reasoning, charts, and vision-centric tasks under a completely zero-annotation premise.
- Vision Language Models are Biased
-
This paper proposes the VLMBias counterfactual visual evaluation framework, which systematically modifies iconic visual elements in animals, logos, flags, chessboards, game boards, optical illusions, and patterned grids. It finds that mainstream VLMs achieve an average accuracy of only 17.05% on objective counting tasks, with 75.70% of responses reverting to commonsense priors rather than visual evidence.
- VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations
-
This paper proposes VisJudge-Bench, the first comprehensive benchmark for evaluating the aesthetics and quality of data visualizations (comprising 3,090 samples across 32 chart types). It further introduces the VisJudge model, which reduces MAE by 23.9% compared to GPT-5 and improves consistency with human experts by 60.5%.
- Visual Compositional Tuning
-
COMPACT transforms visual instruction tuning samples from "single-visual-ability QA" into "natural combinations of multiple atomic visual abilities," achieving or even slightly exceeding the average performance of full visual instruction tuning using only 10% of the LLaVA-665K data volume.
- Visual Jigsaw Post-Training Improves MLLMs
-
The classic "shuffle and sort" jigsaw task is integrated into the reinforcement learning post-training phase of MLLMs. Without changing the architecture, adding generation modules, or requiring any human annotation, this self-supervised verifiable reward approach significantly enhances fine-grained perception, temporal understanding, and spatial reasoning across image, video, and 3D modalities.
- Visual Prompt-Agnostic Evolution
-
The authors propose Prompt-Agnostic Evolution (PAE), which accelerates VPT convergence (average 1.41× speedup) and improves accuracy by 1–3% across 25 datasets through frequency-aware task initialization (MPA) and cross-layer prompt correlation via the Koopman-Lyapunov discrete dynamical system (KLD). It is plug-and-play for various VPT variants with zero inference overhead.
- Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing
-
Addressing the issues of missing points, misalignment, and hallucinations in vision-intensive chart parsing, this paper proposes the "Visual Self-Refine" (VSR) paradigm. The model first outputs pixel-level coordinates, visualizes these marks back onto the image for iterative error correction, and finally uses verified coordinates as "finger anchors" to parse numerical values. A 3B model outperforms Gemini-2.5-Pro on the self-constructed, high-difficulty ChartP-Bench.
- Visual Symbolic Mechanisms: Emergent Symbol Processing in Vision Language Models
-
This paper discovers an emergent three-stage symbolic processing mechanism (ID retrieval → ID selection → feature retrieval) within VLMs. It utilizes content-independent spatial position indices (position IDs) to solve the visual binding problem and demonstrates that binding errors can be directly traced to the failure of these mechanisms.
- VL-JEPA: Joint Embedding Predictive Architecture for Vision-language
-
VL-JEPA replaces the traditional autoregressive token generation of VLMs with non-autoregressive prediction of semantic embeddings for target text. Under identical training settings, it consumes fewer parameters and converges faster than token-space VLMs, while naturally supporting selective decoding for classification, retrieval, VQA, and online video scenarios.
- VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
-
The authors construct VLSU—a multimodal safety benchmark containing 8,187 real image-text pairs, covering 15 harm categories and 17 safety combination patterns. The study systematically reveals a fundamental defect in mainstream VLMs: while maintaining 90%+ accuracy in unimodal safety scenarios, performance plummets to 20–55% in "jointly harmful" scenarios where individual modalities are safe but their combination requires cross-modal joint reasoning.
- WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM
-
WAVE projects text, audio, silent video, and synchronized audio-visual streams into a unified semantic space based on Qwen2.5-Omni. By employing "dual-audio encoders + hierarchical full-layer feature fusion + joint multimodal multi-task training," it achieves any-to-any retrieval and instruction-dependent prompt-aware embeddings, reaching SOTA on the MMEB-v2 video track.
- WebDS: An End-to-End Benchmark for Web-based Data Science
-
The authors propose WebDS, the first end-to-end web-based data science benchmark (870 tasks, 29 websites, 10 domains). The current strongest Agent (BrowserUse + GPT-4o) completes only 15% of tasks compared to 90% achieved by humans, revealing a significant performance gap in real-world data science workflows.
- WebWatcher: Breaking New Frontiers of Vision-Language Deep Research Agent
-
WebWatcher is a "deep research" web agent capable of joint reasoning across text and image modalities. It utilizes automatically synthesized high-quality tool-calling trajectories for SFT cold-starts, followed by GRPO reinforcement learning to refine decision-making. It also introduces the BrowseComp-VL benchmark requiring cross-modal retrieval, outperforming prompt-based workflows and existing open-source multimodal agents on challenging leaderboards like HLE, LiveVQA, and MMSearch.
- When MLLMs Meet Compression Distortion: A Coding Paradigm Tailored to MLLMs
-
The authors systematically analyze the impact of image compression distortion on Multimodal Large Language Models (MLLMs), identifying "cross-level features" as the most vulnerable. Consequently, they propose CoTAM, an image codec tailored for MLLMs. It utilizes shallow CLIP attention for semantic rate allocation at the encoder and preserves multi-level information via a reconstruction prior, adapters, and multi-level losses at the decoder, saving up to 35.99% bitrate while maintaining downstream performance.
- Why Keep Your Doubts to Yourself? Trading Visual Uncertainties among Vision-Language Models
-
This paper proposes Agora, which reconstructs the collaboration among multiple heterogeneous VLMs into an "uncertainty trading market." It decomposes epistemic uncertainty into three-dimensional tradable assets (perceptual, semantic, and reasoning). Agents sell uncertainty to the most capable and cost-effective experts according to economic rules of "minimizing total system cost." A market broker extended from Thompson Sampling selects the initial agent. Agora achieves significant performance gains (e.g., +8.5% on MMMU) while reducing costs by more than 3x across five multimodal benchmarks.
- Why Reinforcement Fine-Tuning Preserves Prior Knowledge Better: A Data Perspective
-
This work systematically investigates the impact of SFT and RFT on prior knowledge through jigsaw tasks, revealing that RFT's ability to avoid catastrophic forgetting stems from its data distribution rather than algorithmic differences—data sampled by RFT naturally aligns with the base model's probability landscape, resulting in minimal interference.
- WorldSense: Evaluating Real-World Omnimodal Understanding for Multimodal LLMs
-
WorldSense is the first real-world omnimodal video understanding benchmark that mandates audio-visual synergy. It comprises 1,662 synchronized audio-visual segments and 3,172 multiple-choice questions, each designed such that "it cannot be answered correctly if either audio or video is removed." Results show that even the strongest Gemini 2.5 Pro achieves only 65.1% accuracy, while most open-source omnimodal models perform close to random guessing.
- XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
-
XModBench is the first "tri-modal fully balanced" multiple-choice benchmark, comprising 61,000 questions that ask the same semantic content across Audio/Image/Text modalities and 6 "context \(\to\) candidate" directions. It specifically diagnoses whether Omni-modal Large Language Models (OLLMs) achieve modality-agnostic reasoning or rely on surface-level features—concluding that even the strongest Gemini 2.5 Pro falls far short of the standard.
- Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition
-
Ours proposes DA-HOI, a zero-shot HOI detection framework that completely decouples object detection from interaction recognition. It leverages the VQA capabilities of MLLMs to replace traditional CLIP features for interaction recognition. The core contributions include Deterministic Generation (reaching 31.50 mAP training-free), Spatial-Aware Pooling (SAP, introducing spatial priors and cross-attention), and single-pass Deterministic Matching (DM, reducing \(M\) forward passes to one). It outperforms state-of-the-art (SOTA) methods across four zero-shot settings on HICO-DET and allows for plug-and-play switching of any detector after training.