🎵 Audio & Speech¶

📷 CVPR2026 · 17 paper notes

BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models: This paper proposes BabyVLM-V2, a framework that constructs three formats of pretraining data (768K image pairs + 181K video pairs + 63K interleaved sequences) from the SAYCam longitudinal egocentric corpus, designs the DevCV Toolbox (10 developmental cognitive tasks) grounded in the NIH Baby Toolbox®, and demonstrates that a compact model trained from scratch surpasses GPT-4o on selected mathematical tasks — representing the first systematic exploration of Artificial Developmental Intelligence (ADI).
Cleaning the Pool: Progressive Filtering of Unlabeled Pools in Deep Active Learning: This paper proposes Refine, an ensemble active learning method that employs a two-stage strategy—progressive filtering (iteratively refining the unlabeled pool via multiple strategies) and coverage-based selection (selecting high-value, diverse samples from the refined pool)—to consistently outperform individual AL strategies and existing ensemble methods without requiring prior knowledge of the optimal strategy.
Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models: This paper proposes MMHNet, a Multimodal Hierarchical Network based on a hierarchical architecture and non-causal Mamba-2, achieving length generalization by training on short clips (8 seconds) while generating high-quality, well-aligned audio for long videos (5+ minutes). MMHNet substantially outperforms existing methods on the UnAV100 and LongVale benchmarks.
GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization: GEM-TFL is proposed to bridge the gap between weak and full supervision for temporal forgery localization via a two-stage classification-regression framework. Three core modules are introduced: EM-based decomposition of binary labels into multi-dimensional latent attributes, training-free temporal consistency refinement (TCR), and graph diffusion proposal refinement (GPR). The method achieves an average mAP improvement of 4–8% on weakly supervised temporal forgery localization benchmarks.
Omni-MMSI: Toward Identity-Attributed Social Interaction Understanding: This paper introduces the Omni-MMSI task—understanding multi-person social interactions from raw audio-visual inputs (rather than pre-processed oracle social cues)—and proposes Omni-MMSI-R, a reference-guided pipeline that achieves accurate social interaction understanding via tool-generated identity-attributed social cues combined with chain-of-thought reasoning.
OmniRet: Efficient and High-Fidelity Omni Modality Retrieval: This paper proposes OmniRet, the first unified retrieval model supporting composed queries across text, vision, and audio modalities. It introduces a Shared Media Resampler to improve computational efficiency and Attention Sliced Wasserstein Pooling (ASWP) to preserve fine-grained information, achieving state-of-the-art performance on 12 out of 13 retrieval tasks.
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text: This paper proposes the Universal Holistic Audio Generation (UniHAGen) task and the OmniSonic framework, which employs a TriAttn-DiT architecture with triple cross-attention and MoE gating to simultaneously generate on-screen environmental sound, off-screen environmental sound, and human speech within a unified audio synthesis pipeline, achieving comprehensive state-of-the-art performance on the newly constructed UniHAGen-Bench.
SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval: This paper proposes SAVE, a speech-aware video representation learning method that introduces a dedicated speech branch (Whisper ASR + CLIP text encoder) and a soft-ALBEF visual-audio early alignment strategy, achieving comprehensive state-of-the-art performance across five video-text retrieval benchmarks.
Semantic Audio-Visual Navigation in Continuous Environments: This paper introduces the SAVN-CE task, extending semantic audio-visual navigation to continuous 3D environments, and proposes MAGNet (Memory-Augmented Goal description Network). By fusing historical context and ego-motion cues, MAGNet achieves robust goal inference after target sounds cease, yielding absolute success rate improvements of up to 12.1%.
Solution for 10th Competition on Ambivalence/Hesitancy (AH) Video Recognition Challenge using Divergence-Based Multimodal Fusion: For the Ambivalence/Hesitancy (A/H) recognition task of the 10th ABAW competition, this paper proposes a divergence-based multimodal fusion strategy that explicitly models cross-modal conflict by computing pairwise absolute differences among embeddings from three modalities — visual (AU), audio (Wav2Vec 2.0), and text (BERT) — achieving a Macro F1 of 0.6808 on the BAH dataset, substantially surpassing the baseline of 0.2827.
Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach: This work is the first to incorporate behavioral description embeddings extracted by a VLM (Qwen3-VL-4B-Instruct) as an independent third modality, combining them with GRADA facial encodings and WavLM audio features via two fusion strategies—DCMMOE and RAAV—achieving a continuous VA estimation CCC of 0.658 (dev) / 0.62 (test) on Aff-Wild2, demonstrating the value of VLM behavioral semantics for continuous emotion recognition.
Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach: This paper proposes a multimodal approach combining facial visual features, VLM-based behavioral description embeddings, and audio features for continuous valence-arousal (VA) estimation. Two fusion strategies—DCMMOE and RAAV—are explored, achieving competitive results on the Aff-Wild2 dataset.
Tri-Subspaces Disentanglement for Multimodal Sentiment Analysis: This paper proposes the TSD framework, which explicitly decomposes multimodal features into three complementary subspaces—globally shared, pairwise shared, and modality-private—and adaptively integrates these three levels of information via a subspace-aware cross-attention (SACA) fusion module, achieving state-of-the-art performance on CMU-MOSI and CMU-MOSEI.
UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark: This paper proposes UniM, the first unified any-to-any interleaved multimodal benchmark (31K samples, 7 modalities, 30 domains), accompanied by a three-dimensional evaluation suite and an agentic baseline UniMA based on traceable evidence reasoning, revealing critical deficiencies of existing MLLMs under the interleaved multimodal paradigm.
Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods: Through systematic data-centric experiments, this paper demonstrates that audio pre-training performance is primarily driven by label/supervision quality rather than model design. It proposes the Unified Tag System (UTS), which unifies speech, music, and environmental sound under a high-granularity vocabulary of 800–3k tags. Models trained with UTS surpass AudioSet baselines on out-of-domain tasks such as speaker verification (VoxCeleb2) and music (MusicCaps) using 5× less data.
ViDscribe: Multimodal AI for Customizing Audio Description and Question Answering in Online Videos: This paper presents ViDscribe, a web platform integrating AI-generated audio descriptions (with 6 user-customizable options) and a conversational visual question answering interface. A longitudinal field study with 8 blind and low-vision (BLV) users demonstrates that customized audio descriptions significantly improve effectiveness, enjoyment, and immersion.
ViDscribe: Multimodal AI for Customizing Audio Description and Question Answering in Online Videos: ViDscribe is a web-based platform that leverages a multimodal large language model (Gemini 3 Pro) to provide customizable AI-generated audio descriptions (AD) and interactive visual question answering (VQA) for blind and low-vision (BLV) users. Supporting arbitrary YouTube videos, the system is validated through a one-week longitudinal user study, which demonstrates that customized AD outperforms default AD in terms of effectiveness, enjoyment, and immersion.