🎵 Audio & Speech¶
📷 CVPR2026 · 22 paper notes
📌 Same area in other venues: 🔬 ICLR2026 (80) · 💬 ACL2026 (72) · 🧪 ICML2026 (36) · 🤖 AAAI2026 (31) · 🧠 NeurIPS2025 (47) · 📹 ICCV2025 (11)
🔥 Top topics: Speech & Audio ×14 · Multimodal/VLM ×3 · Alignment/RLHF ×2 · Diffusion Models ×2
- AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding
-
This paper introduces AMUSE—an audio-visual benchmark for "multi-speaker, dialogue-dense" scenarios (6 agentic tasks × Zero-shot/Guided/Agentic evaluation modes), revealing systematic weaknesses in mainstream MLLMs like GPT-4o and Qwen3-Omni regarding "who is speaking, when, and cross-scene causality." It also proposes the RAFT alignment framework (Reflective Reward + Selective Reasoning Adaptation), which improves the accuracy of open-source models on this benchmark by up to 39.52% (relative) using minimal annotations.
- AudioStory: Generating Long-Form Narrative Audio with Large Language Models
-
AudioStory integrates LLM narrative reasoning with a DiT diffusion audio generator into an end-to-end framework. The LLM first decomposes complex instructions into timestamped sub-events, then generates short audio segments sequentially to form long-form narrative audio. Decoupled bridging via "semantic tokens + residual tokens" ensures intra-segment alignment and cross-segment coherence, enabling stable generation of multi-scene audio stories up to 150 seconds.
- BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
-
The BabyVLM-V2 framework is proposed, which constructs three formats of pretraining data (768K image pairs + 181K video pairs + 63K interleaved sequences) from the SAYCam longitudinal corpus from an infant's first-person perspective. It designs the DevCV Toolbox (10 developmental cognitive tasks) based on the NIH Baby Toolbox®. A compact model trained from scratch surpasses GPT-4o on certain mathematical tasks, marking the first systematic exploration of Artificial Developmental Intelligence (ADI).
- Cleaning the Pool: Progressive Filtering of Unlabeled Pools in Deep Active Learning
-
The authors propose Refine, an ensemble active learning method that consistently outperforms individual AL strategies and existing ensemble methods. It employs a two-stage strategy: progressive filtering (iterative refinement of the unlabeled pool using multiple strategies) followed by coverage selection (selecting high-value diverse samples from the refined pool) without requiring prior knowledge of the optimal strategy.
- Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
-
Ours proposes MMHNet, a multi-modal hierarchical network based on a hierarchical architecture and non-causal Mamba-2. It achieves length generalization capabilities—training on short segments (8s) while generating high-quality aligned audio for long videos (5+ minutes)—significantly outperforming existing methods on UnAV100 and LongVale benchmarks.
- EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation
-
Addressing the issues of "visual dominance, inability to understand text instructions, and lack of fine-grained editing" in existing video-to-audio models, this paper proposes the EchoFoley task (using symbolic "sound event" representations + three levels of control granularity) along with a densely annotated benchmark of 6k samples. It designs EchoVidia, a training-free agentic framework (using slow-fast thinking + an action pool), which improves controllability by approximately 40.7% and perceptual quality by 12.5% over the strongest baseline.
- FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts
-
FoleyDirector attaches a pluggable adapter to a pre-trained DiT-based V2A generator (MMAudio), utilizing "director's script"-style per-second Structured Temporal Scripts (STS) to supplement visual cues and realize precise temporal control over sound occurrence. By employing dual-stream parallel rendering for on-screen/off-screen sounds, it raises the control F1 score on DirectorBench from 0.2451 to 0.4819 with almost no degradation in original audio quality.
- GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization
-
GEM-TFL is proposed to bridge the gap between weak and full supervision through a two-stage classification-regression framework. It incorporates three modules: EM-based decomposition of binary labels into multi-dimensional latent attributes, training-free Temporal Consistency Refinement, and Graph-diffusion Proposal Refinement, achieving a 4-8% average mAP improvement in weakly supervised temporal forgery localization.
- Hear What You See: Video-to-Audio Generation with Diffusion Transformer and Semantic-Temporal Alignment-Ranked Direct Preference Optimization
-
VisioSonic utilizes a dual-stream condition of "CLIP low-frame-rate semantics + Synchformer high-frame-rate temporal" fed into a video-text-audio co-attention Diffusion Transformer for rectified flow matching to generate dubbing for silent videos. It further maximizes semantic and temporal alignment using STAR-DPO, a fully automated preference optimization requiring no human annotation. With only 151M trainable parameters (the fewest among similar works), it achieves the strongest distribution matching and audio-visual synchronization on VGGSound.
- Hierarchical Codec Diffusion for Video-to-Speech Generation
-
HiCoDiT reframes "silent video to speech" generation as a masked diffusion task that proceeds layer-by-layer along the RVQ discrete token hierarchy. Lower-level tokens handle content and timbre under lip-motion and identity guidance, while higher-level tokens manage prosody via dual-scale AdaLN modulation of expressions. This approach achieves leading performance in naturalness, intelligibility, and lip-sync on LRS2/LRS3 through zero-shot cross-dataset evaluation.
- How Far Can We Go With Synthetic Data for Audio-Visual Sound Source Localization?
-
This paper proposes the first scalable framework to generate data in bulk using text-to-X generative models for training Sound Source Localization (SSL) models. It demonstrates that purely synthetic data can match the performance of real data, and replacing noisy real "intermediate frames" with synthetic images can "purify" the training set. Hybrid training involving real and synthetic data achieves new SOTA results across three tasks: single-source localization, audio-visual segmentation, and interactive localization.
- InfinityHuman: Towards Long-Term Audio-Driven Human Animation
-
InfinityHuman proposes a coarse-to-fine framework that "generates motion at low resolution first, then refines via pose guidance." By utilizing pose sequences—which are decoupled from appearance and naturally resistant to temporal degradation—alongside a first-frame visual anchor, the method combats identity drift and color shifts in long videos. It further introduces hand-specific reward feedback learning to correct hand distortions. The model achieves SOTA performance on EMTD/HDTF datasets in terms of image quality, identity preservation, hand accuracy, and lip-sync for long-term audio-driven full-body animation.
- Omni-MMSI: Toward Identity-Attributed Social Interaction Understanding
-
Ours proposes the Omni-MMSI task—understanding multi-person social interactions from raw audio-visual inputs rather than pre-processed oracle social cues. It designs the Omni-MMSI-R reference-guided pipeline, achieving accurate social interaction understanding through tool-generated identity-attributed social cues combined with Chain-of-Thought (CoT) reasoning.
- OmniRet: Efficient and High-Fidelity Omni Modality Retrieval
-
Ours proposes OmniRet, the first unified retrieval model supporting text-vision-audio tri-modal composed queries. It enhances computational efficiency via a Shared Media Resampler and introduces Attention Sliced Wasserstein Pooling (ASWP) to preserve fine-grained information, achieving leading results on 12 out of 13 retrieval tasks.
- OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
-
This paper proposes the Universal Holistic Audio Generation (UniHAGen) task and the OmniSonic framework. Utilizing a TriAttn-DiT architecture with tri-way cross-attention and a MoE gating mechanism, it achieves the unified synthesis of on-screen/off-screen ambient sounds and human speech for the first time, significantly outperforming SOTA models on the newly constructed UniHAGen-Bench.
- PAVAS: Physics-Aware Video-to-Audio Synthesis
-
PAVAS explicitly injects two physical quantities, "object-level mass + velocity," into a latent diffusion Video-to-Audio (V2A) framework. It utilizes a VLM to estimate mass and combines segmentation with dynamic 3D reconstruction to estimate velocity. These physical cues are fed into a Diffusion Transformer via a Phy-Adapter with zero-initialized residuals, ensuring that generated sound intensity and decay align with physical dynamics. On the self-constructed VGG-Impact benchmark, it reduces the physics consistency metric (APCC-∆) from over 0.5 to 0.378.
- Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning
-
PEAV (Perception Encoder Audiovisual) is a family of unified "audio-visual-text" contrastive encoders proposed by Meta. It utilizes a two-stage synthetic caption data engine to generate high-quality captions across three categories (audio, visual, audiovisual) for O(100M) audiovisual pairs. By employing up to ten sets of cross-modal contrastive losses to align audio, video, and text into a single space, it sets new SOTA benchmarks across four zero-shot categories: sound, music, speech, and video (e.g., AudioCaps T→A R@1 improved from 35.4 to 45.8, VGGSound classification from 36.0 to 47.1). Furthermore, it enables "speech→transcript" retrieval to work effectively for the first time, jumping from near 0 to 85.6.
- SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval
-
Ours proposes the SAVE method, which achieves speech-aware video representation learning by adding a dedicated speech branch (Whisper ASR + CLIP text encoder) and a soft-ALBEF vision-audio early alignment strategy, significantly outperforming the SOTA on five video-text retrieval benchmarks.
- TAPE: Task-Adaptive Prototype Evolution in Audio-Language Models for Fully Few-shot Class-incremental Audio Classification
-
Addressing the "Fully Few-shot Class-incremental Audio Classification" (FFCAC) task where both base and incremental stages have extremely few samples, TAPE avoids fine-tuning the text branch of CLAP. Instead, it freezes the audio encoder and learns a linear Task-Adapter to project audio into an orthogonal reference point space to resist forgetting. During inference, it dynamically updates class prototypes using low-entropy query samples to combat overfitting. The method improves average accuracy from 54.93% to 82.76% across three datasets.
- Tri-Subspaces Disentanglement for Multimodal Sentiment Analysis
-
The TSD framework is proposed to explicitly decompose multimodal features into three complementary subspaces: global shared, pairwise shared, and modality-private. A subspace-aware cross-attention fusion module adaptively integrates these three layers of information, achieving state-of-the-art (SOTA) performance on CMU-MOSI and CMU-MOSEI datasets.
- UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark
-
This paper introduces UniM, the first unified any-to-any interleaved multimodal benchmark (31K samples, 7 modalities, 30 domains), along with a three-dimensional evaluation suite and an agentic baseline UniMA based on traceable reasoning. The study reveals significant deficiencies in existing MLLMs under the interleaved multimodal paradigm.
- Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods
-
This paper demonstrates through systematic data-centric experiments that audio pre-training performance is primarily driven by label/supervision quality rather than model design. It proposes the Unified Tag System (UTS) to unify speech, music, and environmental sounds into a fine-grained vocabulary of 800-3k labels. Models trained with UTS achieve performance surpassing AudioSet baselines on out-of-domain tasks like speech (VoxCeleb2) and music (MusicCaps) using 5 times less data.