Skip to content

🎵 Audio & Speech

📷 CVPR2025 · 19 paper notes

📌 Same area in other venues: 📷 CVPR2026 (22) · 🔬 ICLR2026 (80) · 💬 ACL2026 (72) · 🧪 ICML2026 (36) · 🤖 AAAI2026 (30) · 🧠 NeurIPS2025 (47)

🔥 Top topics: Speech & Audio ×11 · Multimodal/VLM ×4 · Layout & Composition ×2

Contextual AD Narration with Interleaved Multimodal Sequence

A unified framework named Uni-AD is proposed. It takes interleaved multimodal sequences (video features + text + character bank + context) as input. By aligning features through a visual mapping network, identifying main characters via a character-refinement module, and enhancing contextual consistency with a contrastive loss, it achieves SOTA performance on MAD-eval-Named.

Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

This paper proposes Crab, a unified audio-visual scene understanding model. By constructing the AV-UIE dataset (200K samples) with explicit reasoning processes, it clarifies the collaborative relationships across tasks. Combined with interaction-aware LoRA (multi-head LoRA) designed to learn different audio-visual interaction patterns, Crab outperforms specialized models across multiple tasks.

DistinctAD: Distinctive Audio Description Generation in Contexts

Generates distinctive audio descriptions (AD) in contexts to avoid generating generic and featureless descriptions by employing contrastive learning to encourage differences from preceding and succeeding ADs.

DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations

Proposes DualTalk, the first unified framework for multi-turn dual-speaker interactive 3D talking head generation that models both speaker and listener behaviors, accompanied by a dual-speaker dialogue dataset containing 50 hours and over 1,000 identities.

EMoVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

EMoVA is proposed as the first end-to-end omni-modal LLM that achieves visual understanding, speech recognition, and emotion-controllable speech synthesis simultaneously through a semantic-acoustic decoupled speech tokenizer, outperforming GPT-4o on vision-language benchmarks and achieving a 2.9% WER in speech recognition.

Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model

This work proposes PN-Diffusion, which extracts positive and negative beat conditions from forward-played and backward-played dance videos respectively. It designs a dual diffusion and reverse process to jointly train a U-Net, enhancing the beat consistency and music quality of generated music with dance movements. On the AIST++ and TikTok datasets, it improves BCS by 1.80/3.85 and BHS by 4.22/5.90.

HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation

This paper proposes HOP, a heterogeneous topology-based multimodal entanglement method. By using audio as a bridge, it aligns audio-text semantics via a reprogramming module and audio-action rhythm via a spatio-temporal graph network. This achieves more natural and coherent co-speech gesture generation, reaching SOTA on FGD, BC, and diversity metrics.

Improving Sound Source Localization with Joint Slot Attention on Image and Audio

Proposes a joint slot attention mechanism to decompose both images and audio into target/non-target representations, achieving precise sound source localization through cross-modal attention matching and contrastive learning, resulting in SOTA performance of 65.16% AUC and 86.00% cIoU on Flickr-SoundNet.

ImViD: Immersive Volumetric Videos for Enhanced VR Engagement

This work constructs the first immersive volumetric video dataset by capturing 7 indoor/outdoor scenes using a mobile multi-view system with 46 synchronized GoPros. It proposes STG++, which introduces learnable affine color transformations to resolve cross-camera color inconsistency, achieving rendering at 110.47 FPS with 387MB of storage, and integrates HRTF spatial audio.

Learning to Highlight Audio by Watching Movies

A novel task of visually-guided acoustic highlighting is proposed, leveraging well-crafted audiovisual data from movies as free supervision. Through a Transformer-based multimodal framework, VisAH, poorly mixed audio is converted into visually and semantically aligned highlighted audio, significantly outperforming baseline methods across all metrics.

LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

This paper proposes LiveCC, which trains video LLMs by densely interleaving ASR transcript words with video frames along the timeline. It constructs the Live-CC-5M pre-training dataset, enabling a 7B model to outperform 72B models (including Qwen2.5-VL-72B) on real-time video commentating tasks.

Object-aware Sound Source Localization via Audio-Visual Scene Understanding

This paper proposes OA-SSL: during the training phase, an MLLM is used to generate fine-grained descriptions of "\(K\) sounding objects + 1 silent object" for each image as additional supervision anchors. Then, OCA (object-aware contrastive alignment) and ORI (object region isolation) losses are employed, enabling the model to locate only the truly sounding objects even in complex scenarios with multiple guitars where only one is being played.

Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition

Mel-QCD is proposed to decompose Mel spectrograms into three signals: semantic vectors (quantized), energy, and standard deviation (continuous). By predicting these signals from video via a V2X predictor, and combining ControlNet with textual inversion, this approach achieves comprehensive SOTA video-to-audio generation across eight metrics on VGGSound.

Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach

This work proposes a continuous emotion estimation method that integrates three modalities: facial expressions (GRADA+Transformer), behavioral descriptions (Qwen3-VL+Mamba), and audio (WavLM). By employing two fusion strategies—Directed Cross-Modal MoE and Reliability-Aware Audio-Visual—the approach achieves a CCC of 0.6576 (dev) / 0.62 (test) on the Aff-Wild2 dataset.

Towards Lossless Implicit Neural Representation via Bit Plane Decomposition

Discovers that the model capacity upper bound of implicit neural representation (INR) grows exponentially with bit precision (\(\mathcal{P}(f_\theta) \propto 2^n\)), and proposes bit-plane decomposition—decomposing an n-bit signal into n independent 1-bit planes to train individual INRs, achieving lossless (BER=0) implicit neural representation of 16-bit images for the first time.

Towards Open-Vocabulary Audio-Visual Event Localization

This work formally defines the Open-Vocabulary Audio-Visual Event Localization (OV-AVEL) task, constructs the OV-AVEBench benchmark containing 24,800 videos across 67 event categories, and proposes two baselines (training-free and fine-tuning) based on ImageBind. Fine-tuning with just a single-layer temporal Transformer achieves an average performance of 57.8%.

UWAV: Uncertainty-Weighted Weakly-Supervised Audio-Visual Video Parsing

The authors propose UWAV, a weakly-supervised audio-visual video parsing framework. By pre-training a temporal-aware module on large-scale annotated data to generate high-quality pseudo-labels, and employing three techniques—uncertainty-weighted soft labels, class-balanced reweighting, and feature mixup—UWAV improves weakly-supervised training performance and achieves state-of-the-art (SOTA) results on the LLP dataset.

MultiFoley: Video-Guided Foley Sound Generation with Multimodal Controls

The paper presents MultiFoley, a video-guided Foley sound generation system based on Diffusion Transformers. It supports textual semantic controls and reference audio style controls. By jointly training on video-audio and text-audio datasets, it achieves 48kHz high-quality audio generation, outperforming existing methods with a 90% win rate in human evaluations.

VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation

Proposes VinTAGe, the first audio generation model joint-conditioned on video and text. It balances visual and textual guidance via learnable layer weights and mitigates modality bias through a teacher-student framework, achieving comprehensive state-of-the-art performance on both on-screen and off-screen audio generation (FAD 3.05, MOS 3.36).