ICML2025 Audio & Speech AI paper notes paper summaries Speech & Audio Dialogue Few-/Zero-Shot Learning Diffusion Models

🎵 Audio & Speech¶

🧪 ICML2025 · 15 paper notes

📌 Same area in other venues: 📷 CVPR2026 (22) · 🔬 ICLR2026 (80) · 💬 ACL2026 (72) · 🧪 ICML2026 (36) · 🤖 AAAI2026 (30) · 🧠 NeurIPS2025 (47)

🔥 Top topics: Speech & Audio ×11 · Dialogue ×2

Aligning Spoken Dialogue Models from User Interactions: This work introduces the first comprehensive preference alignment framework designed for a full-duplex spoken dialogue model (Moshi). By automatically constructing content and temporal preference pairs from over 150k real user voice interactions and performing DPO-LN alignment exclusively on text tokens, this approach achieves an average QA improvement of 3.1% and a safety increase of 6.9%, with human evaluations confirming enhanced multi-turn dialogue quality.
BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models: This work proposes BinauralFlow, a streamable binaural speech synthesis framework based on conditional Flow Matching. Incorporating a causal U-Net architecture and a continuous inference pipeline, it produces high-fidelity, streamable binaural audio. In perceptual evaluations, a 42% confusion rate demonstrates that the synthesized audio is virtually indistinguishable from real recordings.
Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition: This paper proposes LatentVoiceMix, which performs mixup interpolation in the latent space of the speaker style encoder of the voice conversion model Diff-HierVC to generate synthetic speech data with novel voice characteristics for augmenting ASR training. This approach achieves superior WER improvements on the low-resource language Wolof compared to waveform augmentation, spectrogram augmentation, and standard voice conversion.
Do Not Mimic My Voice: Speaker Identity Unlearning for Zero-Shot Text-to-Speech: This paper introduces the speaker identity unlearning task in zero-shot TTS for the first time, designing a Teacher-Guided Unlearning (TGU) framework that introduces randomness to make models "forget" target speaker voiceprint features while maintaining high-quality speech synthesis capabilities for other speakers, and proposes the spk-ZRF metric to quantify unlearning effectiveness.
ETTA: Elucidating the Design Space of Text-to-Audio Models: ETTA systematically elucidates the design space (data, architecture, training objectives, and sampling strategies) of text-to-audio (TTA) models through large-scale experiments, and constructs the current state-of-the-art TTA model under public data based on these findings.
FLAM: Frame-Wise Language-Audio Modeling: Proposes FLAM, a frame-level audio-language contrastive model that achieves precise temporal localization of open-vocabulary sound events through text-dependent logit bias correction and a million-scale synthetic SED dataset, while maintaining outstanding performance in global retrieval and zero-shot classification.
IMPACT: Iterative Mask-based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling: This paper proposes the IMPACT framework, which combines iterative mask-based parallel decoding (MGM) with latent diffusion models (LDMs) for text-to-audio generation in a continuous latent space. It replaces heavy attention layers with a lightweight MLP diffusion head and introduces an unconditional pre-training stage, achieving state-of-the-art (SOTA) FD/FAD metrics on AudioCaps while maintaining an inference speed comparable to the fastest MAGNET-S model.
Long-Form Speech Generation with Spoken Language Models: Proposes SpeechSSM, the first textless spoken language model capable of learning and generating up to 16 minutes of speech in a single decoding session. It leverages the Griffin hybrid SSM architecture to achieve constant-memory decoding and infinite context, and introduces the LibriSpeech-Long evaluation benchmark along with new embedding and LLM-as-a-judge metrics.
MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners: This work proposes MuseControlLite, which introduces Rotary Position Embedding (RoPE) into decoupled cross-attention layers. This enables precise time-varying conditional control for text-to-music generation with only 85M trainable parameters (6.75x fewer than ControlNet), while pioneering unified support for both music attribute control and audio inpainting/outpainting.
NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction: Proposes the Next-Token-Pair Prediction (NTPP) paradigm, which models the joint distribution of dual-channel spoken dialogue in a speaker-independent manner using a decoder-only architecture for the first time, achieving more natural turn-taking, lower inference latency, and stronger speaker independence.
OmniAudio: Generating Spatial Audio from 360-Degree Video: This work proposes the OmniAudio framework, which achieves first-of-its-kind spatial audio generation in First-order Ambisonics (FOA) format from 360-degree panoramic videos. By incorporating a coarse-to-fine self-supervised pre-training paradigm and a dual-branch video encoding architecture, OmniAudio achieves state-of-the-art (SOTA) performance on the self-collected Sphere360 dataset.
One Wave To Explain Them All: A Unifying Perspective On Feature Attribution: Proposes the Wavelet Attribution Method (WAM), which shifts feature attribution from the pixel domain to the wavelet domain, leveraging the spatial-scale locality of wavelet coefficients to provide unified and more structurally informative model explanations for audio, image, and volumetric data.
Sortformer: A Novel Approach for Permutation-Resolved Speaker Supervision in Speech-to-Text Systems: This work proposes Sortformer—an encoder-based speaker diarization model that resolves the permutation problem by sorting speakers according to their arrival times using Sort Loss, thereby replacing or supplementing the traditional Permutation Invariant Loss (PIL). It designs a sinusoidal kernel function to inject speaker labels into the ASR encoder, enabling multi-speaker ASR training with standard cross-entropy loss and achieving relative error reductions of 30% and 25% on 2-mix and 3-mix LibriSpeechMix, respectively.
Sounding that Object: Interactive Object-Aware Image to Audio Generation: An interactive object-aware image-to-audio generation model is proposed, which learns the correlation between image regions and sounds during training using multimodal dot-product attention, and replaces the attention weights with SAM segmentation masks during testing, allowing users to generate corresponding sounds by clicking on visual objects in the image.
Teaching Physical Awareness to LLMs through Sounds: Proposes the ACORN framework, which teaches LLMs to understand physical world phenomena from sound. It generates large-scale training data through a physics-based acoustic channel simulator, coupled with an audio encoder that captures both magnitude and phase information.