Skip to content

🎵 Audio & Speech

💬 ACL2025 · 46 paper notes

📌 Same area in other venues: 📷 CVPR2026 (22) · 🔬 ICLR2026 (80) · 💬 ACL2026 (72) · 🧪 ICML2026 (36) · 🤖 AAAI2026 (30) · 🧠 NeurIPS2025 (47)

🔥 Top topics: Speech & Audio ×41 · Dialogue ×6 · Few-/Zero-Shot Learning ×4 · Translation ×4 · LLM ×4

Finding A Voice: Exploring the Potential of African American Dialect and Voice Generation for Chatbots

This work presents a systematic study on integrating African American English (AAE) into chatbots across text and speech modalities. It reveals that while text-based AAE hurts the user experience, speech-based chatbots paired with an African American accent are favored by AAE speakers, highlighting the crucial role of modality choice in linguistic personalization.

Acoustic Individual Identification of White-Faced Capuchin Monkeys Using Joint Multi-Species Embeddings

This paper explores utilizing cross-species acoustic pre-trained embeddings from birds and humans to identify individual calls of white-faced capuchin monkeys. It discovers that joint multi-species representations can further enhance identification performance, providing a new transfer learning paradigm for individual identification of wild animals under extreme data scarcity.

Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment

This paper proposes the INTP (Intelligibility Preference Speech Dataset) dataset and multi-architecture DPO extension methods. Through preference alignment, the proposed approach significantly improves the intelligibility of zero-shot TTS systems in challenging scenarios (e.g., tongue twisters, repeated words, code-switching, and cross-lingual settings) while demonstrating weak-to-strong generalization.

AI4Reading: Chinese Audiobook Interpretation System Based on Multi-Agent Collaboration

This paper proposes AI4Reading, a Chinese audiobook interpretation system based on the collaboration of 11 specialized LLM Agents. It automatically generates interpretation scripts through phases of thematic analysis, case expansion, editorial refining, colloquial rewriting, and integration/revision, and then synthesizes audio using TTS. The generated interpretation scripts outperform the professional human interpretation platform, FanDeng Reading, in terms of quality (conciseness, completeness, accuracy, and coherence).

Amplifying Trans and Nonbinary Voices: A Community-Centred Harm Taxonomy for LLMs

This paper adopts a community-centred research methodology, building a specialized harm taxonomy for LLM outputs affecting Trans and Nonbinary (TNB) individuals through deep collaboration with the TNB community, thereby uncovering unique harm categories unaddressed by existing LLM safety evaluations.

ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors

This paper theoretically analyzes that the root cause of cross-lingual inconsistency in Multilingual Audio-Text Retrieval (ML-ATR) is the training data distribution error. It proposes two strategies, namely 1-to-K Contrastive Learning (KCL) and Audio-English Common-Anchor Contrastive Learning (CACL), to reduce this error, achieving SOTA performance in both recall and consistency.

Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

This paper proposes ADU-Bench, a comprehensive benchmark comprising 4 sub-datasets (general dialogue, professional skills, multilingualism, and ambiguity handling) totaling over 20,000 open-ended audio dialogues. It systematically evaluates 16 Large Audio-Language Models (LALMs) on their audio dialogue understanding capabilities, revealing significant deficiencies in current models regarding mathematical formula understanding, role-playing, multilingual processing, and speech ambiguity resolution.

Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models

This paper uncovers and quantitatively analyzes the Discrete Representation Inconsistency (DRI) issue in neural audio codecs—where identical audio segments are encoded into different discrete token sequences depending on context. Two constraint methods, slice consistency and perturbation consistency, are proposed to improve average consistency by 21-36% and reduce the Word Error Rate (WER) by 3.72% in VALL-E speech generation.

Autoregressive Speech Synthesis without Vector Quantization

MELLE proposes an autoregressive language model for TTS based on continuous mel-spectrogram frames. By utilizing a regression loss, a variational inference sampling module, and a spectrogram flux loss, it directly predicts continuous spectrogram frames, thereby avoiding the fidelity loss and sampling robustness issues caused by vector quantization. This single-stage model achieves speech synthesis quality comparable to human levels.

Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis

This paper proposes Chain-Talker, which achieves interpretable empathetic conversational speech synthesis through a three-stage chain modeling (emotion understanding \(\rightarrow\) semantic understanding \(\rightarrow\) empathetic rendering), and develops CSS-EmCap, an automatic annotation pipeline to generate emotional captions for conversational speech.

ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5

This paper introduces ChildMandarin, a Mandarin speech dataset for young children aged 3-5, containing 397 speakers, 41.25 hours of speech, and covering 22 provincial-level administrative regions in China, along with comprehensive baseline evaluations on ASR and speaker verification tasks.

CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages

Proposed the CLaMP 3 unified framework, which aligns sheet music, performance signals, and audio recordings with multilingual text into a shared representation space via contrastive learning. This enables cross-modal retrieval across modalities without paired training data and demonstrates strong generalization capabilities to unseen languages.

Contextual Biasing with the Knowledgeable External Language Model for End-to-End Speech Recognition

This paper proposes utilizing a Knowledgeable External Language Model (KELM) for contextual biasing. By dynamically fusing external domain knowledge and a bias phrase list during end-to-end speech recognition, it significantly improves the recognition accuracy of rare words and proper nouns.

ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control

ControlSpeech is the first TTS system to achieve simultaneous and independent zero-shot speaker cloning and zero-shot language style control, addressing the many-to-many style control challenge through decoupled representations in discrete codec spaces and a Style-Mixture Semantic Density (SMSD) module.

Dialectal Coverage and Generalization in Arabic Speech Recognition

This study systematically investigates the impact of Arabic dialect coverage on ASR performance. By utilizing multi-dialectal pre-training and joint fine-tuning, the ArTST model is extended to cover speech variants from 17 Arabic countries, while multilingual optimization strategies in code-switching scenarios are additionally explored.

Different Speech Translation Models Encode and Translate Speaker Gender Differently

Through an attention-based probing analysis, this study investigates how speech translation models of different architectures encode speaker gender. It finds that traditional encoder-decoder models preserve gender information well, whereas adapters in modern speech+MT architectures significantly erase gender information, leading to more severe masculine default bias in translation.

Distilling an End-to-End Voice Assistant Without Instruction Training Data

This work proposes DiVA (Distilled Voice Assistant), which performs cross-modal distillation by utilizing the text LLM's responses to transcriptions as self-supervised signals. This approach trains an end-to-end speech LLM without any speech instruction training data. With only 3.5k hours of ASR data, the model generalizes to spoken QA, classification, and translation tasks, outperforming Qwen 2 Audio (which uses over 100x more training compute) with a 72% win rate in user preference tests.

DNCASR: End-to-End Training for Speaker-Attributed ASR

DNCASR, an end-to-end trainable speaker-attributed ASR system, is proposed by linking a neural clustering decoder and an ASR decoder. Through joint training to generate speaker-attributed transcripts, it achieves a 9.0% relative reduction in cpWER on the AMI meeting dataset.

Does Your Voice Assistant Remember? Analyzing Conversational Context Recall and Utilization in Voice Interaction Models

Systematically evaluates the conversation history recall capabilities of open-source voice interaction models, introduces the ContextDialog benchmark, and reveals that these models are far weaker than text models in recalling past speech information, a gap that RAG methods also struggle to bridge effectively.

Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion

This paper proposes DE-detect, an audio-only multi-view late fusion pipeline. By combining the textual features of automatically transcribed lyrics with lyric-related acoustic features extracted by a speech model, it achieves robust detection of AI-generated lyrics, outperforming single-modality methods in both in-domain and out-of-domain scenarios.

Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation

This paper proposes Eta-WavLM, which decomposes WavLM self-supervised speech representations into speaker-dependent and speaker-independent components using a simple linear equation. It generates high-quality speaker-disentangled representations without complex training, comprehensively outperforming existing methods in voice conversion tasks.

GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages

GigaSpeech 2 constructs a large-scale ASR corpus of approximately 30,000 hours for low-resource languages (Thai, Indonesian, and Vietnamese). Through an automated crawling-transcription-refinement pipeline, high-quality pseudo-labels are generated from unlabeled YouTube videos. The trained model reduces WER by 25%-40% compared to Whisper large-v3 while utilizing only 10% of its parameters.

Improving Language and Modality Transfer in Translation by Character-level Modeling

A cross-lingual and cross-modal translation method is proposed based on a character-level encoder, charSONAR. A character-level text encoder is obtained via teacher-student training and is then connected to a 1000+-language CTC ASR model (MMS) using lightweight adapters. It achieves SOTA on text translation in 75 languages and speech translation in 33 languages, with particularly prominent performance in zero-resource and low-resource scenarios.

Investigating and Enhancing Vision-Audio Capability in Omnimodal Large Language Models

It is discovered that current Omnimodal Large Language Models (OLLMs) perform significantly worse on vision-audio tasks than on vision-text tasks. The primary reason is the lack of direct alignment between vision and audio modalities. Consequently, this work proposes Self-KD (Self-Knowledge Distillation) to enhance vision-audio capabilities by leveraging the OLLM's own vision-text components as a teacher.

It's Not a Walk in the Park! Challenges of Idiom Translation in Speech-to-text Systems

This paper presents the first systematic comparison of the performance of speech-to-text translation (SLT), text machine translation (MT), and large language models (LLMs) on idiom translation tasks. It reveals that the performance of SLT systems deteriorates significantly when handling idioms, tending towards literal translation even in the higher encoder layers, whereas MT and LLMs demonstrate clearly superior capabilities in idiom processing.

Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation

This paper proposes the concept of "unit language," which constructs text-like representations of discrete speech units via n-gram language modeling. It utilizes multi-task learning to guide the training of textless speech-to-speech translation (S2ST) models, while presenting task prompt modeling to alleviate conflicts when utilizing both source-side and target-side unit languages, achieving significant improvements on the VoxPopuli tetralingual dataset.

Mind the Gap! Static and Interactive Evaluations of Large Audio Models

By collecting 7,500 interaction evaluation data points from 484 participants, this paper systematically compares the static benchmarks and interactive evaluation performance of Large Audio Models (LAMs) for the first time. It reveals a significant gap between the two (\(R^2=0.30\)) and uncovers the real-world usage scenarios and user preferences of LAMs.

Mitigating Confounding in Speech-Based Dementia Detection through Weight Masking

To address the gender confounding bias in speech transcript-based dementia detection, this paper proposes Extended Confounding Filter (ECF) and Dual Filter (DF), two weight masking methods that require no additional training modules. By tracking weight updates during fine-tuning, the methods locate and zero out gender-associated parameters, significantly reducing gender gaps in false positive rates and statistical parity while maintaining robust dementia detection performance across various distribution shifts.

MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens

This work proposes MMS-LLaMA, which compresses multimodal speech tokens to only 3.5 per second through three modules: early audio-visual fusion, an AV Q-Former with dynamic query allocation, and a speech rate predictor. It achieves SOTA performance on LRS3 with a 0.72% WER while reducing token usage by 86% and FLOPs by 35.7%.

MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder

MultiMed is introduced as the first multilingual medical ASR dataset (150 hours, 5 languages, 10 recording conditions, 16 accents), along with end-to-end Whisper model baselines spanning small to large scales. This work presents the first systematic study of multilingual medical ASR, comparing monolingual vs. multilingual fine-tuning and AED vs. Hybrid architectures. Key findings reveal that multilingual joint training benefits small models but can lead to performance degradation in larger models.

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

OmniFlatten is proposed—an end-to-end full-duplex voice conversation model based on Qwen2-0.5B. By employing a three-stage progressive post-training scheme (modality alignment \(\rightarrow\) half-duplex \(\rightarrow\) full-duplex conversation learning) and a unified flatten operation, it achieves low-latency natural full-duplex voice interaction without modifying the GPT architecture, reducing the turn-taking response time to only \(193\text{ ms}\), which significantly outperforms Moshi's \(553\text{ ms}\).

On the Robust Approximation of ASR Metrics

This paper proposes a label-free approximation method for ASR performance metrics. By utilizing speech-text similarity in a unified multimodal embedding space and proxy metrics from high-quality proxy models, an ensemble regression model is trained to predict WER/CER. The absolute error is maintained within single digits across over 40 models and 14 datasets, outperforming the latest baseline by more than 50%.

Predicting Turn-Taking and Backchannel in Human-Machine Conversations Using Linguistic, Acoustic, and Visual Signals

Proposes the first end-to-end framework integrating linguistic, acoustic, and visual tri-modal signals to predict turn-taking and backchannel behaviors in conversations. Introduces MM-F2F, a face-to-face conversational dataset of over 210 hours, improving turn-taking F1 by 10% and backchannel F1 by 33%.

Soundwave: Less is More for Speech-Text Alignment in LLMs

The Soundwave model is proposed to address the representation space gap and sequence length inconsistency between speech and text using efficient training strategies and a novel architecture. With only one-fiftieth of the training data, it outperforms Qwen2-Audio on speech translation and AIR-Bench speech tasks.

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

This paper proposes Spark-TTS, an efficient TTS system based on a novel single-stream speech codec, BiCodec, and the Qwen2.5 LLM. By decoupling speech into low-bitrate semantic tokens and fixed-length global tokens, Spark-TTS achieves zero-shot voice cloning and coarse-to-fine attribute control, reaching SOTA intelligibility on Seed-TTS-eval.

Sparsify: Learning Sparsity for Effective and Efficient Music Performance Question Answering

Sparsify proposes a three-level sparsification strategy (sparse masking + adaptive sparse merging + key-subset selection) for Music Audio-Visual Question Answering (Music AVQA). It achieves SOTA on both MUSIC-AVQA and v2.0 benchmarks (81.75%/81.30%), reduces training time by 28.32%, and retains 74% of full-data performance using only 25% of the data.

SpeechIQ: Speech-Agentic Intelligence Quotient Across Cognitive Levels in Voice Understanding by Large Language Models

Proposes SpeechIQ, a hierarchical speech understanding evaluation framework based on Bloom's Taxonomy. It comprehensively assesses the intelligence of speech LLMs across three levels: Remember (WER), Understand (semantic similarity), and Apply (QA accuracy). The study reveals that cascaded ASR+LLM systems outperform end-to-end multimodal models at equivalent scales.

SpeechWeave: Diverse Multilingual Synthetic Text & Audio Data Generation Pipeline for Training Text to Speech Models

SpeechWeave proposes an end-to-end synthetic speech data generation pipeline that enhances text diversity through keyphrase sampling, performs text normalization at the generation stage (achieving \(97\%\) accuracy), and uses cross-lingual voice cloning to standardize speakers. The generated data is \(10\)-\(48\%\) more diverse than direct LLM prompting and significantly improves downstream TTS model performance.

T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback

Proposes three fine-grained AI audio scoring pipelines (event occurrence, event sequence, and acoustic harmony quality) to replace human annotation for constructing a large-scale audio preference dataset, T2A-Feedback (41K prompts, 249K audios). By utilizing preference tuning to enhance the basic capabilities of TTA models, it significantly improves multi-event audio generation quality in both simple (AudioCaps) and complex (T2A-EpicBench) scenarios.

In-the-wild Audio Spatialization with Flexible Text-guided Localization

This paper proposes the TAS (Text-guided Audio Spatialization) framework, which utilizes flexible text prompts (e.g., 3D spatial location descriptions or relative positions between sound sources) to guide a latent diffusion model in converting monaural audio into binaural audio. A SpatialTAS dataset containing 376K samples is constructed. This method outperforms existing approaches on both simulated and real-recorded data, and a spatial semantic consistency evaluation model is developed based on Llama-3.1-8B.

TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis

TCSinger 2 is proposed as a multi-task multilingual zero-shot singing voice synthesis model. Through a blurred boundary encoder, a contrastive learning audio encoder, and a Flow-based customized Transformer (incorporating Cus-MOE), it achieves style transfer and multi-level style control based on singing, speech, or text prompts.

Towards Reliable Large Audio Language Model

This paper presents the first systematic study on the reliability of Large Audio Language Models (LALMs), proposing training-free methods (IDK/MCoT/Task Agent) and a training-based method (LoRA SFT on model-specific IDK datasets). It also designs the Reliability Gain Index (RGI) metric to evaluate improvements in reliability, revealing that "knowing when to say I don't know" is a cross-modal transferable meta-capability.

UniCodec: Unified Audio Codec with Single Domain-Adaptive Codebook

UniCodec proposes a unified audio codec using a single domain-adaptive codebook. Through partitioned domain codebooks and a domain Mixture-of-Experts (MoE) strategy, it achieves outstanding reconstruction and semantic representation performance across three domains: speech, music, and sound.

WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models

Proposed WavRAG, the first end-to-end, natively audio-compatible retrieval-augmented generation framework. It achieves unified retrieval over mixed audio-text knowledge bases via WavRetriever and enhances the contextual capabilities of spoken dialogue models using Chain-of-Thought (CoT) reasoning, achieving an approximate 10\(\times\) speedup while maintaining performance comparable to state-of-the-art (SOTA) text RAG.

Who Can Withstand Chat-Audio Attacks? An Evaluation Benchmark for Large Audio-Language Models

This work proposes the Chat-Audio Attacks (CAA) benchmark, comprising four categories of universal adversarial audio attacks (content, emotional, explicit noise, and implicit noise attacks). Through three evaluation methodologies, it systematically assesses the robustness of six SOTA Large Audio-Language Models (LALMs), finding that GPT-4o performs the best but all models exhibit significant vulnerabilities.

Zero-Shot Text-to-Speech for Vietnamese

To address the lack of high-quality long-audio datasets for Vietnamese zero-shot TTS, the 941-hour PhoAudiobook dataset was constructed. Systematic experiments conducted on three SOTA zero-shot TTS models (VALL-E, VoiceCraft, and XTTS-v2) demonstrate that PhoAudiobook significantly improves model performance. Specifically, XTTS-v2 completely outperforms the baseline viXTTS on long-sentence synthesis, while VALL-E and VoiceCraft exhibit higher robustness in short-sentence synthesis.