Skip to content

🎵 Audio & Speech

💬 ACL2026 · 30 paper notes

Affectron: Emotional Speech Synthesis with Affective and Contextually Aligned Nonverbal Vocalizations

This paper proposes Affectron, a framework that achieves diverse and emotionally aligned nonverbal vocalization (NV) synthesis—such as laughter and sighs—on small-scale open-source disentangled corpora, via two training-time augmentation strategies: emotion-driven Top-K NV matching and emotion-aware Top-K routing. The proposed method substantially outperforms the purely language-pretrained VoiceCraft baseline.

Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs

Alexandria constructs a parallel English-Dialectal Arabic multi-round dialogue dataset covering 13 Arabic countries, 11 social impact domains, and 107K turns. Through a community-driven manual translation and revision process, it provides unprecedented fine-grained training and evaluation resources for dialectal Arabic machine translation, accompanied by a systematic benchmark assessment across 24 LLMs.

An Exploration of Mamba for Speech Self-Supervised Models

This work presents the first comprehensive exploration of the Mamba architecture as a backbone for speech self-supervised learning (SSL), demonstrating that Mamba-based HuBERT outperforms Transformers in long-context ASR, streaming ASR, and causal probing tasks while maintaining linear time complexity.

Anchored Cyclic Generation: A Novel Paradigm for Long-Sequence Symbolic Music Generation

This paper proposes the Anchored Cyclic Generation (ACG) paradigm, which calibrates the generation direction by using confirmed musical content as anchors during autoregressive decoding, effectively mitigating error accumulation in long-sequence symbolic music generation. A hierarchical framework, Hi-ACG, is further constructed to realize global-to-local music generation.

Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs

This paper identifies that the perceptual weakness of current AudioLLMs stems from an ASR-centric training paradigm that systematically suppresses paralinguistic and non-linguistic information. It proposes the Unified Audio Schema (UAS), which structures audio information into a three-dimensional JSON format covering transcription, paralinguistics, and non-linguistic events. The approach achieves a 10.9% improvement in perceptual accuracy on the MMSU benchmark while preserving reasoning capabilities.

Breaking Block Boundaries: Anchor-based History-stable Decoding for Diffusion Large Language Models

This paper proposes AHD (Anchor-based History-stable Decoding), a training-free plug-and-play dynamic decoding strategy that identifies cross-block stable tokens in diffusion LLMs by tracing historical trajectories via dynamic anchors, enabling early unlocking. On BBH, AHD reduces decoding steps by 80% while improving performance by 3.67%.

Computational Narrative Understanding for Expressive Text-to-Speech

This paper extracts character direct speech from fiction audiobooks to construct a large-scale expressive speech dataset, LibriQuote (5.3K hours of quotations + 12.7K hours of narration), annotating speaking style with speech verb and adverb pseudo-labels derived from narrative context. Experiments demonstrate that fine-tuning a flow-matching model simultaneously improves expressiveness and intelligibility, and that LibriQuote-test constitutes a challenging benchmark for expressive TTS.

DIA-HARM: Dialectal Disparities in Harmful Content Detection Across 50 English Dialects

This paper introduces DIA-HARM, the first benchmark for evaluating the robustness of misinformation detectors across 50 English dialects. It reveals that human-authored dialectal content causes detection performance drops of 1.4–3.6% F1, that fine-tuned Transformers substantially outperform zero-shot LLMs (96.6% vs. 78.3%), and that some models suffer catastrophic degradation exceeding 33% on mixed-content inputs.

Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models

This paper employs layer-wise oracle intervention experiments to reveal a structured redundancy hierarchy in the speech token representations of large speech language models (LSLMs)—whereby shallow layers encode essential acoustic details while deep layers are highly redundant—and proposes Affinity Pooling, a training-free similarity-based token merging mechanism that reduces FLOPs by 27.48% while maintaining competitive accuracy.

SEPT: Semantically Expanded Prompt Tuning for Audio-Language Models

SEPT leverages LLMs to generate semantic neighbors for each category and introduces a margin-constrained semantic expansion loss to regularize the prompt embedding space, substantially alleviating the Base-New Tradeoff (BNT) in prompt tuning for audio-language models (ALMs). It also establishes the first systematic evaluation benchmark for prompt generalization in ALMs.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

This paper presents HalluAudio, the first large-scale cross-domain (speech/environmental sound/music) benchmark for hallucination detection in large audio-language models (LALMs), comprising 5,000+ human-verified QA pairs and a systematic adversarial prompt design. It evaluates mainstream LALMs across multiple dimensions (accuracy, hallucination rate, Yes-No bias, rejection rate, and error type), revealing significant deficiencies in acoustic grounding, temporal reasoning, and music attribute understanding.

Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages

This paper presents a phoneme-level ASR analysis of two extremely phonologically complex, low-resource endangered East Caucasian languages (Archi and Rutul), finding that phoneme recognition accuracy follows an S-shaped learning curve with respect to training frequency, and that many errors attributed to phonological complexity are in fact primarily caused by data scarcity.

How Hypocritical Is Your LLM Judge? Listener–Speaker Asymmetries in the Pragmatic Competence of Large Language Models

This paper systematically compares 14 LLMs as pragmatic listeners (judging pragmatic appropriateness) and pragmatic speakers (generating pragmatically appropriate language) across three tasks—false presuppositions, antipresuppositions, and deductive reasoning—revealing pervasive listener–speaker asymmetries: most models perform substantially better as judges than as generators, and item-level analysis shows that correct judgments do not reliably predict successful generation.

Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering

This paper introduces Jamendo-MT-QA, a multi-track comparative music question answering benchmark comprising 36,519 QA pairs across 12,173 track pairs. It is the first systematic evaluation of audio-language models on cross-track comparative reasoning, revealing significant deficiencies in sentence-level comparative generation among existing models.

Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective

This paper proposes CmIR (Causal modality Invariant Representation learning), which leverages causal inference theory to explicitly disentangle each modality into a causal invariant representation and an environment-specific spurious representation. Through an elegant objective combining invariance constraints, mutual information constraints, and reconstruction constraints, the framework ensures that invariant representations maintain stable predictive relationships across environments. CmIR achieves state-of-the-art performance on multimodal sentiment, humor, and sarcasm detection, with particularly strong results under OOD and noisy conditions.

MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus

This paper introduces MCGA, the first large-scale (119 hours, 22,000 samples) fully copyright-cleared audio corpus for classical Chinese literature, spanning five major literary genres (Fu, Shi, Wen, Ci, Qu) and six speech tasks (ASR/S2TT/SEC/SQA/SU/SR). An evaluation of 10 multimodal large models reveals substantial deficiencies in current models' ability to understand classical literary speech.

Multimodal In-Context Learning for ASR of Low-Resource Languages

This paper systematically investigates whether multimodal in-context learning (MICL) enables speech LLMs to handle unseen endangered languages, and proposes a MICL-based hypothesis selection system that combines the complementary strengths of acoustic models and speech LLMs, achieving substantial ASR improvements across three endangered languages.

Music Audio-Visual Question Answering Requires Specialized Multimodal Designs

As the first comprehensive survey of the Music Audio-Visual Question Answering (Music AVQA) field, this paper systematically analyzes dataset evolution and method design, demonstrating that specialized input processing, spatiotemporal architectural design, and music domain knowledge are essential for this task, and that general-purpose multimodal models are insufficient to address the unique challenges of music performance understanding.

MSU-Bench: Musical Score Understanding Benchmark

MSU-Bench is the first human-annotated benchmark for complete musical score understanding, comprising 1,800 generative QA pairs from 150 pieces across four difficulty levels. Evaluation reveals severe deficiencies in LLM/VLM localization and hallucination, while text-based ABC notation input substantially mitigates these issues.

Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition

This paper proposes Pseudo2Real, a parameter-space correction method that computes a "correction vector" as the weight difference between a real-label model and a pseudo-label model trained on a source domain, then applies this vector to a pseudo-label fine-tuned model on the target domain to rectify systematic pseudo-label bias. The method achieves up to 35% relative WER reduction across ten African accents in AfriSpeech-200.

Retrieving to Recover: Towards Incomplete Audio-Visual Question Answering via Semantic-consistent Purification

This paper proposes the R2ScP framework, which shifts the missing-modality paradigm in AVQA from conventional generative completion to retrieval-based recovery. By combining cross-modal retrieval with a context-aware adaptive purification mechanism to eliminate retrieval noise, R2ScP achieves substantial performance gains in modality-incomplete settings.

Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction

This paper proposes an instruction-referencing-based defense against prompt injection attacks. Rather than suppressing the LLM's instruction-following capability, the method instructs the model to reference the executed instruction within its response, and then removes responses unrelated to the original instruction via label filtering, reducing the attack success rate to near 0% in several settings.

SpeakerSleuth: Can Large Audio-Language Models Judge Speaker Consistency across Multi-turn Dialogues?

SpeakerSleuth introduces the first benchmark (1,818 instances) for evaluating LALMs' ability to judge speaker consistency in multi-turn dialogues. Systematic evaluation of 12 LALMs and 6 embedding methods reveals that models struggle to detect and localize acoustic inconsistencies, exhibit severe text-over-acoustics modality bias, yet perform comparatively well on comparative/ranking tasks involving acoustic variants.

Splits! Flexible Sociocultural Linguistic Investigation at Scale

This paper proposes a methodology for constructing a sociolinguistic "sandbox," building Splits!—a 9.7 million post dataset from Reddit partitioned along two axes (demographic group × discussion topic) across 6 groups and 89 topics—and designing a two-stage filtering pipeline based on lift and triviality to efficiently identify non-trivial, research-worthy sociocultural linguistic phenomena from 23,000 LLM-generated candidate hypotheses.

Still Between Us? Evaluating and Improving Voice Assistant Robustness to Third-Party Interruptions

To address the inability of voice assistants to distinguish third-party interruptions (TPI) from primary-user speech, this work proposes TPI-Train, a dataset of 88K training instances, along with the TPI-Bench evaluation framework. A speaker-aware hard negative mining strategy is introduced to eliminate semantic shortcut learning, enabling models to rely genuinely on acoustic cues for interruption detection.

StressTest: Can YOUR Speech LM Handle the Stress?

This paper proposes StressTest, a benchmark for evaluating the ability of speech language models (SLMs) to understand the meaning conveyed by sentence stress. Evaluations reveal that existing models are nearly incapable of inferring speaker intent from stress patterns. A synthetic data pipeline, Stress-17k, is introduced, and the resulting fine-tuned model, StresSLM, substantially outperforms state-of-the-art models on both stress detection and stress reasoning tasks.

TellWhisper: Tell Whisper Who Speaks When

This paper proposes TellWhisper, which jointly encodes speaker identity and temporal information into the speech encoder's self-attention via a time-speaker-aware rotary position encoding (TS-RoPE), coupled with a hyperbolic speaker diarization model (Hyper-SD), to achieve joint modeling of "who speaks what when" and attain state-of-the-art performance on multi-speaker ASR tasks.

Temporal Contrastive Decoding: A Training-Free Method for Large Audio-Language Models

This paper proposes TCD, a training-free inference-time decoding method that contrasts logits from the original audio path against a temporally blurred slow path, combined with stability-guided blur window selection and uncertainty-based gating, to help unified audio-language models better exploit transient acoustic cues. Consistent improvements are demonstrated on MMAU and AIR-Bench.

Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training

This paper proposes the FCaps large-scale dataset (47k hours of speech, 19M fine-grained annotations) and the CLSP contrastive learning model. Through an end-to-end annotation pipeline and fine-grained multi-granular contrastive supervision, it presents the first speech-text alignment model capable of uniformly representing both global and fine-grained speaking styles.

When Misinformation Speaks and Converses: Rethinking Fact-Checking in Audio Platforms

This position paper argues that misinformation on audio platforms is fundamentally distinct from textual misinformation in two dimensions: it is simultaneously spoken (conveying persuasion through prosody, pacing, and emotion) and conversational (unfolding across multiple turns, speakers, and episodes). Existing text-centric fact-checking pipelines cannot adequately handle these properties, and verification frameworks must be redesigned around the intrinsic characteristics of audio.