Skip to content

🎵 Audio & Speech

💬 ACL2026 · 72 paper notes

📌 Same area in other venues: 📷 CVPR2026 (22) · 🔬 ICLR2026 (80) · 🧪 ICML2026 (36) · 🤖 AAAI2026 (31) · 🧠 NeurIPS2025 (47) · 📹 ICCV2025 (11)

🔥 Top topics: Speech & Audio ×53 · Dialogue ×9 · LLM ×5 · Multimodal/VLM ×5 · Alignment/RLHF ×4

Affectron: Emotional Speech Synthesis with Affective and Contextually Aligned Nonverbal Vocalizations

This paper proposes the Affectron framework, which implements two train-time augmentation strategies—Emotion-Driven Top-K NV Matching and Emotion-Aware Top-K Routing—on small-scale open-source decoupled corpora. It achieves diverse and emotionally aligned synthesis of nonverbal vocalizations (NVs, e.g., laughter, sighs), significantly surpassing the VoiceCraft baseline based on pure linguistic pre-training.

An Exploration of Mamba for Speech Self-Supervised Models

This work presents the first comprehensive exploration of Mamba as a foundation model for speech self-supervised learning (SSL). It finds that Mamba-based HuBERT outperforms Transformers in long-context ASR, streaming ASR, and causal probing tasks while maintaining linear time complexity.

Analyzing Reasoning Shifts in Audio Deepfake Detection under Adversarial Attacks: The Reasoning Tax versus Shield Bifurcation

This paper designs a "three-dimensional forensic auditing" framework (Acoustic Perception / Cognitive Coherence / Cognitive Dissonance) for Audio Language Models (ALMs) performing deepfake detection with reasoning chains. It finds that CoT reasoning is not a universal enhancement—it acts as a "Reasoning Shield" for models with strong acoustic perception (Qwen2-Audio), but becomes a "Reasoning Tax" for those with weak perception (Gemma-3n, Phi-4). Furthermore, when a model is compromised, high cognitive dissonance can serve as a "silent alarm" to alert human auditors.

Anchored Cyclic Generation: A Novel Paradigm for Long-Sequence Symbolic Music Generation

This paper proposes the Anchored Cyclic Generation (ACG) paradigm, which alleviates error accumulation in long-sequence symbolic music generation by using confirmed musical content as anchors to calibrate the generation direction during the autoregressive process. A hierarchical framework, Hi-ACG, is constructed to achieve music generation from global structure to local details.

[b] = [d] − [t] + [p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic

This work systematically demonstrates the existence of linear phonological feature vectors within the representation spaces of self-supervised speech models (S3M). These vectors satisfy word2vec-style vector arithmetic relationships, and their scaling correlates continuously with acoustic measurements.

Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs

Reveals that current AudioLLM perception weaknesses stem from ASR-centric training patterns (systemic suppression of paralinguistic and non-linguistic information). Proposes the Unified Audio Schema (UAS) to structure audio information into a JSON format across three dimensions: transcription, paralinguistics, and non-linguistic events. Achieving a 10.9% improvement in perception accuracy on the MMSU benchmark while maintaining reasoning capabilities.

Beyond Transcripts: A Renewed Perspective on Audio Chaptering

This paper systematically reconstructs the long-form audio chaptering task: advancing evaluation from transcript-dependent text space to transcript-invariant temporal space, and demonstrating that AudioSeg, utilizing direct audio representations, significantly outperforms text-based segmentation and existing MLLM solutions on YTSeg.

Closing the Modality Reasoning Gap for Speech Large Language Models

This paper introduces TARS (Trajectory Alignment for Reasoning in Speech), a reinforcement learning-based framework that aligns speech-conditioned reasoning trajectories with text-conditioned trajectories through two dense signals: representation alignment and behavior alignment. It achieves SOTA performance in 7B-scale models, with the Modality Recovery Rate (MRR) approaching or even exceeding 100%.

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

This paper proposes SwanBench-Speech, which systematically evaluates long-form speech generation using 1,101 samples across 17 real-world downstream scenarios and 7 automatic evaluation dimensions. The study concludes that while current models approach usability in content accuracy, they still significantly lag behind real recordings in reverb consistency, long-range prosody, and expressive hierarchy.

Computational Narrative Understanding for Expressive Text-to-Speech

This paper extracts character direct quotes from audiobook fiction to construct LibriQuote, a large-scale expressive speech dataset (5.3K hours of quotes + 12.7K hours of narration). It annotates speaking styles using speech verbs and adverbs as pseudo-labels. Experiments demonstrate that fine-tuning flow-matching models improves both expressiveness and intelligibility, and LibriQuote-test serves as a challenging benchmark for expressive TTS.

ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling

This paper proposes ControlAudio, a unified progressive diffusion modeling framework. Through a three-stage progressive training strategy (TTA pre-training → timing control fine-tuning → joint timing and intelligible speech training) and progressive guided sampling, it achieves text-guided, timing-precise, and intelligible speech generation within a single diffusion model. It significantly outperforms existing methods in timing accuracy and speech clarity.

Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech

To address the challenge of aligning ambiguous pronunciations in LLM-based TTS (e.g., the Japanese word "辛い" can be read as both karai and tsurai), the authors propose TKTO. This method first estimates the importance weight \(w_t\) for each token using two contrastive KTO models trained with swapped labels. It then decomposes the utterance-level value function of KTO into token-level components and aggregates them with weights. This achieves both "no paired data required" and "automatic targeting of objective tokens," increasing Japanese pronunciation accuracy from 0.668 to 0.958 (+39%) and reducing CER by 54%.

Dial HEALTHDIAL for Advice: A Multilingual and Multi-Parallel Spoken Dialogue Dataset for Knowledge-Grounded Information Seeking

HEALTHDIAL constructs a dataset comprising 6,000 multi-parallel health information-seeking dialogues in 4 official WHO languages and 163 hours of real user speech. It establishes a multilingual spoken RAG benchmark based on ASR, TTS, retrieval, knowledge filtering, and user studies.

Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models

This paper reveals the structured redundancy hierarchy of speech token representations in Large Speech Language Models (LSLMs) through layer-wise oracle intervention experiments—where shallow layers encode essential acoustic details while deep layers are extremely redundant—and proposes Affinity Pooling, a training-free similarity-based token merging mechanism that reduces FLOPs by 27.48% while maintaining competitive accuracy.

DRInQ: Evaluating Conversational Implicature with Controlled Context Variation

DRInQ constructs a conversational implicature evaluation set by fixing question surface forms and systematically varying contexts. It discovers that while LLMs can generate plausible pragmatic scenarios, they often over-interpret context during inference, resulting in lower consistency compared to human judgment.

DuIVRS-2: An LLM-based Interactive Voice Response System for Large-scale POI Attribute Acquisition

DuIVRS-2 transforms the modular telephone IVR system used for Baidu Maps' large-scale POI attribute acquisition into an LLM-driven end-to-end dialogue system. Through FSM-guided data augmentation, selective generation, and co-evaluator iterative learning, it achieves 83.9% TSR, 130ms average response latency, and a capacity of 0.4M calls per day in production.

Exploration of Perceptual Speech Features for Clinical Decision-Support in Mental Health Care

This paper proposes an explainable speech analysis framework for clinical mental health assistance. It combines perceptually understandable acoustic and linguistic features with XGBoost, statistical testing, SHAP, and LIME to identify stable behavioral vocal cues across multiple datasets (stress, depression, anxiety, ADHD), rather than pursuing black-box end-to-end diagnosis.

FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representations

FC-TTS utilizes disentangled speech representations from FACodec as conditioning sources. Through two-stage spectrogram generation, a VQ-VAE style encoder, and a conditional consistency loss, it separates timbre and speaking style—originally entangled within a single reference in zero-shot TTS—into two independently controllable inputs.

FIGMA: Towards Fine-Grained Music Retrieval

To address the issue where CLAP-like music retrieval models "only utilize the first 40–50 tokens of a caption and collapse long descriptions into a bag-of-words," FIGMA introduces a frame-token level fine-grained contrastive loss (multi-view contrast) alongside the standard global contrastive loss. It also includes the FGMCaps dataset featuring 380k pairs with music theory annotations, enabling the model to retrieve music based on precise attributes like tempo, key, chords, and time signature, achieving a maximum relative improvement of 73.3%.

From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation

This paper introduces S2ST-Omni 2, which replaces flat language labels in multilingual speech translation with structured typological priors. These priors are injected across three levels: representation, acoustic modulation, and LLM decoding. The approach achieves improvements in BLEU, ASR-BLEU, COMET, and BLASER 2.0 on CVSS-C, with significant gains for low-resource languages and those with high typological divergence.

From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?

To be added after in-depth reading.

Full-Duplex-Bench-v2: A Multi-Turn Evaluation Framework for Duplex Dialogue Systems with an Automated Examiner

The authors propose Full-Duplex-Bench-v2, where a GPT-Realtime-powered Examiner interacts with full-duplex models in real-time via WebRTC across four task categories (Daily/Correction/Entity/Safety) and two pacing modes (Fast/Slow). Evaluation scores cover turn-taking, instruction-following, and task-specific dimensions. Findings reveal that performance for GPT-Realtime, Moshi, and Freeze-Omni degrades as dialogues progress, with open-source models performing particularly poorly on correction and entity tracking.

SEPT: Semantically Expanded Prompt Tuning for Audio-Language Models

SEPT significantly alleviates the Base-New Tradeoff (BNT) problem in Audio-Language Model (ALM) prompt tuning by leveraging LLMs to generate semantic neighbors and designing a margin-constrained semantic expansion loss to regularize the prompt embedding space. It establishes the first systematic evaluation benchmark for ALM prompt generalization.

Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages

This paper conducts a phoneme-level ASR analysis of two typologically extreme, low-resource endangered East Caucasian languages (Archi and Rutul). It finds that phoneme recognition accuracy follows an S-shaped learning curve relative to training frequency, suggesting that many errors attributed to phonological complexity actually stem from data scarcity.

HCFD: A Benchmark for Audio Deepfake Detection in Healthcare

This paper proposes HCFD, a task for codec-based fake speech detection in healthcare scenarios. It constructs HCFK, the first codec-based fake speech dataset containing various clinical pathological conditions (Depression, Alzheimer's, Dysarthria), and proposes the PHOENIX-Mamba framework—which models multi-modal forgery evidence prototypes in hyperbolic space to achieve 97.04% accuracy in English depression detection.

How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them

This paper employs three phonology probing tasks (rhyme / G2P / syllable count) to demonstrate that BPE-style subword tokenization is both "too coarse" to capture local phonology and "misaligned" in its boundaries to capture prosodic structures. The authors propose the STAD metric and a lightweight IPA-augmented fine-tuning method, enabling Llama3.1-8B to achieve comprehensive improvements across all three tasks while experiencing only a 1.1% and 0.9% drop in GSM8K and MMLU performance, respectively.

ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment

ImmersiveTTS utilizes a dual-stream MM-DiT to simultaneously model transcript content and environmental descriptions, stabilized by dual-teacher representation alignment with WavLM and ATST-Frame, enhancing speech naturalness, intelligibility, and speech-environment fusion quality in background-noise TTS.

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

This paper constructs the first multi-Indic language CodecFake detection benchmark, ICF, and proposes SATYAM—a hyperbolic audio large language model. By aligning semantic and paralinguistic representations using Bhattacharyya distance in hyperbolic space and subsequently aligning them with prompts, the model achieves a 98.32% detection accuracy with only 3.75M trainable parameters.

Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering

Construcs Jamendo-MT-QA, a multi-track comparative music QA benchmark containing 36,519 comparative QA pairs (covering 12,173 track pairs), to systematically evaluate the cross-track comparative reasoning capabilities of audio-language models for the first time, revealing significant deficiencies of existing models in sentence-level comparison generation.

LLM-MC-Affect: LLM-Based Monte Carlo Modeling of Affective Trajectories and Latent Ambiguity for Interpersonal Dynamic Insight

This paper proposes LLM-MC-Affect, which transforms affective states in dialogue from single-point labels into latent distributions approximated by stochastic LLM decoding. It further utilizes mean, variance, cross-correlation, and slope metrics to analyze affective synchrony and dominance in teacher-student dialogues.

MARQUIS: A Three-Stage Pipeline for Video Retrieval-Augmented Generation

MARQUIS decomposes multi-video retrieval-augmented article generation into a three-stage pipeline: "Query Decomposition and Reranking Retrieval—Calibrated Structured Evidence Extraction—Article Generation with Citations." It also employs an RLM controller for iterative evidence management. On MAGMaR2026, it improved retrieval nDCG@10 from 0.195 to 0.759, while the Iter-QA-Base variant achieved a human rating of 3.83 on the generation side.

MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus

This paper constructs MCGA, the first large-scale (119 hours, 22,000 samples) fully copyrighted audio corpus for classical Chinese literature. It covers five major genres (Fu, Poetry, Prose, Ci, and Qu) and six speech tasks (ASR/S2TT/SEC/SQA/SU/SR). Evaluation of 10 multimodal large language models reveals significant deficiencies in current models regarding the understanding of classical Chinese literary audio.

Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs

The authors propose a multilingual disfluency correction pipeline: first using MuRIL for token-level fluent/disfluent labeling, then feeding the "original transcript + token labels" into Llama-3.2-3B / Qwen2.5-3B for instruction fine-tuning. The key innovation is an anti-disfluency contrastive loss that explicitly penalizes the probability of generating disfluent tokens (\(-\log(1-\sum_v w_v P_\theta(v))\)). On real Hindi/Bengali/Marathi ASR data, this approach achieves +1.97 BLEU over the non-contrastive baseline and +8.54 BLEU over mBART, with the 3B models matching or exceeding GPT-4o in most settings.

MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

This paper proposes MTR-DuplexBench, a comprehensive benchmark for evaluating Full-Duplex Speech Language Models (FD-SLMs) in multi-round scenarios. By introducing an innovative turn segmentation method to address blurred turn boundaries and context inconsistency, the framework evaluates four dimensions: conversational features, conversation quality, instruction following, and safety. Experiments reveal that existing FD-SLMs suffer from continuous performance degradation during multi-round interactions.

Multimodal In-Context Learning for ASR of Low-Resource Languages

This study systematically investigates whether Multimodal In-Context Learning (MICL) enables speech LLMs to learn unseen endangered languages. It proposes an MICL-based hypothesis selection system that combines the complementary strengths of acoustic models and speech LLMs, significantly improving ASR performance across three endangered languages.

Music Audio-Visual Question Answering Requires Specialized Multimodal Designs

As the first comprehensive survey in the Music Audio-Visual Question Answering (Music AVQA) field, this paper systematically analyzes dataset evolution and methodological designs. It demonstrates that specialized input processing, spatio-temporal architectural design, and music domain knowledge are essential for this task, as general multimodal models are insufficient for the unique challenges of musical performance.

MSU-Bench: Musical Score Understanding Benchmark

MSU-Bench is the first human-annotated benchmark for full musical score understanding, comprising 1,800 generative QA pairs from 150 works across four difficulty levels. Evaluations reveal significant deficiencies in LLMs/VLMs regarding score localization and hallucinations, while text input via ABC notation significantly mitigates these issues.

Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval

This paper proposes OEA (Omni-Embed-Audio), which leverages multimodal LLMs as a unified encoder to construct a retrieval-oriented audio-text embedding space. It introduces the User-Intent Queries (UIQ) benchmark and hard negative discrimination metrics (HNSR/TFR). The study finds that the LLM backbone significantly outperforms the CLAP series in T2T retrieval (+22%) and hard negative discrimination (+4.3%p HNSR@10).

Phun-Bench: Evaluating LLMs on Phonological Understanding in Chinese

To be added after further reading.

PlanRAG-Audio: Planning and Retrieval Augmented Generation for Long-form Audio Understanding

PlanRAG-Audio reformulates long-form audio understanding as a process of "planning which modalities and time segments to query, then retrieving evidence from a structured audio database." This reduces the LLM input for a 60-minute audio from approximately 115k tokens to about 1k tokens, while significantly improving performance in speaker counting, event ordering, and speaker-constrained QA.

Privacy-preserving Prosody Representation Learning

This paper proposes a self-supervised prosody encoder using glottal source as input, which reduces identity leakage through F0 speaker normalization and adversarial speaker loss. It outperforms raw prosody and HuBERT baselines in phrase boundary detection, syllable prominence, and pitch reconstruction, while reducing VoxCeleb1 speaker identification accuracy from 0.64 (HuBERT) to 0.14.

Protecting Bystander Privacy via Selective Hearing in Audio LLMs

This paper proposes the first bystander privacy benchmark, SH-Bench, and a Bystander Privacy Fine-Tuning (BPFT) method to evaluate and enhance the capability of audio LLMs to focus solely on the primary speaker and refuse to leak bystander information in multi-speaker environments. After BPFT, the SE metric improves by 16% compared to Gemini 2.5 Pro.

Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition

This paper proposes Pseudo2Real, a parameter space correction method that calculates a "correction vector" by computing the weight difference between a real-label model and a pseudo-label model in the source domain. This vector is then applied to a target domain model fine-tuned on pseudo-labels to rectify systematic pseudo-label biases, achieving up to a 35% relative Word Error Rate (WER) reduction across ten African accents in the AfriSpeech-200 dataset.

Reference Games as a Testbed for the Alignment of Model Uncertainty and Clarification Requests

This paper utilizes color-grid reference games to examine whether VLMs can translate internal uncertainty into appropriate clarification requests, finding that even in controlled tasks, Qwen2.5-VL and GPT-5-mini still exhibit interaction gaps such as overconfidence, unstable clarification behavior, and low-quality clarification questions.

RespiraMFM: A Multimodal Foundation Model for Respiratory Disease Recognition via Contrastive Audio-Language Alignment

RespiraMFM addresses the challenge where "non-linguistic acoustic biomarkers like coughing/wheezing are difficult to align with symptom text" by proposing a two-stage decoupling architecture: first, use contrastive learning to explicitly anchor audio embeddings into the LLM text semantic space, then freeze this aligner for instruction fine-tuning classification. This improves supervised AUROC by 9.15% and zero-shot performance by 20.98% across five respiratory diseases and nine tasks.

ReStyle-TTS: Relative and Continuous Style Control for Zero-Shot Speech Synthesis

ReStyle-TTS enables zero-shot TTS to break free from the style of the reference audio through decoupled text/reference guidance, continuously scalable Style LoRAs, orthogonal LoRA fusion, and timbre consistency optimization. This allows for relative adjustments of pitch, energy, and emotion while maintaining text intelligibility and speaker timbre.

Retrieving to Recover: Towards Incomplete Audio-Visual Question Answering via Semantic-consistent Purification

This paper proposes the R2ScP framework, which shifts the missing modality handling paradigm in AVQA from traditional generative completion to retrieval-based recovery. By employing cross-modal retrieval and a context-aware adaptive purification mechanism to eliminate retrieval noise, it significantly improves question-answering performance in scenarios with incomplete modalities.

RTCFake: Speech Deepfake Detection in Real-Time Communication

RTCFake constructs a ~600-hour speech deepfake detection dataset targeting real-world Real-Time Communication (RTC) platforms and proposes Phoneme-guided Consistency Learning (PCL). This method reduces the average EER of XLSR+AASIST from 7.33% (mixed training) to 5.81% across offline, online, cross-platform, and unseen noise scenarios.

S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models

S2S-Arena proposes a benchmark to evaluate S2S models directly in the speech modality. Using a four-level paralinguistic interaction protocol, 1,243 speech samples, and 1,001 pairwise comparisons, it reveals significant performance gaps in current systems regarding complex tone, emotion, speaking style, and expressive control.

SDiaReward: Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness

SDiaReward constructs a pairwise preference dataset and ESDR-Bench for multi-turn spoken dialogue, training an end-to-end speech reward model. This allows evaluation to transcend text semantics, simultaneously assessing the modality gap (prosody/emotion) and the colloquialness gap (natural spoken style).

SegTune: Structured and Fine-Grained Control for Song Generation

The authors propose SegTune, a song generation framework based on Diffusion Transformer. It achieves fine-grained temporal control over song structure and musical attributes through hierarchical text conditions (global + segment-level prompts) and an LLM-based duration predictor.

Semi-Supervised Diseased Detection from Speech Dialogues with Multi-Level Data Modeling

This paper proposes an audio-only semi-supervised learning framework that jointly models pathological speech features at session, clip, and frame levels within clinical dialogues. By utilizing an EMA teacher-student network to dynamically generate high-quality pseudo-labels, the framework achieves 90% of fully supervised performance in depression and Alzheimer's detection using only 11 labeled samples.

SN-WER: Script-Normalized WER for Multi-Script Indic ASR Evaluation

Proposal of Script-Normalized WER (SN-WER), a training-free evaluation method that decouples script mismatch errors from true recognition errors in multi-script ASR evaluation by transliterating both reference and hypothesis texts into a uniform canonical script before calculating WER.

SpeakerSleuth: Can Large Audio-Language Models Judge Speaker Consistency across Multi-turn Dialogues?

SpeakerSleuth constructs the first benchmark (1,818 instances) to evaluate the ability of LALMs to judge speaker consistency in multi-turn dialogues. Systematic evaluations of 12 LALMs and 6 embedding methods reveal that models struggle to detect and localize acoustic inconsistencies and exhibit severe text-over-audio modality bias, though they perform better at comparing and ranking acoustic variants.

Speculative End-Turn Detector for Efficient Speech Chatbot Assistant

The paper constructs the first public end-turn detection dataset, OpenETD, and proposes SpeculativeETD. This approach utilizes an on-device GRU to continuously monitor speaking/non-speaking states, invoking a server-side Wav2Vec2 to distinguish between Gaps and Pauses only when a 200 ms silence is encountered. On real speech, it achieves real-time turn-taking performance close to large models while reducing FLOPs by 38x and maintainining sub-millisecond on-device latency.

Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

Speech-Hands is proposed as a learnable speech agent framework. By generating explicit action tokens (<internal>/<external>/<rewrite>) at inference time to decide whether to trust internal perception or external ASR hypotheses, it achieves an average WER reduction of 12.1% across 7 benchmarks on the OpenASR leaderboard and reaches 77.37% accuracy on Audio QA.

SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation

This paper extends speech quality assessment from "assigning a score" to "interpretable speech judging" by constructing the SpeechEval dataset, which contains 32,207 multi-lingual audios and 128,754 annotations. By utilizing CoT instruction tuning and GRPO training, SQ-LLM was developed, outperforming existing speech LLMs and expert models across four task categories: quality scoring, pair-wise comparison, improvement suggestions, and deepfake detection.

Standard-to-Dialect Transfer Trends Differ across Text and Speech: A Case Study on Intent and Topic Classification in German Dialects

This paper systematically compares three transfer paths—text, speech, and ASR cascade—using German-Bavarian intent classification and German-Swiss German topic classification. The study finds that optimal solutions for standard language do not necessarily apply to dialects: while text models excel on standard German, speech models are generally more robust on dialectal input.

Still Between Us? Evaluating and Improving Voice Assistant Robustness to Third-Party Interruptions

Addressing the inability of voice assistants to distinguish Third-Party Interruptions (TPI) from primary user speech, this work proposes the TPI-Train dataset with 88K training instances and the TPI-Bench evaluation framework. Through a speaker-aware hard negative mining strategy, semantic shortcut learning is eliminated, forcing the model to rely on acoustic cues for interruption detection.

StressTest: Can YOUR Speech LM Handle the Stress?

The authors propose the StressTest benchmark to evaluate the ability of Speech Language Models (SLMs) to understand the meaning of sentence stress. Findings indicate that existing models struggle to reason about speaker intent based on stress patterns. StresSLM, trained via the Stress-17k synthetic data pipeline, significantly outperforms frontier models on stress detection and reasoning tasks.

Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models

This paper discovers that Spoken Language Models (SLMs) fail to maintain initially specified speaking styles (emotion, accent, volume, speaking rate) during multi-turn dialogues, a phenomenon termed "Style Amnesia." Through attention analysis, the study reveals the cause (attention decay) and proposes an explicit recall process as a mitigation method.

TellWhisper: Tell Whisper Who Speaks When

This paper proposes TellWhisper, which achieves joint modeling of "who spoke what and when" by designing Time-Speaker Aware Rotary Positional Embedding (TS-RoPE) to unify speaker identity and temporal information within the self-attention of the speech encoder. Coupled with a Hyperbolic space Speaker Diarization model (Hyper-SD), it achieves state-of-the-art performance in multi-speaker ASR tasks.

Temporal Contrastive Decoding: A Training-Free Method for Large Audio-Language Models

TCD is proposed as a training-free inference-time decoding method: by contrasting the logit differences between original audio and a temporally blurred slow-path view, combined with stability-guided blurring windows and uncertainty gating, unified audio-language models better utilize transient acoustic cues. It achieves consistent improvements on MMAU and AIR-Bench.

Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training

This paper proposes the FCaps large-scale dataset (47k hours of speech, 19M fine-grained annotations) and the CLSP contrastive learning model. Through an end-to-end annotation pipeline and fine-grained multi-granular contrastive supervision, it realizes the first speech-text alignment model capable of unifying global and fine-grained speaking style representations.

UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

UniSonate utilizes a unified Instruction-Content representation, dynamic SFX token injection, and multi-stage curriculum learning to integrate Text-to-Speech (TTS), Text-to-Music (TTM), and Text-to-Audio (TTA) into a single flow-matching MM-DiT. It achieves performance comparable to or exceeding specialized models in TTS and TTM while maintaining functional sound effect generation capabilities in TTA.

UniSRM: A Unified Speech Reward Model for Fine-Grained Speech Evaluation

This paper proposes UniSRM, a unified speech reward model. Through two-stage training (SFT + GRPO) and a Reasoning Consistency Reward (RCR) mechanism, it supports multi-dimensional, interpretable speech evaluation ranging from utterance-level quality to dialogue-level coherence, significantly outperforming existing methods across multiple evaluation tasks.

UniVocal: Unified Speech-Singing Code-Mixed Synthesis

UniVocal achieves SOTA performance on the newly constructed SCSBench benchmark by utilizing fine-grained cent tokens and two-stage curriculum learning, enabling the model to automatically infer speech/singing switching points from raw text semantics without explicit labels.

VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models

This paper discovers that end-to-end Omni-modal Large Language Models (OLLMs) tend to miscopy slide text as speech content when performing SlideASR. It proposes VAPO, which utilizes a "Look-then-Listen" structured reasoning chain and multi-objective reinforcement learning to transform slide text into semantic anchors for speech recognition rather than sources of interference.

VoxMind: An End-to-End Agentic Spoken Dialogue System

VoxMind is proposed as a unified framework that endows end-to-end speech dialogue models with agentic capabilities. By implementing a "Think-before-Speak" mechanism for explicit reasoning and a multi-agent dynamic tool management architecture to decouple inference latency from tool scale, the task completion rate is improved from a 34.88% baseline to 74.57%, surpassing Gemini-2.5-Pro.

When Misinformation Speaks and Converses: Rethinking Fact-Checking in Audio Platforms

This is a position paper arguing that misinformation on audio platforms is fundamentally different from textual misinformation. It possesses both spoken characteristics (prosody, pacing, emotion) and conversational traits (multi-turn, multi-speaker, cross-episode). Existing text-centric fact-checking pipelines cannot process these effectively, necessitating the redesign of verification frameworks around audio-specific attributes.

XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection

The XLSR-MamBo framework is proposed to systematically explore four topological designs and various SSM variants (Mamba2, Hydra, GDN) within hybrid Mamba-Attention architectures for audio deepfake detection. Among these, MamBo-3-Hydra utilizes the native bidirectional modeling of Hydra to achieve competitive performance across multiple benchmarks, while increasing backbone depth effectively mitigates the performance instability observed in shallow models.

ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching

This paper proposes ZipVoice-Dialog, the first flow-matching-based non-autoregressive (NAR) zero-shot dialogue speech generation model. Through two simple designs—a curriculum learning strategy and speaker turn embeddings—the model resolves issues of speech unintelligibility and turn confusion when flow matching is directly applied to dialogue scenarios. Additionally, the first large-scale open-source dialogue speech dataset, OpenDialog (6.8k hours), is released.