🎵 Audio & Speech¶
🤖 AAAI2026 · 31 paper notes
📌 Same area in other venues: 📷 CVPR2026 (22) · 🔬 ICLR2026 (80) · 💬 ACL2026 (72) · 🧪 ICML2026 (36) · 🧠 NeurIPS2025 (47) · 📹 ICCV2025 (11)
🔥 Top topics: Speech & Audio ×13 · Sentiment Analysis ×7 · Multimodal/VLM ×5 · Dialogue ×2 · Diffusion Models ×2
- A Mind Cannot Be Smeared Across Time
-
This paper formally proves that whether a machine possesses consciousness depends not only on what is computed, but also on when it is computed. Systems executing strictly sequentially fail to satisfy the temporal co-instantiation condition required for the unity of consciousness; consequently, pure software consciousness on strictly sequential hardware is impossible.
- DeepDebater: A Superpersuasive Autonomous Policy Debating System
-
This paper presents DeepDebater, the first autonomous multi-agent system capable of participating in and winning a complete American-style policy debate (eight speeches plus cross-examination). The system employs a hierarchical agent workflow to construct affirmative (Advantage) and negative (DA+CP+Kritik) arguments, leverages over 3 million evidence cards from OpenDebateEvidence for retrieval-augmented generation, and integrates GPT-4o TTS speech synthesis with EchoMimic digital avatar animation for end-to-end presentation. Expert evaluations show DeepDebater significantly outperforms human-authored cases across all metrics (Quality: 4.32 vs. 3.65), achieving an 85% win rate in simulated rounds.
- AHAMask: Reliable Task Specification for Large Audio Language Models without Instructions
-
By applying binary masks (AHAMask) over attention heads in the Transformer backbone of Large Audio Language Models (LALMs), specific acoustic task functionalities can be reliably triggered without any textual instructions, while revealing the existence of "acoustic functional pathways" within LALMs.
- Aligning Generative Music AI with Human Preferences: Methods and Challenges
-
This survey/position paper systematically reviews three technical approaches to preference alignment in music generation—MusicRL (large-scale RLHF with ~300K preference pairs), DiffRhythm+ (multi-preference DPO for diffusion models), and Text2midi-InferAlign (inference-time tree search achieving +29.4% CLAP)—while providing an in-depth analysis of alignment challenges unique to the music domain (multi-scale temporal coherence, harmonic consistency, cultural subjectivity, and the evaluation paradox), and proposing a future research roadmap.
- CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation
-
This paper introduces CCFQA—the first cross-lingual and cross-modal factuality benchmark covering 8 languages with 14,400 fully parallel speech-text factual QA samples. It supports four task settings (QA/XQA/SQA/XSQA), systematically revealing factual inconsistencies in existing MLLMs under language and modality switching. The paper also proposes LLM-SQA, which bridges via English with only 5-shot examples to achieve cross-lingual spoken QA transfer, attaining an F1 of 51.4 on XSQA—surpassing GPT-4o-mini-Audio (45.7).
- Characterizing AI Manipulation Risks in Brazilian YouTube Climate Discourse
-
Through a psycholinguistic framework, this work analyzes 226,775 Brazilian YouTube climate change videos and 2,756,165 comments, revealing that emotional and moral rhetoric significantly drives user engagement. It further demonstrates that fine-tuned LLMs can automatically generate high-engagement climate denial comments, warning of the potential risks of generative AI in public opinion manipulation.
- Cross-Space Synergy: A Unified Framework for Multimodal Emotion Recognition in Conversation
-
This paper proposes the Cross-Space Synergy (CSS) framework, which simultaneously addresses two major challenges in multimodal emotion recognition in conversation (MERC)—insufficient fusion expressiveness and multi-objective gradient conflicts—via Synergistic Polynomial Fusion (SPF) in the representation space and a Pareto Gradient Modulator (PGM) in the gradient space.
- DeformTrace: A Deformable State Space Model with Relay Tokens for Temporal Forgery Localization
-
This paper proposes DeformTrace, which introduces a deformable dynamic receptive field mechanism and relay token scheme into state space models, combining Transformer-level global modeling with SSM-level efficient inference to achieve state-of-the-art accuracy and substantial efficiency gains in temporal forgery localization.
- Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation
-
This paper proposes Diff-V2M, a hierarchical conditional diffusion Transformer framework for video-to-music generation that integrates affective, semantic, and rhythmic features via explicit rhythmic modeling (low-resolution ODF) and a hierarchical cross-attention mechanism, achieving state-of-the-art performance on both in-domain and out-of-domain datasets.
- DiffA: Large Language Diffusion Models Can Listen and Understand
-
This paper proposes DIFFA — the first large audio-language model built upon a diffusion language model — which combines a frozen LLaDA-8B backbone with a lightweight dual-adapter architecture and a two-stage training pipeline. Using only 960 hours of ASR data and 127 hours of synthetic instruction data, DIFFA achieves competitive performance against autoregressive baselines on MMSU, MMAU, and VoiceBench.
- Do LLMs Feel? Teaching Emotion Recognition with Prompts, Retrieval, and Curriculum Learning
-
This paper proposes PRC-Emo, a framework that integrates explicit/implicit emotion prompting, a dedicated retrieval database, and curriculum learning to enhance LLM performance on Emotion Recognition in Conversation (ERC), achieving state-of-the-art results on the IEMOCAP and MELD benchmarks.
- DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling
-
This paper proposes DualSpeechLM, a framework that extracts high-level semantic tokens via an understanding-driven speech tokenizer (USTokenizer) as LLM input and uses acoustic tokens as output, jointly optimizing speech understanding and generation within a single end-to-end framework.
- End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering
-
This paper proposes CLSR, an end-to-end contrastive language-speech retriever that converts acoustic representations into text-like representations before aligning them with text, enabling efficient extraction of question-relevant segments from long-form audio to support RAG-based spoken question answering for downstream LALMs.
- Factor(U,T): Controlling Untrusted AI by Monitoring their Plans
-
This paper investigates the security of the Factor(U,T) protocol, in which an untrusted AI performs task decomposition and a trusted AI handles execution. The study finds that monitoring decomposed plans yields an AUROC of only 0.52 (near random), whereas monitoring concrete code implementations achieves 0.96—malicious intent is difficult to detect at the abstract planning level but becomes exposed in concrete implementations. The key conclusion is that "structural prevention (trusted decomposer) is superior to post-hoc monitoring."
- GOMPSNR: Reflourish the Signal-to-Noise Ratio Metric for Audio Generation Tasks
-
This paper reconstructs the SNR metric by introducing omnidirectional phase derivatives to replace instantaneous phase, proposes GOMPSNR as a more reliable audio quality evaluation metric, and derives a family of novel loss functions that significantly improve neural vocoder performance.
- Hearing More with Less: Multi-Modal Retrieval-and-Selection Augmented Conversational LLM-Based ASR
-
MARS proposes a multimodal retrieval-and-selection approach to identify the most relevant historical context for conversational LLM-ASR—rather than relying on a fixed number of preceding utterances or the entire history—achieving state-of-the-art performance with only 1.5K hours of training data, surpassing TEA-ASLP trained on 179K hours.
- HPSU: A Benchmark for Human-Level Perception in Real-World Spoken Speech Understanding
-
This paper proposes the HPSU benchmark, comprising 20,000+ expert-annotated Chinese and English samples across 16 tasks, to systematically evaluate Speech LLMs' deep perception and reasoning capabilities in real-world spoken language scenarios. The best-performing model (Gemini 2.5 Pro, 62.6%) still falls far short of human performance (87.3%).
- HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios
-
This paper proposes HQ-SVC, a framework that leverages a disentangled audio codec (FACodec) to jointly extract content and speaker features, integrates an Enhanced Voice Adaptor (EVA) to fuse acoustic features such as pitch and energy, and employs a progressive synthesis pipeline combining DDSP and a diffusion model. Trained on a single RTX 3090 with fewer than 80 hours of singing data, HQ-SVC achieves zero-shot singing voice conversion quality surpassing large-scale training baselines, and additionally supports speech super-resolution.
- Improving Multimodal Sentiment Analysis via Modality Optimization and Dynamic Primary Modality Selection
-
This paper proposes the MODS framework, which eliminates redundancy in non-linguistic modalities via Graph-based Dynamic Compression (GDC), and introduces a sample-level Dynamic Primary Modality Selector (MSelector) together with Primary-modality-Centric Cross-Attention (PCCA) to enable adaptive dominant modality selection on a per-sample basis for multimodal sentiment analysis (MSA).
- Let the Model Learn to Feel: Mode-Guided Tonality Injection for Symbolic Music Emotion Recognition
-
The MoGE diagnostic strategy systematically identifies that MIDIBERT fails to encode mode–emotion associations. The proposed MoFi injection framework leverages FiLM conditioning to inject major/minor priors into Layer 1 of MIDIBERT (identified as the weakest layer in terms of emotional information). This achieves 75.2% accuracy (+11.8%) on EMOPIA and 59.1% (+11.8%) on VGMIDI, with F1 improvements of 12.3% and 15.5%, respectively.
- Listening Between the Frames: Bridging Temporal Gaps in Large Audio-Language Models
-
This paper proposes TimeAudio, which equips large audio-language models (LALMs) with precise temporal grounding and end-to-end long audio understanding capabilities through three key modules: Temporal Markers, Absolute Time-aware Encoding (ATE), and Segment-level Token Merging (SEM). The paper also introduces the FTAR dataset for instruction fine-tuning on fine-grained temporal reasoning.
- MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement
-
This paper proposes MF-Speech, a framework that employs multi-objective optimization to disentangle speech signals into three high-purity, independent factor representations—content, timbre, and emotion—and subsequently leverages dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN) to achieve fine-grained, compositional control in speech generation, significantly outperforming existing methods on multi-factor compositional speech generation tasks (WER=4.67%, SECS=0.5685).
- Multi-granularity Interactive Attention Framework for Residual Hierarchical Pronunciation Assessment
-
This paper proposes the HIA framework, which employs an Interactive Attention Module to enable bidirectional information exchange across phoneme, word, and utterance granularities. Combined with a residual hierarchical structure to mitigate feature forgetting, HIA achieves state-of-the-art results on the speechocean762 dataset across all granularities and aspects.
- PaSE: Prototype-aligned Calibration and Shapley-based Equilibrium for Multimodal Sentiment Analysis
-
PaSE is a framework that explicitly addresses modality competition in multimodal sentiment analysis through a two-stage optimization strategy combining prototype-guided calibration alignment (via Entropic Optimal Transport) and Shapley-value-based gradient modulation.
- PSA-MF: Personality-Sentiment Aligned Multi-Level Fusion for Multimodal Sentiment Analysis
-
This paper is the first to introduce a pre-trained personality model into Multimodal Sentiment Analysis (MSA) for extracting personalized sentiment features. Through personality-sentiment contrastive alignment and a progressive multi-level fusion architecture (pre-fusion → cross-modal interaction → enhanced fusion), the proposed PSA-MF achieves state-of-the-art performance on CMU-MOSI and CMU-MOSEI.
- REINA: Regularized Entropy Information-Based Loss for Efficient Simultaneous Speech Translation
-
This paper proposes REINA (Regularized Entropy INformation Adaptation), a loss function grounded in mutual information theory that efficiently converts a non-streaming speech translation model into a streaming simultaneous speech translation model. REINA achieves state-of-the-art streaming translation performance across multiple language directions and introduces a new streaming efficiency metric, NoSE.
- Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding
-
This paper proposes VARSTok, the first fully dynamic variable-frame-rate speech tokenizer, which achieves adaptive token allocation via temporal-aware density peak clustering and implicit duration coding, surpassing fixed-frame-rate baselines while using fewer tokens.
- SpikCommander: A High-Performance Spiking Transformer with Multi-View Learning for Efficient Speech Command Recognition
-
This paper proposes SpikCommander, a fully spike-driven Transformer architecture that jointly enhances temporal and channel feature modeling via Multi-view Spike Temporal-Aware Self-Attention (MSTASA) and Spike Context Refinement MLP (SCR-MLP), surpassing state-of-the-art SNN methods on SHD/SSC/GSC benchmarks with fewer parameters.
- TEXT: Text-Routed Sparse Mixture of Experts for Multimodal Sentiment Analysis with Explanation Enhancement and Temporal Alignment
-
This paper proposes TEXT, a model that leverages MLLMs to generate natural language explanations for audio and video modalities to enhance modal representations, designs a lightweight temporal alignment module combining the strengths of Mamba and temporal cross-attention, and employs text-routed sparse mixture of experts for cross-modal fusion. TEXT comprehensively outperforms prior SOTA methods and large models such as GPT-4o across four MSA benchmarks.
- Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning
-
Authentic-Dubber simulates the director-actor interaction workflow in real-world dubbing by constructing a multimodal reference footage library, employing an emotion-similarity-based retrieval-augmented strategy, and adopting a progressive graph-based speech generation approach. The method significantly improves the emotional expressiveness of automatic movie dubbing, achieving state-of-the-art emotion accuracy and MOS scores on the V2C-Animation dataset.
- USE: A Unified Model for Universal Sound Separation and Extraction
-
The paper proposes USE, a unified framework that employs an EDA network to infer the number of sound sources and generate acoustic cues for sound separation (SS), and a multimodal fusion network to interpret user-provided text/video/label cues for target sound extraction (TSE). Joint training with cross-task alignment enables mutual reinforcement between the two tasks, achieving +1.4 dB SDR on SS and 86% matching accuracy on TSE.