🎵 Audio & Speech¶

🧠 NeurIPS2025 · 49 paper notes

A Controllable Examination for Long-Context Language Models: This paper proposes LongBioBench, which uses synthetically generated fictional biographies as both needles and haystacks to construct a long-context LLM evaluation framework satisfying three core principles: seamless context, controllable settings, and reliable evaluation. Evaluating 18 models, the benchmark reveals that current LCLMs exhibit substantial deficiencies in reasoning and trustworthiness despite adequate retrieval performance.
A Multi-Task Benchmark for Abusive Language Detection in Low-Resource Settings: This paper introduces TiALD (Tigrinya Abusive Language Detection), the first large-scale multi-task benchmark dataset for the low-resource Tigrinya language. It comprises 13,717 YouTube comments annotated jointly across three tasks—abusive language detection, sentiment analysis, and topic classification—and demonstrates that a compact fine-tuned model (TiRoBERTa, 125M parameters) consistently outperforms frontier LLMs such as GPT-4o and Claude Sonnet 3.7 across all tasks.
A TRIANGLE Enables Multimodal Alignment Beyond Cosine Similarity: TRIANGLE proposes using the area of the triangle formed by three modal embedding vectors in high-dimensional space as a similarity measure, replacing traditional pairwise cosine similarity to achieve joint alignment of video, audio, and text modalities. The method surpasses state-of-the-art by up to 9 Recall@1 points on video-text retrieval and related tasks.
Accelerate Creation of Product Claims Using Generative AI: This paper develops the Claim Advisor platform, leveraging LLM in-context learning and LoRA fine-tuning to accelerate the search, generation, refinement, and ranking of product claims for consumer goods. By emulating the MaxDiff research methodology, a fine-tuned Phi-3 14B model outperforms GPT-4o on claim ranking using only 1 in-context example versus GPT-4o's 100, and after three iterative rounds, 100% of generated claims achieve a "highly appealing" rating.
AdaptDel: Adaptable Deletion Rate Randomized Smoothing for Certified Robustness: AdaptDel extends the fixed deletion rate used in randomized smoothing for discrete sequences to an adaptable deletion rate that varies according to input properties such as sequence length. The paper provides a theoretical soundness proof for certification under variable rates, and experiments on NLP sequence classification tasks demonstrate improvements in certified region cardinality of up to 30 orders of magnitude.
Associative Syntax and Maximal Repetitions Reveal Context-Dependent Complexity in Fruit Bat Communication: This paper proposes an unsupervised approach for inferring discrete units, grammar types, and temporal structure from fruit bat vocalizations, and introduces Maximal Repetitions (MRs) to animal communication research for the first time, finding that communicative complexity is significantly higher in conflict contexts than in affiliative ones.
AudSemThinker: Enhancing Audio-Language Models through Reasoning over Semantics of Sound: AudSemThinker introduces a structured semantic reasoning framework for audio-language models by defining 9 categories of sound semantic descriptors (who/what/how/when/where, etc.). Built on Qwen2.5-Omni-7B and trained via SFT + GRPO (with verifiable rewards and length constraints), the model produces three-stage outputs in the format \<think>\<semantic_elements>\<answer>, achieving 66.70% on the MMAU benchmark—surpassing Audio-Reasoner (61.71%) and Qwen2.5-Omni (65.60%).
Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents: Meta proposes WAGIBench, a multimodal goal inference benchmark for assistive wearable agents, comprising 3,477 egocentric recordings (29 hours) from 348 participants across four modalities — visual, audio, digital, and longitudinal. Human accuracy reaches 93% versus the best VLM at 84% (MCQ); under generative evaluation, models produce relevant goals only 55% of the time, exposing a substantial gap between current VLMs and real-world wearable deployment.
BNMusic: Blending Environmental Noises into Personalized Music: This paper proposes BNMusic, a two-stage framework that blends environmental noises into personalized generated music. Stage 1 generates rhythm-aligned music via mel-spectrogram outpainting and inpainting; Stage 2 adaptively amplifies the music signal based on auditory masking theory to reduce noise perception. The approach requires no additional training and significantly outperforms baselines on EPIC-SOUNDS and ESC-50.
Can LLMs Outshine Conventional Recommenders? A Comparative Evaluation: This paper proposes RecBench, a comprehensive evaluation framework that systematically compares 17 LLMs against 10 conventional DLRMs across 5 domain-specific datasets. Results show that LLM-based recommenders achieve up to 5% AUC improvement on CTR tasks and up to 170% NDCG@10 improvement on sequential recommendation, yet incur 10–1000× slower inference. Conventional DLRMs augmented with LLM semantic embeddings (LLM-for-RS) attain approximately 95% of LLM performance at 20× higher throughput, making this paradigm the most industrially viable solution at present.
Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models: Data-Juicer 2.0 is a cloud-scale multimodal data processing system for foundation models, featuring 150+ operators spanning text, image, video, and audio. It supports adaptive distributed execution (Ray/MaxCompute), efficiently processes TB-scale data on 10,000+ CPU cores, and has been widely adopted in products such as Alibaba Cloud PAI.
DeepASA: An Object-Oriented Multi-Purpose Network for Auditory Scene Analysis: This paper proposes DeepASA, an object-oriented multi-task unified architecture that simultaneously performs multi-channel source separation (MIMO), dereverberation, sound event detection (SED), audio classification, and direction-of-arrival estimation (DoAE) within a single model via object-oriented processing and a chain-of-inference mechanism, achieving state-of-the-art performance on multiple spatial audio benchmarks.
E-BATS: Efficient Backpropagation-Free Test-Time Adaptation for Speech Foundation Models: This paper proposes E-BATS, the first backpropagation-free test-time adaptation framework for speech foundation models. Through lightweight prompt adaptation, multi-scale loss functions, and a test-time EMA mechanism, E-BATS achieves 2.0×–6.4× GPU memory savings while maintaining competitive accuracy.
E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis: E2E-VGuard is a proactive defense framework against voice cloning threats in LLM-based end-to-end speech synthesis. It disrupts timbre via encoder ensemble perturbation, interferes with pronunciation recognition via adversarial attacks on ASR systems, and ensures imperceptibility through a psychoacoustic model. Effectiveness is validated across 19 TTS models and 7 ASR systems.
Echoes of Humanity: Exploring the Perceived Humanness of AI Music: Through a randomized controlled crossover trial (RCCT) and mixed-methods content analysis, this paper systematically investigates listeners' ability to distinguish AI-generated music (AIM) from human-created music. Results show that listeners perform at chance level under random pairing (accuracy ≈ random guessing), but accuracy rises significantly to 66% under similar pairing. Vocal, sound, and technical cues are identified as the key factors enabling successful discrimination.
Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space: This paper proposes SLED, which encodes speech waveforms into sequences of continuous latent representations and performs autoregressive modeling in the continuous space via an energy distance objective. This avoids the information loss from discretization and the complex hierarchical architectures required by RVQ, while enabling efficient zero-shot and streaming speech synthesis.
Ethics Statements in AI Music Papers: The Effective and the Ineffective: A systematic review of the current state of ethics statement usage in AI music research papers, finding that the vast majority of ethics statements are not effectively utilized, with actionable recommendations proposed for both conferences and researchers.
EuroSpeech: A Multilingual Speech Corpus: This paper presents a scalable, open-source pipeline for automatically constructing the EuroSpeech dataset from recordings of 22 European parliaments — yielding 61K hours of high-quality speech-text aligned data across 22 languages, with 19 languages exceeding 1K hours. Fine-tuning Whisper on this data reduces average WER by 41.8%.
From Generation to Attribution: Music AI Agent Architectures for the Post-Streaming Era: This paper proposes a content-based Music AI Agent architecture that decomposes music into fine-grained Block components and constructs an Attribution Layer, embedding copyright attribution directly into the AI music creation pipeline to establish a fair AI media platform for the post-streaming era.
Generating Physically Sound Designs from Text and a Set of Physical Constraints: This paper proposes TIDES, a framework that combines the visual guidance of pretrained text-image models (CLIP) with a differentiable finite-element physics simulator. By jointly optimizing a visual similarity loss and a structural compliance loss, TIDES generates load-bearing structural designs that satisfy both engineering performance requirements and text-specified visual characteristics, starting from text descriptions and physical constraints. The method is validated through 3D-printed three-point bending experiments.
Inductive Transfer Learning for Graph-Based Recommenders: This paper proposes NBF-Rec, a graph-based recommendation model built upon the Neural Bellman-Ford Network, which supports inductive transfer learning across datasets with completely disjoint users and items, enabling zero-shot cross-domain recommendation and lightweight fine-tuning adaptation.
Instance-Specific Test-Time Training for Speech Editing in the Wild: This paper proposes an instance-specific test-time training (TTT) method for in-the-wild speech editing. Prior to inference, the model is fine-tuned at the instance level using direct supervision from the acoustic features of unedited regions, and indirect supervision over edited regions via duration constraints and a phoneme prediction auxiliary loss. This approach effectively mitigates bandwidth discontinuity at editing boundaries, supports precise speaking-rate control through mask length adjustment, and surpasses existing systems on both objective and subjective metrics on an in-the-wild benchmark.
Latent Space Factorization in LoRA: This paper proposes FVAE-LoRA, which incorporates a VAE with dual latent spaces into the LoRA framework. Through a novel ELBO objective, it explicitly factorizes task-relevant features (\(\mathbf{z}_1\)) from residual information (\(\mathbf{z}_2\)), consistently outperforming standard LoRA across text, image, and audio tasks.
LeVo: High-Quality Song Generation with Multi-Preference Alignment: This paper proposes LeVo, a song generation framework that employs a language model to jointly model mixed tokens and dual-track tokens, thereby reconciling vocal-accompaniment harmony with audio quality. It further introduces a DPO-based multi-preference alignment method to enhance musicality and instruction-following capability.
LeVo: High-Quality Song Generation with Multi-Preference Alignment: LeVo proposes a language-model-based song generation framework that simultaneously optimizes vocal–accompaniment harmony and audio quality by predicting mixed tokens and dual-track tokens in parallel, and introduces a DPO-based multi-preference alignment method to enhance musicality and instruction-following ability. LeVo comprehensively outperforms all academic baselines and approaches the performance of industrial systems.
LUMIA: A Handheld Vision-to-Music System for Real-Time, Embodied Composition: This paper presents Lumia — a handheld camera-shaped device that analyzes captured frames via GPT-4 Vision to generate structured prompts, which are then fed to Stable Audio to synthesize loopable music segments, enabling a real-time, embodied improvisation workflow from visual input to music.
MEGADance: Mixture-of-Experts Architecture for Genre-Aware 3D Dance Generation: This paper proposes MEGADance, the first music-driven 3D dance generation method based on a Mixture-of-Experts (MoE) architecture. It decouples choreographic consistency into "dance universality" (Universal Expert) and "style specificity" (Specialized Expert), combined with FSQ quantization and a Mamba-Transformer hybrid backbone, achieving state-of-the-art dance quality and strong style controllability.
Merlin L48 Spectrogram Dataset: This paper introduces the L48 dataset — a fine-grained spectrogram multi-label classification benchmark derived from real-world bird recordings that naturally exhibits the Single Positive Multi-Label (SPML) setting. The dataset exposes critical shortcomings of existing SPML methods under realistic conditions, and proposes an intra-recording consistency regularization scheme to improve performance.
Mixed Monotonicity Reachability Analysis of Neural ODE: A Trade-Off Between Tightness and Efficiency: This paper applies continuous-time mixed monotonicity techniques to the reachability analysis of Neural ODEs. By embedding Neural ODE dynamics into a mixed monotone system, it exploits the geometric simplicity of interval boxes to achieve efficient over-approximation, providing a controllable trade-off between tightness and computational efficiency.
MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition: MoME integrates sparse MoE into the Matryoshka representation learning framework for LLM-based audio-visual speech recognition. Through a shared router, it enables cross-granularity knowledge transfer, supporting elastic inference at multiple compression rates under a single set of model weights, while achieving state-of-the-art performance on AVSR/ASR/VSR.
Multi-head Temporal Latent Attention: MTLA extends MLA's low-rank latent compression along the feature dimension by introducing a hyper-network that dynamically merges temporally adjacent KV vectors, achieving dual-axis compression of the KV cache across both feature and temporal dimensions. Combined with a stride-aware causal mask to ensure training–inference consistency, MTLA achieves 4.29× speedup and 6.58× memory reduction on speech translation and related tasks, with quality on par with or slightly exceeding standard MHA.
Node-Based Editing for Multimodal Generation of Text, Audio, Image, and Video: This paper proposes a node-graph-based story editing system that allows creators to iteratively generate, edit, and compare multimodal content (text, audio, image, and video) through natural language and node-level operations, supporting both linear and branching narrative structures.
Perceptually Aligning Representations of Music via Noise-Augmented Autoencoders: This paper demonstrates that applying noise augmentation to latent variables during autoencoder training, combined with a perceptual loss, induces a "perceptual hierarchy" in the encoding space — the most perceptually salient musical features (e.g., pitch) are encoded in the coarsest latent structures, while secondary features (e.g., timbral details) are encoded in finer structures. This alignment improves music surprisal estimation under latent diffusion decoding and enhances EEG brain response prediction.
Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers: This work systematically compares language model architectures via controlled synthetic pretraining tasks, and finds that the Canon layer—a lightweight component performing weighted summation over neighboring tokens—significantly enhances core capabilities including reasoning depth (2–4×), reasoning breadth, and knowledge capacity, enabling NoPE to match RoPE and GLA to rival Mamba2/GDN.
Resounding Acoustic Fields with Reciprocity: Leveraging the reciprocity principle of acoustic wave propagation, this paper proposes Versa (ELE data augmentation + SSL self-supervised learning), which generates physically valid virtual training samples by swapping emitter and receiver roles, substantially improving acoustic field estimation performance under sparse emitter configurations.
SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers: This paper proposes SAND-Math, a fully automated synthetic mathematics question generation pipeline that requires no seed dataset. By employing Difficulty Hiking to systematically increase problem difficulty, augmenting the LIMO baseline with as few as 500 problems yields a 4.39pp improvement on AIME25.
Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI Models in Sound Localization: Through six controlled audio-visual conditions and human psychophysical experiments, this work systematically reveals that existing AI sound source localization (SSL) models suffer from severe visual bias—degrading to near-random performance under audio-visual conflict—and proposes EchoPin, a neuroscience-inspired model combining HRTF filtering, ERB cochleagram representation, and stereo audio. EchoPin substantially outperforms prior methods on the newly constructed AudioCOCO dataset and, without any human behavioral supervision, exhibits a human-like horizontal-over-vertical localization accuracy asymmetry.
Segment-Factorized Full-Song Generation on Symbolic Piano Music: This paper proposes the Segmented Full-Song (SFS) model, which decomposes a song into segments and autoregressively generates each segment by attending selectively to structurally relevant context. SFS achieves faster and more structurally coherent full-song piano generation compared to existing methods, while supporting interactive human-AI co-creation.
Sensorium Arc: AI Agent System for Oceanic Data Exploration and Interactive Eco-Art: This paper presents Sensorium Arc, a multimodal interactive AI agent system that personifies the ocean as a poetic "narrator" figure. Leveraging a multi-agent RAG architecture, the system integrates NASA ocean science data with eco-aesthetic texts, enabling users to explore complex marine environmental data through natural conversation while dynamically generating scientific visualizations and artistic audiovisual feedback—realizing a paradigm shift from "passive data observation" to "active ecological dialogue."
SHAP Meets Tensor Networks: Provably Tractable Explanations with Parallelism: This paper presents the first provably exact SHAP computation framework for Tensor Networks (TNs), proves that SHAP under the Tensor Train (TT) structure is parallelizable in polylogarithmic time (NC² complexity), and reveals via reductions that width rather than depth is the fundamental bottleneck for SHAP computation in binarized neural networks.
SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation: This paper proposes SimulMEGA, a framework combining prefix training with a Mixture-of-Experts (MoE) refinement module to achieve unsupervised read/write policy learning. A 500M-parameter model achieves <7% BLEU degradation at 1.5-second latency across simultaneous speech translation in 6 languages, and extends to streaming TTS.
Slimmable NAM: Neural Amp Models with Adjustable Runtime Computational Cost: This paper applies the Slimmable Networks paradigm to the Neural Amp Modeler (NAM) by randomly pruning WaveNet layer widths during training, enabling dynamic adjustment of network size at inference time without additional training cost, allowing musicians to balance audio fidelity and computational expense in real time.
Sound Logical Explanations for Mean Aggregation Graph Neural Networks: For GNNs using mean aggregation (MAGNN, i.e., mean-GNNs with non-negative weights), this work precisely characterizes the class of monotone logical rules that can serve as sound explanations, constructs a restricted fragment of first-order logic to explain arbitrary MAGNN predictions, and empirically demonstrates that restricting to non-negative weights does not significantly hurt performance while enabling effective extraction of sound rules.
Target Speaker Extraction Through Comparing Noisy Positive and Negative Audio Enrollments: This paper proposes a novel enrollment strategy that encodes target speaker characteristics by contrasting noisy positive enrollments (segments where the target speaker is active) against negative enrollments (segments where the target speaker is silent), achieving state-of-the-art performance on monaural noisy-enrollment target speaker extraction with SI-SNRi exceeding the previous best method by over 2.1 dB.
AVRobustBench: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time: This paper proposes AVRobustBench, the first benchmark that systematically evaluates the test-time robustness of audio-visual models under co-occurring correlated dual-modality corruptions, comprising 4 datasets × 75 corruption types, and introduces AV2C, a TTA method based on low-entropy sample selection.
The Impact of Scaling Training Data on Adversarial Robustness: A systematic evaluation of 36 state-of-the-art vision models under 6 categories of black-box attacks reveals that attack success rate (ASR) decreases logarithmically with training data volume and model scale; however, data quality and model scale are more critical than data volume alone.
Unifying Symbolic Music Arrangement: Track-Aware Reconstruction and Structured Tokenization: This paper proposes a unified symbolic music arrangement framework that employs a segment-level self-supervised reconstruction objective (decoupling content from instrument style) and a novel multi-track tokenization scheme REMI-z, enabling a single pretrained model to handle diverse arrangement tasks—including orchestral arrangement, piano reduction, and drum arrangement—while surpassing task-specific state-of-the-art methods across all three benchmarks.
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction: VITA-1.5 proposes a carefully designed three-stage progressive training strategy that incrementally integrates visual and speech capabilities into an LLM, achieving end-to-end vision-speech real-time interaction without relying on standalone ASR/TTS modules, while attaining state-of-the-art performance among open-source models on image, video, and speech benchmarks.
WhAM: Towards A Translative Model of Sperm Whale Vocalization: This paper proposes WhAM (Whale Acoustics Model), the first Transformer-based generative model for sperm whale codas, achieving acoustic translation, synthetic generation, and downstream classification through fine-tuning of VampNet.