Skip to content

🎵 Audio & Speech

🧪 ICML2026 · 7 paper notes

📌 Same area in other venues: 💬 ACL2026 (35) · 📷 CVPR2026 (15) · 🔬 ICLR2026 (32) · 🤖 AAAI2026 (31) · 🧠 NeurIPS2025 (49) · 📹 ICCV2025 (11)

🔥 Top topics: Speech & Audio ×5

Alethia: A Foundational Encoder for Voice Deepfakes

Alethia introduces a dual-branch pretraining paradigm of "bottleneck-style masked embedding prediction + Flow-Matching spectrogram generation," training the first foundational encoder for voice deepfake detection, localization, and attribution. It significantly outperforms general-purpose SFMs like Wav2vec2, HuBERT, and WavLM across 5 tasks and 56 datasets, demonstrating strong zero-shot robustness to unseen singing deepfakes and real-world perturbations.

MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

MECAT constructs 20k multi-perspective fine-grained audio captions and 100k open-ended QA using a "multi-expert model + CoT large model reasoning" pipeline, and proposes the DATE metric (harmonic mean of semantic similarity × cross-sample discriminability), enabling, for the first time, stable distinction between generic and detail-accurate audio model outputs.

MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

MedMosaic constructs a medical audio QA benchmark (46,701 QA pairs, 10 question types) covering physiological sounds and real/synthetic clinical dialogues via a synthetic pipeline. It systematically evaluates 13 audio/multimodal models, revealing that even Gemini-2.5-Pro achieves only about 68.1% weighted accuracy, exposing fundamental limitations of contemporary LALMs in medical audio reasoning.

MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

MoshiRAG introduces a special ⟨ret⟩ trigger token into Moshi, a full-duplex speech model, enabling the model to asynchronously call an LLM/search backend for reference documents while speaking. By leveraging the natural "keyword delay" (the interval from speaking onset to keyword appearance), retrieval latency under 2 seconds is completely masked. This elevates the factuality of the speech model to the level of GPT-4o Audio on LlamaQ/WebQ/TriviaQA/HaluEval, while preserving full-duplex real-time interaction.

NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating

A real-time salience detector inspired by cortical oscillations is implemented as a 2D oscillatory wave field (OWM), serving as a "training-free attention gate" for Audio Language Models (ALMs) on long audio. Only truly salient windows are fed into the ALM, boosting AP on XD-Violence from 53.5% to 70.6% while reducing ALM invocations by about 40%.

Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration

Polyphonia extends zero-shot timbre transfer from single-track to dense multi-track mixtures: using the Ideal Ratio Mask (IRM) from blind source separation as an external acoustic prior, it performs "source interpolation + acoustic modulation" in the pre-softmax attention logits, enabling the target stem's (e.g., vocals) spectrum to be replaced by a new timbre (e.g., violin) while strictly preserving the background accompaniment. Compared to SOTA, it improves target alignment by 15.5%.

Probing Cross-modal Information Hubs in Audio-Visual LLMs

By combining causal tracing and a unimodal-dominant framework, the authors reveal the existence of hidden "cross-modal sink tokens" in audio-visual LLMs, where the vast majority of cross-modal information is concentrated. Based on this, they propose a training-free attention amplification strategy that significantly alleviates object hallucinations.