ICML2026 Audio & Speech AI paper notes paper summaries Speech & Audio Multimodal/VLM Few-/Zero-Shot Learning Reasoning Layout & Composition LLM

🎵 Audio & Speech¶

🧪 ICML2026 · 36 paper notes

📌 Same area in other venues: 📷 CVPR2026 (22) · 🔬 ICLR2026 (80) · 💬 ACL2026 (72) · 🤖 AAAI2026 (30) · 🧠 NeurIPS2025 (47) · 📹 ICCV2025 (11)

🔥 Top topics: Speech & Audio ×16 · Multimodal/VLM ×4 · Few-/Zero-Shot Learning ×4 · Reasoning ×3 · Layout & Composition ×2

A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation: This paper proposes Hive, a universal sound separation dataset constructed via single-event purification and semantically consistent mixing. Using approximately 2.4k hours of high-purity source audio, it enables AudioSep and FlowSep to approach or even exceed the performance of systems trained on million-hour datasets across multiple separation metrics.
Alethia: A Foundational Encoder for Voice Deepfakes: Alethia proposes a "bottleneck masked embedding prediction + flow-matching spectrogram generation" dual-branch pretraining paradigm to develop the first foundational encoder specifically for voice deepfake detection, localization, and attribution. It significantly outperforms general SFMs like Wav2vec2, HuBERT, and WavLM across 56 datasets in 5 task categories and exhibits strong zero-shot robustness against unseen singing voice deepfakes and real-world perturbations.
Algorithmic Recourse of In-Context Learning for Tabular Data: This paper presents the first systematic study of the algorithmic recourse problem in tabular in-context learning (ICL). It proves that dynamic decision rules induced by ICL can still yield definable recourse and proposes ASR-ICL, which uses adaptive subspace zero-order optimization to generate low-cost, sparse, and actionable counterfactual modifications for black-box ICL models.
An Exterior Method for Nonnegative Matrix Factorization: This paper proposes eNMF, which transforms NMF from "always staying inside the nonnegative orthogonal cone" to "approximating the nonnegative cone from the exterior of the rotation equivalence class of the unconstrained SVD optimal solution, followed by feasibility attainment and descent." It reaches lower reconstruction errors faster than 9 classes of NMF baselines on synthetic, text, audio, image, and recommendation data.
Attend to Anything: Foundation Model for Unified Human Attention Modeling: AAM unifies image, video, and audio-visual saliency prediction into a single attention foundation model featuring text conditioning, hyperbolic hierarchical constraints, and Fokker-Planck temporal dynamics. It consistently outperforms specialized models across 16 benchmarks and improves video inference speed to approximately 111 FPS.
Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models: Existing respiratory acoustic foundation models (FMs) have been evaluated almost exclusively on cough classification. This paper presents the first systematic evaluation of FMs on continuous regression tasks (passively estimating age, BMI, and disease probability from cough audio). Using a multi-model multi-target benchmark protocol consisting of 5 FMs × 6 targets × 3 datasets with frozen encoders and three types of regression heads, the study reveals findings obscured by classification-based evaluations, including the "data scale × head capacity" tradeoff, the advantages of generative pre-training, and strongly asymmetric cross-dataset transfer.
CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction: Addressing the lack of unified evaluation for modern music generation models that simultaneously process "text + lyrics + reference audio," this paper establishes a complete ecosystem: 110k pseudo-labeled CMI-Pref-Pseudo, 4,027 human-labeled CMI-Pref, a unified CMI-RewardBench, and a family of ~30M parameter reward models (CMI-RM) capable of handling all modality combinations in a single architecture. The authors demonstrate high correlation with human judgment and enable "inference-time scaling" via top-k filtering.
Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox: The authors construct VoxParadox, a benchmark of 2,000 Multiple Choice Questions (MCQs) designed with intentional contradictions between "what the text says" and "what the audio sounds like." They demonstrate that current Audio LLMs almost exclusively "read but do not listen" in paralinguistic tasks. By introducing PCLM, a lightweight module that adaptively mixes intermediate audio encoder features based on the prompt, combined with DPO, they improve Audio Flamingo 3's performance on VoxParadox from 17.40% to 65.20%.
Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability: This paper formulates the "continuation probability of a pre-trained Large Audio Language Model (LALM) on ground-truth speech tokens" as an objective style consistency metric named MCLP. By employing a gated hybrid reward of MCLP+CER through GRPO on the newly constructed WenetSpeech-RP-TTS dataset, the subjective MOS of role-play TTS is improved from 1.86 to 3.58.
Few-Shot Synthetic Accented Speech for ASR Fine-Tuning: What Helps and When?: Fine-tuning ASR with accented speech synthesized via few-shot TTS, the authors decompose the question of "why it works." They find that the gains primarily stem from phoneme-level perturbation augmentation—random phoneme replacement captures most of the benefits, while LLM-generated "target accent phoneme editing" or even oracle ground-truth phonemes/prosody offer only marginal improvements. Furthermore, while synthetic data significantly reduces training variance when real data is scarce, a fixed quota of synthetic data eventually dilutes real data; the real-to-synthetic ratio itself is the critical factor.
Focus Then Listen: An Empirical Study of Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models: This paper proposes Focus-Then-Listen (FTL), a plug-and-play audio enhancer that does not update LALM parameters. It decomposes the input waveform into speech and non-speech tracks, utilizes an LLM router to determine "which category to listen to" based on user instructions, and finally employs a modality-aware fusion block to generate task-adaptive enhanced audio for the Large Audio Language Model (LALM), thereby improving perception and reasoning performance under various noise conditions.
Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration: To address the chronic issues of "modality dominance" and "spurious modality coupling" in centralized multimodal fusion, GCL reformulates multimodal learning as a governed two-stage four-agent collaborative protocol. The first stage uses Routing/Auditing agents to decide which cross-modal communications are permitted per sample based on marginal prediction gain. The second stage uses Public-Factor/Aggregation agents to decouple shared semantics from private specializations before aggregation. This approach achieves SOTA on MOSI, MOSEI, and MIntRec.
JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments: Based on Qwen2.5-Omni, JAEGER adapts an end-to-end 3D audio-visual large model using LoRA. By integrating RGB-D depth positional encoding + First-Order Ambisonics (FOA) dual-track audio + a newly proposed Neural Intensity Vector, it extends traditional AV-LLMs from "2D RGB + Monophonic" to "3D Geometry + Multi-channel Spatial Audio," and releases the SpatialSceneQA simulation benchmark with 61k samples.
MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks: MECAT constructs 20k multi-perspective fine-grained audio captions and 100k open-ended QA pairs using a "multi-expert models + CoT LLM reasoning" pipeline. It proposes the DATE metric (harmonic mean of semantic similarity and cross-sample discriminability), achieving the first stable differentiation between generic and detailed audio model outputs.
MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models: MoshiRAG incorporates a special \(\langle\text{ret}\rangle\) trigger token into the Moshi full-duplex speech model, allowing the model to asynchronously invoke an LLM or search engine backend while speaking. By exploiting the natural "keyword delay" between the start of an utterance and the appearance of critical keywords, it preserves full-duplex interactivity while hiding retrieval latencies of up to 2 seconds. This enables the model to achieve factuality on par with GPT-4o Audio across LlamaQ, WebQ, TriviaQA, and HaluEval.
MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety: MultiBreak utilizes an iterative framework of "active learning + uncertainty-guided rewriting" to expand a multi-turn jailbreak dataset to 10,389 conversations and 2,665 independent harmful intents. With a diversity score of 0.942, it significantly outperforms previous works and increases ASR on DeepSeek-R1-7B / GPT-4.1-mini by 54% / 34.6% respectively compared to the next best dataset.
Multimodal Fact-Level Attribution for Verifiable Reasoning: MURGAT is the first benchmark to evaluate the ability of MLLMs to provide "precise modality + timestamp citations at a factual granularity" in multimodal reasoning outputs. It employs a three-step evaluation protocol (Verifiable Claim Identification → Atomic Fact Decomposition → Attribution Quality) and an automated evaluator, MURGAT-SCORE, which achieves high human alignment (Pearson 0.84). The study reveals that strong models often produce hallucinated citations even when answers are correct, and that enhanced reasoning often comes at the cost of verifiable citations.
Multimodal Fusion via Self-Consistent Task-Gradient Fields: SCFAE reformulates the multimodal fusion block as a "Self-Consistent Field" (SCF) composed of "task loss + reconstruction loss." By partitioning each modality's features into "shared/specific" subspaces and cyclically replacing shared components across modalities, it ensures that task gradients backpropagate cleanly to each encoder. Consequently, it achieves higher robustness than strongly coupled or heavily regularized fusion methods across three scenarios: unequal input lengths, modal conflict, and missing modalities.
Multiple Choice Learning of Low-Rank Adapters for Language Modeling: This paper introduces LoRA-MCL, which integrates the "Winner-Takes-All" training paradigm of Multiple Choice Learning into LoRA fine-tuning. By treating \(K\) sets of low-rank adapters as \(K\) competing hypotheses and updating only the most suitable adapter for each training sample, a single base model can generate multiple diverse and plausible texts covering different modes of the conditional distribution in a single forward pass. It achieves a new Pareto frontier for quality-diversity in audio/image captioning and machine translation.
MusicDET: Zero-Shot AI-Generated Music Detection: MusicDET redefines "AI-Generated Music (AIGM) detection" as a zero-shot problem trained exclusively on real music. By employing sub-band decomposition, intra-band normalizing flows, and a global normalizing flow to learn the probability distribution of real music energy spectrograms, the model utilizes likelihood values as "authenticity scores." It reduces the average EER from ~17% to 4.51% (zero-shot) and 0.89% (with class-conditional priors) under cross-generator evaluation on FakeMusicCaps / SONICS.
NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating: This work utilizes a 2D oscillatory wave field (OWM) inspired by cortical oscillations for real-time saliency detection. Serving as a "training-free attention gate" for Audio Language Models (ALMs) on long audio, it feeds only truly salient windows into the ALM. This increases the AP on XD-Violence from 53.5% to 70.6% while reducing approximately 40% of ALM calls.
PhaLar: Phasors for Learned Musical Audio Representations: PhaLar achieves a 70% relative improvement over SOTA on musical stem retrieval tasks by projecting audio features onto the complex plane and leveraging phase equivariance—encoding temporal alignment as phase rotations via FFT. It uses only 44% of the competitor's parameters and achieves 7× training acceleration; it represents a fundamental paradigm shift from "phase invariance" to "phase equivariance."
Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration: Polyphonia extends zero-shot timbre transfer from single-track to dense multitrack mixes. By utilizing the Ideal Ratio Mask (IRM) obtained through Blind Source Separation (BSS) as an external acoustic prior, it performs "source interpolation + acoustic modulation" within the pre-softmax attention logits. This allows the spectrum of the target part (e.g., vocals) to be replaced by a new timbre (e.g., violin) while strictly preserving the background accompaniment, achieving a 15.5% improvement in target alignment compared to the Prev. SOTA.
Position: Text Embeddings Should Capture Implicit Semantics, Not Just Surface Meaning: This is a position paper: the authors argue that current text embedding research focuses excessively on "surface semantics" (morphology / syntax / topical similarity) while systematically ignoring "implicit semantics" such as pragmatics, stance, and social context. Empirical evidence from 7 implicit semantic datasets shows that even SOTA embeddings offer only marginal improvements over Bag-of-Tokens, advocating for implicit semantics as a first-class modeling objective in embedding research.
Position: Towards Responsible Evaluation for Text-to-Speech: This is a position paper proposing that TTS evaluation should evolve from "technical metrics only" to a three-layer hierarchical Responsible Evaluation framework—Fidelity & Accuracy, Comparability & Standardization, and Governance-Fairness-Security. It systematically diagnoses the failure modes of current metrics such as WER, SIM, MOS, and RTF, and provides 13 actionable recommendations.
Probing Cross-modal Information Hubs in Audio-Visual LLMs: The authors employ a causal tracing and unimodal-dominance framework to reveal hidden hubs in Audio-Visual LLMs called "cross-modal sink tokens," where the majority of cross-modal information is condensed. Based on this, a training-free attention amplification strategy is proposed to significantly mitigate object hallucinations.
Probing Token Spaces under Generator Shift in AI-Generated Music Detection: This paper elevates the token space (choice of tokenizer), which was previously treated as a "preprocessing detail" in AI music detection, to a primary experimental variable. By fixing the downstream classifier (CoMoE) and swapping only the input tokens, and conducting "source-restricted" evaluations on the newly constructed MoM-open (where only one fake generator is seen during training while others are used for testing), the study demonstrates that different token spaces exhibit massive robustness gaps under generator shift (e.g., X-Codec tokens achieving 89.0% AUC vs. EnCodec tokens at 58.6% on Fake-Udio).
SafeSearch: Automated Red-Teaming of LLM-Based Search Agents: This paper introduces SafeSearch, a fully automated, sandboxed, and scalable red-teaming framework that evaluates search agent safety by injecting a single LLM-generated unreliable webpage into real search results. Through systematic evaluation of 17 LLMs across 3 agent scaffolds using 300 test cases, the study finds a peak ASR of 90.5% and demonstrates that common reminder-based defenses are largely ineffective.
Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment: During VQ-VAE speech codec training, the decoder is fed both the "quantized tokens" and "pre-quantization continuous latents." A lightweight feature alignment loss forces the decoder's internal features from the former path to align with those of the latter. This significantly enhances reconstruction fidelity with zero inference overhead and allows the codebook size to be reduced fourfold without performance loss.
Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech: The authors train a Top-k Sparse Autoencoder (SAE) on the residual stream of the semantic backbone of an LLM-based TTS (IndexTTS2). By using "sentence-level activation rate differences," they identify a small set of sparse latent features strongly correlated with target emotions. During inference, they perform bidirectional emotion induction and suppression by intervening only on these features without modifying backbone parameters. This approach outperforms global mean-difference steering and existing TTS baselines.
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization: This paper discovers that waveform gradients in Audio Language Model (ALM) jailbreak optimization are highly concentrated in a few tokens. It proposes TAGO, which updates only the waveform segments corresponding to the top-\(\zeta\) high-energy tokens at each step. On Qwen3-Omni, retaining only 25% of tokens maintains an 86% LLM-judge jailbreak success rate (vs. 87% for full tokens).
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning: This paper proposes FLAIR: a framework that allows Full-Duplex Spoken Dialogue Models (SDLM) to replace the steps typically used for filling <SIL> placeholders while "listening to the user" with continuous latent reasoning. By employing an ELBO training objective and a non-causal "global expert" to provide the posterior, the causal LLM learns to "think while listening" through a sequence of embedding vectors, significantly improving QA quality without introducing any inference latency.
Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer: SwanSphere proposes a two-stage streaming architecture consisting of a "Causal AR Language Model + Local DiT (LocDiT)." It generates four-channel First-Order Ambisonics (FOA) spatial audio from panoramic videos or text. Combined with SVAC physics-aware contrastive learning and multi-objective ODPO, it reduces first-chunk latency to 0.21s while outperforming cascade and end-to-end baselines in FD, KL, and angular error.
Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition: This paper treats the decision-making of multimodal large models as an information decomposition from input to output. Using Partial Information Decomposition (PID), the mutual information of VL/omni-modal model predictions is decomposed into four terms: "Vision-unique / Text-unique / Redundant / Synergistic." It discovers that the synergistic term is the best indicator of predictive vision sensitivity and that omni-modal models suffer from a "visual hegemony" synergy bottleneck. Finally, sample-level scores derived from PID are used to guide LoRA reweighted fine-tuning, achieving consistent improvements of 1–2 percentage points on MMStar, MMBench, and POPE.
Two-Dimensional Quantization for Geometry-Aware Audio Coding: The authors replace scalar quantizers in neural audio codecs with Q2D2, a geometric quantizer using "paired channels + structured 2D grids." By replacing learnable codebooks with fixed hexagonal, rectangular, or rhombic lattices, they achieve speech reconstruction quality that matches or surpasses RVQ, VQ, and FSQ using a single quantizer at extremely low token rates.
VocSim: A Training-Free Benchmark for Zero-Shot Content Identity Recognition for Single-Source Audio: VocSim is a training-free benchmark covering 125k single-source audio clips that diagnoses the intrinsic geometry of frozen audio foundation models through label-agnostic PCA whitening—revealing severe generalization defects in current models on low-resource cross-lingual speech.