🎵 Audio & Speech¶

🔬 ICLR2026 · 32 paper notes

AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer: This paper proposes AC-Foley, a reference-audio-guided video-to-audio synthesis framework that achieves fine-grained timbre control, timbre transfer, and zero-shot sound effect generation via two-stage training (acoustic feature learning + temporal adaptation) and multimodal conditional flow matching, significantly outperforming existing methods in audio quality and acoustic fidelity.
AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations: This paper proposes AutoFigure — the first agent framework based on a "Reasoned Rendering" paradigm — which automatically generates publication-ready scientific illustrations from long scientific texts by decoupling structural layout planning and aesthetic rendering into two stages. It is accompanied by FigureBench, the first large-scale benchmark (3,300 pairs) for systematic evaluation, with 66.7% of generated results deemed usable in camera-ready submissions by the original authors.
Discovering and Steering Interpretable Concepts in Large Generative Music Models: This work presents the first application of Sparse Autoencoders (SAEs) to the audio/music domain, extracting interpretable musical concept features from the residual stream of the autoregressive music generation model MusicGen, and leveraging these features for controllable generation (steering).
Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation: This paper proposes Dynamic Parameter Memory (DPM), a mechanism that encodes speech information sentence-by-sentence into the parameter space of a temporary LoRA module during inference, enabling speech large language models (SLLMs) with limited context windows to process arbitrarily long conversational audio. The approach achieves state-of-the-art performance on IEMOCAP and MELD.
EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models: This paper proposes EchoMind, the first multi-level interrelated benchmark for empathetic dialogue, which systematically evaluates Speech Language Models' ability to perceive non-verbal acoustic cues and generate empathetic responses through a cognitive pipeline of Understanding → Reasoning → Conversation.
Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention: This paper proposes Dolphin, a model that maps lip movements to discrete semantic tokens via a dual-path lightweight video encoder (DP-LipCoder), and introduces a Global-Local Attention (GLA) separator. Dolphin surpasses state-of-the-art methods on three benchmarks while reducing parameters by 50%+, MACs by 2.4×, and GPU inference latency by 6×.
EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning: This work is the first to reformulate Speech Emotion Recognition (SER) as a deep reasoning problem, leveraging a prosody-enhanced backbone model combined with GRPO-PTR (Progressive Trustworthy Reasoning reward) reinforcement learning to generate explainable emotion reasoning grounded in acoustic evidence.
FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates: FlexiCodec is proposed as a dynamic frame rate merging strategy guided by ASR features, achieving high-quality speech codec at ultra-low frame rates of 3–12.5 Hz while maintaining superior semantic information retention.
Improving Black-Box Generative Attacks via Generator Semantic Consistency: By analyzing semantic degradation in intermediate-layer features of perturbation generators, this paper proposes a Mean Teacher-based semantic structure-aware framework that performs self-feature distillation at early generator layers to preserve semantic consistency, thereby enhancing the transferability of adversarial examples across models, domains, and tasks.
Incentive-Aligned Multi-Source LLM Summaries: This paper introduces the Truthful Text Summarization (TTS) framework, which incorporates a multi-task peer prediction mechanism from game theory into LLM multi-source summarization pipelines. The approach constructs evaluation claim sets via leave-one-out cross-referencing, extracts each source's stance on individual claims, scores source reliability using informative agreement, filters unreliable sources, and regenerates the summary. The framework is theoretically proven to make truthful reporting a utility-maximizing strategy, and empirically demonstrates robustness against prompt injection, misinformation sources, and coordinated attacks.
Knowing When to Quit: Probabilistic Early Exits for Speech Separation: This paper proposes PRESS (Probabilistic Early-exit for Speech Separation) and the PRESS-Net architecture. By jointly modeling clean speech signals and error variance within a probabilistic framework, PRESS derives an interpretable early-exit criterion based on signal-to-noise ratio (SNR), enabling fine-grained dynamic computation scaling for speech separation networks while maintaining performance competitive with static SOTA models.
Latent Speech-Text Transformer: This paper proposes the Latent Speech-Text Transformer (LST), which aggregates discrete speech tokens into higher-level "latent speech patches" as autoregressive units (analogous to BLT's treatment of bytes), aligning the sequence modeling granularity of speech and text (reducing the length ratio from 20× to ~1:1). LST achieves +6.5% absolute improvement on Speech HellaSwag, with gains that continue to grow from 420M to 7B parameters, while reducing ASR/TTS inference computation.
LogicReward: Incentivizing LLM Reasoning via Step-Wise Logical Supervision: This paper proposes LogicReward, a reward function that employs the Isabelle theorem prover for step-wise logical correctness verification. Combined with Autoformalization with Soft Unification to reduce natural language ambiguity, the trained 8B model surpasses GPT-4o by 11.6% and o4-mini by 2% on NLI and logical reasoning tasks.
MAPSS: Manifold-Based Assessment of Perceptual Source Separation: This paper proposes two complementary metrics—Perceptual Separation (PS) and Perceptual Match (PM)—that embed self-supervised encoded representations onto a low-dimensional manifold via diffusion maps, achieving for the first time a functional decoupling of leakage and self-distortion in source separation evaluation. Compared against 18 mainstream metrics, the proposed measures rank first or second in correlation with subjective listening scores in nearly all experimental conditions.
MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark: This paper introduces MMSU (5,000 audio QA items across 47 tasks), the first benchmark to systematically incorporate linguistic theory into spoken language understanding and reasoning evaluation. Evaluating 22 SpeechLLMs, it reveals significant gaps in phonological perception and complex reasoning among existing models.
PACE: Pretrained Audio Continual Learning: This paper presents the first systematic benchmark for audio continual learning (CL), identifies a fundamental upstream–downstream mismatch in pretrained audio models caused by the dominance of low-level spectral features, and proposes PACE—comprising improved first-session adaptation, adaptive subspace-orthogonal PEFT, and boundary-aware perturbation regularization—achieving substantial improvements over SOTA across 6 audio CL benchmarks.
Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition: This paper proposes USR 2.0, which replaces autoregressive pseudo-label generation with CTC-driven teacher forcing, enabling attention pseudo-labels to be produced in a single forward pass. The approach achieves nearly 2× training speedup, enhances out-of-distribution robustness via joint CTC-attention prediction, and establishes state-of-the-art results on LRS3/LRS2/WildVSR across all three tasks (ASR/VSR/AVSR) within a single unified model.
Query-Guided Spatial-Temporal-Frequency Interaction for Music Audio-Visual Question Answering: This paper proposes QSTar, a framework that embeds query guidance throughout the entire processing pipeline and introduces a three-dimensional Spatial-Temporal-Frequency Interaction module (leveraging spectral features to distinguish timbres), achieving significant performance gains on Music Audio-Visual Question Answering (Music AVQA).
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments: This paper presents RedTeamCUA, the first red-teaming framework for computer-use agents (CUAs) in hybrid Web-OS environments, along with RTC-Bench comprising 864 test cases. The framework systematically evaluates the vulnerability of 9+ frontier CUAs to indirect prompt injection attacks, finding that all evaluated CUAs are exploitable (peak ASR of 83%). Notably, more capable models pose greater risks — the large gap between attempt rate (AR) and attack success rate (ASR) implies that improvements in model capability will directly translate into higher attack success rates.
Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion: This paper proposes a Speech-guided Machine Translation (SMT) framework that synthesizes source-language speech via TTS and jointly feeds it with text into an MLLM for translation. A self-evolution mechanism automatically selects beneficial synthetic speech samples for continual training. The approach achieves state-of-the-art performance on Multi30K, surpassing all MMT methods, and attains average SOTA across 108 translation directions on FLORES-200 with only 9B parameters.
Scaling Speech Tokenizers with Diffusion Autoencoders: This paper proposes SiTok (Speech Diffusion Tokenizer), which employs a diffusion autoencoder to jointly train the encoder–quantizer–decoder in a single stage (rather than two stages), incorporates CTC-based semantic regularization to ensure discrete tokens retain linguistic information, and scales to 1.6B parameters trained on 22 million hours of speech data. SiTok achieves strong performance at an extremely low token rate (12.5 Hz / 200 bps), attaining 3.34% WER (reconstruction) and 4.95 WER (LLM ASR) simultaneously.
SiNGER: A Clearer Voice Distills Vision Transformers Further: This paper proposes SiNGER (Singular Nullspace-Guided Energy Reallocation), a framework that suppresses high-norm artifacts in ViT features by applying perturbations along the left-nullspace directions of teacher features, thereby preserving informative signals. Combined with lightweight LoRA adapters, SiNGER achieves state-of-the-art performance across multiple downstream tasks while producing cleaner and more interpretable representations.
SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables: This paper presents SPARTA, an end-to-end framework for automatically constructing large-scale table-text multi-hop QA benchmarks. By leveraging a reference fact database, provenance-based refinement, and realistic structural constraints to generate high-quality nested SQL queries, SPARTA reduces the F1 of state-of-the-art models by over 30 points.
Statistical Guarantees for Offline Domain Randomization: This paper formalizes offline domain randomization (ODR) as a maximum likelihood estimation problem over a parameterized family of simulators. Under mild regularity and identifiability assumptions, it establishes weak consistency (convergence in probability); with an additional uniform Lipschitz continuity assumption, strong consistency (almost sure convergence) is further proved. These results provide the first theoretical foundation for the empirical success of ODR in sim-to-real transfer.
Stitch: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models: Stitch enables "thinking while speaking" in spoken language models (SLMs) by interleaving silent reasoning tokens with speech tokens in chunks, exploiting idle compute during audio playback for reasoning. Stitch-S achieves first-chunk latency identical to the no-reasoning baseline while improving math reasoning accuracy by approximately 15 percentage points.
SyncTrack: Rhythmic Stability and Synchronization in Multi-Track Music Generation: SyncTrack is proposed with a unified architecture comprising track-shared modules (dual cross-track attention for rhythmic synchronization) and track-specific modules (learnable instrument priors for timbre preservation), along with three new rhythmic consistency evaluation metrics (IRS/CBS/CBD), achieving substantial improvements in multi-track music generation quality (FAD: 6.55→1.26, subjective MOS: 3.42 vs. 1.57).
The Devil behind the Mask: An Emergent Safety Vulnerability of Diffusion LLMs: This paper presents the first systematic investigation of inherent safety vulnerabilities in diffusion large language models (dLLMs) arising from their bidirectional modeling and parallel decoding mechanisms. It introduces the DiJA jailbreak attack framework, which achieves near-100% attack success rates on multiple aligned dLLMs via interleaved mask-text prompts.
Toward Complex-Valued Neural Networks for Waveform Generation: This paper proposes ComVo, the first iSTFT vocoder to employ complex-valued neural networks (CVNNs) in both the generator and discriminator. It stabilizes training via a phase quantization layer and introduces a block-matrix computation scheme that reduces training time by 25%, achieving synthesis quality superior to real-valued baselines such as Vocos on LibriTTS.
TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization: This paper proposes TripleSumm, which achieves dynamic frame-level modality importance adjustment via a Multi-scale Temporal block (hierarchical sliding-window attention) and a Cross-modal Fusion block (fusion token adaptively weighting visual/text/audio). The authors also release MoSu, the first large-scale triple-modality video summarization dataset (52,678 videos), achieving SOTA on 4 benchmarks.
VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation: This paper proposes VowelPrompt, which extracts vowel-level prosodic descriptors (pitch/energy/duration) grounded in phonetic evidence, converts them into natural language to augment LLM emotion recognition prompts, and employs a two-stage SFT+GRPO training pipeline. The method consistently outperforms state-of-the-art approaches under zero-shot, fine-tuning, cross-domain, and cross-lingual conditions, while generating interpretable emotion reasoning.
When and Where to Reset Matters for Long-Term Test-Time Adaptation: ASR proposes an adaptive selective reset scheme that uses prediction concentration \(\mathcal{C}_t\) to dynamically determine when to reset (avoiding the suboptimality of fixed-period resets), and employs a progressive layer selection strategy from output to input layers to determine where to reset (preserving valuable adaptation knowledge). Combined with importance-aware regularization for recovering critical knowledge in reset layers and on-the-fly adaptation adjustment, ASR achieves a 44.12% improvement over the prior SOTA on CCC-Hard.
When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment: This paper identifies that Attack Success Rates (ASR) in LLM jailbreak benchmarks are artificially inflated by semantically irrelevant style patterns (e.g., "create a list of"), a phenomenon observed in nearly all of 36 evaluated LLMs. Superficial style alignment fine-tuning further exacerbates this risk. The paper proposes SafeStyle — a defense that mitigates this risk via style-augmented safety training data.