🎵 Audio & Speech¶
🔬 ICLR2026 · 80 paper notes
📌 Same area in other venues: 📷 CVPR2026 (22) · 💬 ACL2026 (72) · 🧪 ICML2026 (36) · 🤖 AAAI2026 (31) · 🧠 NeurIPS2025 (47) · 📹 ICCV2025 (11)
🔥 Top topics: Speech & Audio ×59 · Reasoning ×10 · Alignment/RLHF ×5 · Adversarial Robustness ×5 · Diffusion Models ×5
- AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer
-
AC-Foley is proposed as a reference-audio-guided video-to-audio synthesis framework. Through two-stage training (acoustic feature learning and temporal adaptation) and multimodal conditional flow matching, it achieves fine-grained timbre control, timbre transfer, and zero-shot sound effect generation, significantly outperforming existing methods in audio quality and acoustic fidelity.
- AlignSep: Temporally-Aligned Video-Queried Sound Separation with Flow Matching
-
AlignSep shifts "Video-Queried Sound Separation (VQSS)" from the mainstream time-frequency masking discriminative paradigm to a flow matching-based generative paradigm. By employing a temporally-aligned vector field estimator implemented with "temporal concatenation + non-cross-attention Transformer," it enforces frame-by-frame synchronization between audio and video. This allows for clean extraction of on-screen target sounds in difficult scenarios with intra-class interference and overlapping tracks, achieving a temporal alignment score \(T_{A\text{-}V}\) of 95.76% on the self-constructed VGGSound-Hard benchmark.
- AudioX: A Unified Framework for Anything-to-Audio Generation
-
AudioX utilizes a unified framework based on a Diffusion Transformer (DiT), integrated with a lightweight "Multimodal Adaptive Fusion (MAF)" module and a self-constructed 7-million-sample multimodal dataset, IF-caps. This allows a single set of model weights to generate high-fidelity sound effects and music from arbitrary combinations of text, video, and audio, significantly outperforming specialized models in fine-grained instruction following.
- Aurelius: Relation Aware Text-to-Audio Generation At Scale
-
Aurelius constructs two large-scale decoupled corpora (AudioEventSet with 110 categories of audio events + AudioRelSet with 100 types of relations) and a text-audio pair generation strategy. This pushes "relation-aware text-to-audio generation" from small-scale exploration to a scalable research level. The authors systematically benchmark 9 mainstream TTA models, revealing that they almost entirely fail at modeling multi-event relations (with relation accuracy generally <10%).
- Automatic Stage Lighting Control: Is it a Rule-Driven Process or Generative Task?
-
This paper redefines "Automatic Stage Lighting Control (ASLC)" from the long-standing paradigm of "music classification → table lookup" to a generative task. It proposes Skip-BART, an end-to-end model that takes music audio as input and autoregressively generates hue and value for lighting frame-by-frame. A novel skip connection explicitly aligns music and lighting frames. Supported by a self-built dataset, pre-training, and transfer learning, the model outperforms rule-based methods across quantitative metrics and a 38-person subjective evaluation, showing no significant difference from professional lighting designers (p=0.72).
- AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization
-
Addressing the issues of spurious associations and hallucinations in multimodal large language models (MLLMs) during emotion reasoning, this work proposes the EmoReAlM evaluation benchmark and the AVEm-DPO preference optimization method. By constructing targeted preference pairs and text prior regularization, it achieves a zero-shot relative performance gain of 6-19% on DFEW, RAVDESS, and EMER.
- AVEX: What Matters for Animal Vocalization Encoding
-
This is a large-scale empirical study: the authors systematically disassemble "what matters most in training a generalizable bioacoustic encoder." The conclusion is that a two-stage recipe—self-supervised pre-training on a mixture of diverse bioacoustic and general audio data, followed by supervised post-training—is the most effective for both in-distribution and out-of-distribution performance. This approach achieves new SOTA across 26 datasets and four task categories.
- Beyond Instance-Level Alignment: Dual-Level Optimal Transport for Audio-Text Retrieval
-
DART introduces a "feature-level" alignment layer beyond traditional "instance-level" audio-text alignment—treating each embedding channel as a distribution and employing Unbalanced Wasserstein distance to pair audio and text channels. Guided by a "Reliability-Aware Margin" based on variance, kurtosis, and cross-modal correlation to favor stable semantic channels, DART achieves SOTA retrieval performance under mini-batch, label-scarce, and noisy-label conditions.
- Bridging Piano Transcription and Rendering via Disentangled Score Content and Style
-
This paper unifies the inverse tasks of Expressive Performance Rendering (EPR, score-to-performance) and Automatic Piano Transcription (APT, performance-to-score) into a single Transformer Seq2Seq framework. By disentangling "note-level score content" and "global performance style" to achieve bidirectional modeling, and training an additional diffusion model to recommend appropriate styles directly from the score, the rendering is made both controllable and automated.
- Can Speech LLMs Think while Listening?
-
This paper inserts text Chain-of-Thought (CoT) into the text monologue stream of a multi-stream speech LLM (Moshi), enabling reasoning in the text space and improving accuracy by an average of 2.4x. It further proposes a "question completeness" metric based on KL divergence, allowing the model to "think while listening" and initiate reasoning before the user finishes speaking. Combined with DPO preference fine-tuning, this reduces additional reasoning latency by approximately 70% without sacrificing accuracy.
- Closing the Gap Between Text and Speech Understanding in LLMs
-
This paper deconstructs the phenomenon where "speech-adapted LLMs underperform their original text versions in language understanding" into two quantifiable causes: forgetting and cross-modal misalignment. Accordingly, it proposes SALAD—aligning the model on natural speech using cross-modal distillation, followed by active selection driven by misalignment signals to supplement a tiny fraction of synthetic speech. Using an order of magnitude less speech data than competitors, 3B/7B models achieve performance approaching the strongest open-source models across six broad-domain knowledge and reasoning benchmarks.
- Confident and Adaptive Generative Speech Recognition via Risk Control
-
Addressing the issue where fixed \(N\) in "LLM-based Generative Error Correction (GER) for ASR N-best hypotheses" either wastes computation or introduces noise, this paper adaptively determines the number of hypotheses per utterance based on ASR confidence scores. By employing the Learn then Test (LTT) risk control framework, it establishes a high-probability upper bound for "relative optimal performance degradation," reducing the average number of hypotheses by up to 52% across three datasets while maintaining or even improving correction performance.
- Continuous Audio Language Models
-
The authors propose CALM (Continuous Audio Language Models), enabling autoregressive Transformers to directly predict audio frame-by-frame in the continuous latent space of a VAE. By replacing the diffusion head with a "consistency model sampling head" for single-step generation, the model bypasses the hard trade-off between audio quality and computational cost inherent in discrete RVQ tokens. It achieves higher fidelity and faster inference across both speech and music, supporting the release of Pocket TTS, a 100M-parameter model capable of running faster than real-time on a laptop CPU.
- CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition
-
Addressing the issue of massive performance disparities among languages in multilingual speech recognition, this paper identifies that group DRO fails on CTC loss (due to CTC loss variance with audio length and acoustic characteristics, making it incomparable across groups). The authors propose CTC-DRO: using "duration-matched batch sampling" to flatten loss disparities caused by length, and "smooth weighted updates" to prevent weights from being monopolized by a single high-loss group. On five language sets of ML-SUPERB 2.0, it reduces the worst-case language error rate by up to 47.1% and the average error rate by up to 32.9%.
- Data-Centric Lessons To Improve Speech-Language Pretraining
-
This paper systematically migrates mature "data-centric" methodologies from the language/vision domains to speech-language pretraining. Through controlled ablations, it addresses three questions: how to chunk raw audio, how to generate synthetic data, and how to sample interleaved data. These insights are distilled into a 3.8B SpeechLM (SpeLangy), which outperforms models three times its size by 10.2% on Spoken Question-Answering (SQA).
- Discovering and Steering Interpretable Concepts in Large Generative Music Models
-
This work represents the first application of Sparse Autoencoders (SAE) to the audio/music domain, extracting interpretable musical concept features from the residual stream of the autoregressive music generation model MusicGen and leveraging these features for steerable generation.
- DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations
-
DrVoice reduces the speech frame rate entering the LLM from the mainstream 12.5Hz to 5Hz. By employing a dual-resolution scheme—"compression via grouping for understanding and a specialized refinement head at the original frame rate for generation"—this approach saves nearly 50% of training GPU hours and mitigates the frame rate mismatch between speech and text tokens. The 7B open-source model achieves SOTA performance across OpenAudioBench, VoiceBench, UltraEval-Audio, and Big Bench Audio.
- EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models
-
EchoMind is proposed as the first interrelated multi-level benchmark for empathetic dialogue. It systematically evaluates the ability of Speech Language Models (SLMs) to perceive non-verbal acoustic cues and generate empathetic responses through a cognitive workflow of "Understanding → Reasoning → Dialogue."
- Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention
-
This paper proposes the Dolphin model, which maps lip movements into discrete semantic tokens using a dual-path lightweight video encoder (DP-LipCoder) and designs a Global-Local Attention (GLA) separator. It surpasses SOTA on three benchmarks while reducing parameters by 50%+, MACs by 2.4×, and accelerating GPU inference by 6×.
- EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning
-
This work reframes Speech Emotion Recognition (SER) as a deep reasoning problem for the first time, utilizing a prosody-enhanced base model combined with GRPO-PTR (Progressive Trustworthy Reasoning) reinforcement learning to generate explainable emotional reasoning grounded in acoustic evidence.
- FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates
-
FlexiCodec is proposed, implementing a high-quality speech codec at ultra-low frame rates (3–12.5Hz) through an ASR-feature-guided dynamic frame merging strategy, while maintaining superior semantic information retention.
- FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions
-
FlexiVoice utilizes an LLM backbone to simultaneously process text, style instructions, and timbre reference speech. Through a three-stage progressive post-training process comprising "DPO → Decoupling GRPO → Instruction GRPO," it specifically addresses the challenge of style-timbre-content entanglement, enabling zero-shot TTS to accurately follow natural language style instructions while maintaining stable timbre cloning.
- Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation
-
The paper proposes Flow2GAN, a two-stage training framework. It first utilizes improved Flow Matching to learn generative capabilities, then applies GAN fine-tuning to achieve few-step (1/2/4 steps) high-fidelity audio generation, incorporating a multi-resolution network architecture to process Fourier coefficients at different time-frequency resolutions.
- From Natural Alignment to Conditional Controllability in Multimodal Dialogue
-
Ours proposes MM-DIA, a large-scale expressive multimodal dialogue dataset (360 hours, 54.7k dialogues) automatically constructed from movies and TV series, along with the MM-DIA-BENCH benchmark. "Controllable Multimodal Dialogue Generation (MDG)" is formalized as a unified conditional generation problem covering explicit prompt control and implicit cross-modal control tasks.
- From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training
-
Addressing the fundamental mismatch in end-to-end speech dialogue models that use the same autoregressive objective for both text and audio, TtT unifies Autoregressive (AR) text generation with Non-Autoregressive (NAR) discrete diffusion for audio within a single Transformer. Leveraging the "arbitrary-order AR" property of absorbing state diffusion, it establishes a unified training objective and introduces three training strategies to eliminate the training-inference gap, enabling a 3B model to outperform same-scale and even some 7B baselines across Audio-QA, ASR, AAC, and S2S.
- Gogo: Group-wise Granularity-ordered Codec for Stable and Efficient Speech Generation
-
This paper proposes Gogo—a speech codec that packs several consecutive frames into "groups" and orders tokens within each group from "coarse to fine." Coarse tokens encode high-level semantics, while fine tokens gradually restore acoustic details. Building on this, the authors develop GogoSpeech, a two-stage speech language model (constructing the skeleton first, then adding details), and a GRPO-trained token allocator (dynamically allocating budgets based on group complexity). It achieves SOTA reconstruction quality at an ultra-low token rate of 47 Hz and demonstrates improved stability and efficiency in long-form zero-shot TTS.
- Hierarchical Semantic-Acoustic Modeling via Semi-Discrete Residual Representations for Expressive End-to-End Speech Synthesis
-
VoxCPM employs a differentiable FSQ semi-discrete bottleneck to naturally decouple "semantic-prosody planning" from "fine-grained acoustic rendering" within a single end-to-end model. TSLM generates a stable semantic skeleton, RALM compensates for acoustic residuals, and LocDiT predicts high-fidelity speech latents. Trained on 1 million hours of data, the 0.5B model achieves SOTA performance for open-source zero-shot TTS and operates entirely without reliance on external discrete speech tokenizers.
- Human Behavior Atlas: Benchmarking Unified Psychological and Social Behavior Understanding
-
Constructed Human Behavior Atlas—the first large-scale multimodal unified benchmark (101K+ samples) covering four dimensions of emotional, cognitive, pathological, and social processes—and validated its effectiveness in multi-task training and transfer learning by training three OmniSapiens-7B model variants.
- Improving Black-Box Generative Attacks via Generator Semantic Consistency
-
By analyzing the semantic degradation phenomenon in the intermediate layer features of generators, this paper proposes a Mean Teacher-based semantic structure-aware framework. It performs self-feature distillation in the early layers of the generator to maintain semantic consistency, thereby enhancing the transferability of adversarial examples across models, domains, and tasks.
- Incentive-Aligned Multi-Source LLM Summaries
-
This work introduces the multi-task peer prediction mechanism from game theory into the LLM multi-source summarization pipeline, proposing the Truthful Text Summarization (TTS) framework. By constructing leave-one-out (LOO) cross-evaluation sets, extracting source stances on claims, and scoring reliability through informative agreement to filter unreliable sources before re-summarizing, the authors theoretically prove that "truthful reporting is the utility-maximizing strategy." Experiments demonstrate effective defense against prompt injection, fake information sources, and collaborative attacks.
- Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards
-
Addressing the phenomenon where "thinking more leads to worse performance" (test-time inverse scaling) in Audio LLMs, this paper employs GRPO online reinforcement learning with a multi-faceted reward suite that rewards the reasoning process itself (consistency, structural patterns, logical depth, domain knowledge, and overthinking penalty). This transforms reasoning from a burden into a gain, achieving SOTA on MMAU and MMSU while outperforming GPT-4o Audio and Gemini 2.5 Pro.
- JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models
-
JALMBench constructs the first large-scale, unified jailbreak evaluation benchmark for Large Audio Language Models (LALM)—comprising 245,000 audio samples, 1,000+ hours, 12 models, 8 attacks, and 5 defenses—systematically revealing security vulnerabilities of LALMs in the audio modality and their correlation with encoding architectures.
- Knowing When to Quit: Probabilistic Early Exits for Speech Separation Networks
-
This paper proposes PRESS: a probabilistic model using a "signal + error variance" framework to estimate interpretable predicted SNR distributions for each early-exit point in speech separation networks. It decides when to stop computation during inference based on the "confidence in reaching the target SNR," achieving dynamic computational scaling without sacrificing reconstruction quality.
- Latent Speech-Text Transformer
-
The authors propose Latent Speech-Text Transformer (LST), which aggregates discrete speech tokens into higher-level "latent speech patches" as autoregressive units (similar to how BLT processes bytes). This aligns the sequence modeling granularity of speech and text (reducing from 20× to ~1:1), achieving a +6.5% absolute gain on Speech HellaSwag with sustained gains from 420M to 7B parameters, while simultaneously reducing ASR/TTS inference computational costs.
- Learnable Fractional Superlets with a Spectro-Temporal Emotion Encoder for Speech Emotion Recognition
-
The classic "super-resolution wavelet" (superlet) is transformed into a fully differentiable, end-to-end learnable time-frequency front-end called LFST. It allows the frequency grid, the number of cycles per band, and the fractional mixture weights to be learned from data. Paired with a lightweight STEE encoder, it achieves SOTA results on three speech emotion datasets with minimal parameter counts.
- MambaVoiceCloning: Efficient and Expressive Text-to-Speech via State-Space Modeling and Diffusion Control
-
MVC transforms the entire conditioning path (text/tempo/prosody) of diffusion TTS into pure SSM (Mamba) at inference, eliminating all attention and explicit recurrences. By utilizing a lightweight training-only aligner and maintaining a fixed StyleTTS2 decoder/vocoder, it achieves modest but statistically significant quality gains over StyleTTS2, VITS, and hybrid Mamba-Attention models, while compressing the encoder to 21M parameters and increasing throughput by 1.6×.
- MAPSS: Manifold-Based Assessment of Perceptual Source Separation
-
This paper proposes two complementary metrics, Perceptual Separation (PS) and Perceptual Match (PM). By utilizing diffusion maps to embed self-supervised representations into low-dimensional manifolds, it achieves the first functional decoupling of leakage and self-distortion in source separation. Compared with 18 mainstream metrics, it almost consistently ranks first or second in terms of correlation with subjective scores.
- Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models
-
This paper reveals the prevalent "zero audio-contribution" phenomenon in Large Audio Language Models (LALMs)—where models answer correctly even when the audio is replaced with silence. It proposes a data filtering method based on "audio contribution" and a two-stage post-training paradigm (Weak-to-Strong / Mixed-to-Strong). Combined with the 570k-sample AudioMCQ dataset, it achieves SOTA results on four major audio understanding benchmarks.
- MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
-
The authors propose MMSU (5,000 audio QAs, 47 tasks), the first speech understanding and reasoning benchmark to systematically integrate linguistic theories. Evaluating 22 SpeechLLMs reveals that existing models still exhibit significant performance gaps in phonological perception and complex reasoning.
- Music Flamingo: Scaling Music Understanding in Audio Language Models
-
By constructing a 5-million-scale multi-cultural, full-length, hierarchically-annotated music dataset (MF-Skills + MF-Think) and applying a "SFT → CoT Cold Start → GRPO Reinforcement Learning" training recipe onto an enhanced Audio Flamingo 3 backbone, Music Flamingo elevates audio language models from "identifying surface attributes" to "performing hierarchical, theory-aware music reasoning like a trained musician," achieving new SOTA results on 12+ music understanding and reasoning benchmarks.
- OWL: Geometry-Aware Spatial Reasoning for Audio Large Language Models
-
This paper proposes SAGE, a geometry-aware binaural audio encoder, and OWL, a spatial audio large language model. By aligning acoustic features with 3D geometry using Room Impulse Response (RIR) and panoramic depth maps during training—and utilizing only audio during inference—combined with "spatial-anchored Chain-of-Thought (CoT) + curriculum learning," the model achieves clock-face-level azimuth estimation and interpretable multi-step spatial reasoning. It significantly outperforms BAT in DoA error and spatial QA.
- PACE: Pretrained Audio Continual Learning
-
This work systematically constructs the first audio continual learning benchmark and reveals an upstream-downstream mismatch caused by the dominance of low-level time-frequency features in pretrained audio models. The proposed PACE method (Improved First-Session Adaptation + Adaptive Subspace Orthogonal PEFT + Boundary-aware Perturbation) significantly outperforms the SOTA across 6 audio CL benchmarks.
- ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-Aware Speech-to-Speech Interaction
-
The ParaS2S framework is proposed, comprising ParaS2SBench—a benchmark for evaluating paralinguistic awareness (emotion/sarcasm/age/gender) in Speech-to-Speech (S2S) interaction—and ParaS2SAlign, a GRPO-based RL alignment framework. It enables S2S models to learn the ability to adjust responses according to speaking styles with minimal annotated data.
- Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition
-
Ours proposes USR 2.0, which replaces autoregressive pseudo-label generation with CTC-driven teacher forcing. Attention pseudo-labels are generated in a single forward pass, increasing training speed by nearly 2×. By synergizing CTC and attention predictions, it enhances out-of-distribution robustness and achieves SOTA results for unified ASR/VSR/AVSR on LRS3, LRS2, and WildVSR.
- Physics-Informed Audio-Geometry-Grid Representation Learning for Universal Sound Source Localization
-
This paper proposes AGG-RL, which projects "Audio-Geometry Representations" and "Grid Representations" into a shared latent space and generates spatial spectra via inner product similarity. Combined with two physical prior components (Learnable Non-uniform DFT and Relative Microphone Position Encoding), it achieves universal sound source localization without retraining for any array geometry or DOA grid, significantly outperforming existing methods on unseen arrays.
- PrismAudio: Decomposed Chain-of-Thought and Multi-dimensional Rewards for Video-to-Audio Generation
-
PrismAudio decomposes video-to-audio (V2A) generation into four specialized Chains-of-Thought (CoT): semantic, temporal, aesthetic, and spatial. Each CoT is paired with a corresponding reward function, and the model is optimized via multi-dimensional reinforcement learning using efficient Fast-GRPO. It achieves SOTA across four perceptual dimensions on VGGSound and the self-built AudioCanvas with fewer parameters and faster inference.
- Query-Guided Spatial-Temporal-Frequency Interaction for Music Audio-Visual Question Answering
-
The QSTar framework is proposed, which embeds Query Guidance throughout the entire processing pipeline and introduces a spatial-temporal-frequency interaction module (specifically utilizing spectral features to distinguish timbre), significantly enhancing Music Audio-Visual Question Answering (Music AVQA) performance.
- RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
-
The study constructs RedTeamCUA, the first red teaming framework for CUA in hybrid Web-OS environments, along with RTC-Bench containing 864 test cases. It systematically evaluates the vulnerability of 9+ frontier CUAs to indirect prompt injection, finding that all CUAs are attackable (highest ASR 83%). Furthermore, more capable models prove more dangerous—the fact that the Attempt Rate (AR) is significantly higher than the Attack Success Rate (ASR) implies that improvements in model capabilities will directly translate into higher attack success rates.
- Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion
-
The authors propose the Speech-guided Machine Translation (SMT) framework, which utilizes TTS to synthesize speech from source text as a joint input with text for MLLM translation. A self-evolution mechanism automatically filters beneficial synthesized speech samples for continuous training. Ours achieves SOTA on Multi30K, surpassing all MMT methods, and reaches average SOTA across 108 translation directions in FLORES-200 with only 9B parameters.
- Scaling Speech Tokenizers with Diffusion Autoencoders
-
Ours proposes SiTok (Speech Diffusion Tokenizer), which utilizes a diffusion autoencoder to jointly train the encoder-quantizer-decoder (as a single-stage process). By incorporating CTC semantic regularization, it ensures that discrete tokens retain linguistic information. Scaled to 1.6B parameters and 22 million hours of speech data, SiTok achieves strong performance with a 3.34% WER (reconstruction) and 4.95 WER (LLM ASR) at an extremely low token rate (12.5Hz / 200bps).
- SiNGER: A Clearer Voice Distills Vision Transformers Further
-
The SiNGER (Singular Nullspace-Guided Energy Reallocation) framework is proposed to suppress high-norm artifacts in ViTs by imposing perturbations in the nullspace direction of teacher features while preserving information signals. Combined with lightweight LoRA adapters for efficient distillation, it achieves SOTA performance across multiple downstream tasks and generates clearer, more interpretable representations.
- SmartDJ: Declarative Audio Editing with Audio Language Model
-
SmartDJ proposes a "declarative audio editing" paradigm — where users specify only the desired outcome (e.g., "transform this recording into a sunny forest"). An Audio Language Model (ALM) acts as a planner to decompose high-level instructions into a sequence of atomic editing steps, which are then incrementally executed by a Stereo Latent Diffusion Model (LDM). It significantly outperforms previous audio editing methods in perceptual quality, spatial realism, and semantic alignment.
- Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences
-
This paper addresses the neglected task of converting spoken mathematical formulas/sentences into LaTeX. It constructs the first large-scale open-source dataset (English and Russian, 66k manual annotations + 571k synthetic audio) and systematically compares "ASR Post-Correction" and "End-to-End Audio LLM" approaches. Notably, SALMONN reduces the Character Error Rate (CER) from MathSpeech's 64% to 17.5% on the self-built S2L-equations benchmark.
- Speech World Model: Causal State–Action Planning with Explicit Reasoning for Speech
-
This paper proposes the Speech World Model (SWM), which decomposes speech understanding into four modules: World Model Activation, Theory of Mind, Speech Act, and Pragmatic Intent. These modules infer states through a Causal Directed Acyclic Graph (DAG). The resulting structured states serve as explicit prompts for instruction-tuned Speech Language Models (SLMs), achieving speech reasoning performance comparable to Gemini 2.5 Pro at a low cost of only 20 GPU-hours.
- SpeechJudge: Towards Human-Level Judgment for Speech Naturalness
-
To fill the missing puzzle piece of "large-scale human preference corpora for naturalness" in speech synthesis, this paper introduces a comprehensive toolkit: a dataset (99K preference pairs), a benchmark (1000 high-consistency samples), and a reward model. By utilizing a "SFT cold-start + GRPO reinforcement" two-stage strategy, Qwen2.5-Omni-7B is trained into a generative reward model, SpeechJudge-GRM. It achieves a 77.2% accuracy rate (79.4% with inference-time voting) in judging speech naturalness, significantly outperforming the classic Bradley-Terry reward model (72.7%).
- SpeechOp: Inference-Time Task Composition for Generative Speech Processing
-
SpeechOp transforms a pretrained TTS diffusion model into a "universal speech processor." By using a single multi-task latent diffusion model, it simultaneously handles synthesis, enhancement, and separation. Crucially, it introduces the TC-CFG guidance strategy, allowing independently learned capabilities to be freely combined at inference time (e.g., using ASR-generated transcripts to guide enhancement), achieving SOTA in content fidelity for speech enhancement (WER reduced by 46% relative to HiFi-GAN-2).
- StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
-
Addressing the fragility of semantic speech tokenizers where bit-level noise causes dramatic token sequence jumps, StableToken introduces the Voting-LFQ architecture with "multi-branch quantization + differentiable bit-level majority voting" alongside "noise-aware consensus training." This reduces the Unit Edit Distance under noise from 26.17% to 10.17% (a 60%+ relative reduction) and significantly boosts the robustness of downstream SpeechLLMs in ASR/SER/TTS tasks.
- STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence
-
This paper introduces the concept of "Audio 4D Intelligence" (physical reasoning of sound source dynamics in 1D time + 3D space) and constructs the STAR-Bench benchmark. Using a dual pipeline of procedural synthesis and four-stage human annotation, 2,353 questions were generated to specifically test fine-grained auditory cues that are "difficult to describe in text." Evaluation of 19 large audio models reveals that even the strongest model, Gemini 2.5 Pro, achieves only 49.6% average accuracy, far below the human level of approximately 79%.
- Steering Autoregressive Music Generation with Recursive Feature Machines
-
This paper proposes MusicRFM, which utilizes Recursive Feature Machines (RFM) to extract "concept directions" corresponding to music theory concepts such as notes, chords, and scales from the hidden activations of MusicGen. During inference, these directions are directly injected into the residual stream to steer generation in real-time—requiring no retraining or step-wise optimization. The method increases the target note hit rate from 0.23 to 0.82 while maintaining text alignment (CLAP) with only a marginal drop (approx. 0.02 compared to baseline).
- Stitch: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models
-
Stitch is proposed to enable "thinking while talking" in Spoken Language Models (SLMs) by interleaving silent reasoning tokens with speech tokens in chunks. It leverages the idle computation time during audio playback to perform reasoning. Stitch-S achieves first-frame latency identical to non-reasoning baselines while improving mathematical reasoning accuracy by approximately 15 percentage points.
- SupCLAP: Controlling Optimization Trajectory Drift in Audio-Text Contrastive Learning with Support Vector Regularization
-
This paper decomposes the gradient of contrastive learning into "pulling force" and "pushing force". It discovers that the component of the negative sample pushing force perpendicular to the pulling force contains rich information but is uncontrolled, leading to optimization trajectory drift. Therefore, it proposes Support Vector Regularization (SVR): constructing a text support vector shifted towards the positive sample and using a semantic radius \(R\) to adaptively suppress this perpendicular component. This improves InfoNCE / SigLIP on audio-text retrieval and zero-shot classification without adding any inference overhead.
- SyncTrack: Rhythmic Stability and Synchronization in Multi-Track Music Generation
-
SyncTrack is proposed with a unified architecture featuring track-shared modules (dual cross-track attention to ensure rhythmic synchronization) and track-specific modules (learnable instrument priors to preserve timbre differences). Together with three new rhythmic consistency evaluation metrics (IRS/CBS/CBD), it significantly improves multi-track music generation quality (FAD from 6.55 → 1.26, subjective MOS 3.42 vs. 1.57).
- TangoFlux: Super-Fast and Faithful Text-to-Audio Generation with Flow Matching and CLAP-Ranked Preference Optimization
-
TangoFlux utilizes a 515M parameter rectified flow matching model to generate 30-second 44.1kHz audio in just 3.7 seconds on an A40. It proposes CRPO—using CLAP as a proxy reward to generate self-improving preference pairs online—enabling this compact model to achieve SOTA across objective and subjective text-to-audio metrics.
- TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling
-
Ours proposes TASTE (Text-Aligned Speech Tokenization and Embedding), which aligns speech tokens with text transcriptions through a cross-attention mechanism. This achieves high-quality speech reconstruction at an extremely low bitrate (~150 bps) and makes joint text-speech modeling direct and efficient. The 1.3B parameter TASLM outperforms 7B pre-trained SLMs.
- The Devil behind the Mask: An Emergent Safety Vulnerability of Diffusion LLMs
-
This paper is the first to systematically reveal inherent safety vulnerabilities in diffusion large language models (dLLMs) caused by bidirectional modeling and parallel decoding mechanisms. It proposes the DiJA jailbreak attack framework, which achieves nearly 100% attack success rates on multiple aligned dLLMs through interleaved mask-text prompts.
- Token-based Audio Inpainting via Discrete Diffusion
-
This paper proposes AIDD, which compresses audio into discrete token sequences using a pretrained tokenizer (WavTokenizer) and then performs absorbing state discrete diffusion within this token space to fill missing segments. With training improvements including span masking and derivative smoothing regularization, AIDD achieves higher stability and lower distortion on medium-to-long gaps (150–750 ms) in MusicNet/MAESTRO compared to strong diffusion baselines (like CQT-Diff+), while being smaller and faster.
- Toward Complex-Valued Neural Networks for Waveform Generation
-
ComVo is proposed as the first iSTFT vocoder utilizing Complex-Valued Neural Networks (CVNN) in both the generator and discriminator. It stabilizes training via a Phase Quantization layer and introduces a block matrix calculation scheme to reduce training time by 25%, achieving synthesis quality superior to real-valued baselines such as Vocos on LibriTTS.
- Towards True Speech-to-Speech Models Without Text Guidance
-
This paper proposes a true speech-to-speech large model: starting from a pre-trained text LLM (Qwen3-8B), it utilizes "modality-based layer split" and "frozen pre-training" to directly understand and generate speech without relying on intermediate text. It achieves SOTA performance in speech QA while almost completely recovering from the text capability degradation commonly seen when extending to new modalities.
- TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization
-
TripleSumm is proposed to achieve dynamic frame-level modality importance adjustment using Multi-Scale Temporal blocks (hierarchical sliding window attention) and Cross-Modal Fusion blocks (adaptive weighting of vision/text/audio via a fusion token). The authors also release MoSu, the first large-scale triple-modality video summarization dataset (52,678 videos), achieving SOTA results across 4 benchmarks.
- TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems
-
Addressing the issue where modern TTS systems approach human quality and traditional MOS/objective metrics fail, this paper proposes TTSDS2—an unsupervised objective metric that factorizes speech into four perceptual factors and uses 2-Wasserstein distance to measure "how close the synthetic distribution is to the real one and how far it is from noise." It is the only metric among 16 candidates to achieve a Spearman correlation >0.5 (average 0.67) across all domains and subjective scores. The authors also release 11,000 subjective ratings, a leakage-proof multilingual reconstruction pipeline, and a benchmark covering 14 languages.
- TVTSyn: Content-Synchronized Time-Varying Timbre for Streaming Voice Conversion and Anonymization
-
To address the representation mismatch in streaming voice conversion—where "content varies frame-by-frame while speaker identity is injected as a static global vector"—TVTSyn utilizes a retrievable Global Timbre Memory (GTM) to expand static timbre into multiple timbre facets. Frame-level content performs attention-based retrieval, followed by gating and spherical interpolation to generate content-synchronized Time-Varying Timbre (TVT). Combined with a factorized VQ bottleneck to remove residual speaker info, TVTSyn achieves superior naturalness, speaker transfer, and anonymization effects compared to existing streaming baselines, with a GPU latency of <80ms.
- UALM: Unified Audio Language Model for Understanding, Generation and Reasoning
-
UALM utilizes a single autoregressive language model to unify audio understanding, text-to-audio generation, and multimodal reasoning. It first demonstrates that a pure LM directly predicting audio tokens can match the generation quality of diffusion models (UALM-Gen), then integrates all three capabilities into one model via data mixing and modal alignment (UALM), and finally empowers the model with generative multimodal reasoning through interleaved "text+audio" Chain-of-Thought for planning, self-evaluation, and revision (UALM-Reason).
- UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice
-
UniSS discretizes speech into three types of tokens—speaker, linguistic, and semantic—and integrates them directly into a pre-trained text LLM (Qwen2.5-1.5B). By employing a single-stage autoregressive model with a "Listen-Translate-Speak" cross-modal Chain-of-Thought (CoT) prompt, it transfers the LLM's inherent text translation capabilities to the speech domain. This approach achieves accurate translation while preserving the original speaker's timbre, emotion, and duration, and additionally releases UniST, a 44.8k-hour Chinese-English expressive S2ST dataset.
- Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification
-
The authors point out that audio self-supervised models (SSL) rely on expensive fine-tuning to achieve SOTA on AudioSet because lightweight linear probing performs poorly. The root cause is not the embedding quality, but a "global pooling bottleneck": the
[cls]-token compresses scattered, local sound events into a single vector, losing critical information. They propose protobin (Binarized Prototype Probe), which uses a set of on-the-fly binarized, class-agnostic prototypes to perform per-class, multi-vector aggregation on the full token map. By simply adding a single prototype layer, it significantly outperforms linear and attentive probes, re-establishing probing as an efficient and reliable paradigm for evaluating audio SSL. - VibeVoice: Expressive Podcast Generation with Next-Token Diffusion
-
VibeVoice utilizes an ultra-low frame rate (7.5 Hz) continuous speech tokenizer to compress long audio into extremely short sequences. It then employs an LLM in a "next-token diffusion" framework to predict acoustic latents segment-by-segment, enabling zero-shot synthesis of podcasts up to 90 minutes with up to 4 speakers, including natural turn-taking and non-lexical details like breathing and lip-smacking.
- VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation
-
VowelPrompt is proposed to extract vowel-level prosodic descriptors (pitch/energy/duration) based on phonetic evidence, converting them into natural language to enhance LLM prompts for emotion recognition. Combined with a two-stage SFT+GRPO training strategy, it consistently outperforms SOTA under zero-shot, fine-tuned, cross-domain, and cross-lingual conditions while generating interpretable emotional reasoning.
- WearVox: An Egocentric Multichannel Voice Assistant Benchmark for Wearables
-
WearVox utilizes AI glasses to collect 3,842 segments of egocentric, multichannel audio from real-world wearable scenarios, covering five task categories: Search QA, Closed-book QA, Side-speech Rejection, Tool Use, and Speech Translation. Systematic evaluation of mainstream Speech Large Language Models (SLLMs) reveals that real-time model accuracy ranges only from 29% to 59% and degrades severely under outdoor noise. A multichannel SLLM case study demonstrates that spatial audio cues significantly enhance noise robustness and device-directed speech discrimination.
- When and Where to Reset Matters for Long-Term Test-Time Adaptation
-
ASR proposes an adaptive selective reset scheme that dynamically determines when to reset via prediction concentration \(\mathcal{C}_t\) (avoiding the suboptimality of fixed cycles) and where to reset via a progressive layer selection strategy from the output layer to the input layer (preserving valuable adaptation knowledge). Combined with importance-aware regularization to recover key reset knowledge and on-the-fly adjustments, it achieves a 44.12% improvement over the SOTA on CCC-Hard.
- When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment
-
It is discovered that Attack Success Rates (ASR) in LLM jailbreak benchmarks are artificially inflated by semantically irrelevant style patterns (e.g., "create a list"). This phenomenon is observed in nearly all of the 36 LLMs evaluated. Superficial style alignment fine-tuning further exacerbates this risk. The authors propose SafeStyle—mitigating these risks using style-augmented safety training data.
- YuE: Scaling Open Foundation Models for Long-Form Music Generation
-
YuE scales the LLaMA2 architecture to trillions of tokens to train the first open-source "lyrics-to-song" foundation model. By employing dual-token track decoupling (separate prediction of vocals/accompaniment), structural progressive conditioning (interleaving lyrics and audio by segments), and re-designed music in-context learning, it generates songs up to 5 minutes long with aligned lyrics and vivid vocals, matching or exceeding certain commercial systems (e.g., Udio, Tiangong) in musicality.