ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling¶
Conference: ACL 2026
arXiv: 2510.08878
Code: Project Page
Area: Image Generation
Keywords: Text-to-Audio, Temporal Control, Intelligible Speech, Progressive Diffusion, DiT
TL;DR¶
This paper proposes ControlAudio, a unified progressive diffusion modeling framework that achieves three capabilities—text-guided generation, precise temporal control, and intelligible speech synthesis—within a single diffusion model through three-stage progressive training (TTA pretraining → temporal control fine-tuning → joint temporal+intelligible speech training) and progressive guidance sampling, significantly outperforming existing methods in temporal precision and speech intelligibility.
Background & Motivation¶
Background: Text-to-audio (TTA) generation has achieved significant progress through large-scale diffusion models. Recent research has begun exploring fine-grained control: one line of work achieves precise temporal control (e.g., "bird chirping, 2-5 seconds"), while another enables intelligible speech generation (audio containing clearly distinguishable speech content).
Limitations of Prior Work: (1) Due to scarce large-scale annotated data (datasets containing both temporal annotations and speech transcriptions are extremely limited), controllable TTA remains constrained in performance even after scaling; (2) No prior work has achieved both temporal control and intelligible speech generation within a unified framework; (3) Adding fine-grained control signals often sacrifices generation quality under pure text conditions (catastrophic forgetting); (4) Natural language descriptions of complex multi-event scenarios contain ambiguities.
Key Challenge: Controllable TTA requires handling condition signals at multiple granularities (text → temporal → phoneme), but training data at different granularities varies drastically in scale (millions of text-audio pairs vs. tens of thousands of temporal annotations), making direct joint training ineffective.
Goal: Unify text-guided generation, temporal indication, and intelligible speech capabilities within a single diffusion model without sacrificing individual capabilities.
Key Insight: Model controllable TTA as a multi-task learning problem through progressive diffusion modeling—adopting coarse-to-fine progressive strategies across data construction, model training, and guidance sampling.
Core Idea: Progressive modeling naturally aligns with the hierarchical nature of control granularity (text → temporal → phoneme) and the coarse-to-fine characteristics of diffusion sampling—emphasizing coarse-grained temporal structure in early diffusion stages and introducing fine-grained phoneme content in later stages.
Method¶
Overall Architecture¶
Three-stage progressive training: Stage 1 pretrains DiT on large-scale text-audio pairs → Stage 2 fine-tunes on temporal annotation data (switching between text/text+temporal conditions to avoid forgetting) → Stage 3 jointly trains on full multi-source data (unfreezing text encoder for joint optimization). During inference, progressive guidance sampling is employed: early diffusion steps use temporal condition guidance, while later steps introduce phoneme conditions.
Key Designs¶
-
Unified Semantic Modeling of Structured Prompts:
- Function: Unify encoding of text, temporal, and phoneme condition signals using a single text encoder
- Mechanism: Design structured prompt format using special tokens to separate event descriptions and precise start/end times (e.g.,
<event>bird chirping<start>2.0<end>5.0), eliminating natural language ambiguities. Temporal windows naturally define speech duration, and phoneme tokens are extended into the text encoder's vocabulary for unified encoding - Design Motivation: Natural language descriptions of complex scenarios create ambiguity where "from...to" could indicate pitch change or time range; structured format eliminates ambiguity and is extensible
-
Progressive Diffusion Training:
- Function: Progressively acquire finer-grained control capabilities while maintaining learned abilities
- Mechanism: Stage 1 text-only conditional pretraining establishes high-fidelity TTA prior. Stage 2 fine-tunes on temporal data while randomly switching between pure text/text+temporal conditions to avoid catastrophic forgetting. Stage 3 unfreezes text encoder for joint optimization, switching among text/text+temporal/text+temporal+phoneme conditions
- Design Motivation: Direct training with all conditions leads to performance degradation due to imbalanced data scales and task complexity; progressive strategy enables models to gradually build coarse-to-fine control capabilities
-
Progressive Guidance Sampling:
- Function: Guide diffusion sampling hierarchically by granularity during inference
- Mechanism: Diffusion sampling is inherently coarse-to-fine—early steps generate large-scale structure, later steps synthesize fine-grained details. Progressive guidance emphasizes temporal conditions early (determining event time windows) and introduces phoneme conditions later (determining speech content), aligning with diffusion sampling's natural characteristics
- Design Motivation: Fixed guidance signals cannot match the requirements of different diffusion stages; progressive guidance achieves natural alignment between condition granularity and sampling granularity
Loss & Training¶
Standard conditional diffusion training objective (noise prediction). Data construction: Annotated data extracted from AudioSet-SL containing speech segments transcribed using Gemini 2.5 Pro; synthetic data from LibriTTS-R combining single/multi-speaker scenarios and mixing with non-speech backgrounds (SNR 2-10 dB), generating 171,246 complex audio scenarios.
Key Experimental Results¶
Main Results¶
Temporal Control Evaluation on AudioCondition Test Set
| Method | Eb ↑ | At ↑ | FAD ↓ | CLAP ↑ | Temporal(subjective) ↑ | OVL(subjective) ↑ |
|---|---|---|---|---|---|---|
| Ground Truth | 43.37 | 67.53 | - | 0.377 | 4.52 | 4.48 |
| Stable Audio | 11.28 | 51.67 | 1.93 | 0.318 | 1.94 | 3.44 |
| PicoAudio | 29.96 | 57.70 | 3.43 | 0.296 | 2.70 | 2.44 |
| ControlAudio | 38.50 | 67.87 | 0.98 | 0.347 | 4.01 | 3.74 |
Ablation Study¶
Training Strategy Ablation
| Config | At ↑ | FAD ↓ | Speech WER ↓ |
|---|---|---|---|
| Stage 1 only | baseline | baseline | no speech capability |
| + Stage 2 (temporal) | significant improvement | slight decrease | no speech capability |
| + Stage 3 (temporal+speech) | best | best | best |
| No progressive guidance (fixed) | decrease | increase | increase |
Key Findings¶
- ControlAudio achieves temporal precision approaching Ground Truth (At 67.87 vs 67.53), far exceeding other methods
- Stage 3 joint training not only unlocks speech capability but further improves temporal precision—benefiting from richer time-content alignment signals in temporal-annotated speech data
- Unfreezing text encoder for joint optimization is critical—enabling condition encoding and generation backbone to synergistically adapt to complex multi-objective tasks
- Progressive guidance sampling significantly outperforms fixed guidance—alignment of condition granularity with sampling granularity improves generation quality
- CoT LLM planning can transform free text into structured prompts, expanding practical use cases
Highlights & Insights¶
- Progressive design permeates data → training → inference, forming a consistent coarse-to-fine paradigm
- Structured prompts + phoneme extension enable handling three condition types with a single text encoder, avoiding multi-module complexity
- Stage 3 joint training counterintuitively improving temporal precision—positive transfer in multi-task learning
Limitations & Future Work¶
- SNR range (2-10 dB) of synthetic data may not cover all real-world scenarios
- Speaker identity control for speech generation remains unexplored
- Long audio generation beyond 10 seconds not evaluated
- Relies on external LLMs to convert free text into structured prompts
Related Work & Insights¶
- vs PicoAudio/AudioComposer: Only achieve temporal control without speech capability; ControlAudio first unifies both
- vs VoiceLDM/VoiceDiT: Focus on speech synthesis but lack temporal control of general audio events
- vs Progressive modeling in video generation: ControlAudio first introduces progressive modeling to controllable TTA domain
Rating¶
- Novelty: ⭐⭐⭐⭐ Novel framework design combining progressive diffusion modeling + unified semantic encoding
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive subjective and objective evaluations, thorough ablation studies
- Writing Quality: ⭐⭐⭐⭐ Systematic and clear method description, well-articulated progressive design motivation
- Value: ⭐⭐⭐⭐ First to unify temporal control and intelligible speech generation, advancing controllable audio generation