T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback¶
Conference: ACL 2025
arXiv: 2505.10561
Code: https://T2Afeedback.github.io
Area: Audio & Speech
Keywords: Text-to-Audio, AI Feedback, Preference Tuning, Multi-event Audio, Fine-grained Evaluation
TL;DR¶
Proposes three fine-grained AI audio scoring pipelines (event occurrence, event sequence, and acoustic harmony quality) to replace human annotation for constructing a large-scale audio preference dataset, T2A-Feedback (41K prompts, 249K audios). By utilizing preference tuning to enhance the basic capabilities of TTA models, it significantly improves multi-event audio generation quality in both simple (AudioCaps) and complex (T2A-EpicBench) scenarios.
Background & Motivation¶
Background: Text-to-Audio (TTA) generation models can produce diverse audio but perform poorly in complex multi-event scenarios—failing to fully include all described events, follow event order, or organize multiple events harmoniously.
Limitations of Prior Work: (a) Existing evaluation metrics like CLAP only assess global audio-text alignment, failing to evaluate event occurrence, sequence, and harmony in a fine-grained manner; (b) Human annotation of audio preference data is extremely costly and unscalable; (c) There is a lack of evaluation benchmarks specifically targeting complex multi-event or narrative scenarios.
Key Challenge: Advanced TTA applications (e.g., narrative audio, video dubbing) require precise multi-event control, yet the "basic capabilities" of models (event inclusion, sequence, harmony) remain insufficient, necessitating targeted enhancements for each basic capability.
Goal: To replace human annotation with automated AI feedback, constructing fine-grained preference data and evaluation metrics for each of the three basic capabilities of TTA models.
Key Insight: Deconstructing TTA preference tuning into three independently evaluable dimensions (event occurrence, event sequence, acoustic harmony) and designing dedicated AI scoring pipelines for each dimension.
Core Idea: Three-dimensional fine-grained AI scoring \(\rightarrow\) large-scale preference dataset \(\rightarrow\) preference tuning to enhance basic capabilities.
Method¶
Overall Architecture¶
(1) Design three AI scoring pipelines: Event Occurrence Score (EOS), Event Sequence Score (ESS), and Acoustic & Harmonic Quality Score (AHQ); (2) Use these three pipelines to score LLM-generated audio at scale, constructing the T2A-Feedback preference dataset (41K prompts, 249K audios); (3) Enhance existing TTA models using preference tuning (a DPO variant). Additionally, the study constructs T2A-EpicBench to evaluate complex scenarios.
Key Designs¶
-
Event Occurrence Score (EOS):
- Function: Verifies whether each event described in the text occurs in the audio.
- Mechanism: Decomposes the prompt into independent event descriptions, calculates the semantic matching score (based on CLAP) between each event and the audio separately, and treats low-scoring events as missing.
- Design Motivation: Global matching via CLAP cannot distinguish between "containing all events" versus "containing only partial events".
-
Event Sequence Score (ESS):
- Function: Verifies whether the sequence of events in the audio aligns with the text description.
- Mechanism: Uses an audio event detection model to estimate the onset and offset times of each event, then compares them with the event sequence in the text.
- Design Motivation: Multi-event audio must not only contain all events but also organize them in the correct order.
-
Acoustic & Harmonic Quality (AHQ):
- Function: Evaluates the overall acoustic quality and the harmony among multiple events in the audio.
- Mechanism: Manually annotates the acoustic and harmonic quality of a subset of audios and trains an automatic predictor.
- Design Motivation: Even if all events exist in the correct order, acoustically unharmonious transitions (e.g., abrupt switching, noise interference) represent low quality.
-
T2A-EpicBench Evaluation Benchmark:
- Function: Evaluates advanced capabilities of TTA models under long-format, multi-event, and narrative settings.
- Mechanism: Constructs a test set comprising long fantasy, narrative, and story descriptions, which is significantly more challenging than the simple descriptions in AudioCaps.
- Design Motivation: Existing benchmarks (like AudioCaps) have overly simplistic descriptions, which are insufficient to evaluate complex applications.
Loss & Training¶
- Preference tuning based on a DPO variant.
- The three-dimensional scores are used individually or jointly to construct preference pairs.
- Base Model: Make-an-Audio 2 (diffusion-based method).
Key Experimental Results¶
Comparison with Existing Evaluation Metrics (Correlation with Human Preferences)¶
| Metric | Correlation with Human Preference | Description |
|---|---|---|
| CLAP (Global Matching) | Medium | Unable to evaluate at a fine-grained level |
| FAD/IS (Distributional Metrics) | Low | Does not evaluate individual samples |
| EOS (Event Occurrence) | High | Fine-grained event verification |
| ESS (Event Sequence) | High | Sequence verification |
| AHQ (Acoustic Harmony) | High | Quality prediction |
Preference Tuning Performance¶
| Setting | AudioCaps (Simple) | T2A-EpicBench (Complex) |
|---|---|---|
| Make-an-Audio 2 (Baseline) | Baseline | Baseline |
| + T2A-Feedback Tuning | Significant Gain | Significant Gain |
Key Findings¶
- The correlation of the three AI scoring pipelines with human preferences is significantly better than CLAP, validating the necessity of fine-grained evaluation.
- Preference tuning is effective in both simple and complex scenarios—enhancing basic capabilities leads to an "emergent" improvement in advanced performance.
- T2A-EpicBench reveals severe deficiencies of current models in narrative audio, with most models failing almost completely on long-form descriptions.
- The three dimensions can be used independently or jointly for tuning, with each dimension providing complementary benefits.
Highlights & Insights¶
- The strategy of "enhancing basic capabilities \(\rightarrow\) emerging advanced performance" is compelling—it removes the need for specialized training on complex scenarios, as focusing on the three basic dimensions is sufficient.
- The three-dimensional fine-grained scoring is far more informative than CLAP's one-dimensional global scoring, establishing a new standard for TTA evaluation.
- T2A-Feedback (249K audios) is the first large-scale TTA preference dataset, filling a critical gap in the field.
- T2A-EpicBench pushes TTA evaluation into the "narrative/multi-event" era.
- The methodology of replacing human annotation with AI feedback is transferrable to other generative domains (e.g., fine-grained evaluation in text-to-video generation).
Limitations & Future Work¶
- The accuracy of the AI scoring pipelines themselves limits preference data quality, especially since AHQ training data is limited.
- Preference tuning is only validated on Make-an-Audio 2, leaving its effectiveness on other TTA models unconfirmed.
- The evaluation of T2A-EpicBench still heavily relies on automatic metrics, whereas human evaluation remains more reliable for long audios.
- The Event Sequence Score depends on the accuracy of the audio event detection model, meaning detection errors propagate to the final score.
Related Work & Insights¶
- vs Tango2: Tango2 uses CLAP for global preference ranking, whereas T2A-Feedback utilizes three-dimensional fine-grained scoring, which offers far richer information.
- vs FlashAudio: FlashAudio optimizes inference speed, while T2A-Feedback optimizes generation quality—constituting complementary directions.
- vs RLHF/DPO in LLM: Porting mature preference tuning methods from the LLM domain to TTA represents a successful cross-domain methodological transfer.
Rating¶
- Novelty: ⭐⭐⭐⭐ The three-dimensional AI scoring pipeline is novel and practical; the large-scale preference dataset is highly valuable.
- Experimental Thoroughness: ⭐⭐⭐⭐ Validated via score correlation, preference tuning, and a new benchmark, though it is only evaluated on a single base model.
- Writing Quality: ⭐⭐⭐⭐ The three-dimensional deconstruction is clear and the motivation is well-justified.
- Value: ⭐⭐⭐⭐⭐ The triple contribution of the dataset, evaluation metrics, and the benchmark significantly advances the TTA field.