AudioX: A Unified Framework for Anything-to-Audio Generation¶
Conference: ICLR2026
OpenReview: https://openreview.net/forum?id=qjJWxK3yWo
Code: https://zeyuet.github.io/AudioX/
Area: Audio Generation / Multimodal Diffusion
Keywords: Anything-to-Audio Generation, Multimodal Fusion, Diffusion Transformer, Instruction Following, Data Construction
TL;DR¶
AudioX utilizes a unified framework based on a Diffusion Transformer (DiT), integrated with a lightweight "Multimodal Adaptive Fusion (MAF)" module and a self-constructed 7-million-sample multimodal dataset, IF-caps. This allows a single set of model weights to generate high-fidelity sound effects and music from arbitrary combinations of text, video, and audio, significantly outperforming specialized models in fine-grained instruction following.
Background & Motivation¶
Background: Sound effect and music generation have rapidly developed with generative models, proving practical value in film, gaming, and social media. However, the mainstream approach remains "one model per task"—where text-to-audio and video-to-audio are handled by separate models, often locked into either sound effects or music.
Limitations of Prior Work: A few attempts at unification exist, but they generally lack flexible support for "arbitrary modality combinations" and exhibit weak instruction-following capabilities (e.g., failing to handle sequence or counting controls like "footsteps followed by a door closing").
Key Challenge: The authors attribute the root cause to data—high-quality multimodal data for training unified models is extremely scarce. Most existing datasets are task-specific, providing supervision for only a single condition (either text-audio pairs or video-music pairs), which is neither multimodal nor compositional.
Goal: To address this through two sub-problems: (1) Designing a unified modeling framework that accommodates text/video/audio conditions and fuses them cleanly; (2) Generating large-scale, fine-grained, and compositional multimodal supervised data.
Key Insight: Transformers excel at cross-modal alignment, while diffusion models (especially DiT) surpass auto-regressive next-token prediction in audio fidelity. By combining these, DiT serves as the generation backbone, supplemented by a fusion module on the conditioning side to extract useful signals and suppress cross-modal noise.
Core Idea: A "Unified Backbone + Adaptive Fusion + High-quality Multimodal Data" trio—using a single DiT weight set to cover anything-to-audio, a MAF module to resolve multimodal interference, and a two-stage labeling pipeline for mass data construction.
Method¶
Overall Architecture¶
The input to AudioX is an arbitrary subset of video \(X_v\), text \(X_t\), and audio \(X_a\) (missing modalities are padded with zeros or default text), and the output is a high-fidelity audio/music waveform aligned with the conditions. The pipeline involves: three-way dedicated encoders for temporal modeling to obtain modality embeddings \(H_v, H_t, H_a\); these are fed into the MAF module for adaptive fusion to form a unified condition embedding \(H_c\); finally, \(H_c\) and the diffusion timestep \(t\) guide the DiT backbone via cross-attention for denoising in the latent space. The supervision for training comes from the self-built IF-caps dataset (7 million samples), generated offline via a two-stage labeling pipeline.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Input<br/>Video / Text / Audio<br/>Arbitrary Combination"] --> C["Multimodal Encoding<br/>& Temporal Modeling"]
B["IF-caps Two-stage<br/>Data Construction"] -->|Training Supervision| E["DiT Backbone<br/>Unified Generation"]
C --> D["MAF Multimodal<br/>Adaptive Fusion"]
D -->|"Unified Embedding Hc"| E
E --> F["Audio / Music Output"]
Key Designs¶
1. IF-caps Two-stage Data Construction: Solving Data Scarcity with Model Collaboration
The bottleneck for unified models is data—existing datasets often feature coarse labels or single modalities. The authors designed a labeling pipeline for video datasets: first, a strong multimodal LLM (Gemini 2.5 Pro) processes 10-second clips to produce a global caption and structured fields (audio tags like "event + count", music tags like "genre + instrument"). To manage costs, the second stage uses the open-source Qwen2-Audio to expand various captions based on initial labels and raw audio. This resulted in fine-grained labels for 1.3M video-audio clips and 5.7M video-music clips, decoupling quality and cost.
2. Multimodal Encoding and Temporal Modeling: Aligning Heterogeneous Conditions
Text, video, and audio are naturally heterogeneous. AudioX uses dedicated encoders: CLIP-ViT-B/32 for frame-level semantics (5 fps) and Synchformer for synchronization features (25 fps) for video; T5-base for text; and an audio autoencoder. Video and audio features pass through a temporal Transformer to capture dynamics before being projected into a common embedding space \(H_v, H_t, H_a\). Missing modalities are padded with zeros or generic natural language templates, providing the engineering foundation for the unified framework.
3. MAF Multimodal Adaptive Fusion: Gating + Expert Queries + Self-Attention
This is the core module designed to prevent modality "clashing." MAF passes initial embeddings through a gate to filter and re-weight signals; then, a set of learnable queries (divided into modality-specific "experts")聚合 (aggregate) evidence from different data streams via cross-attention; finally, self-attention integrates context and refined information is written back via residuals:
MAF is lightweight, accounting for only 60M of the total 2.4B parameters (1.1B trainable). Ablation shows that removing MAF, or its gating/query components, leads to significant performance drops, proving its necessity for reducing interference and improving instruction following.
4. Unified Generation with DiT Backbone: Single Weights for All Tasks
The DiT backbone (24 layers, initialized from Stable Audio Open) performs denoising in the latent space. Ground truth audio \(A\) is encoded to \(z = E(A)\), and noise is added via a Markov chain \(q(z_t|z_{t-1}) = \mathcal{N}(z_t; \sqrt{1-\beta_t}\,z_{t-1}, \beta_t I)\). The network \(\epsilon_\theta\) predicts noise at each step conditioned on \(z_t, t, H_c\):
By unifying all tasks (T2A, V2A, TV2A, T2M, V2M, TV2M, audio inpainting, music continuation) as conditional denoising given \(H_c\), the same weights cover the full spectrum of tasks.
Loss & Training¶
The objective is the noise prediction MSE mentioned above. Optimization uses AdamW (LR 1e-5, weight decay 0.001) with exponential ramp-up/decay and weight EMA. Training took approximately 4k GPU hours on NVIDIA H800 clusters with a batch size of 48. Inference uses 250 steps and a classifier-free guidance scale of 7.0.
Key Experimental Results¶
Main Results¶
A single AudioX model vs. specialized SOTAs (Table 1 excerpt, IS Higher is better, FAD/FD/KL Lower is better):
| Dataset | Task | Method | IS ↑ | FAD ↓ | FD ↓ |
|---|---|---|---|---|---|
| VGGSound | T2A | MMAudio | 17.83 | 2.50 | 11.52 |
| VGGSound | T2A | AudioX | 19.58 | 1.33 | 9.01 |
| MusicCaps | T2M | TangoMusic | 2.86 | 1.88 | 15.00 |
| MusicCaps | T2M | AudioX | 3.55 | 1.53 | 9.76 |
| AudioCaps | T2A | Tango 2 | 10.37 | 3.20 | 12.22 |
| AudioCaps | T2A | AudioX | 12.48 | 1.59 | 11.51 |
AudioX achieves SOTA in T2A and T2M. In V2A, it is competitive with MMAudio on VGGSound and out-of-distribution AVVP, demonstrating strong generalization.
Instruction Following (Table 2, T2A-bench accuracy labels higher is better, AudioTime lower is better):
| Method | Cat-acc ↑ | Cnt-acc ↑ | Ord-acc ↑ | TS-acc ↑ | AudioTime Ordering ↓ |
|---|---|---|---|---|---|
| Stable Audio Open | 31.20 | 9.80 | 6.00 | 21.80 | 0.98 |
| Make-An-Audio2 | 32.40 | 4.00 | 19.80 | 18.80 | 0.76 |
| MMAudio | 26.60 | 4.80 | 2.40 | 21.40 | 0.98 |
| AudioX | 34.20 | 12.40 | 23.60 | 28.20 | 0.34 |
AudioX leads across all fine-grained control dimensions, reducing the Ordering error significantly.
Ablation Study¶
Data construction strategy (Table 3, progressive text supervision quality):
| Caption Source | Cat-acc ↑ | T2A IS ↑ | V2A FAD ↓ |
|---|---|---|---|
| Labels (Original labels) | 17.35 | 7.59 | 1.81 |
| AudioSetCaps | 27.85 | 10.08 | 1.33 |
| QwenCap (Qwen only) | 24.60 | 9.74 | 1.67 |
| GeminiCap (Gemini initial) | 28.05 | 10.81 | 1.31 |
| GeminiCap-aug (Full pipeline) | 28.91 | 10.93 | 1.15 |
MAF Architecture (Table 4):
| Configuration | IS ↑ | FAD ↓ | Ordering ↓ |
|---|---|---|---|
| w/o MAF | 10.70 | 2.67 | 0.912 |
| w/o Gate | 11.66 | 2.00 | 0.876 |
| w/o Query | 11.72 | 2.08 | 0.912 |
| Full MAF | 11.84 | 1.98 | 0.888 |
Key Findings¶
- Both gating and expert queries in MAF are essential; their removal causes the most significant degradation.
- Cross-modal Regularization Effect: Improving text supervision quality benefits not only T2A but also V2A significantly (FAD dropped from 1.81 to 1.15). High-quality text data acts as a strategy for building robust multimodal models.
- High fidelity does not equal strong instruction following: Some SOTAs like Tango 2 have high fidelity but moderate control, indicating these are independent dimensions.
Highlights & Insights¶
- "Data + Architecture" Duo: The logic is clear—attribute model failure to data scarcity and interference, then solve them with IF-caps and MAF.
- MAF as a Transferable Fusion Paradigm: The "Gating → Expert Query → Self-attention → Residual" flow is applicable to any scenario feeding heterogeneous conditions into a single backbone.
- Cross-modal Regularization Insight: Using text quality as a lever for non-text tasks (V2A) is an insightful departure from the intuition that video tasks only require more video data.
- New Benchmark T2A-bench: Addresses the gap in evaluating fine-grained control (category, count, order, timestamps).
Limitations & Future Work¶
- Dependency on External LLMs: The quality of IF-caps is capped by Gemini and Qwen; systematic biases or hallucinations might be inherited.
- Parity in Some Tasks: performance in V2A/TV2A is "at par" with MMAudio rather than dominant; the unified advantage lies primarily in text-related and instruction-following tasks.
- High Cost: 2.4B parameters and 250 diffusion steps limit real-time or edge deployment.
- Future Directions: Exploring efficient sampling (distillation), self-labeling loops, and finer temporal alignment.
Related Work & Insights¶
- vs. Specialists (AudioGen/MusicGen/Tango): Specialists lock into single conditions; AudioX covers all combinations with one weight set and outperforms them on T2A/T2M.
- vs. Video-to-Audio Specialists (MMAudio/Diff-Foley): While these specialize in V2A, AudioX matches their performance while unlocking music generation and continuation.
- vs. Prior Unified Efforts: Previous works had weak flexibility and instruction following. AudioX bridges this gap with compositional data (IF-caps) and adaptive fusion (MAF).
Rating¶
- Novelty: ⭐⭐⭐⭐ The MAF module and cross-modal regularization findings are innovative.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 6 tasks, multiple datasets, dual benchmarks, user studies, and complete ablations.
- Writing Quality: ⭐⭐⭐⭐ Clear logic, though some MAF internal dimensions require the appendix.
- Value: ⭐⭐⭐⭐⭐ A unified framework + 7M open-source samples + new benchmarks represent a solid infrastructure for the community.