Skip to content

AudioX: A Unified Framework for Anything-to-Audio Generation

Conference: ICLR2026
OpenReview: https://openreview.net/forum?id=qjJWxK3yWo
Code: https://zeyuet.github.io/AudioX/
Area: Audio Generation / Multimodal Diffusion
Keywords: Anything-to-Audio Generation, Multimodal Fusion, Diffusion Transformer, Instruction Following, Data Construction

TL;DR

AudioX utilizes a unified framework based on a Diffusion Transformer (DiT), integrated with a lightweight "Multimodal Adaptive Fusion (MAF)" module and a self-constructed 7-million-sample multimodal dataset, IF-caps. This allows a single set of model weights to generate high-fidelity sound effects and music from arbitrary combinations of text, video, and audio, significantly outperforming specialized models in fine-grained instruction following.

Background & Motivation

Background: Sound effect and music generation have rapidly developed with generative models, proving practical value in film, gaming, and social media. However, the mainstream approach remains "one model per task"—where text-to-audio and video-to-audio are handled by separate models, often locked into either sound effects or music.

Limitations of Prior Work: A few attempts at unification exist, but they generally lack flexible support for "arbitrary modality combinations" and exhibit weak instruction-following capabilities (e.g., failing to handle sequence or counting controls like "footsteps followed by a door closing").

Key Challenge: The authors attribute the root cause to data—high-quality multimodal data for training unified models is extremely scarce. Most existing datasets are task-specific, providing supervision for only a single condition (either text-audio pairs or video-music pairs), which is neither multimodal nor compositional.

Goal: To address this through two sub-problems: (1) Designing a unified modeling framework that accommodates text/video/audio conditions and fuses them cleanly; (2) Generating large-scale, fine-grained, and compositional multimodal supervised data.

Key Insight: Transformers excel at cross-modal alignment, while diffusion models (especially DiT) surpass auto-regressive next-token prediction in audio fidelity. By combining these, DiT serves as the generation backbone, supplemented by a fusion module on the conditioning side to extract useful signals and suppress cross-modal noise.

Core Idea: A "Unified Backbone + Adaptive Fusion + High-quality Multimodal Data" trio—using a single DiT weight set to cover anything-to-audio, a MAF module to resolve multimodal interference, and a two-stage labeling pipeline for mass data construction.

Method

Overall Architecture

The input to AudioX is an arbitrary subset of video \(X_v\), text \(X_t\), and audio \(X_a\) (missing modalities are padded with zeros or default text), and the output is a high-fidelity audio/music waveform aligned with the conditions. The pipeline involves: three-way dedicated encoders for temporal modeling to obtain modality embeddings \(H_v, H_t, H_a\); these are fed into the MAF module for adaptive fusion to form a unified condition embedding \(H_c\); finally, \(H_c\) and the diffusion timestep \(t\) guide the DiT backbone via cross-attention for denoising in the latent space. The supervision for training comes from the self-built IF-caps dataset (7 million samples), generated offline via a two-stage labeling pipeline.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Input<br/>Video / Text / Audio<br/>Arbitrary Combination"] --> C["Multimodal Encoding<br/>& Temporal Modeling"]
    B["IF-caps Two-stage<br/>Data Construction"] -->|Training Supervision| E["DiT Backbone<br/>Unified Generation"]
    C --> D["MAF Multimodal<br/>Adaptive Fusion"]
    D -->|"Unified Embedding Hc"| E
    E --> F["Audio / Music Output"]

Key Designs

1. IF-caps Two-stage Data Construction: Solving Data Scarcity with Model Collaboration

The bottleneck for unified models is data—existing datasets often feature coarse labels or single modalities. The authors designed a labeling pipeline for video datasets: first, a strong multimodal LLM (Gemini 2.5 Pro) processes 10-second clips to produce a global caption and structured fields (audio tags like "event + count", music tags like "genre + instrument"). To manage costs, the second stage uses the open-source Qwen2-Audio to expand various captions based on initial labels and raw audio. This resulted in fine-grained labels for 1.3M video-audio clips and 5.7M video-music clips, decoupling quality and cost.

2. Multimodal Encoding and Temporal Modeling: Aligning Heterogeneous Conditions

Text, video, and audio are naturally heterogeneous. AudioX uses dedicated encoders: CLIP-ViT-B/32 for frame-level semantics (5 fps) and Synchformer for synchronization features (25 fps) for video; T5-base for text; and an audio autoencoder. Video and audio features pass through a temporal Transformer to capture dynamics before being projected into a common embedding space \(H_v, H_t, H_a\). Missing modalities are padded with zeros or generic natural language templates, providing the engineering foundation for the unified framework.

3. MAF Multimodal Adaptive Fusion: Gating + Expert Queries + Self-Attention

This is the core module designed to prevent modality "clashing." MAF passes initial embeddings through a gate to filter and re-weight signals; then, a set of learnable queries (divided into modality-specific "experts")聚合 (aggregate) evidence from different data streams via cross-attention; finally, self-attention integrates context and refined information is written back via residuals:

\[\tilde{H}_v, \tilde{H}_t, \tilde{H}_a = \mathrm{MAF}(H_v, H_t, H_a), \qquad H_c = \mathrm{Concat}\left(\tilde{H}_v, \tilde{H}_t, \tilde{H}_a\right).\]

MAF is lightweight, accounting for only 60M of the total 2.4B parameters (1.1B trainable). Ablation shows that removing MAF, or its gating/query components, leads to significant performance drops, proving its necessity for reducing interference and improving instruction following.

4. Unified Generation with DiT Backbone: Single Weights for All Tasks

The DiT backbone (24 layers, initialized from Stable Audio Open) performs denoising in the latent space. Ground truth audio \(A\) is encoded to \(z = E(A)\), and noise is added via a Markov chain \(q(z_t|z_{t-1}) = \mathcal{N}(z_t; \sqrt{1-\beta_t}\,z_{t-1}, \beta_t I)\). The network \(\epsilon_\theta\) predicts noise at each step conditioned on \(z_t, t, H_c\):

\[\min_\theta \ \mathbb{E}_{t, z_t, \epsilon}\ \left\| \epsilon - \epsilon_\theta(z_t, t, H_c) \right\|_2^2.\]

By unifying all tasks (T2A, V2A, TV2A, T2M, V2M, TV2M, audio inpainting, music continuation) as conditional denoising given \(H_c\), the same weights cover the full spectrum of tasks.

Loss & Training

The objective is the noise prediction MSE mentioned above. Optimization uses AdamW (LR 1e-5, weight decay 0.001) with exponential ramp-up/decay and weight EMA. Training took approximately 4k GPU hours on NVIDIA H800 clusters with a batch size of 48. Inference uses 250 steps and a classifier-free guidance scale of 7.0.

Key Experimental Results

Main Results

A single AudioX model vs. specialized SOTAs (Table 1 excerpt, IS Higher is better, FAD/FD/KL Lower is better):

Dataset Task Method IS ↑ FAD ↓ FD ↓
VGGSound T2A MMAudio 17.83 2.50 11.52
VGGSound T2A AudioX 19.58 1.33 9.01
MusicCaps T2M TangoMusic 2.86 1.88 15.00
MusicCaps T2M AudioX 3.55 1.53 9.76
AudioCaps T2A Tango 2 10.37 3.20 12.22
AudioCaps T2A AudioX 12.48 1.59 11.51

AudioX achieves SOTA in T2A and T2M. In V2A, it is competitive with MMAudio on VGGSound and out-of-distribution AVVP, demonstrating strong generalization.

Instruction Following (Table 2, T2A-bench accuracy labels higher is better, AudioTime lower is better):

Method Cat-acc ↑ Cnt-acc ↑ Ord-acc ↑ TS-acc ↑ AudioTime Ordering ↓
Stable Audio Open 31.20 9.80 6.00 21.80 0.98
Make-An-Audio2 32.40 4.00 19.80 18.80 0.76
MMAudio 26.60 4.80 2.40 21.40 0.98
AudioX 34.20 12.40 23.60 28.20 0.34

AudioX leads across all fine-grained control dimensions, reducing the Ordering error significantly.

Ablation Study

Data construction strategy (Table 3, progressive text supervision quality):

Caption Source Cat-acc ↑ T2A IS ↑ V2A FAD ↓
Labels (Original labels) 17.35 7.59 1.81
AudioSetCaps 27.85 10.08 1.33
QwenCap (Qwen only) 24.60 9.74 1.67
GeminiCap (Gemini initial) 28.05 10.81 1.31
GeminiCap-aug (Full pipeline) 28.91 10.93 1.15

MAF Architecture (Table 4):

Configuration IS ↑ FAD ↓ Ordering ↓
w/o MAF 10.70 2.67 0.912
w/o Gate 11.66 2.00 0.876
w/o Query 11.72 2.08 0.912
Full MAF 11.84 1.98 0.888

Key Findings

  • Both gating and expert queries in MAF are essential; their removal causes the most significant degradation.
  • Cross-modal Regularization Effect: Improving text supervision quality benefits not only T2A but also V2A significantly (FAD dropped from 1.81 to 1.15). High-quality text data acts as a strategy for building robust multimodal models.
  • High fidelity does not equal strong instruction following: Some SOTAs like Tango 2 have high fidelity but moderate control, indicating these are independent dimensions.

Highlights & Insights

  • "Data + Architecture" Duo: The logic is clear—attribute model failure to data scarcity and interference, then solve them with IF-caps and MAF.
  • MAF as a Transferable Fusion Paradigm: The "Gating → Expert Query → Self-attention → Residual" flow is applicable to any scenario feeding heterogeneous conditions into a single backbone.
  • Cross-modal Regularization Insight: Using text quality as a lever for non-text tasks (V2A) is an insightful departure from the intuition that video tasks only require more video data.
  • New Benchmark T2A-bench: Addresses the gap in evaluating fine-grained control (category, count, order, timestamps).

Limitations & Future Work

  • Dependency on External LLMs: The quality of IF-caps is capped by Gemini and Qwen; systematic biases or hallucinations might be inherited.
  • Parity in Some Tasks: performance in V2A/TV2A is "at par" with MMAudio rather than dominant; the unified advantage lies primarily in text-related and instruction-following tasks.
  • High Cost: 2.4B parameters and 250 diffusion steps limit real-time or edge deployment.
  • Future Directions: Exploring efficient sampling (distillation), self-labeling loops, and finer temporal alignment.
  • vs. Specialists (AudioGen/MusicGen/Tango): Specialists lock into single conditions; AudioX covers all combinations with one weight set and outperforms them on T2A/T2M.
  • vs. Video-to-Audio Specialists (MMAudio/Diff-Foley): While these specialize in V2A, AudioX matches their performance while unlocking music generation and continuation.
  • vs. Prior Unified Efforts: Previous works had weak flexibility and instruction following. AudioX bridges this gap with compositional data (IF-caps) and adaptive fusion (MAF).

Rating

  • Novelty: ⭐⭐⭐⭐ The MAF module and cross-modal regularization findings are innovative.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 6 tasks, multiple datasets, dual benchmarks, user studies, and complete ablations.
  • Writing Quality: ⭐⭐⭐⭐ Clear logic, though some MAF internal dimensions require the appendix.
  • Value: ⭐⭐⭐⭐⭐ A unified framework + 7M open-source samples + new benchmarks represent a solid infrastructure for the community.