In-the-wild Audio Spatialization with Flexible Text-guided Localization¶
Conference: ACL 2025
arXiv: 2506.00927
Code: GitHub
Area: Audio & Speech
Keywords: audio spatialization, binaural audio, text-guided, latent diffusion, spatial reasoning
TL;DR¶
This paper proposes the TAS (Text-guided Audio Spatialization) framework, which utilizes flexible text prompts (e.g., 3D spatial location descriptions or relative positions between sound sources) to guide a latent diffusion model in converting monaural audio into binaural audio. A SpatialTAS dataset containing 376K samples is constructed. This method outperforms existing approaches on both simulated and real-recorded data, and a spatial semantic consistency evaluation model is developed based on Llama-3.1-8B.
Background & Motivation¶
Background: Audio spatialization maps monaural audio to binaural audio, providing spatial perception for VR/AR and embodied AI. Existing methods mainly rely on visual frame guidance.
Limitations of Prior Work: (a) Visual-guided methods are limited by the camera field of view (FOV) and cannot handle sound sources outside the visual field; (b) they lack flexible interactive control, meaning users cannot selectively specify the spatial locations of specific sound sources; (c) high-quality, large-scale stereo data is scarce.
Key Challenge: Complex multi-object interactive environments require flexible and controllable spatialization methods, but existing methods either depend on complete visual frames or lack selective control.
Goal: To flexibly control audio spatialization using text descriptions (instead of visual frames), supporting 3D position specifications and descriptions of relative relationships between sound sources.
Key Insight: Modeling the binaural difference (left-right channel difference) instead of the complete binaural audio reduces the modeling difficulty; a latent diffusion model is used to generate in the mel-spectrogram latent space.
Core Idea: To use text descriptions of sound source spatial positions and train a latent diffusion model to learn channel differences, achieving flexible and controllable audio spatialization.
Method¶
Overall Architecture¶
Given the input monaural audio \(A_{\text{mono}} = A_l + A_r\) and text spatial prompts \(T_{\text{prompts}}\), the model learns the channel difference \(A_{lr} = A_l - A_r\). During inference, the binaural audio is reconstructed using \(\hat{A}_l = (A_{\text{mono}} + A_{lr})/2\) and \(\hat{A}_r = (A_{\text{mono}} - A_{lr})/2\).
Key Designs¶
-
Channel Difference Learning + Latent Diffusion:
- Function: Instead of directly generating binaural audio, it learns the latent representation of the difference between the left and right channels.
- Mechanism: A VAE encodes the mel-spectrogram of \(A_{lr}\) into a latent space \(\rightarrow\) a conditional diffusion model denoises conditioned on text + audio embeddings \(\rightarrow\) the VAE decodes and a HiFi-GAN vocoder reconstructs the waveform.
- Design Motivation: Learning channel differences is simpler than learning full binaural audio, and the latent space is computationally more efficient than the waveform space.
-
Text-Spatial Consistency Enhancement:
- Function: Fine-tune the text encoder to learn spatial discriminative ability by using flipped channel audio (\(A_{rl} = A_r - A_l\)) as negative samples.
- Mechanism: A classifier \(P\) determines whether the audio difference matches the text description, and a BCE loss \(\mathcal{L}_{loc}\) trains the text encoder to capture spatial orientation information.
- Design Motivation: Pre-trained text encoders (such as FLAN-T5) lack spatial-audio alignment training, and channel flipping provides simple yet effective contrastive signals.
-
LLM Spatial Understanding Evaluation:
- Function: Fine-tune Llama-3.1-8B as a spatial audio reasoning evaluator to assess the spatial semantic correctness of the generated binaural audio.
- Mechanism: Feed ground-truth and generated binaural audio into the evaluation model to answer spatial questions. A smaller gap in prediction accuracy indicates higher spatial fidelity.
Loss & Training¶
Total loss = diffusion noise prediction loss \(\mathcal{L}_\theta\) + spatial consistency BCE loss \(\mathcal{L}_{loc}\), using Classifier-Free Guidance (\(\gamma=2.5\)).
Key Experimental Results¶
Main Results (SpatialTAS Test Set)¶
| Method | FD↓ | FAD↓ | DOA↓ | DE↓ | Direction↓ | Distance↓ |
|---|---|---|---|---|---|---|
| Mono-Mono | 9.03 | 3.67 | 19.66 | 18.12 | 12.79 | 15.33 |
| PseudoBinaural | 7.23 | 2.81 | 6.39 | 4.00 | 10.36 | 12.91 |
| TAS (ours) | 4.93 | 1.44 | 3.07 | 2.45 | 6.99 | 8.16 |
Ablation Study¶
| Configuration | FD↓ | DOA↓ | Direction↓ |
|---|---|---|---|
| Full model | 4.93 | 3.07 | 6.99 |
| w/o text | 6.77 | 5.87 | 9.25 |
| w/o Flipper | 5.08 | 4.14 | 8.63 |
Key Findings¶
- Text guidance is significantly superior to unconditional generation: omitting text leads to substantial degradation in all metrics.
- The flipped channel enhancement contributes significantly to spatial perception metrics (DOA, Direction).
- Great generalization is achieved on real-recorded data (FAIR-Play, YouTube-Binaural), approaching or outperforming visual-guided methods on traditional metrics such as STFT and ENV.
- Relative position descriptions (e.g., "A is on the left of B") are more difficult to learn than absolute ones but are more practical.
Highlights & Insights¶
- Replacing visual frames with text as a spatialization guidance condition represents a major direction shift—it is free from FOV limitations and supports selective control.
- Channel difference modeling is simple and elegant, avoiding the complexity of directly modeling full binaural audio.
- Utilizing LLMs for spatial audio evaluation is a novel paradigm that fills the gap in traditional audio quality metrics which fail to assess spatial correctness.
Limitations & Future Work¶
- The training data is simulated (SpatialSoundQA); reverberation and noise in real-world environments might not be fully covered.
- Text descriptions still need to be manually provided by users or generated from visual frames using GPT-4o, meaning a fully end-to-end workflow is not yet realized.
- There is a 10-second audio length limitation, leaving long-duration audio scenarios uncovered.
- Only English text descriptions are evaluated.
Related Work & Insights¶
- vs Visual-Guided Methods (e.g., Mono2Binaural): Free from FOV limitations but requires additional text descriptions as inputs.
- vs Li et al. (2024b): This was the first text-guided method but was only annotated on the small-scale FAIR-Play, whereas TAS constructs a large-scale dataset of 376K samples.
- vs Waveform-space Diffusion: Latent diffusion is computationally more efficient and produces better generation quality.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of text-guided spatialization and channel-difference latent diffusion is highly novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation on both simulated and real data, using both generation and understanding metrics, along with thorough ablation studies.
- Writing Quality: ⭐⭐⭐⭐ Clear descriptions of methods and intuitive illustrations.
- Value: ⭐⭐⭐⭐ Practical application value for VR/AR and embodied AI audio systems.