Skip to content

In-the-wild Audio Spatialization with Flexible Text-guided Localization

Conference: ACL 2025
arXiv: 2506.00927
Code: GitHub
Area: Audio & Speech
Keywords: audio spatialization, binaural audio, text-guided, latent diffusion, spatial reasoning

TL;DR

This paper proposes the TAS (Text-guided Audio Spatialization) framework, which utilizes flexible text prompts (e.g., 3D spatial location descriptions or relative positions between sound sources) to guide a latent diffusion model in converting monaural audio into binaural audio. A SpatialTAS dataset containing 376K samples is constructed. This method outperforms existing approaches on both simulated and real-recorded data, and a spatial semantic consistency evaluation model is developed based on Llama-3.1-8B.

Background & Motivation

Background: Audio spatialization maps monaural audio to binaural audio, providing spatial perception for VR/AR and embodied AI. Existing methods mainly rely on visual frame guidance.

Limitations of Prior Work: (a) Visual-guided methods are limited by the camera field of view (FOV) and cannot handle sound sources outside the visual field; (b) they lack flexible interactive control, meaning users cannot selectively specify the spatial locations of specific sound sources; (c) high-quality, large-scale stereo data is scarce.

Key Challenge: Complex multi-object interactive environments require flexible and controllable spatialization methods, but existing methods either depend on complete visual frames or lack selective control.

Goal: To flexibly control audio spatialization using text descriptions (instead of visual frames), supporting 3D position specifications and descriptions of relative relationships between sound sources.

Key Insight: Modeling the binaural difference (left-right channel difference) instead of the complete binaural audio reduces the modeling difficulty; a latent diffusion model is used to generate in the mel-spectrogram latent space.

Core Idea: To use text descriptions of sound source spatial positions and train a latent diffusion model to learn channel differences, achieving flexible and controllable audio spatialization.

Method

Overall Architecture

Given the input monaural audio \(A_{\text{mono}} = A_l + A_r\) and text spatial prompts \(T_{\text{prompts}}\), the model learns the channel difference \(A_{lr} = A_l - A_r\). During inference, the binaural audio is reconstructed using \(\hat{A}_l = (A_{\text{mono}} + A_{lr})/2\) and \(\hat{A}_r = (A_{\text{mono}} - A_{lr})/2\).

Key Designs

  1. Channel Difference Learning + Latent Diffusion:

    • Function: Instead of directly generating binaural audio, it learns the latent representation of the difference between the left and right channels.
    • Mechanism: A VAE encodes the mel-spectrogram of \(A_{lr}\) into a latent space \(\rightarrow\) a conditional diffusion model denoises conditioned on text + audio embeddings \(\rightarrow\) the VAE decodes and a HiFi-GAN vocoder reconstructs the waveform.
    • Design Motivation: Learning channel differences is simpler than learning full binaural audio, and the latent space is computationally more efficient than the waveform space.
  2. Text-Spatial Consistency Enhancement:

    • Function: Fine-tune the text encoder to learn spatial discriminative ability by using flipped channel audio (\(A_{rl} = A_r - A_l\)) as negative samples.
    • Mechanism: A classifier \(P\) determines whether the audio difference matches the text description, and a BCE loss \(\mathcal{L}_{loc}\) trains the text encoder to capture spatial orientation information.
    • Design Motivation: Pre-trained text encoders (such as FLAN-T5) lack spatial-audio alignment training, and channel flipping provides simple yet effective contrastive signals.
  3. LLM Spatial Understanding Evaluation:

    • Function: Fine-tune Llama-3.1-8B as a spatial audio reasoning evaluator to assess the spatial semantic correctness of the generated binaural audio.
    • Mechanism: Feed ground-truth and generated binaural audio into the evaluation model to answer spatial questions. A smaller gap in prediction accuracy indicates higher spatial fidelity.

Loss & Training

Total loss = diffusion noise prediction loss \(\mathcal{L}_\theta\) + spatial consistency BCE loss \(\mathcal{L}_{loc}\), using Classifier-Free Guidance (\(\gamma=2.5\)).

Key Experimental Results

Main Results (SpatialTAS Test Set)

Method FD↓ FAD↓ DOA↓ DE↓ Direction↓ Distance↓
Mono-Mono 9.03 3.67 19.66 18.12 12.79 15.33
PseudoBinaural 7.23 2.81 6.39 4.00 10.36 12.91
TAS (ours) 4.93 1.44 3.07 2.45 6.99 8.16

Ablation Study

Configuration FD↓ DOA↓ Direction↓
Full model 4.93 3.07 6.99
w/o text 6.77 5.87 9.25
w/o Flipper 5.08 4.14 8.63

Key Findings

  • Text guidance is significantly superior to unconditional generation: omitting text leads to substantial degradation in all metrics.
  • The flipped channel enhancement contributes significantly to spatial perception metrics (DOA, Direction).
  • Great generalization is achieved on real-recorded data (FAIR-Play, YouTube-Binaural), approaching or outperforming visual-guided methods on traditional metrics such as STFT and ENV.
  • Relative position descriptions (e.g., "A is on the left of B") are more difficult to learn than absolute ones but are more practical.

Highlights & Insights

  • Replacing visual frames with text as a spatialization guidance condition represents a major direction shift—it is free from FOV limitations and supports selective control.
  • Channel difference modeling is simple and elegant, avoiding the complexity of directly modeling full binaural audio.
  • Utilizing LLMs for spatial audio evaluation is a novel paradigm that fills the gap in traditional audio quality metrics which fail to assess spatial correctness.

Limitations & Future Work

  • The training data is simulated (SpatialSoundQA); reverberation and noise in real-world environments might not be fully covered.
  • Text descriptions still need to be manually provided by users or generated from visual frames using GPT-4o, meaning a fully end-to-end workflow is not yet realized.
  • There is a 10-second audio length limitation, leaving long-duration audio scenarios uncovered.
  • Only English text descriptions are evaluated.
  • vs Visual-Guided Methods (e.g., Mono2Binaural): Free from FOV limitations but requires additional text descriptions as inputs.
  • vs Li et al. (2024b): This was the first text-guided method but was only annotated on the small-scale FAIR-Play, whereas TAS constructs a large-scale dataset of 376K samples.
  • vs Waveform-space Diffusion: Latent diffusion is computationally more efficient and produces better generation quality.

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of text-guided spatialization and channel-difference latent diffusion is highly novel.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation on both simulated and real data, using both generation and understanding metrics, along with thorough ablation studies.
  • Writing Quality: ⭐⭐⭐⭐ Clear descriptions of methods and intuitive illustrations.
  • Value: ⭐⭐⭐⭐ Practical application value for VR/AR and embodied AI audio systems.