Skip to content

Few-shot Acoustic Synthesis with Multimodal Flow Matching

Conference: CVPR2026
arXiv: 2603.19176
Code: Project Page
Area: Image Generation (Audio Generation / Acoustic Synthesis)
Keywords: flow matching, room impulse response, few-shot acoustic synthesis, diffusion transformer, multimodal conditioning, joint embedding

TL;DR

This paper proposes FLAC, the first flow matching-based few-shot room impulse response (RIR) generation framework, capable of synthesizing spatially consistent acoustic responses in unseen scenes from a single recording. It further introduces AGREE, a joint embedding for geometry–acoustic consistency evaluation.

Background & Motivation

Importance of room acoustic modeling: Immersive virtual environments require spatial consistency between sound and space. Room impulse responses (RIRs) characterize sound propagation between a source and a receiver and are essential for spatial audio rendering.

Limitations of neural acoustic fields: Existing neural acoustic field methods (e.g., NeRAF, AV-GS) achieve spatially continuous rendering within a single scene but require dense recordings and per-scene training, preventing generalization to new environments.

Insufficiency of few-shot methods: Few-shot approaches such as FewShotRIR, MAGIC, and xRIR require 8–20 reference recordings and produce deterministic predictions, ignoring the inherent uncertainty of acoustic responses under sparse observations.

Drawbacks of deterministic modeling: With limited scene information, a single source–receiver configuration may correspond to multiple plausible RIRs (e.g., carpet versus hardwood floors yield significantly different acoustics); deterministic methods cannot capture this ambiguity.

Potential of flow matching for audio generation: Flow matching, as an efficient alternative to diffusion models, has demonstrated strong performance in text-to-speech and music generation, yet has not been applied to explicit RIR synthesis.

Lack of geometry-consistency evaluation: Conventional acoustic metrics (T60, C50, EDT) measure perceptual quality only and lack the means to assess the geometric consistency of generated RIRs with the scene.

Method

Overall Architecture

FLAC is a conditional latent generative model comprising three core modules:

  1. VAE encoder: Compresses RIR waveforms into a compact latent representation \(\mathbf{z}_0\) with a bottleneck dimension of 32.
  2. Multimodal conditioner: Fuses acoustic (reference RIRs), spatial (source position), and geometric (panoramic depth map) conditioning.
  3. Diffusion Transformer (DiT): Trained with a flow matching objective to generate RIR latent representations from noise.

Training employs rectified flow matching with linear interpolation between data and noise: \(\mathbf{z}_t = (1-t)\mathbf{z}_0 + t\boldsymbol{\epsilon}\), with the model predicting the velocity field \(\mathbf{v}_t = \boldsymbol{\epsilon} - \mathbf{z}_0\). At inference, the ODE is solved in reverse from Gaussian noise to produce RIRs.

Key Designs

  • Timestep sampling strategy: Sampled from \(\alpha \sim \mathcal{N}(-1.2, 4)\) and mapped through sigmoid, concentrating on intermediate noise levels (\(t \approx 0.7\)\(0.8\)) to improve training efficiency.
  • Multimodal conditioning injection:
    • Acoustic conditioning: \(K\) reference RIRs encoded by ResNet-18 into 512-dimensional embeddings.
    • Spatial conditioning: Source position coordinates encoded via sinusoidal positional encoding followed by a linear projection.
    • Geometric conditioning: Panoramic depth maps converted to 3D coordinates via equirectangular projection, reflection maps computed, and encoded by a fine-tuned DINOv3 ViT-S/16.
  • DiT architecture: 12-layer Transformer with 8-head attention and hidden dimension 256. Target pose and timestep are injected via AdaLN; multimodal context is fused via cross-attention. RoPE positional encoding is used.
  • Classifier-free guidance: Conditions are randomly dropped during training; guidance weight \(\omega\) controls conditioning strength at inference.
  • AGREE joint embedding: A CLIP-style dual encoder aligning RIRs and scene geometry in a shared latent space, enabling zero-shot cross-modal retrieval.

Loss & Training

  • Flow matching loss: \(\mathcal{L}_{\text{RFM}} = \mathbb{E}[\|u(\mathbf{z}_t, t, \boldsymbol{\tau}) - \mathbf{v}_t\|^2]\)
  • VAE training loss: Multi-resolution STFT loss \(\mathcal{L}_{\text{MR}}\) (spectral convergence + energy decay) + adversarial hinge loss \(\mathcal{L}_{\text{adv}}\) + feature matching loss \(\mathcal{L}_{\text{feat}}\) (Encodec multi-scale STFT discriminator) + KL divergence \(\mathcal{L}_{\text{KL}}\)
  • AGREE contrastive loss: Maximizes similarity for matched pairs and minimizes similarity for unmatched pairs.

Key Experimental Results

Datasets & Setup

  • AcousticRooms (AR): 260 rooms with 300k+ RIRs at 22,050 Hz, simulated via the wave equation; 243 seen / 17 unseen rooms.
  • Hearing-Anything-Anywhere (HAA): 4 real rooms for sim-to-real transfer evaluation.
  • Training conducted on a single H100 GPU with AdamW optimizer, learning rate \(5 \times 10^{-5}\), batch size 64, BF16 precision.

Main Results

8-shot generation on unseen scenes (AcousticRooms):

Method K T60 (%) ↓ C50 (dB) ↓ EDT (ms) ↓ R@5 (%) ↑
xRIR 8 9.98 1.354 49.40 2.00
FLAC 8 8.60 0.970 37.13 19.38
xRIR 1 14.47 1.961 74.45 1.36
FLAC 1 9.95 1.046 40.04 18.92

Sim-to-real transfer (HAA):

Method K T60 (%) ↓ C50 (dB) ↓ EDT (ms) ↓
Diff-RIR† 12 3.74 2.067 88.09
FLAC 8 3.10 2.167 84.52
FLAC 1 3.45 2.170 90.02

Ablation Study

  • Conditioning modality ablation: Geometry-only conditioning yields better C50 and EDT (early reflections determined by nearby surfaces); acoustic-only conditioning yields better T60 (global reverberation is difficult to infer from local geometry); combining both achieves the best performance.
  • Geometric encoder: Fine-tuned DINOv3 ViT-S/16 outperforms both training from scratch and frozen variants, as well as xRIR's ViT.
  • DiT conditioning strategy: AdaLN + Cross-Attention significantly outperforms In-Context and pure Cross-Attention approaches.
  • Acoustic encoder: A frozen VAE encoder generalizes slightly better across rooms than ResNet-18, at higher computational cost.

Key Findings

  • FLAC with 1-shot surpasses all 8-shot baselines; 93.01% of participants (n=46) in subjective listening tests preferred FLAC.
  • Uncertainty analysis: Low-frequency bands exhibit higher sample variance and longer durations, consistent with room acoustics theory — low-frequency responses are dominated by sparse boundary modes, while high frequencies stabilize above the Schröder frequency.
  • Intra-condition diversity ratio is 4.5% (1.03 vs. 22.96), indicating that the model introduces meaningful stochasticity while maintaining contextual consistency.
  • The deterministic variant (fixed noise) shows significantly degraded performance (+6% T60, +10% C50, −40% R@5), confirming that stochasticity is essential for few-shot acoustic synthesis.

Highlights & Insights

  • Pioneering contribution: First application of flow matching to explicit RIR synthesis, framing few-shot acoustic synthesis as a probabilistic generative problem.
  • Exceptional data efficiency: 1-shot performance surpasses the previous 8-shot SOTA, reducing required recordings by 8×.
  • AGREE evaluation framework: Proposes a CLIP-style acoustic–geometry joint embedding that fills the gap in geometry-consistency evaluation and enables zero-shot cross-modal retrieval.
  • Physically grounded uncertainty modeling: Higher uncertainty at low frequencies and convergence at high frequencies align with room acoustics Schröder frequency theory.
  • Practical applicability: Trained on a single H100; inference requires only one step to obtain high-quality results; the few-shot method adapts to new scenes in minutes.

Limitations & Future Work

  • Inaccurate area classification: This paper genuinely belongs to audio/acoustic synthesis; its classification under image generation is inappropriate.
  • Limited real-scene generalization: Geometric annotations in the HAA dataset are simplified (e.g., tables modeled as planes), and the VAE is not fine-tuned on real recordings, limiting sim-to-real transfer.
  • Single sample rate constraint: The current model supports only 22,050 Hz; high-fidelity applications require higher sample rates.
  • Elevated FDG metric: The generated distribution still diverges from the real distribution in AGREE space, particularly on real data.
  • Scarcity of real data: The absence of large-scale, diverse real audio-visual datasets constrains VAE and overall model performance in real-world scenarios.
  • Monaural limitation: Only monaural omnidirectional RIRs are handled; binaural and multichannel scenarios are not addressed.
  • Neural acoustic fields: Per-scene training methods such as NeRAF and AV-GS achieve spatially continuous rendering but are non-generalizable.
  • Few-shot acoustic synthesis: FewShotRIR (20 samples) → MAGIC (semantic augmentation) → xRIR (8 samples + depth maps); all are deterministic methods.
  • Audio diffusion and flow matching: Diffusion models have succeeded in speech/music generation; flow matching improves efficiency; this paper is the first to introduce it to RIR synthesis.
  • Joint embedding models: CLIP → audio-visual/audio-text embeddings, but standard audio embeddings are unsuitable for RIRs; AGREE is the first to align RIRs with scene geometry.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ (First application of flow matching to RIR synthesis; novel probabilistic modeling perspective; pioneering AGREE evaluation framework)
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Two datasets, multiple baselines, comprehensive ablations, uncertainty analysis, subjective listening tests, cross-modal retrieval validation)
  • Writing Quality: ⭐⭐⭐⭐ (Clear structure, rich figures, adequate physical intuition; notation is somewhat dense in places)
  • Value: ⭐⭐⭐⭐ (Opens a new direction for few-shot acoustic synthesis with exceptional practical data efficiency, though the field is relatively niche)