Few-shot Acoustic Synthesis with Multimodal Flow Matching¶
Conference: CVPR2026
arXiv: 2603.19176
Code: Project Page
Area: Image Generation (Audio Generation / Acoustic Synthesis)
Keywords: flow matching, room impulse response, few-shot acoustic synthesis, diffusion transformer, multimodal conditioning, joint embedding
TL;DR¶
This paper proposes FLAC, the first flow matching-based few-shot room impulse response (RIR) generation framework, capable of synthesizing spatially consistent acoustic responses in unseen scenes from a single recording. It further introduces AGREE, a joint embedding for geometry–acoustic consistency evaluation.
Background & Motivation¶
Importance of room acoustic modeling: Immersive virtual environments require spatial consistency between sound and space. Room impulse responses (RIRs) characterize sound propagation between a source and a receiver and are essential for spatial audio rendering.
Limitations of neural acoustic fields: Existing neural acoustic field methods (e.g., NeRAF, AV-GS) achieve spatially continuous rendering within a single scene but require dense recordings and per-scene training, preventing generalization to new environments.
Insufficiency of few-shot methods: Few-shot approaches such as FewShotRIR, MAGIC, and xRIR require 8–20 reference recordings and produce deterministic predictions, ignoring the inherent uncertainty of acoustic responses under sparse observations.
Drawbacks of deterministic modeling: With limited scene information, a single source–receiver configuration may correspond to multiple plausible RIRs (e.g., carpet versus hardwood floors yield significantly different acoustics); deterministic methods cannot capture this ambiguity.
Potential of flow matching for audio generation: Flow matching, as an efficient alternative to diffusion models, has demonstrated strong performance in text-to-speech and music generation, yet has not been applied to explicit RIR synthesis.
Lack of geometry-consistency evaluation: Conventional acoustic metrics (T60, C50, EDT) measure perceptual quality only and lack the means to assess the geometric consistency of generated RIRs with the scene.
Method¶
Overall Architecture¶
FLAC is a conditional latent generative model comprising three core modules:
- VAE encoder: Compresses RIR waveforms into a compact latent representation \(\mathbf{z}_0\) with a bottleneck dimension of 32.
- Multimodal conditioner: Fuses acoustic (reference RIRs), spatial (source position), and geometric (panoramic depth map) conditioning.
- Diffusion Transformer (DiT): Trained with a flow matching objective to generate RIR latent representations from noise.
Training employs rectified flow matching with linear interpolation between data and noise: \(\mathbf{z}_t = (1-t)\mathbf{z}_0 + t\boldsymbol{\epsilon}\), with the model predicting the velocity field \(\mathbf{v}_t = \boldsymbol{\epsilon} - \mathbf{z}_0\). At inference, the ODE is solved in reverse from Gaussian noise to produce RIRs.
Key Designs¶
- Timestep sampling strategy: Sampled from \(\alpha \sim \mathcal{N}(-1.2, 4)\) and mapped through sigmoid, concentrating on intermediate noise levels (\(t \approx 0.7\)–\(0.8\)) to improve training efficiency.
- Multimodal conditioning injection:
- Acoustic conditioning: \(K\) reference RIRs encoded by ResNet-18 into 512-dimensional embeddings.
- Spatial conditioning: Source position coordinates encoded via sinusoidal positional encoding followed by a linear projection.
- Geometric conditioning: Panoramic depth maps converted to 3D coordinates via equirectangular projection, reflection maps computed, and encoded by a fine-tuned DINOv3 ViT-S/16.
- DiT architecture: 12-layer Transformer with 8-head attention and hidden dimension 256. Target pose and timestep are injected via AdaLN; multimodal context is fused via cross-attention. RoPE positional encoding is used.
- Classifier-free guidance: Conditions are randomly dropped during training; guidance weight \(\omega\) controls conditioning strength at inference.
- AGREE joint embedding: A CLIP-style dual encoder aligning RIRs and scene geometry in a shared latent space, enabling zero-shot cross-modal retrieval.
Loss & Training¶
- Flow matching loss: \(\mathcal{L}_{\text{RFM}} = \mathbb{E}[\|u(\mathbf{z}_t, t, \boldsymbol{\tau}) - \mathbf{v}_t\|^2]\)
- VAE training loss: Multi-resolution STFT loss \(\mathcal{L}_{\text{MR}}\) (spectral convergence + energy decay) + adversarial hinge loss \(\mathcal{L}_{\text{adv}}\) + feature matching loss \(\mathcal{L}_{\text{feat}}\) (Encodec multi-scale STFT discriminator) + KL divergence \(\mathcal{L}_{\text{KL}}\)
- AGREE contrastive loss: Maximizes similarity for matched pairs and minimizes similarity for unmatched pairs.
Key Experimental Results¶
Datasets & Setup¶
- AcousticRooms (AR): 260 rooms with 300k+ RIRs at 22,050 Hz, simulated via the wave equation; 243 seen / 17 unseen rooms.
- Hearing-Anything-Anywhere (HAA): 4 real rooms for sim-to-real transfer evaluation.
- Training conducted on a single H100 GPU with AdamW optimizer, learning rate \(5 \times 10^{-5}\), batch size 64, BF16 precision.
Main Results¶
8-shot generation on unseen scenes (AcousticRooms):
| Method | K | T60 (%) ↓ | C50 (dB) ↓ | EDT (ms) ↓ | R@5 (%) ↑ |
|---|---|---|---|---|---|
| xRIR | 8 | 9.98 | 1.354 | 49.40 | 2.00 |
| FLAC | 8 | 8.60 | 0.970 | 37.13 | 19.38 |
| xRIR | 1 | 14.47 | 1.961 | 74.45 | 1.36 |
| FLAC | 1 | 9.95 | 1.046 | 40.04 | 18.92 |
Sim-to-real transfer (HAA):
| Method | K | T60 (%) ↓ | C50 (dB) ↓ | EDT (ms) ↓ |
|---|---|---|---|---|
| Diff-RIR† | 12 | 3.74 | 2.067 | 88.09 |
| FLAC | 8 | 3.10 | 2.167 | 84.52 |
| FLAC | 1 | 3.45 | 2.170 | 90.02 |
Ablation Study¶
- Conditioning modality ablation: Geometry-only conditioning yields better C50 and EDT (early reflections determined by nearby surfaces); acoustic-only conditioning yields better T60 (global reverberation is difficult to infer from local geometry); combining both achieves the best performance.
- Geometric encoder: Fine-tuned DINOv3 ViT-S/16 outperforms both training from scratch and frozen variants, as well as xRIR's ViT.
- DiT conditioning strategy: AdaLN + Cross-Attention significantly outperforms In-Context and pure Cross-Attention approaches.
- Acoustic encoder: A frozen VAE encoder generalizes slightly better across rooms than ResNet-18, at higher computational cost.
Key Findings¶
- FLAC with 1-shot surpasses all 8-shot baselines; 93.01% of participants (n=46) in subjective listening tests preferred FLAC.
- Uncertainty analysis: Low-frequency bands exhibit higher sample variance and longer durations, consistent with room acoustics theory — low-frequency responses are dominated by sparse boundary modes, while high frequencies stabilize above the Schröder frequency.
- Intra-condition diversity ratio is 4.5% (1.03 vs. 22.96), indicating that the model introduces meaningful stochasticity while maintaining contextual consistency.
- The deterministic variant (fixed noise) shows significantly degraded performance (+6% T60, +10% C50, −40% R@5), confirming that stochasticity is essential for few-shot acoustic synthesis.
Highlights & Insights¶
- Pioneering contribution: First application of flow matching to explicit RIR synthesis, framing few-shot acoustic synthesis as a probabilistic generative problem.
- Exceptional data efficiency: 1-shot performance surpasses the previous 8-shot SOTA, reducing required recordings by 8×.
- AGREE evaluation framework: Proposes a CLIP-style acoustic–geometry joint embedding that fills the gap in geometry-consistency evaluation and enables zero-shot cross-modal retrieval.
- Physically grounded uncertainty modeling: Higher uncertainty at low frequencies and convergence at high frequencies align with room acoustics Schröder frequency theory.
- Practical applicability: Trained on a single H100; inference requires only one step to obtain high-quality results; the few-shot method adapts to new scenes in minutes.
Limitations & Future Work¶
- Inaccurate area classification: This paper genuinely belongs to audio/acoustic synthesis; its classification under image generation is inappropriate.
- Limited real-scene generalization: Geometric annotations in the HAA dataset are simplified (e.g., tables modeled as planes), and the VAE is not fine-tuned on real recordings, limiting sim-to-real transfer.
- Single sample rate constraint: The current model supports only 22,050 Hz; high-fidelity applications require higher sample rates.
- Elevated FDG metric: The generated distribution still diverges from the real distribution in AGREE space, particularly on real data.
- Scarcity of real data: The absence of large-scale, diverse real audio-visual datasets constrains VAE and overall model performance in real-world scenarios.
- Monaural limitation: Only monaural omnidirectional RIRs are handled; binaural and multichannel scenarios are not addressed.
Related Work & Insights¶
- Neural acoustic fields: Per-scene training methods such as NeRAF and AV-GS achieve spatially continuous rendering but are non-generalizable.
- Few-shot acoustic synthesis: FewShotRIR (20 samples) → MAGIC (semantic augmentation) → xRIR (8 samples + depth maps); all are deterministic methods.
- Audio diffusion and flow matching: Diffusion models have succeeded in speech/music generation; flow matching improves efficiency; this paper is the first to introduce it to RIR synthesis.
- Joint embedding models: CLIP → audio-visual/audio-text embeddings, but standard audio embeddings are unsuitable for RIRs; AGREE is the first to align RIRs with scene geometry.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (First application of flow matching to RIR synthesis; novel probabilistic modeling perspective; pioneering AGREE evaluation framework)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Two datasets, multiple baselines, comprehensive ablations, uncertainty analysis, subjective listening tests, cross-modal retrieval validation)
- Writing Quality: ⭐⭐⭐⭐ (Clear structure, rich figures, adequate physical intuition; notation is somewhat dense in places)
- Value: ⭐⭐⭐⭐ (Opens a new direction for few-shot acoustic synthesis with exceptional practical data efficiency, though the field is relatively niche)