How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation for Objects¶
Conference: ICCV 2025 arXiv: 2508.02905 Code: Project Page Area: Audio & Speech Keywords: Room Impulse Response, Material Control, Audio-Visual Learning, RIR Generation, Acoustic Simulation
TL;DR¶
This paper proposes a material-controlled acoustic profile generation task (M-CAPA): given audio-visual observations of an indoor scene and a user-defined target material configuration, the model generates a target room impulse response (RIR) that reflects the material changes. A companion dataset, Acoustic Wonderland, is also introduced.
Background & Motivation¶
Sound propagation is significantly influenced by room geometry and object/surface materials — in the same room, wooden walls and concrete walls produce distinctly different reverberation characteristics. Accurate RIR modeling is critical for AR/VR, gaming, and architectural acoustic design.
Limitations of existing RIR prediction methods:
Physics-based simulation (e.g., ray tracing) requires detailed 3D meshes and material annotations, which are costly to obtain and difficult to scale.
Data-driven methods mostly predict RIR from images, audio, or room dimensions, but typically ignore material properties, simplifying rooms as rectangular boxes or implicitly inferring materials from RGB alone.
- The few methods that consider materials [AV-RIR, Listen2Scene] either require dense sampling and 3D reconstruction, or use fixed semantic class-to-material mappings (e.g., "all walls = brick"), without the flexibility to modify material configurations at inference time.
The paper proposes a new task: given the original audio-visual observations \((V, A_S)\) of a scene and user-specified target material mask \(\mathcal{M}_T\), generate a new RIR \(A_T\). Users can dynamically adjust materials at inference time (e.g., replacing the floor with carpet or walls with glass) without physically renovating the room.
Method¶
Overall Architecture¶
The M-CAPA model consists of three components: a multimodal scene encoder \(f^E\) (encoding audio-visual features) → a target material encoder \(f^M\) (encoding the new material configuration) → a conditional RIR generator \(f^T\) (fusing both to generate the target RIR).
Key Designs¶
-
Multimodal Scene Encoder
- Visual encoder \(f^V\): a four-layer convolutional UNet encoder processing 256×256 RGB images \(V_n\), outputting visual embedding \(e_v\).
- Semantic encoder \(f^G\): same architecture processing semantic segmentation mask \(G_n\), outputting semantic embedding \(e_g\).
- Acoustic encoder \(f^A\): a four-layer convolutional UNet encoder processing binaural spectrogram \(A_S \in \mathbb{R}^{2 \times F \times T}\) (via STFT), outputting acoustic embedding \(e_a\).
- The three embeddings are concatenated into a multimodal embedding \(e_m = [e_v; e_g; e_a]\).
Design insight: only a 90° FoV RGB image is needed, because the echo response itself already captures the acoustic information of the entire room (including areas outside the field of view).
-
Target Material Encoder The target material mask \(\mathcal{M}_T \in \mathbb{R}^{H \times W}\) (where each pixel is a material category index) is mapped to an embedding \(e_t\) via a convolutional encoder. Users can generate \(\mathcal{M}_T\) simply by selecting objects on the semantic segmentation map and assigning material categories.
-
Conditional RIR Generator (Core Contribution) A fusion layer \(\mathcal{F}\) merges \(e_m\) and \(e_t\), and a four-layer transposed convolutional decoder (with skip connections from \(f^A\)) outputs two tensors:
- Weighting mask \(W_T \in \mathbb{R}^{2 \times F \times T}\): controls which frequency/time bins in the source RIR should be enhanced or suppressed.
- Material residual \(B_T \in \mathbb{R}^{2 \times F \times T}\): introduces new reverberation patterns absent in the source RIR.
Final generation: \(\hat{A}_T = W_T \odot A_S + B_T\)
Key motivation: conventional masking methods can only adjust the magnitude of existing frequency-time bins, but new materials may introduce entirely new reverberation in previously silent bins. The residual term \(B_T\) addresses this limitation. Ablation studies confirm that \(B_T\) contributes significantly to RTE and CTE metrics.
Loss & Training¶
- L2 loss + L1 loss: capture spectral detail errors.
- Energy decay loss \(L_D\): aligns the temporal energy decay curves of predicted and ground-truth RIRs, improving reverberation quality.
- \(\lambda_1 = \lambda_2 = 0.5\), \(\lambda_3 = 5 \times 10^{-3}\)
- Adam optimizer, learning rate \(10^{-3}\), batch size 64, single-GPU training.
Key Experimental Results¶
Main Results¶
RIR Generation Performance on Unseen Scenes (Du_u test set, ×10⁻²)
| Method | Input | L1↓ | STFT↓ | RTE(ms)↓ | CTE(dB)↓ |
|---|---|---|---|---|---|
| Direct Mapping | \(A_S\) | 7.47 | 7.10 | 119.7 | 12.78 |
| Image2Reverb | \(V\) | 14.13 | 7.59 | 223.4 | 19.15 |
| FAST-RIR++ | \(A_S\) | 14.81 | 28.39 | 231.8 | 16.83 |
| Material Aware | \(V\) | 8.91 | 11.29 | 98.06 | 11.75 |
| AV-RIR | \(A_S\)+\(V\) | 7.59 | 7.17 | 99.10 | 11.35 |
| M-CAPA (ours) | \(A_S\)+\(V\) | 5.27 | 3.87 | 91.44 | 8.44 |
M-CAPA substantially outperforms all baselines and prior state-of-the-art methods across all metrics. Even the vision-only variant of M-CAPA (L1=6.06) surpasses AV-RIR, which uses both audio and visual inputs.
Ablation Study¶
Component Ablation (Du_u test set)
| Configuration | L1↓ | STFT↓ | RTE(ms)↓ | CTE(dB)↓ |
|---|---|---|---|---|
| M-CAPA (full model) | 5.27 | 3.87 | 91.44 | 8.44 |
| w/o \(\mathcal{M}_T\) | 5.61 | 4.06 | 109.46 | 9.19 |
| w/o \(B_T\) (masking only) | 5.75 | 4.93 | 105.19 | 10.83 |
| With inferred \(G_n\) | 5.63 | 3.99 | 97.63 | 9.10 |
| Changed materials only | 5.47 | 4.00 | 96.36 | 9.04 |
Removing the residual term \(B_T\) leads to a 2.39 dB degradation in CTE, confirming the argument that pure masking is insufficient.
Key Findings¶
- Prediction error is lowest when material changes cover 50%–70% of the area (corresponding to large surfaces such as walls and floors); small-area changes (e.g., a single chair) are paradoxically harder to predict.
- Difficulty varies considerably across material types: fabric and acoustic tiles are easier to predict, while steel and wood are harder, likely due to their complex frequency-dependent reflection and absorption characteristics.
- Real-scene user study: 5 participants achieved 61.1% accuracy (random baseline: 33%) in identifying target materials from speech convolved with predicted RIRs, validating the model's generalization to real scenes.
- The model is extremely lightweight: only 10.56M parameters, 17.98 GFLOPs, and 114ms inference time, compared to AV-RIR's 390.66M parameters.
Highlights & Insights¶
- This is the first RIR generation method that enables arbitrary modification of material configurations at inference time, filling a gap in interactive editing for acoustic simulation.
- The generation formulation \(W_T \odot A_S + B_T\) is concise and elegant, decoupling the adjustment of existing reverberation from the introduction of new reverberation.
- The Acoustic Wonderland dataset (1.68 million data points, 2673 material configurations) provides the community with a new benchmark for systematically evaluating material-acoustic relationships.
- The user interaction design is intuitive: materials are assigned by selecting objects on a semantic segmentation mask, requiring no pixel-level annotation.
Limitations & Future Work¶
- Prediction of acoustic effects from material changes on highly irregular shapes (e.g., domes, complex columns) remains challenging.
- The model cannot generalize to unseen material categories at inference time (12 fixed material classes).
- Evaluation is limited to simulated data; the real-scene user study is small in scale (2 scenes, 5 participants).
- Robustness to noise is limited: performance degrades when the source RIR is noisy; noisy training could address this in future work.
- The frequency-dependent characteristics of materials (varying absorption rates across frequency bands) could be more explicitly modeled.
Related Work & Insights¶
- AV-RIR [CVPR 2024] is the most direct comparison method, but it uses fixed semantic-to-material mappings and retrieves late reverberation from the training set.
- Image2Reverb [ICCV 2021] was the first to generate complete RIRs from RGB and depth, but disregards material information.
- This task has broad application potential: acoustic previewing for interior design, immersive VR/AR experiences, and studio material planning.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (First material-controlled RIR generation task with a complete companion dataset and method)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Multiple baselines, ablations across multiple splits, detailed component analysis, and user study)
- Writing Quality: ⭐⭐⭐⭐ (Clear problem formulation and well-structured method description)
- Value: ⭐⭐⭐⭐ (Direct applicability to acoustic rendering in AR/VR and indoor acoustic design)