How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation for Objects¶

Conference: ICCV 2025 arXiv: 2508.02905 Code: Project Page Area: Audio & Speech Keywords: Room Impulse Response, Material Control, Audio-Visual Learning, RIR Generation, Acoustic Simulation

TL;DR¶

This paper proposes a material-controlled acoustic profile generation task (M-CAPA): given audio-visual observations of an indoor scene and a user-defined target material configuration, the model generates a target room impulse response (RIR) that reflects the material changes. A companion dataset, Acoustic Wonderland, is also introduced.

Background & Motivation¶

Sound propagation is significantly influenced by room geometry and object/surface materials — in the same room, wooden walls and concrete walls produce distinctly different reverberation characteristics. Accurate RIR modeling is critical for AR/VR, gaming, and architectural acoustic design.

Limitations of existing RIR prediction methods:

Physics-based simulation (e.g., ray tracing) requires detailed 3D meshes and material annotations, which are costly to obtain and difficult to scale.

Data-driven methods mostly predict RIR from images, audio, or room dimensions, but typically ignore material properties, simplifying rooms as rectangular boxes or implicitly inferring materials from RGB alone.

The few methods that consider materials [AV-RIR, Listen2Scene] either require dense sampling and 3D reconstruction, or use fixed semantic class-to-material mappings (e.g., "all walls = brick"), without the flexibility to modify material configurations at inference time.

The paper proposes a new task: given the original audio-visual observations \((V, A_S)\) of a scene and user-specified target material mask \(\mathcal{M}_T\), generate a new RIR \(A_T\). Users can dynamically adjust materials at inference time (e.g., replacing the floor with carpet or walls with glass) without physically renovating the room.

Method¶

Overall Architecture¶

The M-CAPA model consists of three components: a multimodal scene encoder \(f^E\) (encoding audio-visual features) → a target material encoder \(f^M\) (encoding the new material configuration) → a conditional RIR generator \(f^T\) (fusing both to generate the target RIR).

Key Designs¶

Multimodal Scene Encoder
- Visual encoder \(f^V\): a four-layer convolutional UNet encoder processing 256×256 RGB images \(V_n\), outputting visual embedding \(e_v\).
- Semantic encoder \(f^G\): same architecture processing semantic segmentation mask \(G_n\), outputting semantic embedding \(e_g\).
- Acoustic encoder \(f^A\): a four-layer convolutional UNet encoder processing binaural spectrogram \(A_S \in \mathbb{R}^{2 \times F \times T}\) (via STFT), outputting acoustic embedding \(e_a\).
- The three embeddings are concatenated into a multimodal embedding \(e_m = [e_v; e_g; e_a]\).

Design insight: only a 90° FoV RGB image is needed, because the echo response itself already captures the acoustic information of the entire room (including areas outside the field of view).

Target Material Encoder The target material mask \(\mathcal{M}_T \in \mathbb{R}^{H \times W}\) (where each pixel is a material category index) is mapped to an embedding \(e_t\) via a convolutional encoder. Users can generate \(\mathcal{M}_T\) simply by selecting objects on the semantic segmentation map and assigning material categories.
Conditional RIR Generator (Core Contribution) A fusion layer \(\mathcal{F}\) merges \(e_m\) and \(e_t\), and a four-layer transposed convolutional decoder (with skip connections from \(f^A\)) outputs two tensors:
- Weighting mask \(W_T \in \mathbb{R}^{2 \times F \times T}\): controls which frequency/time bins in the source RIR should be enhanced or suppressed.
- Material residual \(B_T \in \mathbb{R}^{2 \times F \times T}\): introduces new reverberation patterns absent in the source RIR.

Final generation: \(\hat{A}_T = W_T \odot A_S + B_T\)

Key motivation: conventional masking methods can only adjust the magnitude of existing frequency-time bins, but new materials may introduce entirely new reverberation in previously silent bins. The residual term \(B_T\) addresses this limitation. Ablation studies confirm that \(B_T\) contributes significantly to RTE and CTE metrics.

Loss & Training¶

\[L_n = \lambda_1 \|{\hat{A}_T - A_T}\|_2 + \lambda_2 \|{\hat{A}_T - A_T}\|_1 + \lambda_3 L_D(\hat{A}_T, A_T)\]

L2 loss + L1 loss: capture spectral detail errors.
Energy decay loss \(L_D\): aligns the temporal energy decay curves of predicted and ground-truth RIRs, improving reverberation quality.
\(\lambda_1 = \lambda_2 = 0.5\), \(\lambda_3 = 5 \times 10^{-3}\)
Adam optimizer, learning rate \(10^{-3}\), batch size 64, single-GPU training.

Key Experimental Results¶

Main Results¶

RIR Generation Performance on Unseen Scenes (Du_u test set, ×10⁻²)

Method	Input	L1↓	STFT↓	RTE(ms)↓	CTE(dB)↓
Direct Mapping	\(A_S\)	7.47	7.10	119.7	12.78
Image2Reverb	\(V\)	14.13	7.59	223.4	19.15
FAST-RIR++	\(A_S\)	14.81	28.39	231.8	16.83
Material Aware	\(V\)	8.91	11.29	98.06	11.75
AV-RIR	\(A_S\)+\(V\)	7.59	7.17	99.10	11.35
M-CAPA (ours)	\(A_S\)+\(V\)	5.27	3.87	91.44	8.44

M-CAPA substantially outperforms all baselines and prior state-of-the-art methods across all metrics. Even the vision-only variant of M-CAPA (L1=6.06) surpasses AV-RIR, which uses both audio and visual inputs.

Ablation Study¶

Component Ablation (Du_u test set)

Configuration	L1↓	STFT↓	RTE(ms)↓	CTE(dB)↓
M-CAPA (full model)	5.27	3.87	91.44	8.44
w/o \(\mathcal{M}_T\)	5.61	4.06	109.46	9.19
w/o \(B_T\) (masking only)	5.75	4.93	105.19	10.83
With inferred \(G_n\)	5.63	3.99	97.63	9.10
Changed materials only	5.47	4.00	96.36	9.04

Removing the residual term \(B_T\) leads to a 2.39 dB degradation in CTE, confirming the argument that pure masking is insufficient.

Key Findings¶

Prediction error is lowest when material changes cover 50%–70% of the area (corresponding to large surfaces such as walls and floors); small-area changes (e.g., a single chair) are paradoxically harder to predict.
Difficulty varies considerably across material types: fabric and acoustic tiles are easier to predict, while steel and wood are harder, likely due to their complex frequency-dependent reflection and absorption characteristics.
Real-scene user study: 5 participants achieved 61.1% accuracy (random baseline: 33%) in identifying target materials from speech convolved with predicted RIRs, validating the model's generalization to real scenes.
The model is extremely lightweight: only 10.56M parameters, 17.98 GFLOPs, and 114ms inference time, compared to AV-RIR's 390.66M parameters.

Highlights & Insights¶

This is the first RIR generation method that enables arbitrary modification of material configurations at inference time, filling a gap in interactive editing for acoustic simulation.
The generation formulation \(W_T \odot A_S + B_T\) is concise and elegant, decoupling the adjustment of existing reverberation from the introduction of new reverberation.
The Acoustic Wonderland dataset (1.68 million data points, 2673 material configurations) provides the community with a new benchmark for systematically evaluating material-acoustic relationships.
The user interaction design is intuitive: materials are assigned by selecting objects on a semantic segmentation mask, requiring no pixel-level annotation.

Limitations & Future Work¶

Prediction of acoustic effects from material changes on highly irregular shapes (e.g., domes, complex columns) remains challenging.
The model cannot generalize to unseen material categories at inference time (12 fixed material classes).
Evaluation is limited to simulated data; the real-scene user study is small in scale (2 scenes, 5 participants).
Robustness to noise is limited: performance degrades when the source RIR is noisy; noisy training could address this in future work.
The frequency-dependent characteristics of materials (varying absorption rates across frequency bands) could be more explicitly modeled.

AV-RIR [CVPR 2024] is the most direct comparison method, but it uses fixed semantic-to-material mappings and retrieves late reverberation from the training set.
Image2Reverb [ICCV 2021] was the first to generate complete RIRs from RGB and depth, but disregards material information.
This task has broad application potential: acoustic previewing for interior design, immersive VR/AR experiences, and studio material planning.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (First material-controlled RIR generation task with a complete companion dataset and method)
Experimental Thoroughness: ⭐⭐⭐⭐ (Multiple baselines, ablations across multiple splits, detailed component analysis, and user study)
Writing Quality: ⭐⭐⭐⭐ (Clear problem formulation and well-structured method description)
Value: ⭐⭐⭐⭐ (Direct applicability to acoustic rendering in AR/VR and indoor acoustic design)