Modeling and Driving Human Body Soundfields through Acoustic Primitives¶

Conference: ECCV 2024
arXiv: 2407.13083
Code: Yes (https://wikichao.github.io/Acoustic-Primitives/)
Area: Human Body Understanding
Keywords: Soundfield Modeling, Acoustic Primitives, Spatial Audio, Human Pose, AR/VR

TL;DR¶

This paper proposes a 3D human body soundfield modeling and rendering framework based on Acoustic Primitives, which attaches multiple low-order spherical harmonic soundfields to human skeletal joints. While maintaining audio quality comparable to the state-of-the-art (SOTA), it achieves a 15x acceleration and near-field sound rendering capabilities.

Background & Motivation¶

3D human body visual modeling (e.g., MetaHumans, Codec Avatars) has reached highly realistic levels, but acoustic modeling lags far behind. For immersive AR/VR experiences, precise audio-visual synchronization is crucial, yet few studies have successfully rendered spatial audio for virtual humans.

The pioneering work SoundingBodies [Xu et al.] developed a full-body soundfield rendering system based on body pose and headset microphones, but it exhibits major limitations:

Inability to render in near-field: Using a single high-order spherical harmonic representation (17th order), sound can only be rendered outside a sphere of approximately 2m diameter enclosing the human body, yielding chaotic results within the sphere.

Low computational efficiency: It requires predicting the raw audio signals of 345 microphones and then converting them into spherical harmonic coefficients via traditional DSP processing, preventing real-time operation.

Large parameter size and instability: The 17th-order spherical harmonics require a massive number of coefficients, making the estimation process unstable.

The core idea of this work originates from volumetric primitives in visual neural rendering (e.g., 3D Gaussians), analogously transferring them to the acoustic domain to replace a single high-order soundfield with multiple small, low-order soundfield primitives attached to the joints of the human body.

Method¶

Overall Architecture¶

The system consists of two primary stages:

Acoustic Primitive Learning: Given headset microphone audio and 3D body pose sequences, a neural network is used to predict the spherical harmonic coefficients, weights, and position offsets of a set of acoustic primitives.
Soundfield Rendering: Differentiable rendering is performed using spherical wave functions, superimposing the contributions of all primitives at the target 3D position to obtain the final audio.

Component	Function	Output
Pose Encoder	Temporal convolutions to extract spatiotemporal features of body pose	Pose features \(f_p\)
Audio Encoder	ResNet+LSTM to extract audio features (including delay compensation)	Audio features \(f_a\)
Audiovisual Fusion Module	ResNet+Attention to fuse both features	Fused features \(f_{ap}\)
Soundfield Decoder	U-Net style decoding to output spherical harmonic coefficients of all primitives	\(K\) sets of spherical harmonic coefficients \(\{\mathcal{S}_i\}\)
Offset MLP	Predicts position offsets of primitives relative to joints	Offsets \(\Delta(x,y,z)\)
Weight MLP	Predicts time-varying weights of primitives	Weights \(W\)

Key Designs¶

1. Definition of Acoustic Primitives

Each acoustic primitive is a small spherical soundfield attached to a human joint, described by low-order (2nd-order) spherical harmonic coefficients. Its sound pressure can be precisely calculated via spherical wave functions:

\[\mathbf{w}(r,\theta,\varphi) = \sum_{n=0}^{N}\sum_{m=-n}^{n} \tilde{c}_{nm} \cdot \frac{h_n(kr)}{h_n(kr_{ref})} \cdot Y_{nm}(\theta,\varphi)\]

The entire human body soundfield is a linear superposition of the sounds generated by all \(K\) primitives at the listener's position.

2. Delay Compensation

Sounds generated by different body parts (e.g., feet, hands) require different amounts of time to reach the headset microphone. The method estimates the time delay of each primitive using pose features via an MLP, and applies time-warping to the audio signal to compensate for it.

3. Primitive Position Offset

Joint positions do not precisely correspond to sound source positions (e.g., wrist joint vs. the position of finger snaps). Therefore, a bounded offset (constrained within a 20cm range by tanh) is learned to enable the primitive positions to more accurately reflect the true sound sources.

4. Primitive Weights

The importance of each primitive varies at different moments (e.g., the hand primitive is active during finger snapping while other primitives remain silent). This time-varying nature is explicitly modeled using weights activated by a softmax function.

Loss & Training¶

The total loss function consists of four terms:

\[\mathcal{L}_{total} = \lambda_{amp}\mathcal{L}_{amp} + \lambda_{ri}\mathcal{L}_{ri} + \lambda_{s\ell1}\mathcal{L}_{s\ell1} + \lambda_{cts}\mathcal{L}_{cts}\]

Loss Term	Function	Weight
\(\mathcal{L}_{amp}\)	Multi-scale STFT magnitude spectrum loss (window sizes of 2048/1024/512/256)	7
\(\mathcal{L}_{ri}\)	STFT real and imaginary component loss	3
\(\mathcal{L}_{s\ell1}\)	Shift-L1 loss to reduce spatial alignment error	0.5
\(\mathcal{L}_{cts}\)	Cross-entropy loss to supervise primitive weights with segment-level labels	1

Training details: batch size = 1 per GPU, 20 target microphones are randomly selected per forward pass, using the AdamW optimizer (lr = 0.0002), training for 100 epochs, taking about 55 hours on 4×A100 GPUs.

Key Experimental Results¶

Main Results¶

Evaluated on the SoundingBodies public dataset (recorded in an anechoic chamber, featuring a 345-microphone spherical array and 6.5 human subjects).

Method	Inference Speed	Non-speech SDR↑	Non-speech Mag Error↓	Non-speech Phase Error↓	Speech SDR↑	Speech Mag Error↓	Speech Phase Error↓
SoundingBodies	3.56s	3.052	0.832	0.314	9.635	0.701	0.464
Ours (12 primitives, 2nd-order)	0.24s	3.597	0.883	0.323	8.448	0.943	0.417

Core conclusion: 15x acceleration, superior non-speech SDR, and speech quality that is close but slightly degraded.

Ablation Study¶

Effects of different numbers of primitives and spherical harmonic orders:

Number of Primitives K	Order N	Non-speech SDR↑	Speech SDR↑
5	Order 0	2.009	4.775
5	Order 2	3.552	7.981
9	Order 2	3.569	8.200
12	Order 2	3.597	8.448
12 (No offset)	Order 2	3.528	7.730

Key Findings¶

Both the number of primitives and the order are positively correlated with performance: 12 2nd-order primitives achieve the optimal effect. Speech particularly benefits from more primitives and higher orders.
The position offset mechanism is effective: Removing the offset is roughly equivalent to degrading 12 primitives to 9 primitives (with redundant primitives collapsing into the same location).
The performance gain is largest from 0th to 1st order: 0th-order can only model omnidirectional soundfields, whereas 1st-order and above can capture sound directionality.
Correct assignment of sound sources: Visualization shows that speech energy is primary assigned to head primitives, handclaps are assigned to hand primitives, and directional patterns align with the head orientation.

Highlights & Insights¶

Elegant visual-acoustic analogy: Transferring the concept of 3D Gaussian Splatting to the acoustic domain, analogizing acoustic primitives to volumetric primitives, represents an impressive case of cross-modal methodological migration.
Differentiable acoustic renderer: The differentiable implementation of spherical wave functions enables end-to-end training, where the rendering equations provide a strong physical prior.
Implicit sound source separation: The system implicitly learns to decompose mixed audio signals into different body-part sources without requiring explicit sound source separation annotations.
Real-time rendering capability: Processing 1 second of 48kHz audio in 0.24 seconds is close to real-time, laying the foundation for AR/VR applications.

Limitations & Future Work¶

Limited spherical harmonics order: PyTorch implementation only supports 2nd-order spherical wave functions; higher orders could potentially improve accuracy.
Requirement for large-scale microphone array training data: Anechoic chamber recordings using 345 microphones require expensive equipment, making it difficult to scale.
Unmodeled indoor reverberation: Training and testing in an anechoic chamber prevent the model from handling reverberation effects in real-world indoor environments.
Support for single-person scenarios only: Soundfield modeling for multi-person interaction scenarios remains unexplored.
Poor magnitude matching for body slaps: The highly variable radiation patterns present a challenge for low-order primitives.

Relationship with NeRF/3DGS: High methodological similarity—replacing global representations with primitive distributions and replacing explicit solving with differentiable rendering.
Complementarity with SoundingBodies: SoundingBodies pursues high fidelity at the cost of efficiency, while this work pursues efficiency while maintaining quality.
Potential Extensions: Can be combined with text/audio-driven body motion generation models to achieve a fully integrated audio-visual-motion virtual human.

Rating¶

Dimension	Score (1-5)	Evaluation
Novelty	4.5	Novel concept of acoustic primitives and exquisite cross-modal analogy
Technical Depth	4	Organic integration of physical acoustics and deep learning, with exquisite differentiable rendering design
Experimental Thoroughness	3.5	Sufficient ablation studies, but only tested on one dataset, lacking real-world environment testing
Writing Quality	4	Clear motivation, and the analogy to visual rendering aids comprehension
Value	4	Direct application value for spatial audio rendering in AR/VR