Skip to content

3D Face Reconstruction From Radar Images

Conference: CVPR 2025
arXiv: 2412.02403
Code: None
Area: Human Understanding / 3D Face Reconstruction / Radar Imaging
Keywords: mmWave radar, 3DMM, Analysis-by-Synthesis, Learnable Renderer, Face Reconstruction

TL;DR

For the first time, 3D face reconstruction is achieved from millimeter-wave radar images: a synthetic dataset generated with a physical radar renderer is used to train a CNN encoder to estimate BFM parameters, and a model-based autoencoder is constructed by learning a differentiable radar renderer, achieving a mean vertex-to-vertex error of 2.56 mm on synthetic data while allowing unsupervised parameter optimization during inference.

Background & Motivation

Background: 3D face reconstruction has matured significantly in the RGB image domain. Methods based on 3DMMs (such as BFM and FLAME) can reconstruct faces with high precision from a single RGB image. Concurrently, as a non-optical sensor, millimeter-wave (mmWave) radar has been applied in scenarios like human pose estimation and security scanning due to its ability to penetrate fabric and operate independently of illumination.

Limitations of Prior Work: Optical sensors have inherent limitations in scenarios such as sleep laboratories—they require illumination, cannot penetrate blankets or pillows, and require patients to be exposed to cameras. Although radar can solve these issues, the resolution of radar images is low (spatial resolution of approx. \(4\text{ mm} \times 4\text{ mm} \times 11\text{ mm}\)) and its imaging characteristics are completely different from RGB (relying on surface normal reflections, with not all facial regions being visible), making it impossible to directly transfer RGB reconstruction methods to the radar domain.

Key Challenge: The radar domain lacks a differentiable renderer. Model-based autoencoder approaches in the RGB domain (such as the Analysis-by-Synthesis framework by Tewari et al.) rely on differentiable rendering. However, physical radar renderers (based on ray tracing + back-projection) are both non-differentiable and extremely slow (approx. 2 minutes per image), rendering them unsuitable for direct training.

Goal - How to estimate 3DMM parameters from radar images (amplitude image / depth image) for face reconstruction? - How to implement the Analysis-by-Synthesis training paradigm without a differentiable radar renderer? - How to bridge the domain gap between synthetic and real radar images?

Key Insight: Approximate the physical renderer by training a neural network as a "differentiable radar renderer," thereby transferring the mature model-based autoencoder framework from the RGB domain to the radar domain.

Core Idea: Replace the physical renderer with a learned differentiable radar renderer to construct a model-based autoencoder, enabling 3D face reconstruction from radar images to be optimized in an unsupervised manner using Analysis-by-Synthesis.

Method

Overall Architecture

The input is a radar amplitude image (or depth image, or a combination of both), and the output comprises the shape parameters \(\alpha \in \mathbb{R}^{10}\), expression parameters \(\gamma \in \mathbb{R}^{7}\), and pose parameters (translation \(t \in \mathbb{R}^3\), rotation \(T \in \mathbb{R}^3\)) of BFM 2019. The overall pipeline is divided into two stages:

  1. Encoder (Fully Supervised Training): Three CNNs predict shape, expression, and pose parameters respectively, trained using an \(L_2\) parameter loss.
  2. Autoencoder (Analysis-by-Synthesis): Combines the pre-trained encoder with a learned differentiable renderer (decoder) to jointly optimize parameter loss and image reconstruction loss.

Key Designs

  1. Physical Radar Renderer and Synthetic Dataset

    • Function: Generate 10,000 synthetic radar images for training.
    • Mechanism: Based on the face12 mask of BFM 2019, a face mesh is generated by Gaussian sampling of the shape vector \(\alpha \sim \mathcal{N}(0,1)\) and expression vector \(\gamma \sim \mathcal{N}(0,1)\). Then, the physical radar renderer by Schüssler et al. (ray-tracing + back-projection algorithm) is used to generate the corresponding radar amplitude image. Poses include \(\pm 5\) cm translation, \(\pm 5^\circ\) yaw, and \(\pm 10^\circ\) pitch/roll. Rendering parameters (material factor, antenna size) are also randomly sampled to increase diversity.
    • Design Motivation: Real radar data is extremely scarce (the paper only has data from 4 individuals), relying heavily on synthetic data for training. Although the physical renderer is slow (2 minutes per image), the dataset can be generated offline in batches.
  2. Encoder Architecture (Three-Network Parallelism)

    • Function: Predict 3DMM parameters from radar images.
    • Mechanism: Inspired by the design of Chang et al., two ResNet-50 networks are used to predict shape \(\alpha\) and expression \(\gamma\) respectively, while one AlexNet predicts pose \((t, T)\). The output of ResNet is scaled to the \([-3, 3]\) range using a tanh layer (covering 99.8% of the sampled values). During training, the dynamic range \([-15, -30]\) dB is randomly sampled for data augmentation, and is fixed to \(-20\) dB during evaluation.
    • Design Motivation: Shape and expression are high-dimensional semantic features requiring larger networks (ResNet-50), while pose represents low-dimensional geometric quantities well-suited for a lightweight AlexNet. The tanh scaling constrains the output range to prevent the prediction of unreasonable 3DMM parameters.
  3. Learnable Radar Renderer (Inverse ResNet-50)

    • Function: Generate radar images from 3DMM parameters, substituting for the non-differentiable physical renderer.
    • Mechanism: Rearrange the layers of ResNet-50 in reverse to serve as a decoder, appending a fully connected layer at the front to map parameters to the first convolutional layer. The inputs are shape, expression, pose, and rendering parameters, and the outputs are amplitude/depth images. It is pre-trained independently on synthetic data, frozen, and then utilized as the decoder of the autoencoder.
    • Design Motivation: The physical renderer is non-differentiable and slow, whereas a trained neural network renderer can achieve: (a) differentiability—enabling Analysis-by-Synthesis; (b) a speedup of over 2000\(\times\) (58 ms vs. 2 minutes); (c) constraint on the structure of the encoder's latent space as a decoder.
  4. Model-based Autoencoder

    • Function: Integrate the encoder and decoder to provide additional supervision via image reconstruction loss.
    • Mechanism: Connect the pre-trained encoder and the frozen decoder in series. During training, the decoder weights are fixed (maintaining the latent space structure), and only the encoder weights are updated. The total loss is defined as: $\(L_{train} = L_{image} + \lambda \cdot L_{params}\)$ where \(L_{image}\) is the \(L_2\) loss between the output and input images, \(L_{params}\) is the \(L_2\) loss between the predicted parameters and GT parameters, and \(\lambda = 1\).
    • Design Motivation: Training the encoder solely with parameter loss provides point-wise supervision. Adding the image reconstruction loss imposes a global consistency constraint in pixel space, serving as a regularizer. Crucially, during inference, the encoder and decoder can be frozen to optimize latent variables (parameters) solely by backpropagating the image loss, achieving unsupervised test-time optimization.

Loss & Training

  • Encoder Training: \(L_2\) parameter loss, Adam optimizer, learning rate linearly decayed from 0.01 to 0.001 (first 150 epochs), 200 epochs in total, batch size 50.
  • Decoder Training: \(L_2\) image reconstruction loss, learning rate fixed at 0.001, 300 epochs, batch size 150.
  • Autoencoder Training: \(L_{train} = L_{image} + L_{params}\), decoder weights frozen, learning rate 0.001, 300 epochs, batch size 50.
  • Inference Optimization: Freeze encoder + decoder, optimize only latent variables using \(L_{image}\).
  • Hardware: Trained on Nvidia A100; autoencoder takes approximately 4 hours.

Key Experimental Results

Main Results

Method Input Type \(L_2\) Shape \(\downarrow\) \(L_2\) Expr \(\downarrow\) \(L_2\) Total \(\downarrow\) Vertex Dist. (mm) \(\downarrow\)
Baseline (Mean) - 0.991 1.000 0.808 4.42
Baseline (Random) - 2.012 2.059 1.648 6.28
Encoder Amplitude 0.840 0.921 0.655 3.47
Encoder Depth 0.740 0.910 0.608 2.77
Encoder Amp.+Depth 0.735 0.885 0.598 2.82
Autoencoder Amplitude 0.770 0.815 0.594 3.29
Autoencoder Depth 0.634 0.850 0.544 2.56
Autoencoder Amp.+Depth 0.640 0.820 0.539 2.61

Ablation Study

Configuration Vertex Dist. (mm) Description
Autoencoder + Depth 2.56 Best configuration
Autoencoder + Amp.+Depth 2.61 Dual-modality slightly worse than depth-only, likely due to amplitude noise
Encoder + Depth 2.77 Without autoencoder regularization, 0.21 mm worse
Encoder + Amp.+Depth 2.82 Encoder dual-modality
Autoencoder + Amplitude 3.29 Insufficient information from amplitude alone
Encoder + Amplitude 3.47 Worst valid configuration

Key Findings

  • Autoencoder consistently outperforms Encoder: The autoencoder version demonstrates lower \(L_2\) errors in both shape and expression parameter predictions than the pure encoder version. The extra image reconstruction loss functions as a regularizer.
  • Depth images significantly outperform Amplitude images: Depth inputs decrease the vertex distance from 3.29 mm to 2.56 mm (a 22% improvement). This is because depth images possess a more uniform distribution of facial pixel values, unlike amplitude images which present high-value peaks in small areas.
  • Pose affects shape/expression prediction: The diagonal of the cosine similarity is more distinct under neutral pose, indicating that pose variations increase the difficulty of shape/expression estimation.
  • Shape recognition is feasible on real data: Cosine similarity analysis shows that shape parameters for the same individual retain high similarity in real images (potentially useful for face recognition).
  • Expression recognition on real data remains challenging: Expression parameters lack a distinct clustering structure on real data, with only identity-specific models being capable of identifying expressions.
  • The learnable renderer is over 2000\(\times\) faster than the physical renderer (58 ms vs. 2 min).

Highlights & Insights

  • First study utilizing radar images for 3D face reconstruction: This opens up an entirely new sensing modality for application scenarios (sleep monitoring, dark environments, penetrating occlusions), presenting not just a simple improvement on existing RGB methods but solving a completely new problem. This "first" carries significant pioneering value.
  • Bypassing non-differentiable renderers with a learned renderer: Physical radar renderers based on ray tracing are inherently non-differentiable. Instead of deriving differentiable formulations (which is extremely difficult), the authors train a neural network to approximate it. This is a practical approach—any non-differentiable rendering/simulation process can be "wrapped" as a differentiable module this way. It is transferable to other physics-based yet non-differentiable simulation scenarios.
  • Test-time optimization via image loss: Freezing all weights during inference and optimizing latent parameters solely via backpropagated image reconstruction loss is an elegant form of unsupervised self-adaptation. This essentially leverages reconstruction consistency as a self-supervised signal.
  • Rendering parameters predicted as targets: The encoder predicts not only 3DMM parameters but also rendering parameters like dynamic range, material factors, and antenna sizes, teaching the network to understand the imaging process itself.

Limitations & Future Work

  • Extreme scarcity of real data: The dataset contains only 4 male Europeans with 5 expressions each, which severely lacks diversity. This makes it impossible to fully validate the synthetic-to-real generalization ability. Large-scale real radar face datasets are needed in the future.
  • Synthetic-to-real domain gap: Significant pattern and scale differences exist between synthetic and real images, resulting in a pronounced degradation of shape/expression reconstruction quality on real data.
  • Failure in expression recognition: The model fails to effectively recognize expression changes on real data, where only identity-specific models can do so, which severely restricts practicality in applications like sleep monitoring.
  • Limited 3DMM parameter space: Relying on only 10-dimensional shape and 7-dimensional expression parameters covers roughly 85% of shape variance and 76% of expression variance, losing a substantial amount of detailed information.
  • Future directions: (1) Utilize domain adaptation/domain randomization to narrow the synthetic-to-real gap; (2) Extend to the FLAME model for a richer expression space; (3) Introduce self-supervised contrastive learning to leverage unlabeled real data; (4) Integrate multi-view radar arrays to enhance reconstruction completeness.
  • vs. Tewari et al. (MoFA): Both utilize a model-based autoencoder for Analysis-by-Synthesis face reconstruction, but MoFA relies on a differentiable OpenGL renderer. Here, because there is no differentiable renderer in the radar domain, one is learned. The core contribution is not the autoencoder framework itself, but the migration of this framework to the radar domain.
  • vs. Xie et al.: The first work on radar-based 3D face reconstruction, which relies on landmark detection + FLAME fitting (a two-step pipeline). This work directly regresses 3DMM parameters (end-to-end), and the autoencoder framework supports unsupervised optimization.
  • vs. Chen et al. (ImmFusion): Performs full-body reconstruction combining radar + RGB. Here, pure radar is used without requiring RGB assistance, focusing specifically on the face rather than the full body.

Rating

  • Novelty: ⭐⭐⭐⭐ For the first time, 3D face reconstruction is extended from RGB to radar images, and the approach of using a learned renderer to bypass non-differentiable physical rendering is creative.
  • Experimental Thoroughness: ⭐⭐⭐ Synthetic experiments are comprehensive (multiple input modalities, encoder vs. autoencoder comparison), but real data relies only on 4 subjects and lacks diversity.
  • Writing Quality: ⭐⭐⭐⭐ Well-structured with clear logical flow from motivation to method and high-quality charts.
  • Value: ⭐⭐⭐ A pioneering work of inspiring significance, though still far from practical application due to data limitations and real-world performance.