DiffPortrait360: Consistent Portrait Diffusion for 360° View Synthesis¶
Conference: CVPR 2025
arXiv: 2503.15667
Code: https://freedomgu.github.io/DiffPortrait360
Area: 3D Vision
Keywords: 360-degree Head Generation, Diffusion Models, Novel View Synthesis, Portrait Reconstruction, Style Generalization
TL;DR¶
Presents the first method capable of generating consistent 360° full head views from a single portrait. Using a dual-appearance control module, a back-view generation ControlNet, and a continuous view sequence training strategy, it supports real human, stylized, and anthropomorphic characters, and can be converted into high-quality NeRF for real-time free-viewpoint rendering.
Background & Motivation¶
Generating 360° full head views from a single portrait is crucial for immersive telepresence, personalized avatars, and content creation. Existing methods face three major challenges: (1) GAN-based methods (PanoHead, SphereHead) are only applicable to realistic human faces and cannot handle stylized characters; (2) Diffusion model methods (DiffPortrait3D) can only generate front-facing angles, with poor multi-view consistency; (3) General 3D generation methods (Zero123, Unique3D) lack domain-specific knowledge of human heads, often leading to severe artifacts. The goal of this paper is: To build a "style-agnostic" 360° head generation framework that simultaneously ensures global appearance consistency and local viewpoint continuity.
Method¶
Overall Architecture¶
Built upon DiffPortrait3D, the framework uses a frozen, pre-trained Latent Diffusion Model (LDM) as the rendering backbone and introduces three trainable auxiliary modules: a dual-appearance reference module \(\mathcal{R}\) (extracting front/back appearance information), a camera control module \(\mathcal{C}\) (ControlNet injecting camera poses rendered by a 3D-aware GAN), and a view consistency module \(\mathcal{V}\) (temporal cross-attention ensuring inter-frame continuity). During inference, a dedicated ControlNet \(\mathcal{F}\) is first used to generate the back view.
Key Designs¶
-
Dual Appearance Module:
- Function: Solves the "double-face" artifacts and information leakage caused by insufficient appearance information under large angular viewpoint changes when using only a front-facing reference image.
- Mechanism: During training, both a front-facing reference image \(I_{\text{ref}}\) and a back-facing reference image \(I_{\text{back}}\) (selecting the pair of views with minimal overlap) are input. A ReferenceNet extracts the appearance features of both images, allowing the diffusion network to automatically determine which image's information to rely on more under different camera viewpoints.
- Design Motivation: In single-reference schemes, when generating the back of the head, the network erroneously leaks front-facing features to the back (resulting in a second face). The dual-appearance scheme provides complete 360° appearance coverage, eliminating ambiguity.
-
Back-view Generation Network (ControlNet \(\mathcal{F}\)):
- Function: Automatically generates a plausible back view from a front-facing portrait during inference.
- Mechanism: Based on the ControlNet architecture, it takes the front image \(I_{\text{ref}}\) as input to generate the back image \(I_{\text{back}}\), ensuring consistent style, reasonable head shape, and matching hairstyle. A key innovation is the inclusion of 1,000 stylized front/back view pairs (generated by Unique3D) in the training data, mitigating the bias toward photorealistic data.
- Design Motivation: The dual-appearance module requires a back-view image during inference, which is practically unavailable. Training solely on real data would lead to the generation of a realistic back view when given a stylized input (domain bias).
-
Continuous View Sequence Training Strategy:
- Function: Enhances local continuity and smooth transitions between viewpoints.
- Mechanism: Instead of training the temporal Transformer with randomly sampled sparse views, a 3D-aware GAN (PanoHead) is used to generate continuously sampled viewpoint sequences (8 consecutive views) to leverage the pre-trained motion priors fully.
- Design Motivation: Random viewpoint training fails to let the temporal module learn smooth transitions, resulting in flickering and jitter during inference. Continuous sequence training, even with limited data, significantly improves consistency, enabling the generation results to be successfully fitted to a NeRF.
Loss & Training¶
- Standard LDM denoising loss: \(L_{ldm} = \mathbb{E}[\|\epsilon - \epsilon_\theta(z_t, t)\|_2^2]\)
- Initialization using 3D-aware noise (derived from finetuning inversion with StyleGAN-based 360 head generators)
- Mixed training data: RenderMe360 (500 subjects, 60 camera views) + PanoHead/SphereHead synthesized continuous views (600 identities)
- Generation speed of 5.6 seconds per view
Key Experimental Results¶
Main Results¶
| Method | Front PSNR↑ | Front CSIM↑ | Front FID↓ | Back PSNR↑ | Back FID↓ |
|---|---|---|---|---|---|
| PanoHead + PTI | 28.35 | 0.471 | 98.93 | 28.39 | 169.52 |
| SphereHead + PTI | 28.62 | 0.556 | 69.41 | 28.63 | 106.21 |
| DiffPortrait3D* | 28.96 | 0.709 | 49.02 | 28.47 | 91.37 |
| Ours | 29.44 | 0.746 | 35.34 | 30.92 | 39.40 |
Ablation Study¶
| Configuration | Effect | Description |
|---|---|---|
| Without dual-appearance control | Inconsistent accessories/hair color appear in far viewpoints | Insufficient information from a single reference image |
| With dual-appearance control | Good consistency in texture and 3D shape | Complementary front and back appearances |
| Training back-view network solely on real data | Generates realistic back views for stylized inputs | Data bias |
| + Stylized data augmentation | Correctly maintains input style | 1000 pairs are sufficient to correct bias |
| Without 3D-aware noise | Complete failure of visual alignment | Missing camera control signals |
| Without continuous sequence training | Some viewpoints align, but textures are messy after NeRF fitting | Multi-view inconsistency |
| Full method | Consistent viewpoints + fittable NeRF | Supports downstream 3D reconstruction |
Key Findings¶
- The back-view FID dropped from 91.37 to 39.40 (-57%), almost approaching the front-view level, indicating that the dual-appearance module combined with back-view generation effectively solves full-head consistency.
- GAN-based methods completely fail on stylized portraits; general 3D methods (Zero123/Unique3D) cannot handle the domain-specific nuances of human heads.
- Only 1,000 stylized front/back view pairs are needed to significantly reduce the domain bias of back-view generation.
Highlights & Insights¶
- First style-agnostic 360° full head generation method: Works effectively across various styles, including real human, cartoon, and anthropomorphic animal characters.
- Ingenious design of the dual-appearance module: Enables the network to automatically learn to weight its dependence on front and back references based on the viewpoint, rather than using hard-coded rules.
- The generation results can be directly fitted to a NeRF for real-time rendering, possessing significant practical application value.
Limitations & Future Work¶
- The generation speed of 5.6 seconds per view remains slow, which limits interactive applications.
- The generalization of the back-view generation network relies on the diversity of the stylized augmentation data.
- NeRF is currently used as the downstream 3D representation; combining with 3DGS could be considered to achieve faster rendering.
- There is still room for improvement regarding consistency under extreme expressions and occluded scenes.
Related Work & Insights¶
- Key improvements over DiffPortrait3D: dual-appearance + back-view generation + continuous sequence training, all of which are indispensable.
- The fundamental difference from GAN-based methods like PanoHead/SphereHead lies in the open-domain generalization capability of diffusion models.
- Insight: The two-stage paradigm of "diffusion model generation + downstream 3D fitting" can be generalized to other targets such as full human bodies and hands.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of dual-appearance control and back-view generation is innovative, but the overall framework is an incremental improvement.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comparisons with multiple methods are thorough and ablations are clear, but quantitative metrics for viewport consistency are lacking.
- Writing Quality: ⭐⭐⭐⭐ Complete structure with effective illustrations, though the method section is slightly verbose.
- Value: ⭐⭐⭐⭐ Clear application value in stylized 3D portrait generation, though constrained to the head-specific domain.