Novel View Synthesis with Pixel-Space Diffusion Models¶

Conference: CVPR 2025
arXiv: 2411.07765
Code: Project Page
Area: 3D Vision
Keywords: Novel View Synthesis, Pixel-space Diffusion, Dual U-Net, Single-view Augmentation, Homographic Transformation

TL;DR¶

VIVID achieves end-to-end novel view synthesis using the EDM2 pixel-space diffusion model. By employing a dual U-Net encoder-decoder with cross-attention to transfer geometric information, a simple camera pose embedding (instead of complex geometric encodings), and single-view data augmentation based on homography, it achieves an FID of 2.89 (51% lower than GenWarp) and a PSNR of 17.36 (+29%) on RealEstate10K.

Background & Motivation¶

Background: Novel view synthesis (NVS) methods are categorized into 3D representation-based (NeRF/3DGS, requiring multi-view training) and generative-based (diffusion models, enabling single-view inference). Existing diffusion methods mostly operate in latent space (e.g., Zero-1-to-3, GeoGPT).
Limitations of Prior Work: (1) Latent-space diffusion introduces reconstruction loss via the VAE encoder-decoder, leading to severe detail loss, particularly under large view changes; (2) Existing methods require complex geometric encodings (epipolar features, depth warping), increasing engineering complexity; (3) Training relies solely on multi-view video data, leading to poor generalization on out-of-domain scenes.
Key Challenge: Latent space is efficient but suffers from reconstruction bottlenecks; pixel space provides high fidelity but is difficult to train (high resolution and slow convergence).
Goal: Perform NVS directly in pixel space to bypass the latent-space reconstruction loss, while reducing complexity through simple pose embeddings and single-view augmentation.
Key Insight: Advances in the EDM2 architecture dramatically improve the training efficiency of pixel-space diffusion, making pixel-space NVS feasible.
Core Idea: Dual U-Net + cross-attention geometric transfer + pose embedding + homographic rotation augmentation.

Method¶

Overall Architecture¶

Source view image \(\rightarrow\) Encoder U-Net extracts features \(\rightarrow\) Joint self-attention + cross-attention transfers source view features to the target view \(\rightarrow\) Target pose embedding (flattened extrinsics + intrinsics) \(\rightarrow\) Decoder U-Net denoises to generate the target view image. Cascaded design: low-resolution base model + super-resolution model.

Key Designs¶

Simple Pose Embedding
- Function: Inject camera pose information into the diffusion process.
- Mechanism: Directly flatten and normalize (\(\mu=0, \sigma=1\)) the extrinsic matrix, focal length, and principal point to use as embedding. Ablation studies show that a simple pose embedding (FID 3.00) performs close to complex pose+epipolar encodings (FID 2.87).
- Design Motivation: Complex geometric encodings (epipolar features, depth warping) increase engineering complexity with marginal gains. Cross-attention is already capable of learning implicit geometric correspondences.
Cross-Attention Geometric Transfer
- Function: Transfer geometric and appearance information from the source view to the target view.
- Mechanism: Joint self-attention + cross-attention, where queries come from the target view while keys/values come from the joint features of both source and target views.
- Design Motivation: More flexible than warping operations. Warping fails in occluded regions, whereas attention can "borrow" information from visible areas.
Single-View Homographic Augmentation
- Function: Simulate multi-view data augmentation using single-view images.
- Mechanism: Apply random rotational homography \(H_{rot} = K_{dst} R_{dst} R_{src}^{-1} K_{src}^{-1}\) to single-view images to generate source-target image pairs. A 10% mixing ratio is optimal.
- Design Motivation: The training data (RealEstate10K) primarily consists of indoor house tour videos, yielding poor generalization to out-of-domain environments (e.g., outdoor natural scenes). Single-view augmentation introduces appearance diversity from out-of-domain images.

Loss & Training¶

Standard EDM2 diffusion loss. CFG scale of 1.5 (in-domain) / 2.0 (out-of-domain). Cascaded two-stage: base model + super-resolution.

Key Experimental Results¶

Main Results¶

Method	Mid-range FID↓	Mid-range PSNR↑	Long-range FID↓	Long-range PSNR↑
GeoGPT	6.43	14.06	7.22	13.13
GenWarp	5.91	13.43	7.38	12.10
PhotoNVS	7.12	13.32	9.22	12.05
VIVID	2.89	17.36	3.89	15.21

Ablation Study¶

Geometric Encoding	FID↓	PSNR↑	Description
No Encoding	5.75	13.39	Baseline
Epipolar Encoding	4.14	17.43	Geometry helps
Pose Embedding	3.00	21.11	Simpler is better
Pose + Epipolar	2.87	21.15	Marginal gain

Key Findings¶

FID is 51% lower than GenWarp (5.91 \(\rightarrow\) 2.89) and PSNR is 29% higher, highlighting the significant fidelity advantage of pixel space.
The performance gap between simple pose embedding (FID 3.00) and pose+epipolar encoding (FID 2.87) is minimal, suggesting that complex geometric encodings are unnecessary.
10% single-view augmentation registration reduces out-of-domain FID from 36.14 to 31.98 (-11.4%), but excessive augmentation (25%) degrades performance.
Out-of-domain generalization remains the main bottleneck (in-domain FID 2.89 vs. out-of-domain >30).

Highlights & Insights¶

The "Simple Pose is Enough" Discovery: Challenges the consensus that "NVS must use complex geometric encoding"—attention mechanisms can implicitly learn geometric correspondence.
Empirical Comparison of Pixel Space vs. Latent Space: The first systematic study to compare them in NVS, demonstrating that pixel space has an inherent advantage in fidelity.
Clever use of Homographic Augmentation: Although only handling rotations (not translations), it is sufficient to introduce valuable out-of-domain diversity.

Limitations & Future Work¶

Homographic augmentation only models rotations and lacks translation-based depth changes.
Pixel-space diffusion requires more computational resources compared to latent-space methods.
Out-of-domain generalization still incurs significant performance drops (FID >30).
RealEstate10K consists mainly of indoor scenes; outdoor scenes require more training data.

vs GenWarp: Employs warping operations for geometric alignment, which fail in occluded regions. VIVID replaces warping with attention, making it more robust.
vs GeoGPT: A latent-space method (FID 6.43). Pixel-space VIVID yields an FID of 2.89. The disparity in fidelity is mainly due to VAE reconstruction loss.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of pixel-space NVS and simplified pose embedding is innovative.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multiple distances + out-of-domain + detailed ablation + multiple metrics.
Writing Quality: ⭐⭐⭐⭐ Clear.
Value: ⭐⭐⭐⭐ Offers a new architectural alternative for NVS.