MALD-NeRF: Taming Latent Diffusion Model for Neural Radiance Field Inpainting¶

Conference: ECCV 2024
arXiv: 2404.09995
Code: Project Page
Area: 3D Vision
Keywords: NeRF Inpainting, Latent Diffusion Model, Adversarial Training, 3D Consistency, Object Removal

TL;DR¶

MALD-NeRF is proposed to achieve high-quality NeRF inpainting through masked adversarial training and a scene-customized latent diffusion model, effectively addressing the multi-view inconsistency and texture shift problems of diffusion models.

Background & Motivation¶

Using 2D latent diffusion models for NeRF inpainting faces two core challenges: (1) The inpainting results of diffusion models are inconsistent across views, and using pixel-level losses (\(L_1\)/\(L_2\)) leads to a foggy, blurry inpainted region; (2) The autoencoder error of latent diffusion models causes a texture shift between the inpainted pixels and the original pixels, producing a conspicuous inpainting boundary seam. Although perceptual loss (LPIPS) can mitigate this to some extent, it does not fundamentally solve the problem.

Method¶

Overall Architecture¶

MALD-NeRF contains three core components: (1) Masked Adversarial Training—replacing pixel-level losses to supervise the inpainted region; (2) Scene-Customized Diffusion—reducing the generation diversity of the diffusion model through LoRA fine-tuning; (3) Iterative Dataset Update—gradually reducing diffusion noise to propagate 3D-consistent information.

Key Designs¶

Masked Adversarial Training: adversarial training is conducted with patches of the inpainted images as "real" samples and patches rendered by NeRF as "generated" samples. The key innovation lies in the mask design: for both real and generated images, only the pixels within the inpainting mask are retained, while the area outside the mask is filled with black. This hides the inpainted/non-inpainted boundary from the discriminator, eliminating the impact of the texture shift. Meanwhile, a discriminator feature matching loss is introduced to provide fine-grained supervision.

Scene-Customized Diffusion: LoRA fine-tuning is conducted for each scene to learn scene-specific text tokens. It is trained with a self-supervised inpainting loss (random rectangular masks), setting the loss in the object removal mask area to zero. After fine-tuning, the cross-view consistency of the generation results of the diffusion model is substantially improved.

Iterative Dataset Update and Noise Scheduling: The inpainted images are updated once every \(U\) iterations, using partial DDIM starting from the current NeRF rendering. The noise timestep decreases with training progress: \(t = t_{max} - (t_{max} - t_{min}) \cdot \sqrt{k/K}\), achieving coarse-to-fine propagation of 3D-consistent information.

Loss & Training¶

Reconstructed Region: \(L^r = \lambda_{pix}L_{pix} + \lambda_{inter}L_{inter} + \lambda_{distort}L_{distort} + \lambda_{decay}L_{decay}\)
Inpainted Region: \(L^m = -\lambda_{adv}L_{adv} + \lambda_{fm}L_{fm}\) + regularization terms
Discriminator: \(L^D = L_{adv} + \lambda_{GP}L_{GP}\) (R1 regularization)
The inpainted region does not use pixel/perceptual losses at all.

Key Experimental Results¶

Quantitative Comparison on SPIn-NeRF Dataset¶

Method	LPIPS↓	M-LPIPS↓	FID↓	KID↓
SPIn-NeRF	0.5356	0.4019	219.80	0.0616
SPIn-NeRF (LDM)	0.5568	0.4284	227.87	0.0558
Inpaint3D	0.5437	0.4374	271.66	0.0964
InpaintNeRF360	0.4694	0.3672	222.12	0.0544
MALD-NeRF	0.4345	0.3344	183.25	0.0397

Ablation Study¶

Settings	LPIPS↓	FID↓	KID↓
w/o Adversarial + L1 Reconstruction	0.6623	305.60	0.1177
w/o Adversarial + LPIPS	0.4231	192.86	0.0447
Ours + L1 Reconstruction	0.5106	256.82	0.0827
Ours + LPIPS	0.4130	185.79	0.0419
Ours - Scene Customization	0.4894	224.29	0.0596
MALD-NeRF (Full)	0.4345	183.25	0.0397

Key Findings¶

FID drops significantly from 219.80 to 183.25, demonstrating that adversarial training substantially improves generation quality.
Pixel-level L1 loss is actually detrimental to the inpainting task (FID 305.60), and LPIPS is also sub-optimal; adversarial training is the correct choice.
Scene customization reduces the generation of out-of-context objects, dramatically improving 3D consistency.
Masked adversarial training is the key design to eliminate the texture shift.

Highlights & Insights¶

Deep Problem Analysis: clearly points out the limitations of pixel-level and perceptual losses in NeRF inpainting tasks.
Clever masked adversarial training concept—hiding information so that the discriminator cannot exploit the texture shift at the inpainting boundaries.
The combination of scene customization, iterative updates, and noise scheduling forms a complete scheme for 3D consistency enhancement.

Limitations & Future Work¶

The use of an internally pre-trained diffusion model might require replacement with open-source models for public reproduction.
Adversarial training can be unstable.
Geometric reconstruction of large occluded regions remains challenging.

The proposed masked adversarial training design is conceptually similar to AmbientGAN, but with different application goals. The "taming" strategies for diffusion models in 3D tasks hold significant guiding significance for future work.

Rating¶

Novelty: ⭐⭐⭐⭐⭐
Usability: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐⭐