DATENeRF: Depth-Aware Text-based Editing of NeRFs¶

Conference: ECCV 2024
arXiv: 2404.04526
Code: https://datenerf.github.io/DATENeRF/ (Project Page)
Area: 3D Vision
Keywords: NeRF editing, diffusion models, depth guidance, text-driven 3D editing, multi-view consistency

TL;DR¶

Leverages scene depth reconstructed by NeRF to guide text-based 2D image editing (via depth-conditioned ControlNet + projection inpainting scheme), achieving multi-view consistent, high-quality NeRF scene editing.

Background & Motivation¶

Background: NeRF successfully reconstructs and renders 3D scenes with high quality, but its implicit representation lacks explicit decoupling of geometry and appearance, rendering editing operations difficult. Meanwhile, 2D diffusion models (such as Stable Diffusion) have demonstrated powerful capabilities in text-guided image editing.

Limitations of Prior Work: Applying 2D diffusion models to NeRF editing faces the multi-view consistency problem—editing each 2D image independently produces inconsistent results. The existing state-of-the-art method Instruct-NeRF2NeRF (IN2N) adopts an "iterative dataset update" strategy, but due to the randomness and inconsistency of the edits, the final results suffer from geometric errors, blurry textures, and poor text alignment.

Key Challenge: The independence of 2D editing vs. the multi-view consistency requirements of 3D scenes. Relying on NeRF optimization to bridge the inconsistency is an indirect mechanism, which performs particularly poorly on high-frequency texture details.

Key Insight: The scene geometry (depth information) reconstructed by NeRF inherently provides a natural bridge to unify 2D editing. Rough alignment is ensured via depth-conditioned ControlNet, and then edited content is directly propagated through a depth-based pixel reprojection scheme.

Core Idea: Simultaneously constrain the geometric consistency (ControlNet depth condition) and appearance consistency (projection inpainting) of edits using NeRF depth information, realizing an efficient pipeline of "consistent editing first, NeRF optimization later."

Method¶

Overall Architecture¶

Inputs: Reconstructed NeRF scene (with posed images) + editing masks per view + text prompts. The pipeline consists of three steps: (1) 3D-consistent region segmentation to generate masks; (2) masked editing via depth-conditioned ControlNet + a projection inpainting scheme to generate multi-view consistent edited images; (3) optimization of NeRF using the edited images.

Key Designs¶

3D-Consistent Region Segmentation: The initial 2D segmentation masks are back-projected into a 3D point cloud using NeRF depth, aggregated and voted on in 3D space, and then re-projected back to 2D views to generate occlusion-aware, view-consistent, and accurate masks. Guided filtering is further applied to smooth the mask boundaries.

Design Motivation: Mask generation utilizing segmentation models independently on different views leads to inconsistencies, resulting in misaligned edited regions.

Depth-Conditioned ControlNet Editing: Converts depth maps rendered from NeRF into disparity maps, which serve as condition signals for ControlNet. Combined with Blended Diffusion, text-guided inpainting is performed within the masked region:

\(I_k^e = \text{Blended-Diffusion}(\text{ControlNet}(I_k, D_k), M_k)\)

Design Motivation: Unlike IN2N which uses the original image as the conditioning signal, the depth condition allows the model to generate content with significantly different appearances from the input image (e.g., turning a bear into a zebra) while maintaining geometric alignment.

Projection Inpainting: Given an edited reference view \(I_{\text{ref}}^e\), pixels from the edited reference are re-projected to other views utilizing NeRF depth:

\(I_k^p = R_{\text{ref} \to k}(I_{\text{ref}}^e), \quad M_k^{\text{vis}} = R_{\text{ref} \to k}(M_{\text{ref}})\)

However, directly utilizing the re-projected pixels leads to quality degradation due to geometric errors and sampling stretching. To solve this, a hybrid inpainting scheme is proposed: the re-projected pixels are preserved during the first \(N=5\) steps of diffusion denoising (constraining overall appearance), and subsequent steps switch to full masked region inpainting (allowing the diffusion model to repair occluded regions and reconstruction artifacts).

$I_k^e = \text{Blended-Diffusion}(\text{ControlNet}(I_k^p, D_k), M_k^p)$

Where \(M_k^p = M_k \cdot (1 - M_k^{\text{vis}})\) is the occluded region to be inpainted.

Design Motivation: \(N=0\) (no projection) yields poor appearance consistency; \(N=20\) (full projection) suffers from severe cumulative error. The hybrid scheme with \(N=5\) balances both consistency and visual quality.

View Ordering Heuristic: When choosing the sequence of reprojections, priority is given to the next view with the highest overlap with the current view, maximizing the coverage of re-projected pixels.

Loss & Training¶

After projection inpainting is completed, the NeRF is directly optimized using the edited images (initialized from the original NeRF). The first 1000 iterations use all edited images for training (\(L_1\) + LPIPS loss).
Afterward, the pipeline switches to the iterative dataset update strategy of IN2N, but with a high noise intensity (0.5-0.8 vs. 0.02-0.98 in IN2N) to only perform detail enhancement.
Training takes a total of 4000 iterations, completing in approximately 20 minutes on an NVIDIA A100 GPU.
Image Resolution: 512×512 is used for training, and upsampled to 1024×1024 during generation to improve ControlNet performance.

Key Experimental Results¶

Main Results¶

Method	Image Editing Model	Projection Inpainting	CLIP Text-Image Direction ↑	CLIP Consistency ↑
IN2N	InstructPix2Pix	✗	0.1407	0.6349
IN2N	ControlNet	✗	0.1330	0.6799
ViCA-NeRF	InstructPix2Pix	✗	0.1683	0.6981
Ours	InstructPix2Pix	✓	0.1618	0.6910
Ours	ControlNet	✗	0.1772	0.6879
Ours (Full)	ControlNet	✓	0.1866	0.7069

Quantitative evaluation on 24 different editing scenes. The full method outperforms all compared baselines in both text alignment and view consistency.

Ablation Study¶

Configuration	Effect	Explanation
N=0 (No projection, ControlNet only)	Good text alignment but poor consistency	Coarse geometric alignment but large appearance discrepancies
N=5 (Hybrid projection inpainting)	Best balance	Retains visual quality while enhancing consistency
N=20 (Full projection)	Severe degradation when moving away from the reference frame	Accumulation of geometric errors and sampling issues
Ours w/o projection (IN2N strategy + ControlNet)	Better than IN2N but worse than the full method	Clearer textures but consistency is still insufficient

Key Findings¶

Convergence Speed: DATENeRF visibly converges in 87 image edits + 400 iterations, whereas IN2N requires 300 edits + 3000 iterations to achieve similar quality. Editing consistent images dramatically accelerates NeRF optimization convergence.
Generality of Projection Inpainting: Even with InstructPix2Pix as the editing model, incorporating projection inpainting improves performance.
High-Frequency Textures: The proposed method can generate clear high-frequency textures such as stripes (zebra) and grids (chessboard), where IN2N and ViCA-NeRF suffer from severe blurriness.
Scalability: Can use Canny edges as ControlNet conditioning, supporting 3D object insertion (via TSDF intermediate geometry).

Highlights & Insights¶

The core insight is extremely simple and powerful: NeRF geometry itself acts as a bridge to unify 2D editing, avoiding the need to design complex 3D-aware diffusion models.
The hybrid inpainting scheme is elegantly designed—dynamically switching masked regions during the diffusion denoising process to elegantly balance consistency and quality.
Standing in stark contrast to the "slow introduction of inconsistent edits" strategy of IN2N, this work proposes a "one-time generation of consistent edits" paradigm, achieving a speedup of nearly 10x.
The introduction of ControlNet not only improves consistency but also enhances the controllability of editing (supporting depth/edges/object insertion).

Limitations & Future Work¶

Unable to undergo major geometric changes (constrained by NeRF depth).
ControlNet may fail to faithfully preserve content aligned with depth in the peripheral areas of large-scale, complex scenes.
Does not model view-dependent effects (like specular highlights).
Realistic editing of human faces carries ethical risks (deepfakes).
Exploratory directions: Combining 3D Gaussian Splatting as a replacement for NeRF; introducing stronger appearance conditions (such as texture mapping).

Comparison with ViCA-NeRF: The latter blends projection encodings in the latent space, yielding blurrier results; this work directly projects in pixel space.
The iterative strategy of Instruct-NeRF2NeRF essentially uses NeRF as a "consistency decoder," but this is ineffective for high-frequency content.
Inspiration: The depth/geometry-guided strategy can be extended to view consistency problems in scenarios like 3D Gaussian Splatting and video editing.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of depth-guided ControlNet + hybrid projection inpainting is novel and intuitive.
Experimental Thoroughness: ⭐⭐⭐⭐ Quantitative results on 24 editing scenes + multiple ablations + convergence speed analysis + extension experiments.
Writing Quality: ⭐⭐⭐⭐⭐ Clear illustrations, step-by-step comparative analysis of N=0/5/20, and precise motivation formulation.
Value: ⭐⭐⭐⭐ Highly practical, enabling editing completions within 20 minutes and providing an important reference paradigm for subsequent 3D editing methods.