Ctrl-D: Controllable Dynamic 3D Scene Editing with Personalized 2D Diffusion¶
Conference: CVPR 2025
arXiv: 2412.01792
Code: Project Page
Area: 3D Vision / Dynamic Scene Editing
Keywords: Dynamic 3D Editing, Deformable Gaussian, InstructPix2Pix, Personalized Diffusion, Scene Editing
TL;DR¶
Fine-tunes the InstructPix2Pix model using a single edited reference image to "learn" the editing capability, which, combined with a two-stage deformable 3D Gaussian optimization, achieves controllable and consistent dynamic 3D scene editing.
Background & Motivation¶
- Dynamic 3D scene editing is a key requirement for VR/AR, data augmentation, and content creation, but existing methods suffer from inconsistent editing and poor controllability.
- Prior works such as Instruct 4D-to-4D rely on pre-trained diffusion models (e.g., vanilla IP2P) and are limited by the capabilities of their editing backbones, preventing precise local editing.
- Tracking edited regions is far more difficult in dynamic scenes than in static ones, and traditional methods relying on noise discrepancies to determine editing regions are unstable.
- Key Insight: Simplifying the complex dynamic scene editing task into a simple 2D image editing problem—users only need to edit a single image, and the editing effects can then be propagated to the entire dynamic scene.
Method¶
Overall Architecture¶
A three-stage pipeline: (1) edit a single reference image using any 2D editing tool; (2) fine-tune the IP2P model using the pre- and post-edited image pairs to obtain a personalized editor; (3) conduct two-stage optimization of the deformable 3D Gaussian scene.
Key Designs¶
-
Personalized InstructPix2Pix Fine-tuning:
- Function: Enables IP2P to "learn" specific editing capabilities from a single edited reference image.
- Mechanism: Uses GPT-4V to generate text instructions \(C_T^{\star}\) and introduces a special token
<V>to enhance specificity; incorporates Prior Preservation Loss (inspired by DreamBooth) to maintain the model's generalization capability. - Design Motivation: The editing capability of the vanilla IP2P is limited by the training data distribution. Personalized fine-tuning allows the model to directly learn the editing region and style from the reference image without explicit tracking of editing regions.
- Fine-tuning Loss: \(\mathcal{L}_{\text{finetune}} = \mathbb{E}[\|\epsilon - \epsilon_\theta(z_t, t, I, C_T)\|_2^2] + \lambda \mathbb{E}[\|\epsilon - \epsilon_\theta(z_t^{\star}, t, I_d, C_T^{\star})\|_2^2]\)
- Data Augmentation: Augments the source and edited images via affine transformations (rotation, translation, shearing) to prevent overfitting during single-image fine-tuning.
-
Two-stage Dynamic Gaussian Optimization:
- Function: Progressively edits the pre-trained dynamic 3D Gaussian scene.
- Mechanism: Stage 1 only optimizes the canonical space and performs Gaussian densification (freezing the deformation field); Stage 2 simultaneously optimizes the deformation field and 3D Gaussians, utilizing an edited image buffer to accelerate convergence.
- Design Motivation: Multi-stage optimization allows first establishing a rough geometry of the edited region, followed by global optimization to achieve temporal consistency.
-
Edited Image Buffer:
- Function: Accelerates the editing process and enhances temporal consistency.
- Mechanism: At each iteration, unedited frames are randomly selected, edited via the personalized IP2P, and added to the buffer; only the images in the buffer are used to train the 3D Gaussians and the deformation field (warm-up phase).
- Design Motivation: To avoid editing from the original frames every time, leveraging existing editing results to accelerate convergence.
Loss & Training¶
- Total scene optimization loss: \(\mathcal{L} = (1-\lambda_d)\mathcal{L}_1 + \lambda_d \mathcal{L}_{\text{D-SSIM}} + \lambda_t \mathcal{L}_{\text{temp}}\)
- Parameter settings: \(\lambda_d = 0.2\), \(\lambda_t = 0.001\)
- Monocular scenes are modeled using [Yang et al.], with Stage 1 set as the first 300 iterations; multi-camera scenes utilize [Wu et al.], with Stage 1 set as the first 100 iterations.
- An image is edited every 50 iterations.
Key Experimental Results¶
Main Results¶
| Scene | Method | CLIP Score↑ | Consistency↑ | Time↓ |
|---|---|---|---|---|
| Portrait | Ctrl-D | 27.75 | 0.953 | 60 min |
| Portrait | IN4D | 27.38 | 0.933 | 2 hours |
| Cat | Ctrl-D | 31.81 | 0.968 | 60 min |
| Cat | IN4D | 31.72 | 0.964 | 2 hours |
| Steak | Ctrl-D | 28.52 | 0.988 | 40 min |
| Steak | IN4D | 28.23 | 0.983 | 2 hours |
Ablation Study¶
| Configuration | Effect | Explanation |
|---|---|---|
| w/o Data Augmentation | Blurry, temporally inconsistent | Overfitting during IP2P fine-tuning causes unstable editing |
| w/ Data Augmentation | High quality, good consistency | Affine transformations effectively prevent overfitting |
| w/o Edited Buffer | Still close to vanilla scene after 1000 steps | Inefficient training due to random frame selection |
| w/ Edited Buffer | Successfully edited after 1000 steps | Focuses on edited frames to accelerate convergence |
Key Findings¶
- The total time (fine-tuning + optimization) is less than half of IN4D.
- Editing capabilities can generalize across domains: IP2P fine-tuned on cat images can be applied to portraits and full-body scenes.
- Supports multiple 2D editing modalities, including text-driven, image-driven, and style transfer.
Highlights & Insights¶
- Simplifying complex 4D editing into a 2D editing task drastically lowers the barrier to dynamic scene editing.
- Personalized IP2P can directly learn the editing region from the reference image, bypassing the difficult target tracking in dynamic scenes.
- Cross-domain generalization of editing capabilities proves that personalized fine-tuning learns general "editing skills" rather than overfitting to a single scene.
Limitations & Future Work¶
- When the reconstruction quality of the dynamic 3D Gaussians is poor (e.g., motion-blurred hands), the edited results will also be blurry.
- Adding complex content to empty regions (e.g., adding a bag to a dog) still suffers from multi-view consistency issues.
- Future work could employ stronger reconstruction backbones and more powerful foundation diffusion models.
Related Work & Insights¶
- InstructPix2Pix → Fundamental editing capabilities; DreamBooth → Prior Preservation concept
- Deformable 3DGS → Dynamic scene representation
- Instruct-NeRF2NeRF → Inspiration source for the Iterative Dataset Update strategy
Rating¶
- Novelty: ⭐⭐⭐⭐ The paradigm of simplifying dynamic editing into 2D editing is novel, and the personalized fine-tuning strategy is practical.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive qualitative and quantitative comparisons, thorough ablation studies, and convincing cross-domain generalization experiments.
- Writing Quality: ⭐⭐⭐⭐ Clear structure with detailed pipeline descriptions.
- Value: ⭐⭐⭐⭐ Highly practical, lowering the threshold for dynamic scene editing.