Skip to content

DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors

Conference: ECCV 2024
arXiv: 2407.16260
Code: Based on threestudio
Area: 3D Vision
Keywords: Text-to-3D, NeRF Disentanglement, Diffusion Models, Score Distillation Sampling, 3D Editing

TL;DR

The DreamDissector framework is proposed to disentangle text-to-3D NeRFs containing multi-object interactions into independent textured meshes using a Neural Category Field and Deep Concept Mining, achieving object-level 3D editing control.

Background & Motivation

Background: Text-to-3D generation has achieved significant progress via Score Distillation Sampling (SDS), enabling the generation of 3D NeRF scenes from text descriptions. However, the multi-object scenes generated by existing methods represent holistic, inseparable entities.

Limitations of Prior Work: - Existing methods either generate inseparable holistic scenes or produce independent objects that lack spatial interaction. - CompoNeRF/Comp3D require 3D bounding box inputs and can only handle simple spatial arrangements (e.g., placing a cabinet next to a table), failing to handle complex interactions (e.g., an astronaut riding a kangaroo). - It is impossible to independently manipulate each object in the 3D scene, akin to the concept of "layers" in 2D image editing.

Key Challenge: Multi-object text-to-3D requires interactive relationships between objects, whereas editing demands independent representations of objects; balancing both is highly challenging.

Goal: To automatically disentangle a generated NeRF with multi-object interactions into independent object meshes while preserving their interactive relationships and appearances.

Key Insight: Instead of directly generating independent objects, the scene is generated as a whole and then "disassembled"—the density field is decomposed by learning the category probability distribution of every point in space.

Core Idea: Decomposing the NeRF density field using a probability distribution for disentanglement, and resolving the concept gap challenge via a personalized diffusion model.

Method

Overall Architecture

Input multi-object interactive NeRF \(\to\) Render multi-view images for Deep Concept Mining to personalize the diffusion model \(\to\) Train a Neural Category Field with CSDS to disentangle the NeRF into sub-NeRFs \(\to\) Convert to DMTet for geometry and texture refinement \(\to\) Export independent textured meshes. Two stages: Disentanglement + Refinement.

Key Designs

1. Neural Category Field (NeCF)

  • Function: Learns a category probability distribution for each point in 3D space to decompose the density field of the original NeRF into multiple sub-NeRFs.
  • Mechanism:
    • Decomposes density into a probability-weighted form: \(\sigma = \sum_{k=1}^{K} \frac{\sigma_k}{\sigma} \sigma\), where \(\frac{\sigma_k}{\sigma}\) forms a probability simplex.
    • Models category probability using an MLP + softmax: \(\mathbf{p}_i^k = \frac{\exp(f_k/T)}{\sum_k^K \exp(f_k/T)}\), where the temperature \(T=0.05\) makes the output approximate a one-hot vector.
    • Rendering of the \(k\)-th category object: \(C(\mathbf{r})^k = \sum_i \alpha_i^k (1-\exp(-\mathbf{p}_i^k \sigma_i \delta_i)) \mathbf{c}_i\).
    • The original density and color networks are frozen and not trained; only the category field network is learned.
  • Design Motivation:
    • Training only a lightweight category field network is more efficient than training auxiliary density and color fields.
    • Freezing the original network ensures that the recombined sub-NeRFs are exactly mathematically identical to the original NeRF, preventing any appearance loss.

2. Category Score Distillation Sampling (CSDS) + Deep Concept Mining (DCM)

  • Function: Trains the NeCF using multiple category-specific SDS losses, resolving the "concept gap" issue in diffusion models via DCM.
  • Mechanism:
    • Naive approach: For each category \(k\), apply SDS with the category text \(y_k\): \(\nabla_\theta L_{SDS}(\phi,\theta)_k = \mathbb{E}_{t,\epsilon}[w(t)(\epsilon_\phi(x_t; y_k, t) - \epsilon) \frac{\partial x}{\partial \theta}]\).
    • Concept gap issue: The prompt "a chimpanzee looking through a telescope" generates a hand-held telescope, while "a telescope" generates a tripod-mounted telescope—the two occupy different regions in the latent space of the diffusion model.
    • DCM solution: Personalizes and fine-tunes the diffusion model and text embeddings using the masked regions of the rendered multi-view images.
    • Masked diffusion loss: \(L_{mine}(\phi, y_k) = \mathbb{E}_{t,\epsilon}[||\epsilon_\phi(x_t; y_k, t) \odot M_k - \epsilon \odot M_k||_2^2]\).
    • Two-stage training: Stage 1 fine-tunes text embeddings (400 steps, lr=\(5\times10^{-4}\)), and Stage 2 simultaneously fine-tunes the model backbone (100 steps, lr=\(2\times10^{-6}\)).
    • Masks are obtained via Grounded-SAM.
  • Design Motivation: The concept gap causes misalignment of object regions during disentanglement. DCM resolves this by personalizing the diffusion model to understand the actual appearance of specific objects within the scene.

3. Refinement Stage

  • Function: Converts the disentangled sub-NeRFs into DMTet, repairing artifacts and enhancing geometry and texture quality.
  • Mechanism:
    • Uses isosurface extraction to convert sub-NeRFs to DMTet.
    • Guides DMTet refinement using the DCM-fine-tuned diffusion model (5000 steps).
    • Further fine-tunes colors using the original Stable Diffusion (1000 steps) to avoid oversaturation caused by DCM overfitting.
    • Uses "unrealistic, low quality, shadow" as negative prompts.
    • Introduces an interpenetration loss to prevent mesh intersection during object replacement: \(\mathcal{L}_{interpenetration} = \sum_i \max(\epsilon - (\mathbf{v}_i - \mathbf{v}_i') \cdot \mathbf{n}_i', 0)\).
  • Design Motivation: "Black hole" artifacts appear in previously occluded contact areas after disentanglement, requiring refinement for proper occlusion handling and geometry completion.

Loss & Training

  • NeCF Training: CSDS loss (guided by the DCM-personalized diffusion model), 1000 steps, batch=1, taking around 3 minutes.
  • DCM Training: Two-stage masked diffusion loss, taking around 6 minutes (A100).
  • DMTet Refinement: SDS loss + interpenetration loss, 5000 steps + 1000 steps of color fine-tuning.

Key Experimental Results

Main Results (Quantitative Evaluation via CLIP Score)

Method CLIP-B-16 CLIP-B-32 CLIP-L-14
Negative Prompting 0.299 0.296 0.247
Composition 0.281 0.278 0.234
DreamDissector (Ours) 0.316 0.311 0.270

Ablation Study (DCM Component Analysis)

Configuration Effect Explanation
Full DCM Successfully extracts independent concepts Generated "baby bunny" contains no pancake elements
w/o Masked attention loss Concept separation fails Generated images still contain features from other objects
w/o Stage 1 training Concept separation fails Text embeddings are under-optimized
w/o Stage 2 training Concept separation fails Backbone is not fine-tuned, leading to insufficient concept understanding

Key Findings

  • Vanilla CSDS fails completely when the concept gap is large (e.g., a frog on a water lily \(\to\) incorrect segmentation).
  • SA3D fails in heavily occluded scenarios (e.g., an octopus playing the piano), whereas DCM handles them successfully.
  • When applied to refinement, DCM successfully repairs "black hole" artifacts on contact surfaces, whereas original SD fails (generating unrelated content).
  • The overall pipeline is highly efficient: DCM takes ~6 minutes, NeCF takes ~3 minutes, with refinement being the main time consumer.

Highlights & Insights

  • Novel Problem Definition: For the first time, the text-to-3D NeRF disentanglement problem is systematically formulated, filling the gap between multi-object 3D generation and editing.
  • Elegant NeCF Design: By decomposing the density field via probability, disentanglement is achieved by training only a lightweight network, guaranteeing that the recombined scene exactly matches the original.
  • Discovery & Resolution of the Concept Gap: Provides an in-depth analysis of the latent space inconsistency between holistic and partial prompts in diffusion models; the masked fine-tuning strategy of DCM is simple yet effective.
  • Rich Application Scenarios: Highly practical, supporting object-level texture editing, object replacement, and geometric editing.

Limitations & Future Work

  • DCM relies on Grounded-SAM to provide initial masks, showing dependency on segmentation quality.
  • Heavy topological changes during object replacement remain challenging (as SDS struggles to drastically alter DMTet topology).
  • Currently, NeRF is used as the input, and integration with newer representations like 3D Gaussian Splatting has not yet been explored.
  • The disentanglement granularity is limited to semantic category levels; finer parts-level disentanglement remains a path for future work.
  • vs CompoNeRF/Comp3D: These methods require 3D bounding boxes and can only generate simple spatial arrangements, whereas DreamDissector starts from a complete, interactive scene and disentangles it, handling complex interactions like riding or hugging.
  • vs SA3D: SA3D performs 3D segmentation based on Segment Anything but fails under heavy occlusion. DCM provides stronger semantic guidance via a personalized diffusion model.
  • vs Break-a-Scene: The design of DCM is inspired by the 2D concept extraction paradigm, which this work extends to concept mining in 3D scenes.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First to propose the NeRF disentanglement problem, with innovations in both NeCF and DCM.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Solid qualitative, quantitative, ablation, and application demonstrations, though lacking large-scale quantitative evaluations.
  • Writing Quality: ⭐⭐⭐⭐ Clear problem motivation, with deep and intuitive analysis of the concept gap.
  • Value: ⭐⭐⭐⭐⭐ Bridges the gap between text-to-3D generation and object-level editing, possessing high application potential.