AAAI 2026 3D Vision Text-to-3D Score Distillation Sampling Diffusion Models 3DGS Dynamic Source Distribution Semantic Consistency

AnchorDS: Anchoring Dynamic Sources for Semantically Consistent Text-to-3D Generation¶

Conference: AAAI 2026
arXiv: 2511.11692
Code: https://github.com/viridityzhu/AnchorDS
Area: 3D Vision
Keywords: Text-to-3D, Score Distillation Sampling, Diffusion Models, 3DGS, Dynamic Source Distribution, Semantic Consistency

TL;DR¶

This paper identifies a critical yet overlooked issue in SDS: the source distribution is dynamically evolving rather than static. AnchorDS is proposed to anchor the source distribution by feeding the current rendered image as an image condition into a dual-conditioned diffusion model, thereby resolving semantic over-smoothing and multi-view inconsistency in SDS. The method comprehensively outperforms SDS, VSD, and SDS-Bridge on T3Bench.

Background & Motivation¶

Background: Optimization-based text-to-3D methods distill gradients from pretrained 2D diffusion models via SDS to train NeRF/3DGS, enabling 3D content generation without 3D data.

Limitations of Prior Work: SDS suffers from two well-known failure modes — (1) semantic over-smoothing: object-specific semantic features degrade into blurry, uniform representations (e.g., swans and lake surfaces blending together); (2) multi-view inconsistency: geometry and appearance are incoherent across viewpoints (e.g., the multi-head/Janus problem).

Key Challenge: Through mathematical analysis, the authors identify the underlying cause — the CFG gradient in SDS can be decomposed into a "target term \(m_1\)" (pushing toward the text-conditioned distribution) and a "variance term \(m_2\)" (pushing away from the source distribution). The source distribution is approximated by the unconditional prior \(p(z_t; t, \emptyset)\), which entirely ignores the dynamic evolution of the 3D model during optimization. As a result, \(\hat{z}_{t \to 0}^{\text{source}}\) encodes neither the semantics of the current 3D state nor the existing geometric structure, causing the gradient direction to be inconsistent with the actual 3D state.

Key Insight: SDS is reinterpreted as a dynamic editing process — each step is a progressive edit conditioned on the current 3D state, and the source distribution should co-evolve with the 3D model.

Core Idea: The current rendered image is used as an image condition for the diffusion model, replacing the unconditional prior to estimate the source distribution, thereby achieving state-anchored gradient guidance.

Method¶

Overall Architecture¶

The core modification of AnchorDS lies in the SDS gradient computation: the source distribution estimate is replaced from the unconditional noise prediction \(\hat{\epsilon}_\phi(z_t; t, \emptyset)\) to an image-conditioned noise prediction \(\hat{\epsilon}_\phi(z_t; t, \emptyset, I^{(\tau)})\), where \(I^{(\tau)}\) is the rendered image at the current optimization step \(\tau\).

The new guidance gradient is: \(g_t^{(\tau)} = \hat{\epsilon}_\phi(z_t; t, y) - \hat{\epsilon}_\phi(z_t; t, \emptyset, I^{(\tau)})\)

The target term remains unchanged (text-conditioned), while the source term now encodes the structural and semantic information of the current 3D state.

Overall pipeline: render the current 3D model at each step → encode to latent and add noise → query the diffusion model twice (text-conditioned for target prediction, image-conditioned for source prediction) → backpropagate the difference gradient to update 3D parameters.

Key Designs¶

Dynamic Source Distribution Anchoring (Core):
- Function: Replace the unconditional prior with an image-conditioned diffusion model to estimate the source distribution.
- Mechanism: Using IP-Adapter or ControlNet, the current rendered image \(I^{(\tau)}\) is fed as an image condition into the diffusion model. The image condition does not constrain the output content but serves as a contextual anchor that guides generation while preserving the structural information of the current 3D state.
- Design Motivation: Pretrained image-conditioned diffusion models inherently possess image inversion capability — the model's intrinsic image-to-latent mapping is directly leveraged for accurate source anchoring without additional inversion steps. Only one extra U-Net forward pass is required (parallelizable with the original pass), keeping runtime identical to standard SDS.
Pseudo-Source Reconstruction and Quality Assessment:
- Function: Explicitly reconstruct the source image and provide a quantitative metric for source distribution estimation quality.
- Mechanism: The one-step denoised latent \(\hat{z}_{t \to 0}^{\text{anchored}}\) is recovered from the image-conditioned noise prediction and decoded to a reconstructed image; the L2 distance to the original rendered image is computed: \(\mathcal{L}_{\text{rec}} = \| \varepsilon(\hat{z}_{t \to 0}^{\text{anchored}}) - I^{(\tau)} \|_2^2\)
- Design Motivation: This reconstruction loss serves both as a quality metric and as the foundation for two complementary enhancement strategies.
Filtering Strategy:
- Function: A threshold \(\gamma\) is applied based on \(\mathcal{L}_{\text{rec}}\) to discard source estimates with excessively large reconstruction errors.
- Mechanism: When \(\mathcal{L}_{\text{rec}} > \gamma\), the AnchorDS loss for that step is set to zero, skipping unreliable gradient updates.
- Design Motivation: A simple yet effective mechanism to filter out anomalous predictions caused by domain shift in the image condition, improving training stability.
Fine-tuning Strategy:
- Function: Lightweight fine-tuning of a single layer in the IP-Adapter to bridge the domain gap between real image distributions and rendered image distributions.
- Mechanism: Gradient updates from \(\mathcal{L}_{\text{rec}}\) are used to update the parameters of one layer in the image adapter, exposing it to rendered-domain data. The overhead is minimal (training time increases from ~25 min to ~30 min).
- Design Motivation: Pretrained 2D models are trained on real images and exhibit a distributional bias when processing synthetic rendered images.

Loss & Training¶

AnchorDS gradient: \(\nabla_\Theta \mathcal{L}_{\text{AnchorDS}} = w(t) \cdot g_t^{(\tau)} \cdot \frac{\partial z_t}{\partial \Theta}\)
Source reconstruction loss: \(\mathcal{L}_{\text{rec}} = \| \varepsilon(\hat{z}_{t \to 0}^{\text{anchored}}) - I^{(\tau)} \|_2^2\)
Filtering and Fine-tuning are used as alternative complementary strategies.
Default image conditioners: IP-Adapter (SD 1.5) or ControlNet (SD 2.1)
3D representations: supports 3DGS (GaussianDreamer) and NeRF

Key Experimental Results¶

Main Results¶

T3Bench benchmark (300 prompts, covering single object / single object with surroundings / multiple objects):

Method	All↑	Single↑	Surr↑	Multi↑
SDS (DreamFusion)	20.5	24.9	19.3	17.3
SDS (GaussianDreamer)	29.7	42.3	26.1	20.6
AnchorDS (IP-Adapter) + Finetune	33.3	45.3	29.0	25.7
AnchorDS (ControlNet) + Filter	33.2	46.1	29.4	24.0

Human evaluation (912 participants):

Method	CLIP↑	3D Consistency Q1↓	Text Alignment Q2↓	Visual Quality Q3↓
VSD (SD 2.1)	0.352	1.84	1.85	1.79
Ours (ControlNet, SD 2.1)	0.369	1.16	1.15	1.21
VSD (SD 1.5)	0.281	1.99	2.00	2.08
SDS-Bridge (SD 1.5)	0.233	2.38	2.35	2.29
Ours (IP-Adapter, SD 1.5)	0.334	1.63	1.66	1.63

Ablation Study¶

Configuration	All↑	Notes
SDS baseline	29.7	Baseline
AnchorDS (IP-Adapter)	30.7	Source anchoring only, +1.0
+ Filter	32.8	Filter unstable predictions, +3.1
+ Finetune	33.3	Fine-tune adapter, +3.6

Key Findings¶

Source anchoring alone is effective (+1.0); Filter and Finetune each provide additional gains.
Largest improvement on multi-object scenes: the Multi category improves from 20.6 to 25.7 (+24.8%), as source anchoring effectively prevents semantic mixing between different objects.
Method is insensitive to the choice of image conditioner: both IP-Adapter and ControlNet are effective, demonstrating the generality of the approach.
VSD, despite more refined distribution modeling, still produces unnatural colors and structural artifacts due to neglect of source dynamics.
SDS-Bridge's hand-crafted negative prompts introduce new biases (e.g., material and texture artifacts), limiting flexibility.

Highlights & Insights¶

Precise problem analysis with mathematical grounding: Decomposing the SDS gradient into \(m_1\) and \(m_2\), and further expanding it into a pseudo-editing formulation (Eq. 8), clearly exposes the information loss caused by the unconditional source estimate. This analytical framework is far more convincing than intuitive explanations alone.
Extremely lightweight method: The core modification amounts to replacing the unconditional branch in CFG with an image-conditioned branch — a change on the order of a single line of code, requiring no additional networks or training data. The Filter/Finetune strategies are equally simple.
The "dynamic source distribution" perspective may generalize to other distillation scenarios: e.g., SDS variants in 2D editing and video generation — any iterative optimization setting using SDS could benefit from analogous source anchoring.

Limitations & Future Work¶

Still reliant on the SDS paradigm: inherits SDS's inherently slow optimization (requiring thousands of steps), making it unsuitable for real-time applications.
Upper-bounded by image conditioner capability: if IP-Adapter/ControlNet exhibits poor inversion capability for certain rendering styles, source estimation quality degrades (which motivates the Filter strategy).
No comparison with feed-forward 3D generation methods: the T3Bench evaluation only compares against SDS variants, lacking comparison with 3D-native generative models.

vs. VSD (ProlificDreamer): VSD trains a LoRA-based particle distribution model to approximate the source distribution, incurring high computational overhead (4 models). AnchorDS directly conditions on the rendered image with zero additional models and achieves superior performance.
vs. SDS-Bridge: SDS-Bridge corrects source bias via hand-crafted negative prompts describing the 3D state, introducing new biases. AnchorDS lets the model directly "see" the current rendering, introducing no additional bias.
vs. DDS: DDS also uses a reference image but requires a paired reference prompt and serves a different purpose. AnchorDS acquires its image condition automatically (the current rendering).

Rating¶

Novelty: ⭐⭐⭐⭐ In-depth analysis of the source distribution problem in SDS; method is elegant and concise.
Experimental Thoroughness: ⭐⭐⭐⭐ T3Bench + human evaluation + ablation + multi-baseline comparison.
Writing Quality: ⭐⭐⭐⭐⭐ Analysis and derivations are clear, figures are intuitive, and the logical chain is complete.
Value: ⭐⭐⭐⭐ A significant improvement within the SDS framework; the method is simple and reproducible.