Boosting Generative Image Modeling via Joint Image-Feature Synthesis¶
Conference: NeurIPS 2025 Spotlight arXiv: 2504.16064 Code: GitHub Area: Image Generation Keywords: Joint Image-Feature Generation, Diffusion Model, DINOv2, Representation Guidance, DiT
TL;DR¶
This paper proposes ReDi (Representation Diffusion), a framework that jointly models VAE image latents and DINOv2 semantic features within a diffusion model — both are simultaneously denoised from pure noise within a single diffusion process. With minimal modifications to the DiT architecture, ReDi achieves a 23× training convergence speedup and state-of-the-art FID, while unlocking a novel Representation Guidance inference strategy.
Background & Motivation¶
Background: Latent diffusion models (LDMs) are the dominant approach for high-quality image generation, while self-supervised representation learning methods (e.g., DINOv2) excel at semantic understanding. Although each paradigm has its strengths, they have long remained separate — LDM internal features lack semantic grounding, while DINOv2 has no generative capability.
Limitations of Prior Work: REPA (Yu et al., 2025) first demonstrated that aligning the internal representations of a diffusion model with DINOv2 can simultaneously improve generation quality and training efficiency. However, REPA requires additional distillation losses (contrastive/MSE) to align intermediate features, resulting in a complex training objective.
Key Challenge: How can one elegantly achieve both high-quality image generation and semantic representation learning within a single model, without introducing complex distillation mechanisms?
Goal: To propose a more direct alternative to REPA — rather than aligning representations, the diffusion model is trained to jointly generate both images and semantic features.
Key Insight: DINOv2 semantic features are treated as a "second modality" on par with VAE latents, jointly denoised within the same diffusion process.
Core Idea: Instead of indirectly aligning the internal representations of a diffusion model, the model is trained to directly learn to generate semantic features — joint modeling naturally forces the model to integrate low-level visual and high-level semantic information during generation.
Method¶
Overall Architecture¶
Given an input image \(I\), both the VAE latent \(\mathbf{x}_0 = \mathcal{E}_x(I) \in \mathbb{R}^{L \times C_x}\) and the DINOv2 feature \(\mathbf{z}_0 = \mathcal{E}_z(I) \in \mathbb{R}^{L \times C_z}\) are extracted simultaneously. The same forward diffusion noise schedule is applied to both, and a single Transformer is used for joint denoising. The training objective is a simple joint denoising loss; at inference time, both image and semantic features are generated simultaneously from pure noise.
Key Designs¶
- Joint Forward-Reverse Diffusion Process:
- Function: Independent noise is added to both the image latent and semantic features using the same noise schedule, followed by joint denoising.
- Mechanism: The forward process is defined as \(\mathbf{x}_t = \sqrt{\bar\alpha_t}\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\boldsymbol{\epsilon}_x\) and \(\mathbf{z}_t = \sqrt{\bar\alpha_t}\mathbf{z}_0 + \sqrt{1-\bar\alpha_t}\boldsymbol{\epsilon}_z\). The model simultaneously predicts two sets of noise: \(\boldsymbol{\epsilon}_\theta^x\) and \(\boldsymbol{\epsilon}_\theta^z\). The joint loss is \(\mathcal{L} = \|\boldsymbol{\epsilon}_\theta^x - \boldsymbol{\epsilon}_x\|^2 + \lambda_z\|\boldsymbol{\epsilon}_\theta^z - \boldsymbol{\epsilon}_z\|^2\), with default \(\lambda_z=1\).
-
Design Motivation: Forcing the model to learn the joint distribution of image details and semantic structure, where each modality provides complementary information to the other, naturally leads to improved generation quality.
-
Token Fusion Strategy (Merged vs. Separate):
- Function: Two approaches for feeding VAE tokens and semantic tokens into the Transformer.
- Mechanism: The Merged approach projects both token sets via separate linear projections and adds them channel-wise — \(\mathbf{h}_t = \mathbf{x}_t\mathbf{W}_{emb}^x + \mathbf{z}_t\mathbf{W}_{emb}^z\) — preserving the original token count \(L\). The Separate approach concatenates both sets along the sequence dimension, yielding \(2L\) tokens. Merged is used by default to maintain computational efficiency.
-
Design Motivation: Merged enables tight early-stage interaction between the two information streams without increasing computational cost; Separate offers greater expressive capacity at the cost of doubled computation.
-
PCA-Reduced Semantic Representations + Representation Guidance:
- Function: DINOv2's 768-dimensional features are reduced to 8 dimensions via PCA to balance computation; during inference, the semantic branch is used to guide image generation.
- Mechanism: PCA dimensionality reduction addresses the capacity imbalance caused by \(C_z \gg C_x\). Representation Guidance, analogous to CFG, is formulated as \(\hat{\boldsymbol{\epsilon}}_\theta = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) + w_r(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \mathbf{z}_t, t) - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t))\). During training, \(\mathbf{z}_t\) is randomly dropped with probability \(p_{drop}\) so the model learns denoising both with and without semantic conditioning.
- Design Motivation: PCA prevents high-dimensional semantic features from consuming disproportionate model capacity. Representation Guidance leverages the model's own learned semantics to guide generation without requiring any additional model.
Loss & Training¶
The joint denoising loss is \(\mathcal{L}_{joint} = \|\boldsymbol{\epsilon}_\theta^x - \boldsymbol{\epsilon}_x\|^2 + \lambda_z\|\boldsymbol{\epsilon}_\theta^z - \boldsymbol{\epsilon}_z\|^2\) (\(\lambda_z=1\)). During training, \(\mathbf{z}_t\) is zeroed out and the semantic loss is disabled with probability \(p_{drop}=0.2\). Images are encoded using SD-VAE-FT-EMA (\(32\times32\times4\)), and semantics are extracted via DINOv2-B+Registers. The PCA projection matrix is precomputed on 76,800 randomly sampled ImageNet images.
Key Experimental Results¶
Main Results¶
| Model | Method | Iterations | FID↓ | Notes |
|---|---|---|---|---|
| DiT-XL/2 | Baseline | 7M | 9.6 | Original DiT convergence |
| DiT-XL/2 | REPA | 400K | 12.3 | Distillation alignment |
| DiT-XL/2 | ReDi | 400K | 8.7 | Joint modeling, surpasses 7M-step baseline |
| SiT-XL/2 | Baseline | 7M | 8.3 | Original SiT convergence |
| SiT-XL/2 | REPA | 4M | 5.9 | Requires 10× iterations |
| SiT-XL/2 | ReDi | 700K | 5.6 | 6× faster convergence |
| SiT-XL/2+CFG | ReDi | 350 epochs | 1.72 | SOTA unconditional diffusion |
| SiT-XL/2+CFG | ReDi | 800 epochs | 1.61 | Current best |
Ablation Study¶
| Configuration | FID↓ | Notes |
|---|---|---|
| Merged tokens (default) | 8.7 | Efficient and effective |
| Separate tokens | 8.2 | Stronger but doubles computation |
| No PCA (768-dim) | Degraded | Capacity imbalance |
| PCA to 8-dim (default) | 8.7 | Optimal balance |
| \(\lambda_z=0\) (no semantic loss) | ~DiT baseline | Semantic branch is necessary |
| ReDi + REPA | 3.3 (4M iters) | Two methods are complementary |
Key Findings¶
- ReDi accelerates convergence of DiT-XL/2 and SiT-XL/2 by approximately 23×.
- Compared to REPA, ReDi converges 6× faster with better FID.
- ReDi and REPA are complementary — combined, they reach FID 3.6 at 1M steps (REPA alone requires 4M steps to reach 5.9).
- Representation Guidance improves generation quality without relying on any external classifier.
Highlights & Insights¶
- An elegantly designed Spotlight contribution — substantial gains with minimal architectural modification.
- The paradigm debate of joint modeling vs. distillation alignment: directly modeling the joint distribution proves more effective than indirect alignment.
- Representation Guidance is a fully self-contained inference strategy, complementary to CFG.
- The finding that the two methods are complementary suggests that joint modeling and alignment capture different aspects of information.
Limitations & Future Work¶
- PCA dimensionality reduction is linear and may discard nonlinear semantic structure.
- The semantic encoder (DINOv2) is frozen; joint end-to-end fine-tuning may yield further improvements.
- Validation is limited to ImageNet 256×256; higher resolutions and text-to-image settings remain unexplored.
- The Separate tokens approach doubles computation, motivating the need for more efficient attention mechanisms.
Related Work & Insights¶
- vs. REPA: REPA aligns intermediate features via distillation, requiring additional losses and yielding weaker results; ReDi directly models the joint distribution in a simpler and more effective manner.
- vs. MT-Diffusion: MT-Diffusion also introduces CLIP representations but does not quantify their impact on generation; ReDi systematically evaluates the benefits of joint modeling.
- vs. VideoJam: Inspires the Representation Guidance design — analogous to the motion guidance strategy employed in VideoJam.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Joint image-feature diffusion is a genuinely new paradigm; Representation Guidance is an original contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-scale models, multiple frameworks, and comprehensive ablations.
- Writing Quality: ⭐⭐⭐⭐ Method is clearly described with well-formalized mathematical notation.
- Value: ⭐⭐⭐⭐⭐ Spotlight recognition is well-deserved; opens a new direction for representation-aware generative modeling.