Skip to content

D2C: Accelerating Diffusion Model Training under Minimal Budgets via Condensation

Conference: CVPR 2026
arXiv: 2507.05914
Code: None
Area: Image Generation / Efficient Training / Dataset Condensation
Keywords: Dataset Condensation, Diffusion Training, Difficulty Score, Interval Sampling, REPA

TL;DR

This work introduces dataset condensation to diffusion model training for the first time, proposing the D2C two-stage framework (Select+Attach). Using only 0.8% of ImageNet data, it achieves an FID of 4.3 in 40K steps, performing 100x faster than REPA and 233x faster than vanilla SiT.

Background & Motivation

Background: Diffusion models have achieved remarkable results in image generation, but training costs are extremely high. For instance, SiT-XL/2 requires 7 million training steps on the full ImageNet (1.28 million images) to converge, consuming hundreds of GPU days. Recently, researchers have improved training efficiency from the model side (architectural improvements, attention optimization, representation alignment like REPA), but the possibility of directly reducing the training set scale from the data side remains largely unexplored.

Limitations of Prior Work: Dataset Condensation (DC) is well-studied for discriminative models (e.g., SRe2L, RDED). However, applying these methods directly to diffusion model training leads to complete collapse, with FIDs as high as 80-166.

Key Challenge: Discriminative DC methods optimize for category-discriminative features rather than the pixel distribution of real images. The generated synthetic images lack sufficient structural and semantic fidelity to support the training requirements of generative models.

Goal: Can a carefully designed dataset condensation strategy reduce training data to 0.8%-8% of the original size while maintaining or even exceeding the generation quality of diffusion models trained on the full dataset?

Key Insight: The authors observe that diffusion models themselves can encode sample learning difficulty via denoising loss. Based on this, they propose using the diffusion model as a "scorer" to select the most informative training subset based on difficulty, then attaching rich semantic and visual priors to the selected samples.

Core Idea: Select a subset using interval sampling based on diffusion difficulty scores (Select), then enhance each sample with T5 text embeddings and DINOv2 visual features (Attach) to achieve efficient diffusion training under extreme data compression.

Method

Overall Architecture

D2C is a two-stage framework that takes a full training set (e.g., ImageNet 1.28M) and outputs a minimal enhanced subset (e.g., 10K or 50K images with attached semantic/visual metadata):

  • Select Stage: A pre-trained diffusion model calculates a "diffusion difficulty score" for each sample. Within each class, samples are sorted by difficulty, and a compact subset is obtained via uniform interval sampling with a fixed interval \(k\), balancing diversity and learnability.
  • Attach Stage: For each selected sample, two types of information are pre-computed and attached: (1) DC-Embedding: Textual semantic embeddings generated by a T5 text encoder; (2) Visual Representation: Patch-level visual features extracted by DINOv2. This information is stored on disk and loaded directly during training.
  • Training Stage: The diffusion model is trained on the compressed, enhanced dataset using standard denoising loss combined with a REPA-style visual alignment loss.

Key Designs

  1. Diffusion Difficulty Score:
  2. Function: Quantifies the learning difficulty of each sample for the diffusion model, used for sorting and selection.
  3. Mechanism: Defined as \(s_{diff}(x) = -\mathbb{E}_{\epsilon,t}[\|\epsilon - \epsilon_\theta(x_t, t, c)\|^2]\), representing the negative average denoising loss of the pre-trained model. By Bayesian derivation, \(p_\theta(c|x) \propto p_\theta(x|c)\), so the denoising loss directly reflects the confidence that a sample belongs to the target category—higher loss indicates more complex or ambiguous samples that are harder to learn.
  4. Design Motivation: Experiments show that both the easiest (Min) and hardest (Max) samples are suboptimal. Min samples (clean, simple backgrounds) lack diversity, while Max samples (cluttered, ambiguous) are difficult to optimize. Medium-difficulty samples have the smallest bias in distribution matching, necessitating a sampling strategy that covers multiple difficulty levels.

  5. Interval Sampling:

  6. Function: Performs uniform interval sampling on sorted samples to balance the coverage of easy and hard samples.
  7. Mechanism: Within each class \(y\), samples are sorted by \(s_{diff}\) in ascending order, and every \(k\)-th sample is selected: \(\mathcal{D}_{IS} = \bigcup_{y=1}^{C}\{x^{(i)} \in \mathcal{D}_y \mid i \in \{0, k, 2k, \ldots\}\}\). The optimal value of \(k\) is proportional to the data budget—\(k=96\) for a 10K subset and \(k=16\) for 50K.
  8. Design Motivation: Unlike selecting only easy or medium samples, interval sampling naturally covers the full spectrum of difficulty, avoiding biases from extreme sampling. This outperforms geometric/feature-based methods like K-Center or Herding, suggesting that distribution coverage along the difficulty dimension is more important than geometric coverage in feature space for diffusion training.

  9. Dual-Condition Embedding (DC-Embedding) and Visual Information Injection:

  10. Function: Attaches rich conditioning signals to each sample to compensate for the lack of semantic and visual information in minimal datasets.
  11. Mechanism: DC-Embedding processes T5 text embeddings \(t_c\) of class names via 1D convolution and fuses them with learnable class embeddings \(e_c\) through a residual MLP: \(y_{text} = \text{MLP}(\tilde{t}_c) + \tilde{t}_c + e_c\). Visual information is provided by DINOv2 patch features \(y_{vis} = f_{vis}(x) \in \mathbb{R}^{N \times d}\), with the first \(h\) tokens used as alignment targets.
  12. Design Motivation: Pure class embeddings struggle to capture rich semantic relationships from scratch. T5 embeddings naturally encode semantic hierarchies (e.g., similar dog breeds cluster in embedding space). Fusing them with learnable embeddings preserves pre-trained semantics while maintaining training flexibility. DINOv2 features provide instance-level spatial priors to capture intra-class variation.

Loss & Training

The total loss is a weighted sum of the denoising loss and the semantic alignment loss:

\[\mathcal{L}_{total} = \mathcal{L}_{diff} + \lambda \mathcal{L}_{proj}\]
  • \(\mathcal{L}_{diff}\): Standard denoising loss conditioned on the DC-Embedding.
  • \(\mathcal{L}_{proj}\): REPA-style visual alignment loss, mapping intermediate diffusion features to DINOv2 features via a projection head using cosine similarity.
  • \(\lambda = 0.5\) is the default weight.
  • Training: Using Adam with a learning rate of 1e-4. A 10K subset requires only 7.4 hours on 8×A800/4090.

Key Experimental Results

Main Results

ImageNet 256² Acceleration Comparison (SiT-XL/2, CFG=1.5):

Method Data Size Steps gFID-50K↓
Vanilla SiT-XL/2 1.28M 7M 8.3
+ REPA 1.28M 4M 5.9
+ REPA-E 1.28M 235K 5.9
+ REG 1.28M 200K 5.0
Ours (D2C) 10K (0.8%) 40K 4.3
Ours (D2C) 50K (4%) 180K 2.78

ImageNet 256² Comparison of DC Methods (DiT-L/2, 0.8% 10K, 100K steps):

Method gFID↓ sFID↓ IS↑ Precision↑
RDED 166.2 60.1 10.8 0.09
SRe2L 104.2 20.2 14.1 0.20
Ours (D2C) 4.2 11.0 283.6 0.72

Ablation Study

Contribution of Select and Attach Components (DiT-L/2, 10K, gFID-10K):

Select DC-Embedding Visual Embedding gFID↓
37.07
8.79
14.96
10.37
9.01
7.62

Key Findings

  • Discriminative DC methods (SRe2L, RDED) fail completely in diffusion training (FID > 80), proving that generation tasks require specialized DC strategies.
  • The optimal \(k\) for interval sampling is roughly proportional to the data budget (\(k=96\) for 10K, \(k=16\) for 50K).
  • Even with a weak scorer trained from scratch (baseline gFID 11.5), D2C achieves a gFID of 4.9, far exceeding random selection (37.07).
  • Using the Attach stage alone (without Select) achieves a gFID of 5.6, outperforming REPA (5.9).

Highlights & Insights

  • 233x Training Acceleration: Reducing training time from weeks and hundreds of GPUs to less than 10 hours while improving quality (FID 4.3 vs 8.3) is extraordinary. This significantly lowers the barrier to entry for large-scale generative model research.
  • Bayesian Derivation of Difficulty: Establishing a theoretical link between denoising loss and learning difficulty via \(p_\theta(c|x) \propto p_\theta(x|c)\) provides a principled metric for data selection, avoiding reliance on external classifiers.
  • Decoupled Two-Stage Design: Separating "what to select" from "how to enhance" allows each module to be improved independently. Both Select and Attach are effective on their own, and their combination provides further gains.

Limitations & Future Work

  • The Select stage depends on a pre-trained diffusion model as a scorer; cold-start scenarios require additional pre-training overhead.
  • Primarily validated on Class-to-Image (C2I) settings; applicability to Text-to-Image (T2I) at scale (e.g., Stable Diffusion) remains to be verified.
  • The interval \(k\) must be manually selected based on the data budget, though empirical rules of thumb exist.
  • vs. REPA (Model-side acceleration): REPA accelerates training but still requires the full 1.28M dataset for 4M steps. D2C achieves better results (FID 4.3 vs 5.9) 100x faster using only 0.8% of the data.
  • vs. SRe2L/RDED (Discriminative DC): These fail in diffusion training because they preserve categorical features but destroy pixel-level distribution structure.
  • Data Pruning: Unlike recent pruning works that focus on selection and reweighting, D2C incorporates information enhancement via the Attach stage.

Rating

  • Novelty: ⭐⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐⭐