Skip to content

Diffusion Curriculum: Synthetic-to-Real Data Curriculum via Image-Guided Diffusion

Conference: ICCV 2025 arXiv: 2410.13674 Code: tianyi-lab/DisCL Area: Object Detection / Data Augmentation Keywords: Diffusion Models, Curriculum Learning, Synthetic Data, Long-Tail Classification, Low-Quality Data

TL;DR

This paper leverages the image guidance strength of diffusion models to generate a continuous synthetic-to-real spectrum of data, and proposes a Diffusion Curriculum Learning (DisCL) strategy that adaptively selects synthetic data at optimal guidance levels across different training stages, effectively addressing long-tail classification and low-quality data learning challenges.

Background & Motivation

Deep learning in real-world scenarios frequently faces dual challenges of data quality and quantity: images captured by wildlife cameras, traffic surveillance systems, and similar devices often suffer from poor illumination, motion blur, and occlusion; class imbalance further degrades model performance on tail classes.

Limitations of existing solutions: - Traditional data augmentation (flipping, cropping, etc.) produces samples with limited diversity - Although text-guided diffusion models can generate high-quality and diverse data, text-only control cannot guarantee similarity between synthetic and original images, causing out-of-distribution data to harm model performance - Existing methods (e.g., ALIA, LDMLR) still exhibit deficiencies in controlling the synthetic-to-real distribution gap

Core insight: The image guidance strength \(\lambda\) in diffusion models naturally provides a continuous interpolation spectrum from synthetic to real. Low \(\lambda\) produces diverse, prototypical, and simple images, while high \(\lambda\) produces images closely resembling the original but potentially inheriting its defects.

Method

Overall Architecture

DisCL consists of two stages: 1. Synthetic-to-Real Data Generation: Identifies hard samples and generates a full synthetic-to-real interpolation spectrum using image guidance levels \(\lambda \in [0,1)\) 2. Generative Curriculum Learning: Selects synthetic data at appropriate guidance levels for each training stage based on training dynamics

Key Designs

  1. Synthetic-to-Real Interpolation via Image Guidance Control:

    • Built upon Stable Diffusion XL; modifies the starting diffusion step in classifier-free guidance as \(t(\lambda) = \lfloor(1-\lambda)T\rfloor\)
    • Noise initialization formula: \(z_{t(\lambda)} = \sqrt{\tilde{\alpha}_{t(\lambda)}} z_{real} + \sqrt{1-\tilde{\alpha}_{t(\lambda)}} \epsilon\)
    • \(\lambda = 0\) corresponds to pure text guidance (most diverse but furthest from original); \(\lambda \rightarrow 1\) corresponds to near-original images (most similar but least diverse)
    • CLIPScore thresholding filters low-fidelity synthetic images where target objects are absent or occluded
  2. Non-Adaptive Curriculum for Long-Tail Classification (Diverse-to-Specific):

    • Core mechanism: Given data scarcity in tail classes, low-guidance-level data is used first to increase diversity and quantity, then guidance level is progressively increased to adapt the model to the real distribution
    • Early training employs \(\lambda \rightarrow 0\) synthetic data (prototypical features, high diversity)
    • Later training gradually transitions to \(\lambda \rightarrow 1\) data (closer to real distribution)
    • The synthetic-to-real gap is bridged incrementally to avoid abrupt distribution shifts
  3. Adaptive Curriculum for Low-Quality Data (Adaptive):

    • Hard samples are identified by the real-class probability of a pretrained classifier (lower probability indicates greater difficulty)
    • At each epoch, the guidance level \(\lambda\) for the next round is adaptively selected based on "learning progress" (improvement in real-class confidence on a validation subset)
    • The \(\lambda\) that maximizes progress at the current training stage is selected, ensuring the most informative data is used at each phase
    • This avoids negative transfer caused by premature introduction of out-of-distribution data in non-adaptive curricula for low-quality settings

Loss & Training

  • Hard sample identification: frequency-based for long-tail tasks; classifier confidence-based for low-quality tasks
  • Synthetic data scale: 3–4× the original tail-class data yields optimal results for long-tail tasks
  • Training backbones: ResNet-10 for ImageNet-LT; CLIP ViT-B/16 and ViT-L/14 for iWildCam
  • DDIM is used as the noise scheduler

Key Experimental Results

Main Results: ImageNet-LT Long-Tail Classification

Method Curriculum Many Medium Few Overall
CE baseline N/A 57.70% 26.60% 4.40% 35.80%
CE + CUDA N/A 57.49% 28.16% 6.58% 36.30%
CE + LDMLR N/A 57.20% 29.20% 7.30% 37.20%
BS + CUDA N/A 51.16% 37.35% 19.28% 40.03%
CE + DisCL Diverse-to-Specific 56.78% 30.73% 23.64% 39.82%
BS + DisCL Diverse-to-Specific 52.68% 37.68% 21.36% 41.33%

DisCL improves tail-class accuracy from 4.40% to 23.64% (+19.24%) and overall accuracy by 4.02%.

Main Results: iWildCam Low-Quality Data Classification

Method Curriculum OOD F1 ID F1
FLYP N/A 35.5 52.2
FLYP + ALIA N/A 36.9 52.6
FLYP + DisCL Adaptive 38.2 54.3
FLYP + DisCL + WE Adaptive 38.7 54.6

OOD and ID F1 scores improve by 2.7% and 2.1%, respectively.

Ablation Study

Configuration Few-class (IN-LT) Note
Text-only Guidance (\(\lambda=0\)) 17.90% Text guidance only; diverse but large distributional gap
All-Level Guidance 19.17% All levels mixed; no curriculum strategy
DisCL Specific-to-Diverse 18.36% Reverse curriculum; prone to overfitting the real distribution
DisCL Adaptive 16.78% Adaptive strategy unsuitable here due to small validation set in long-tail setting
DisCL Diverse-to-Specific 23.64% Optimal strategy; progressively bridges the distribution gap
Configuration OOD F1 (iWildCam) Note
Easy-to-Hard (non-adaptive) 35.2 Fixed strategy from low to high guidance
Random 35.9 Random guidance level selection
Adaptive 38.2 Adaptively selected based on learning progress

Key Findings

  • Synthetic data scaling experiments show that 3–4× tail-class synthetic data is the optimal trade-off; beyond this, head-class accuracy slightly decreases
  • The Diverse-to-Specific and Adaptive strategies are best suited for long-tail and low-quality scenarios, respectively, demonstrating that curriculum strategies must be matched to task characteristics
  • CLIPScore thresholding has a significant impact in the low-quality data setting (high variance in synthetic image quality) but a minor impact on ImageNet (Stable Diffusion produces consistently high-quality outputs for ImageNet categories)

Highlights & Insights

  • The idea of controlling synthetic-to-real interpolation via image guidance level is highly generalizable, elevating diffusion models from a "data augmentation tool" to a "controllable curriculum learning engine"
  • The design of two curriculum strategies (non-adaptive vs. adaptive) reflects a deep understanding of the fundamental differences between task types
  • The experimental design is systematic and comprehensive, covering four datasets: ImageNet-LT, CIFAR100-LT, iNaturalist2018, and iWildCam

Limitations & Future Work

  • Synthetic data quality is constrained by the capabilities of the diffusion model and CLIP alignment
  • Text prompts are based solely on class names, without leveraging image caption information
  • Discrepancies in object position and scale between synthetic and real images may amplify the distributional gap
  • The method is validated only on classification tasks; more complex detection and segmentation tasks remain unexplored
  • CUDA is among the earliest works to combine curriculum learning with data augmentation for long-tail learning, but relies solely on engineered augmentations
  • LDMLR trains a dedicated long-tail sampler using diffusion models, whereas DisCL directly leverages image guidance from a pretrained diffusion model
  • Compared to contrastive learning augmentation methods (e.g., DoCL), DisCL employs generative models rather than dynamic selection of real samples

Rating

  • Novelty: ⭐⭐⭐⭐ — The cross-cutting innovation of image guidance level × curriculum learning is concise and effective
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four datasets, extensive ablations, and validation across multiple loss functions
  • Writing Quality: ⭐⭐⭐⭐ — Clear motivation, rigorous formal expression, and interpretable illustrations
  • Value: ⭐⭐⭐⭐ — Provides a general framework for leveraging diffusion models for data augmentation with broad applicability