Skip to content

Difficulty Controlled Diffusion Model for Synthesizing Effective Training Data

Conference: AAAI 2026 arXiv: 2411.18109 Code: https://github.com/komejisatori/Difficulty-Aware-Synthesis Area: Image Generation / Data Synthesis Keywords: Diffusion Models, Difficulty-Controllable Generation, Training Data Synthesis, Curriculum Learning, Difficulty Encoder

TL;DR

A difficulty encoder (MLP taking class label and difficulty score as input) is incorporated into Stable Diffusion, with LoRA fine-tuning used to decouple the objectives of "domain alignment" and "difficulty control," enabling controllable learning difficulty in synthesized data. Using only 10% additional synthetic data, the proposed method surpasses the best results of Real-Fake while saving 63.4 GPU hours.

Background & Motivation

Background: Synthesizing training data with diffusion models has become a mainstream approach to data augmentation; methods such as Real-Fake achieve domain alignment through fine-tuning.

Limitations of Prior Work: Fine-tuned models tend to capture only the dominant features of the target dataset, producing samples that skew heavily toward low difficulty (i.e., "easy samples"). Yet medium-difficulty samples are most valuable for training (Table 1: medium-difficulty samples yield +0.8%, easy samples only +0.2%, and extremely hard samples actually cause −0.4%), whereas medium-difficulty samples account for only approximately 1% of data generated by Real-Fake.

Key Challenge: Without fine-tuning, domain alignment fails; with fine-tuning, only easy samples are generated. A fundamental trade-off exists between domain alignment and difficulty diversity.

Goal: To maintain domain alignment while controlling the learning difficulty of generated samples.

Key Insight: Learning difficulty is treated as an explicit conditioning signal injected into the diffusion model, with a dedicated difficulty encoder used to decouple domain alignment from difficulty modeling.

Core Idea: The difficulty encoder learns a mapping from difficulty scores to difficulty features, while LoRA handles domain alignment — the two components have clearly separated responsibilities and do not interfere with each other.

Method

Overall Architecture

Inputs: Images from the target dataset, class labels, difficulty scores computed by a pretrained classifier, and text prompts. Fine-tuning: A difficulty encoder is introduced into Stable Diffusion and the model is fine-tuned with LoRA. Generation: Difficulty scores are sampled from a specified distribution \(s \sim \mathcal{N}(\mu, \sigma)\) to guide synthesis.

Key Designs

  1. Difficulty Score Definition:

    • Function: Defines sample difficulty using the prediction confidence of a pretrained classifier.
    • Mechanism: \(s = 1 - c\), where \(c\) is the softmax probability assigned to the ground-truth class. Higher \(s\) indicates greater difficulty.
    • Design Motivation: The definition is straightforward, architecture-agnostic, and directly related to the downstream task.
  2. Difficulty Encoder \(\mathcal{E}_d\):

    • Function: Maps difficulty scores to latent embeddings that control generation.
    • Mechanism: An MLP receives the concatenation of the class label and difficulty score, \(\bm{h_i} = \mathcal{E}_d([y_i] \oplus [s_i])\); the output embedding is concatenated with the CLIP text embedding and fed into the cross-attention layers of the U-Net. Class information is incorporated because the same difficulty score corresponds to different visual characteristics across different categories.
    • Design Motivation: Class conditioning is necessary since the difficulty factors for "garbage truck" (e.g., cluttered scenes) differ fundamentally from those for "golf ball."
  3. Decoupled Training Strategy:

    • Function: LoRA fine-tunes the U-Net for domain alignment, while the difficulty encoder is trained from scratch for difficulty control.
    • Mechanism: The loss function follows the standard denoising objective \(\mathcal{L} = \mathbb{E}[\|\epsilon - \epsilon_{(\theta,\delta)}(z_t, t, \tau)\|_2^2]\), where the conditioning signal is \(\tau = \mathcal{E}_{text}(p) \oplus \bm{h}\).
    • Design Motivation: The learning objectives of LoRA and the difficulty encoder are naturally disjoint — LoRA learns the domain distribution while the encoder learns the difficulty-score-to-difficulty-feature mapping.

Generation Strategy

During inference, difficulty scores are sampled from \(s \sim \mathcal{N}(\mu=0.5, \sigma=0.1)\). Diverse text prompts generated by BLIP-2 are employed (simple templates are used during training; complex prompts are used during generation to increase diversity).

Key Experimental Results

Main Results

ResNet-50 classification accuracy on ImageNet (Table 2):

Method Synthetic Data Ratio Top-1 Acc GPU Hours
Real only 0% 78.21 0
Real-Fake 100% (best) 78.73 158.5
Ours 10% 78.74 15.9
Ours 25% 78.76 39.6

Results on multiple datasets: +1.2% on CUB, with consistent improvements over Real-Fake on Cars as well.

Ablation Study

Config (\(\mu\), \(\sigma\)) Imagenette Acc Note
\(\mu\)=0.5, \(\sigma\)=0.1 96.4 Best config
\(\mu\)=0.3, \(\sigma\)=0.1 95.8 Skewed easy
\(\mu\)=0.7, \(\sigma\)=0.1 96.0 Skewed hard
\(\mu\)=0.9, \(\sigma\)=0.1 95.2 Too hard, performance drops

Validation across model architectures (Table 4):

Model Real only + Real-Fake + Ours
ResNet-50 95.0 95.4 96.4
ResNet-101 95.6 95.8 96.8
ViT-Small 82.6 84.8 86.0

Key Findings

  • Medium-difficulty samples are most valuable: \(\mu\)=0.5 yields the best performance; both overly easy and overly hard samples are suboptimal.
  • Cross-architecture generalization: Difficulty scores computed with ResNet-50 effectively improve ViT-Small training, demonstrating that difficulty features are not architecture-specific.
  • 10% data surpasses SOTA: Only 10% synthetic data (15.9 GPU hours) matches the best result of Real-Fake using 100% data (158.5 GPU hours).
  • Difficulty factor visualization: Generating samples conditioned solely on difficulty scores (without text prompts) reveals category-specific difficulty factors (e.g., for garbage trucks: cluttered scenes = hard, clean backgrounds = easy).

Highlights & Insights

  • Elegant decoupled design — LoRA handles domain alignment while the encoder manages difficulty control, with each component fulfilling a distinct role. This "one condition, one module" decoupling paradigm is generalizable to any multi-conditional generation task.
  • Strong practical value — Only 10% additional synthetic data is needed to surpass the prior SOTA, substantially reducing synthesis cost. The difficulty encoder introduces only an 8% latency overhead.
  • Valuable by-product — The difficulty factor visualization capability can serve as a dataset analysis tool, providing insight into what makes samples difficult.

Limitations & Future Work

  • Classifier-dependent difficulty definition: Different classifiers may yield different difficulty distributions; robustness across classifiers remains to be validated.
  • Evaluated only on classification: Extension to more complex downstream tasks such as detection and segmentation has not been explored.
  • Single scalar difficulty dimension: In practice, difficulty arises from multiple sources (occlusion, lighting, inter-class similarity); a single scalar may be insufficient.
  • Future directions: A multi-dimensional difficulty vector could be introduced to control distinct difficulty factors independently; active learning could be combined to enable the classifier to dynamically specify the most needed difficulty range during training.
  • vs. Real-Fake: Real-Fake focuses solely on domain alignment and generates predominantly easy samples; the proposed method adds difficulty control and achieves with 10% of the data what Real-Fake requires 100% to accomplish.
  • vs. Curriculum Learning: Traditional curriculum learning reorders samples at training time; this work controls difficulty at the data generation stage — an intervention further upstream in the pipeline.
  • Broader implication: The paradigm of "controlling data properties on the generation side" can be extended to instruction tuning for VLMs — specifically, to controlling the difficulty and type of generated VQA samples.

Rating

  • Novelty: ⭐⭐⭐⭐ — The difficulty encoder + LoRA decoupling design is concise and effective.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-dataset, multi-architecture validation with complete ablations and visualizations.
  • Writing Quality: ⭐⭐⭐⭐⭐ — The motivation chain is clearly articulated (trade-off → decoupling → validation), with excellent figures and tables.
  • Value: ⭐⭐⭐⭐ — Offers direct practical value to the data synthesis community.