Difficulty Controlled Diffusion Model for Synthesizing Effective Training Data¶
Conference: AAAI 2026 arXiv: 2411.18109 Code: https://github.com/komejisatori/Difficulty-Aware-Synthesis Area: Image Generation / Data Synthesis Keywords: Diffusion Models, Difficulty-Controllable Generation, Training Data Synthesis, Curriculum Learning, Difficulty Encoder
TL;DR¶
A difficulty encoder (MLP taking class label and difficulty score as input) is incorporated into Stable Diffusion, with LoRA fine-tuning used to decouple the objectives of "domain alignment" and "difficulty control," enabling controllable learning difficulty in synthesized data. Using only 10% additional synthetic data, the proposed method surpasses the best results of Real-Fake while saving 63.4 GPU hours.
Background & Motivation¶
Background: Synthesizing training data with diffusion models has become a mainstream approach to data augmentation; methods such as Real-Fake achieve domain alignment through fine-tuning.
Limitations of Prior Work: Fine-tuned models tend to capture only the dominant features of the target dataset, producing samples that skew heavily toward low difficulty (i.e., "easy samples"). Yet medium-difficulty samples are most valuable for training (Table 1: medium-difficulty samples yield +0.8%, easy samples only +0.2%, and extremely hard samples actually cause −0.4%), whereas medium-difficulty samples account for only approximately 1% of data generated by Real-Fake.
Key Challenge: Without fine-tuning, domain alignment fails; with fine-tuning, only easy samples are generated. A fundamental trade-off exists between domain alignment and difficulty diversity.
Goal: To maintain domain alignment while controlling the learning difficulty of generated samples.
Key Insight: Learning difficulty is treated as an explicit conditioning signal injected into the diffusion model, with a dedicated difficulty encoder used to decouple domain alignment from difficulty modeling.
Core Idea: The difficulty encoder learns a mapping from difficulty scores to difficulty features, while LoRA handles domain alignment — the two components have clearly separated responsibilities and do not interfere with each other.
Method¶
Overall Architecture¶
Inputs: Images from the target dataset, class labels, difficulty scores computed by a pretrained classifier, and text prompts. Fine-tuning: A difficulty encoder is introduced into Stable Diffusion and the model is fine-tuned with LoRA. Generation: Difficulty scores are sampled from a specified distribution \(s \sim \mathcal{N}(\mu, \sigma)\) to guide synthesis.
Key Designs¶
-
Difficulty Score Definition:
- Function: Defines sample difficulty using the prediction confidence of a pretrained classifier.
- Mechanism: \(s = 1 - c\), where \(c\) is the softmax probability assigned to the ground-truth class. Higher \(s\) indicates greater difficulty.
- Design Motivation: The definition is straightforward, architecture-agnostic, and directly related to the downstream task.
-
Difficulty Encoder \(\mathcal{E}_d\):
- Function: Maps difficulty scores to latent embeddings that control generation.
- Mechanism: An MLP receives the concatenation of the class label and difficulty score, \(\bm{h_i} = \mathcal{E}_d([y_i] \oplus [s_i])\); the output embedding is concatenated with the CLIP text embedding and fed into the cross-attention layers of the U-Net. Class information is incorporated because the same difficulty score corresponds to different visual characteristics across different categories.
- Design Motivation: Class conditioning is necessary since the difficulty factors for "garbage truck" (e.g., cluttered scenes) differ fundamentally from those for "golf ball."
-
Decoupled Training Strategy:
- Function: LoRA fine-tunes the U-Net for domain alignment, while the difficulty encoder is trained from scratch for difficulty control.
- Mechanism: The loss function follows the standard denoising objective \(\mathcal{L} = \mathbb{E}[\|\epsilon - \epsilon_{(\theta,\delta)}(z_t, t, \tau)\|_2^2]\), where the conditioning signal is \(\tau = \mathcal{E}_{text}(p) \oplus \bm{h}\).
- Design Motivation: The learning objectives of LoRA and the difficulty encoder are naturally disjoint — LoRA learns the domain distribution while the encoder learns the difficulty-score-to-difficulty-feature mapping.
Generation Strategy¶
During inference, difficulty scores are sampled from \(s \sim \mathcal{N}(\mu=0.5, \sigma=0.1)\). Diverse text prompts generated by BLIP-2 are employed (simple templates are used during training; complex prompts are used during generation to increase diversity).
Key Experimental Results¶
Main Results¶
ResNet-50 classification accuracy on ImageNet (Table 2):
| Method | Synthetic Data Ratio | Top-1 Acc | GPU Hours |
|---|---|---|---|
| Real only | 0% | 78.21 | 0 |
| Real-Fake | 100% (best) | 78.73 | 158.5 |
| Ours | 10% | 78.74 | 15.9 |
| Ours | 25% | 78.76 | 39.6 |
Results on multiple datasets: +1.2% on CUB, with consistent improvements over Real-Fake on Cars as well.
Ablation Study¶
| Config (\(\mu\), \(\sigma\)) | Imagenette Acc | Note |
|---|---|---|
| \(\mu\)=0.5, \(\sigma\)=0.1 | 96.4 | Best config |
| \(\mu\)=0.3, \(\sigma\)=0.1 | 95.8 | Skewed easy |
| \(\mu\)=0.7, \(\sigma\)=0.1 | 96.0 | Skewed hard |
| \(\mu\)=0.9, \(\sigma\)=0.1 | 95.2 | Too hard, performance drops |
Validation across model architectures (Table 4):
| Model | Real only | + Real-Fake | + Ours |
|---|---|---|---|
| ResNet-50 | 95.0 | 95.4 | 96.4 |
| ResNet-101 | 95.6 | 95.8 | 96.8 |
| ViT-Small | 82.6 | 84.8 | 86.0 |
Key Findings¶
- Medium-difficulty samples are most valuable: \(\mu\)=0.5 yields the best performance; both overly easy and overly hard samples are suboptimal.
- Cross-architecture generalization: Difficulty scores computed with ResNet-50 effectively improve ViT-Small training, demonstrating that difficulty features are not architecture-specific.
- 10% data surpasses SOTA: Only 10% synthetic data (15.9 GPU hours) matches the best result of Real-Fake using 100% data (158.5 GPU hours).
- Difficulty factor visualization: Generating samples conditioned solely on difficulty scores (without text prompts) reveals category-specific difficulty factors (e.g., for garbage trucks: cluttered scenes = hard, clean backgrounds = easy).
Highlights & Insights¶
- Elegant decoupled design — LoRA handles domain alignment while the encoder manages difficulty control, with each component fulfilling a distinct role. This "one condition, one module" decoupling paradigm is generalizable to any multi-conditional generation task.
- Strong practical value — Only 10% additional synthetic data is needed to surpass the prior SOTA, substantially reducing synthesis cost. The difficulty encoder introduces only an 8% latency overhead.
- Valuable by-product — The difficulty factor visualization capability can serve as a dataset analysis tool, providing insight into what makes samples difficult.
Limitations & Future Work¶
- Classifier-dependent difficulty definition: Different classifiers may yield different difficulty distributions; robustness across classifiers remains to be validated.
- Evaluated only on classification: Extension to more complex downstream tasks such as detection and segmentation has not been explored.
- Single scalar difficulty dimension: In practice, difficulty arises from multiple sources (occlusion, lighting, inter-class similarity); a single scalar may be insufficient.
- Future directions: A multi-dimensional difficulty vector could be introduced to control distinct difficulty factors independently; active learning could be combined to enable the classifier to dynamically specify the most needed difficulty range during training.
Related Work & Insights¶
- vs. Real-Fake: Real-Fake focuses solely on domain alignment and generates predominantly easy samples; the proposed method adds difficulty control and achieves with 10% of the data what Real-Fake requires 100% to accomplish.
- vs. Curriculum Learning: Traditional curriculum learning reorders samples at training time; this work controls difficulty at the data generation stage — an intervention further upstream in the pipeline.
- Broader implication: The paradigm of "controlling data properties on the generation side" can be extended to instruction tuning for VLMs — specifically, to controlling the difficulty and type of generated VQA samples.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The difficulty encoder + LoRA decoupling design is concise and effective.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-dataset, multi-architecture validation with complete ablations and visualizations.
- Writing Quality: ⭐⭐⭐⭐⭐ — The motivation chain is clearly articulated (trade-off → decoupling → validation), with excellent figures and tables.
- Value: ⭐⭐⭐⭐ — Offers direct practical value to the data synthesis community.