Difficulty Controlled Diffusion Model for Synthesizing Effective Training Data¶

Conference: AAAI 2026 arXiv: 2411.18109 Code: https://github.com/komejisatori/Difficulty-Aware-Synthesis Area: Image Generation / Data Synthesis Keywords: Diffusion Models, Difficulty-Controllable Generation, Training Data Synthesis, Curriculum Learning, Difficulty Encoder

TL;DR¶

A difficulty encoder (MLP taking class label and difficulty score as input) is incorporated into Stable Diffusion, with LoRA fine-tuning used to decouple the objectives of "domain alignment" and "difficulty control," enabling controllable learning difficulty in synthesized data. Using only 10% additional synthetic data, the proposed method surpasses the best results of Real-Fake while saving 63.4 GPU hours.

Background & Motivation¶

Background: Synthesizing training data with diffusion models has become a mainstream approach to data augmentation; methods such as Real-Fake achieve domain alignment through fine-tuning.

Limitations of Prior Work: Fine-tuned models tend to capture only the dominant features of the target dataset, producing samples that skew heavily toward low difficulty (i.e., "easy samples"). Yet medium-difficulty samples are most valuable for training (Table 1: medium-difficulty samples yield +0.8%, easy samples only +0.2%, and extremely hard samples actually cause −0.4%), whereas medium-difficulty samples account for only approximately 1% of data generated by Real-Fake.

Key Challenge: Without fine-tuning, domain alignment fails; with fine-tuning, only easy samples are generated. A fundamental trade-off exists between domain alignment and difficulty diversity.

Goal: To maintain domain alignment while controlling the learning difficulty of generated samples.

Key Insight: Learning difficulty is treated as an explicit conditioning signal injected into the diffusion model, with a dedicated difficulty encoder used to decouple domain alignment from difficulty modeling.

Core Idea: The difficulty encoder learns a mapping from difficulty scores to difficulty features, while LoRA handles domain alignment — the two components have clearly separated responsibilities and do not interfere with each other.

Method¶

Overall Architecture¶

Inputs: Images from the target dataset, class labels, difficulty scores computed by a pretrained classifier, and text prompts. Fine-tuning: A difficulty encoder is introduced into Stable Diffusion and the model is fine-tuned with LoRA. Generation: Difficulty scores are sampled from a specified distribution \(s \sim \mathcal{N}(\mu, \sigma)\) to guide synthesis.

Key Designs¶

Difficulty Score Definition:
- Function: Defines sample difficulty using the prediction confidence of a pretrained classifier.
- Mechanism: \(s = 1 - c\), where \(c\) is the softmax probability assigned to the ground-truth class. Higher \(s\) indicates greater difficulty.
- Design Motivation: The definition is straightforward, architecture-agnostic, and directly related to the downstream task.
Difficulty Encoder \(\mathcal{E}_d\):
- Function: Maps difficulty scores to latent embeddings that control generation.
- Mechanism: An MLP receives the concatenation of the class label and difficulty score, \(\bm{h_i} = \mathcal{E}_d([y_i] \oplus [s_i])\); the output embedding is concatenated with the CLIP text embedding and fed into the cross-attention layers of the U-Net. Class information is incorporated because the same difficulty score corresponds to different visual characteristics across different categories.
- Design Motivation: Class conditioning is necessary since the difficulty factors for "garbage truck" (e.g., cluttered scenes) differ fundamentally from those for "golf ball."
Decoupled Training Strategy:
- Function: LoRA fine-tunes the U-Net for domain alignment, while the difficulty encoder is trained from scratch for difficulty control.
- Mechanism: The loss function follows the standard denoising objective \(\mathcal{L} = \mathbb{E}[\|\epsilon - \epsilon_{(\theta,\delta)}(z_t, t, \tau)\|_2^2]\), where the conditioning signal is \(\tau = \mathcal{E}_{text}(p) \oplus \bm{h}\).
- Design Motivation: The learning objectives of LoRA and the difficulty encoder are naturally disjoint — LoRA learns the domain distribution while the encoder learns the difficulty-score-to-difficulty-feature mapping.

Generation Strategy¶

During inference, difficulty scores are sampled from \(s \sim \mathcal{N}(\mu=0.5, \sigma=0.1)\). Diverse text prompts generated by BLIP-2 are employed (simple templates are used during training; complex prompts are used during generation to increase diversity).

Key Experimental Results¶

Main Results¶

ResNet-50 classification accuracy on ImageNet (Table 2):

Method	Synthetic Data Ratio	Top-1 Acc	GPU Hours
Real only	0%	78.21	0
Real-Fake	100% (best)	78.73	158.5
Ours	10%	78.74	15.9
Ours	25%	78.76	39.6

Results on multiple datasets: +1.2% on CUB, with consistent improvements over Real-Fake on Cars as well.

Ablation Study¶

Config (\(\mu\), \(\sigma\))	Imagenette Acc	Note
\(\mu\)=0.5, \(\sigma\)=0.1	96.4	Best config
\(\mu\)=0.3, \(\sigma\)=0.1	95.8	Skewed easy
\(\mu\)=0.7, \(\sigma\)=0.1	96.0	Skewed hard
\(\mu\)=0.9, \(\sigma\)=0.1	95.2	Too hard, performance drops

Validation across model architectures (Table 4):

Model	Real only	+ Real-Fake	+ Ours
ResNet-50	95.0	95.4	96.4
ResNet-101	95.6	95.8	96.8
ViT-Small	82.6	84.8	86.0

Key Findings¶

Medium-difficulty samples are most valuable: \(\mu\)=0.5 yields the best performance; both overly easy and overly hard samples are suboptimal.
Cross-architecture generalization: Difficulty scores computed with ResNet-50 effectively improve ViT-Small training, demonstrating that difficulty features are not architecture-specific.
10% data surpasses SOTA: Only 10% synthetic data (15.9 GPU hours) matches the best result of Real-Fake using 100% data (158.5 GPU hours).
Difficulty factor visualization: Generating samples conditioned solely on difficulty scores (without text prompts) reveals category-specific difficulty factors (e.g., for garbage trucks: cluttered scenes = hard, clean backgrounds = easy).

Highlights & Insights¶

Elegant decoupled design — LoRA handles domain alignment while the encoder manages difficulty control, with each component fulfilling a distinct role. This "one condition, one module" decoupling paradigm is generalizable to any multi-conditional generation task.
Strong practical value — Only 10% additional synthetic data is needed to surpass the prior SOTA, substantially reducing synthesis cost. The difficulty encoder introduces only an 8% latency overhead.
Valuable by-product — The difficulty factor visualization capability can serve as a dataset analysis tool, providing insight into what makes samples difficult.

Limitations & Future Work¶

Classifier-dependent difficulty definition: Different classifiers may yield different difficulty distributions; robustness across classifiers remains to be validated.
Evaluated only on classification: Extension to more complex downstream tasks such as detection and segmentation has not been explored.
Single scalar difficulty dimension: In practice, difficulty arises from multiple sources (occlusion, lighting, inter-class similarity); a single scalar may be insufficient.
Future directions: A multi-dimensional difficulty vector could be introduced to control distinct difficulty factors independently; active learning could be combined to enable the classifier to dynamically specify the most needed difficulty range during training.

vs. Real-Fake: Real-Fake focuses solely on domain alignment and generates predominantly easy samples; the proposed method adds difficulty control and achieves with 10% of the data what Real-Fake requires 100% to accomplish.
vs. Curriculum Learning: Traditional curriculum learning reorders samples at training time; this work controls difficulty at the data generation stage — an intervention further upstream in the pipeline.
Broader implication: The paradigm of "controlling data properties on the generation side" can be extended to instruction tuning for VLMs — specifically, to controlling the difficulty and type of generated VQA samples.

Rating¶

Novelty: ⭐⭐⭐⭐ — The difficulty encoder + LoRA decoupling design is concise and effective.
Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-dataset, multi-architecture validation with complete ablations and visualizations.
Writing Quality: ⭐⭐⭐⭐⭐ — The motivation chain is clearly articulated (trade-off → decoupling → validation), with excellent figures and tables.
Value: ⭐⭐⭐⭐ — Offers direct practical value to the data synthesis community.