lbGen: Low-Biased General Annotated Dataset Generation¶
Conference: CVPR 2025
arXiv: 2412.10831
Code: https://github.com/vvvvvjdy/lbGen
Area: Image Generation
Keywords: dataset bias, synthetic dataset, diffusion model, CLIP, bi-level semantic alignment, quality assurance, transfer learning
TL;DR¶
The lbGen framework is proposed to fine-tune Stable Diffusion through bi-level semantic alignment (global adversarial + individual cosine similarity) and quality assurance losses. Using only category names, it generates a low-biased general annotated dataset. Backbones pretrained on this dataset outperform those trained on real ImageNet data by 1.7%~2.1% in average transfer accuracy.
Background & Motivation¶
Background: Pretraining backbone networks on general annotated datasets (such as ImageNet) is a foundational step for various vision tasks. Recent advances in diffusion models have enabled the direct synthesis of annotated image data.
Limitations of Prior Work: (1) Manually collected datasets like ImageNet exhibit implicit data biases (e.g., fixed backgrounds, styles, and object locations for specific categories). Backbone networks capture these non-transferable shortcut features during pretraining, leading to degraded cross-domain/cross-category generalization. (2) Existing synthetic datasets (e.g., GenRobust, RealFake) mainly mimic the ImageNet distribution without considering bias reduction. (3) Re-collecting low-biased data manually is extremely expensive and inevitably introduces new biases.
Key Challenge: High accuracy on the ImageNet validation set does not equate to strong generalization capability — bias forces the model to rely on shortcut features instead of transferable semantic features.
Key Insight: Leveraging the low-biased semantic space defined by CLIP, this work fine-tunes a diffusion model via reinforcement learning to directly generate low-biased images that align with the semantic distribution, without utilizing any external biased images.
Method¶
Overall Architecture¶
Based on Stable Diffusion 1.5 + LoRA fine-tuning, the method takes only the 1000 category names of ImageNet-1K as input. The training framework consists of two modules: a bi-level semantic alignment module (core) and a quality assurance module (auxiliary), optimized using a reinforcement learning paradigm.
Key Designs¶
1. Entire Dataset Alignment - Function: Aligns the CLIP feature distribution of all generated images with the overall semantic distribution of the 1000 text categories. - Mechanism: Using a Linear-ReLU-Linear discriminator \(\mathcal{D}_\phi\), adversarial learning is performed by randomly selecting text features of categories different from the current image as positive samples, and the generated image features as negative samples: $\(\mathcal{L}_{en} = \log(\mathcal{D}_\phi(f_{c_j})) + \log(1 - \mathcal{D}_\phi(f_{im_i}))\)$ - Design Motivation: By avoiding matching text features of the same class, the entire synthetic dataset distribution is encouraged to approach the global distribution of the semantic space rather than performing category-level alignment.
2. Individual Image Alignment - Function: Ensures each generated image precisely matches the semantic description of its corresponding category. - Mechanism: Using a simple "photo of \(c_i\)" as the low-biased semantic description, the CLIP image-text cosine similarity is maximized: $\(\mathcal{L}_{in} = 1 - \frac{f_{im_i} \cdot f_{p_{c_i}}}{\|f_{im_i}\| \cdot \|f_{p_{c_i}}\|}\)$ - Design Motivation: While global alignment guarantees distribution consistency, it cannot precisely control the specific category of each image, which necessitates individual-level constraints.
3. Quality Assurance - Function: Prevents image quality degradation caused by the semantic alignment training. - Mechanism: Converts the score \(Q(im_i)\) (ranging from [1, 5]) of the Q-ALIGN image quality assessment model into a loss: $\(\mathcal{L}_q = 1 - \frac{Q(im_i)}{5}\)$ - Design Motivation: Relying solely on semantic constraints leads to style or quality degradation; the quality assurance loss establishes a baseline for image fidelity.
Loss & Training¶
where \(\mathcal{L}_{bi} = \mathcal{L}_{en} + \mathcal{L}_{in}\). The model is trained using a reinforcement learning paradigm, where gradients are computed only at 5 out of 50 denoising steps to save GPU memory.
Key Experimental Results¶
Main Results — Average Top-1 Accuracy on Eight Transfer Learning Datasets¶
| Backbone | Pretraining Data | IN-val | 8 Datasets Avg. |
|---|---|---|---|
| ResNet50 | IN-Real | 76.2 | 71.5 |
| ResNet50 | IN-RealFake | 69.8 | 71.8 |
| ResNet50 | IN-lbGen | 46.1 | 73.2 |
| ViT-S | IN-Real | 78.7 | 72.3 |
| ViT-S | IN-RealFake | 72.3 | 70.8 |
| ViT-S | IN-lbGen | 46.3 | 74.4 |
Key Findings: Although the IN-val accuracy of the backbone pretrained on lbGen is only ~46%, its transfer accuracy significantly outperforms the baselines, demonstrating that ImageNet validation accuracy is not positively correlated with generalization capability.
Visual Perception Tasks (COCO Detection / ADE20K Segmentation)¶
| Pretraining Data | COCO AP^box (0.2×) | ADE20K mIoU (0.2×) |
|---|---|---|
| IN-Real | 29.14 | 32.10 |
| IN-lbGen | 30.68 (+1.54) | 33.57 (+1.47) |
lbGen achieves the most significant advantages when using 20% downstream data.
Bias Metric Experiments¶
| Backbone | Pretraining Data | TI↓ (Texture Bias) | CB_avg↑ (Context) | BG_Gap↓ (Background) |
|---|---|---|---|---|
| ResNet50 | IN-Real | 60.9 | 60.0 | 6.8 |
| ResNet50 | IN-lbGen | 56.1 | 64.7 | 6.4 |
| ViT-S | IN-Real | 67.0 | 61.8 | 6.7 |
| ViT-S | IN-lbGen | 57.2 | 66.0 | 6.1 |
Under all three bias metrics, the proposed method comprehensively outperforms models pretrained on real data.
Key Findings¶
- Data bias is quantifiable: High IN-val accuracy \(\neq\) high generalization; bias is the root cause.
- Greater benefits in few-shot scenarios: The less downstream data available, the more prominent the advantages of lbGen become (Figure 3).
- Efficacy of semantic space alignment: The text semantic space of CLIP indeed provides a low-biased representation anchor.
Highlights & Insights¶
- First to directly generate a low-biased dataset: Diverging from the traditional "collect then de-bias" paradigm, this work directly addresses the bias problem from the generation end.
- Zero-image training: The diffusion model is fine-tuned using only 1000 category names, without introducing any external biased images.
- Counter-intuitive finding: Synthetic data with 46% IN-val accuracy performs stronger in transfer learning than real data with 76% IN-val accuracy.
- Lightweight: Training costs are kept highly manageable through LoRA fine-tuning and a 5-step gradient strategy.
Limitations & Future Work¶
- Extremely low IN-val accuracy (46%) demands caution when applying it to in-domain scenarios.
- Validated only on 1K categories; the scaling behavior to larger categories (e.g., 21K) remains to be verified.
- Relies heavily on the quality of CLIP's semantic space — CLIP itself may exhibit inherent biases.
- The quality assurance module utilizes the scoring model Q-ALIGN, which might introduce implicit quality-preference bias.
- Evaluated only on ResNet50 and ViT-S; whether the advantages persist for larger models (e.g., ViT-L) remains unknown.
Related Work & Insights¶
- RealFake (Yuan et al.): Synthesizes data after learning the ImageNet distribution but does not mitigate bias \(\rightarrow\) essentially replicating the biases.
- GenRobust (Bansal et al.): Fine-tunes a diffusion model on ImageNet and uses carefully designed prompts \(\rightarrow\) still constrained by the original distribution.
- CLIP align: Utilizing CLIP's multimodal alignment capability as a "de-biasing" tool is a paradigm worth promoting — it can scale to other scenarios requiring low-bias representations (e.g., fairness, domain adaptation).
Rating¶
⭐⭐⭐⭐ — High novelty in perspective (addressing data bias from the generation end for the first time), with comprehensive experiments covering bias metrics; the counter-intuitive results are convincing. However, the assumption of relying on CLIP's semantic space requires deeper theoretical support.