CVPR 2025 Image Generation dataset bias synthetic dataset diffusion model CLIP bi-level semantic alignment quality assurance transfer learning

lbGen: Low-Biased General Annotated Dataset Generation¶

Conference: CVPR 2025
arXiv: 2412.10831
Code: https://github.com/vvvvvjdy/lbGen
Area: Image Generation
Keywords: dataset bias, synthetic dataset, diffusion model, CLIP, bi-level semantic alignment, quality assurance, transfer learning

TL;DR¶

The lbGen framework is proposed to fine-tune Stable Diffusion through bi-level semantic alignment (global adversarial + individual cosine similarity) and quality assurance losses. Using only category names, it generates a low-biased general annotated dataset. Backbones pretrained on this dataset outperform those trained on real ImageNet data by 1.7%~2.1% in average transfer accuracy.

Background & Motivation¶

Background: Pretraining backbone networks on general annotated datasets (such as ImageNet) is a foundational step for various vision tasks. Recent advances in diffusion models have enabled the direct synthesis of annotated image data.

Limitations of Prior Work: (1) Manually collected datasets like ImageNet exhibit implicit data biases (e.g., fixed backgrounds, styles, and object locations for specific categories). Backbone networks capture these non-transferable shortcut features during pretraining, leading to degraded cross-domain/cross-category generalization. (2) Existing synthetic datasets (e.g., GenRobust, RealFake) mainly mimic the ImageNet distribution without considering bias reduction. (3) Re-collecting low-biased data manually is extremely expensive and inevitably introduces new biases.

Key Challenge: High accuracy on the ImageNet validation set does not equate to strong generalization capability — bias forces the model to rely on shortcut features instead of transferable semantic features.

Key Insight: Leveraging the low-biased semantic space defined by CLIP, this work fine-tunes a diffusion model via reinforcement learning to directly generate low-biased images that align with the semantic distribution, without utilizing any external biased images.

Method¶

Overall Architecture¶

Based on Stable Diffusion 1.5 + LoRA fine-tuning, the method takes only the 1000 category names of ImageNet-1K as input. The training framework consists of two modules: a bi-level semantic alignment module (core) and a quality assurance module (auxiliary), optimized using a reinforcement learning paradigm.

Key Designs¶

1. Entire Dataset Alignment - Function: Aligns the CLIP feature distribution of all generated images with the overall semantic distribution of the 1000 text categories. - Mechanism: Using a Linear-ReLU-Linear discriminator $\mathcal{D}_\phi$, adversarial learning is performed by randomly selecting text features of categories different from the current image as positive samples, and the generated image features as negative samples: $$\mathcal{L}_{en} = \log(\mathcal{D}_\phi(f_{c_j})) + \log(1 - \mathcal{D}_\phi(f_{im_i}))$$ - Design Motivation: By avoiding matching text features of the same class, the entire synthetic dataset distribution is encouraged to approach the global distribution of the semantic space rather than performing category-level alignment.

2. Individual Image Alignment - Function: Ensures each generated image precisely matches the semantic description of its corresponding category. - Mechanism: Using a simple "photo of $c_i$" as the low-biased semantic description, the CLIP image-text cosine similarity is maximized: $$\mathcal{L}_{in} = 1 - \frac{f_{im_i} \cdot f_{p_{c_i}}}{\|f_{im_i}\| \cdot \|f_{p_{c_i}}\|}$$ - Design Motivation: While global alignment guarantees distribution consistency, it cannot precisely control the specific category of each image, which necessitates individual-level constraints.

3. Quality Assurance - Function: Prevents image quality degradation caused by the semantic alignment training. - Mechanism: Converts the score $Q(im_i)$ (ranging from [1, 5]) of the Q-ALIGN image quality assessment model into a loss: $$\mathcal{L}_q = 1 - \frac{Q(im_i)}{5}$$ - Design Motivation: Relying solely on semantic constraints leads to style or quality degradation; the quality assurance loss establishes a baseline for image fidelity.

Loss & Training¶

\[\mathcal{L} = \mathcal{L}_{bi} + \lambda_1 \mathcal{L}_q\]

where $\mathcal{L}_{bi} = \mathcal{L}_{en} + \mathcal{L}_{in}$. The model is trained using a reinforcement learning paradigm, where gradients are computed only at 5 out of 50 denoising steps to save GPU memory.

Key Experimental Results¶

Main Results — Average Top-1 Accuracy on Eight Transfer Learning Datasets¶

Backbone	Pretraining Data	IN-val	8 Datasets Avg.
ResNet50	IN-Real	76.2	71.5
ResNet50	IN-RealFake	69.8	71.8
ResNet50	IN-lbGen	46.1	73.2
ViT-S	IN-Real	78.7	72.3
ViT-S	IN-RealFake	72.3	70.8
ViT-S	IN-lbGen	46.3	74.4

Key Findings: Although the IN-val accuracy of the backbone pretrained on lbGen is only ~46%, its transfer accuracy significantly outperforms the baselines, demonstrating that ImageNet validation accuracy is not positively correlated with generalization capability.

Visual Perception Tasks (COCO Detection / ADE20K Segmentation)¶

Pretraining Data	COCO AP^box (0.2×)	ADE20K mIoU (0.2×)
IN-Real	29.14	32.10
IN-lbGen	30.68 (+1.54)	33.57 (+1.47)

lbGen achieves the most significant advantages when using 20% downstream data.

Bias Metric Experiments¶

Backbone	Pretraining Data	TI↓ (Texture Bias)	CB_avg↑ (Context)	BG_Gap↓ (Background)
ResNet50	IN-Real	60.9	60.0	6.8
ResNet50	IN-lbGen	56.1	64.7	6.4
ViT-S	IN-Real	67.0	61.8	6.7
ViT-S	IN-lbGen	57.2	66.0	6.1

Under all three bias metrics, the proposed method comprehensively outperforms models pretrained on real data.

Key Findings¶

Data bias is quantifiable: High IN-val accuracy $\neq$ high generalization; bias is the root cause.
Greater benefits in few-shot scenarios: The less downstream data available, the more prominent the advantages of lbGen become (Figure 3).
Efficacy of semantic space alignment: The text semantic space of CLIP indeed provides a low-biased representation anchor.

Highlights & Insights¶

First to directly generate a low-biased dataset: Diverging from the traditional "collect then de-bias" paradigm, this work directly addresses the bias problem from the generation end.
Zero-image training: The diffusion model is fine-tuned using only 1000 category names, without introducing any external biased images.
Counter-intuitive finding: Synthetic data with 46% IN-val accuracy performs stronger in transfer learning than real data with 76% IN-val accuracy.
Lightweight: Training costs are kept highly manageable through LoRA fine-tuning and a 5-step gradient strategy.

Limitations & Future Work¶

Extremely low IN-val accuracy (46%) demands caution when applying it to in-domain scenarios.
Validated only on 1K categories; the scaling behavior to larger categories (e.g., 21K) remains to be verified.
Relies heavily on the quality of CLIP's semantic space — CLIP itself may exhibit inherent biases.
The quality assurance module utilizes the scoring model Q-ALIGN, which might introduce implicit quality-preference bias.
Evaluated only on ResNet50 and ViT-S; whether the advantages persist for larger models (e.g., ViT-L) remains unknown.

RealFake (Yuan et al.): Synthesizes data after learning the ImageNet distribution but does not mitigate bias $\rightarrow$ essentially replicating the biases.
GenRobust (Bansal et al.): Fine-tunes a diffusion model on ImageNet and uses carefully designed prompts $\rightarrow$ still constrained by the original distribution.
CLIP align: Utilizing CLIP's multimodal alignment capability as a "de-biasing" tool is a paradigm worth promoting — it can scale to other scenarios requiring low-bias representations (e.g., fairness, domain adaptation).

Rating¶

⭐⭐⭐⭐ — High novelty in perspective (addressing data bias from the generation end for the first time), with comprehensive experiments covering bias metrics; the counter-intuitive results are convincing. However, the assumption of relying on CLIP's semantic space requires deeper theoretical support.