AugGen: Synthetic Augmentation using Diffusion Models Can Improve Recognition¶

Conference: NeurIPS 2025 arXiv: 2503.11544 Code: https://parsa-ra.github.io/auggen/ (project page) Area: Image Generation / Synthetic Data Augmentation Keywords: synthetic data augmentation, class-conditional diffusion models, face recognition, class mixing, self-contained framework

TL;DR¶

This paper proposes AugGen, a self-contained synthetic data augmentation method that trains a class-conditional diffusion model on the target dataset, generates new "mixed-class" samples by interpolating class conditioning vectors across different identities, and uses the resulting augmented data to improve discriminative model training. AugGen achieves 1–12% performance gains on face recognition benchmarks without relying on any external data or auxiliary models.

Background & Motivation¶

The reliance of machine learning on large-scale datasets raises significant privacy and ethical concerns, particularly in sensitive domains such as face recognition (FR). Synthetic data generation has emerged as an alternative, but existing methods (e.g., DCFace, IDiff-Face) depend heavily on external datasets and large pretrained models, increasing both complexity and resource requirements. Moreover, the external data used by these methods may itself carry privacy concerns (e.g., MS1Mv2).

AugGen is motivated by the question: can a generative model serve as an "augmenter" rather than a "replacement" for the discriminative model, using only the original training data with an intelligent sampling strategy? This is especially valuable in data-limited scenarios—bridging the performance gap between small- and large-scale training using limited real data.

Two key motivations underpin this work: (1) synthetic datasets generated by diffusion models frequently leak training data, offering no clear advantage over real data when used as a direct substitute; and (2) responsibly curated FR datasets are scarce and difficult to collect, motivating the use of generative models as augmentation tools rather than data surrogates.

Method¶

Overall Architecture¶

The AugGen pipeline consists of four steps: (a) jointly train a class-conditional generative model \(G\) and a discriminative model \(M_{\text{orig}}\) on the labeled dataset \(D_{\text{orig}}\); (b) regenerate samples \(D_{\text{repro}}\) using the original conditioning vectors to simulate the training distribution; (c) identify optimal mixing weights \((\alpha^*, \beta^*)\) via grid search for generating new classes; and (d) generate augmented data \(D_{\text{aug}}\), combine it with the original data, and retrain the discriminative model.

Key Designs¶

Class-conditional mixing sampling strategy: Given one-hot conditioning vectors \(\mathbf{c}^i, \mathbf{c}^j\) for two classes \(i\) and \(j\), a new conditioning vector is constructed as \(\mathbf{c}^* = \alpha \mathbf{c}^i + \beta \mathbf{c}^j\). Sampling from the trained diffusion model conditioned on \(\mathbf{c}^*\) yields new samples that blend characteristics of both source classes. The selection of appropriate \(\alpha, \beta\) is critical: values that are too small yield samples indistinguishable from the source classes, while excessively large values push samples outside the valid distribution. The design motivation is to generate "hard negatives"—new classes that are close to the source classes in feature space yet distinguishable—forcing the discriminative model to learn more compact intra-class representations and greater inter-class separation.
Similarity-based mixing weight search: A composite metric \(m^{\text{total}} = m_s^{\text{total}} + m_d^{\text{total}}\) is designed, where \(m_d\) measures the feature difference between the new class and its source classes (higher is better) and \(m_s\) measures the intra-class consistency across multiple samples drawn from the same \(\mathbf{c}^*\) (higher is better). Grid search over \(\alpha, \beta \in [0.1, 1.1]\) maximizes this metric. The optimal values are \((0.7, 0.7)\) for CASIA-WebFace and \((0.8, 0.8)\) for WebFace160K. The search is computationally efficient—requiring fewer than 2 GPU-days on a single RTX 3090 Ti.
Self-contained design principle: The entire pipeline uses only the target dataset, introducing no external data, pretrained generators, or auxiliary classifiers. The generative model is trained from scratch using the EDM/EDM2 framework, while the discriminative model uses IR50/IR101 with AdaFace; both share the same training data. This ensures privacy compliance and reproducibility.

Loss & Training¶

The generative model adopts the EDM/EDM2 diffusion objective from Karras et al. (denoising score matching), supporting both latent-space and pixel-space variants.
The discriminative model employs AdaFace margin loss, an improved variant of ArcFace.
Standard data augmentation (illumination transforms, cropping, low-resolution simulation) is applied to all models.
Results are reported with mean and standard deviation across multiple random seeds to ensure reliability.

Key Experimental Results¶

Main Results¶

Using CASIA-WebFace as the base dataset with an IR50 backbone, evaluated across 8 FR benchmarks:

Method	Aux. Data	IJB-B 1e-6	IJB-C 1e-6	TinyFace R1	Avg-H
CASIA-WebFace (real)	None	1.02±0.26	0.73±0.19	58.12±0.31	94.21
CASIA-WebFace IR101†	None	0.74±0.31	0.38±0.13	59.64±0.49	94.84
DCFace (synthetic)	Yes	22.48±4.35	35.27±10.78	45.94	91.56
IDiff-Face (synthetic)	Yes	26.84±2.03	41.75±1.04	45.98	84.68
AugGen \(D_{\text{aug}}\) (mixed)	None	2.61±0.91	4.36±1.41	59.82	94.66

On WebFace160K, AugGen surpasses training with pure real data on a larger network (IR101) under strict IJB-B/C thresholds:

Method	IJB-B 1e-6	IJB-B 1e-5	IJB-C 1e-6	IJB-C 1e-5
WebFace160K IR50	32.13±1.87	72.18±0.18	70.37±0.75	78.81±0.32
WebFace160K IR101†	34.84±0.49	74.10±0.24	72.56±0.02	81.26±0.14
AugGen + WebFace160K	36.62±0.77	78.32±0.33	78.58±0.15	85.02±0.15

Ablation Study¶

Equivalent real-data experiments show that adding 600K AugGen samples yields performance gains comparable to adding approximately 110K real samples (a 1.69× real-data ratio), demonstrating the data efficiency of synthetic augmentation.

Key Findings¶

AugGen is the only method that consistently outperforms the baseline across all benchmarks—DCFace and IDiff-Face actually degrade performance on IJB-B/C at low FAR thresholds when used in mixed training.
The gains from synthetic augmentation frequently exceed those from architectural upgrades (IR50 + AugGen > IR101 with real data only).
Existing generation quality metrics (FD, KD) correlate poorly with downstream discriminative performance, highlighting the need for better proxy metrics.
AugGen's mixed-class samples occupy positions in feature space that are close to, yet distinguishable from, the source identities, increasing the discriminative difficulty of training.

Highlights & Insights¶

Counterintuitive value: Contrary to the prevailing trend of large-scale pretraining followed by fine-tuning, this work demonstrates that small generative models trained solely on limited real data can substantially improve downstream discriminative tasks.
Privacy compliance: The fully self-contained design avoids privacy concerns associated with external data, carrying significant implications for responsible AI development.
The mixed conditioning vector approach can be viewed as an extension of Mixup into the conditioning space of generative models, producing semantically coherent new samples rather than naive pixel-level interpolations.
The finding that synthetic augmentation outperforms architectural upgrades underscores the central role of data quality and diversity in discriminative tasks.

Limitations & Future Work¶

Validation is limited to face recognition; generalization to broader image classification or fine-grained recognition tasks remains unexplored.
The mixing weight search, though efficient, still requires non-trivial computation and a pre-trained discriminative model.
Only two-class mixing (\(\alpha \mathbf{c}^i + \beta \mathbf{c}^j\)) is explored; mixing among more classes or continuous interpolation may yield further gains.
Generation quality is bounded by the scale and diversity of the original dataset—diffusion model quality may degrade when training data is extremely scarce.
The applicability of text-conditioned diffusion models (e.g., Stable Diffusion) within this framework is not explored.

DCFace / IDiff-Face: These methods rely on external data and models; AugGen demonstrates that such dependency is unnecessary.
Mixup: AugGen's class-mixing strategy is a natural extension of Mixup into the conditioning space of generative models.
DigiFace1M: A 3D rendering-based approach; AugGen shows that learned generative models are viable even without 3D priors.
Core insight: generative models should not be positioned as competitors to real data, but rather as collaborative augmentation tools that work in concert with real data.

Rating¶

⭐⭐⭐⭐ — The approach is conceptually simple and practically motivated (self-contained, no external dependencies), with thorough experimental validation across 8 FR benchmarks. Although the core idea is straightforward, it represents a meaningful paradigm shift in synthetic data augmentation. The primary limitation is that applicability is currently restricted to face recognition.