SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data¶
Conference: ICCV 2025 arXiv: 2410.09865 Code: Available Area: Facial Expression Recognition / Synthetic Data Keywords: Facial expression recognition, synthetic data, diffusion models, facial action units, label calibration
TL;DR¶
This paper proposes SynFER, a diffusion-model-based facial expression synthesis framework that achieves fine-grained expression generation via dual control signals — text descriptions and Facial Action Units (FAUs) — and introduces a FERAnno label calibrator to ensure annotation reliability. The effectiveness of synthetic data for FER is validated across four learning paradigms: self-supervised, supervised, zero-shot, and few-shot learning.
Background & Motivation¶
Facial Expression Recognition (FER) suffers from a severe data scarcity problem. Existing FER datasets are far smaller than general-purpose visual datasets (AffectNet ~280K vs. ImageNet 1.4M), and are plagued by high annotation subjectivity, low-quality images, and elevated label error rates. These deficiencies hinder the development of foundation models for FER.
Directly leveraging diffusion models to generate FER data faces two key challenges: (1) generative model training data lacks diverse expressions, making it difficult to capture subtle expressive semantics; and (2) expressions are abstract and subjective concepts that cannot be directly annotated like depth maps or segmentation masks.
Method¶
Overall Architecture¶
A three-stage pipeline: 1. Preparation: Construct the FEText dataset and generate FAU labels and text descriptions. 2. Generation: FAU-controlled, semantically guided diffusion model generates high-fidelity expression images. 3. Annotation: FERAnno label calibrator automatically generates reliable labels, validated via ensemble voting.
Key Designs¶
-
FEText Dataset Construction:
- The first facial expression image-text pair dataset, integrating FFHQ, CelebA-HQ, AffectNet, and SFEW.
- Contains 400K curated image-text pairs.
- A super-resolution model is applied to unify the resolution of low-resolution images (AffectNet, SFEW).
- ShareGPT-4V, a multimodal large language model, is used to generate detailed expression description texts.
- Carefully designed prompts guide the model to produce precise, emotionally reflective descriptions.
-
FAU-Controlled Expression Generation:
- Text descriptions provide high-level semantic control but lack fine-grained facial muscle movement information.
- Facial Action Units (FAUs) are introduced as explicit control signals, with each AU corresponding to a specific facial muscle movement.
- Following IP-Adapter, a decoupled cross-attention module integrates FAU embeddings.
- FAU labels are annotated using the OpenGraphAU model.
- The diffusion model parameters are frozen; only the AU adapter (MLP) is trained.
-
Semantic Guidance:
- Addresses FER label imbalance and inter-expression ambiguity (e.g., disgust vs. anger).
- Layout initialization: images are randomly selected from FEText and inverted into initial noise to preserve natural facial structure.
- Semantic guidance: during the late denoising stage, gradients from an external FER classifier are used to update text embeddings.
- Update rule: \(c_{t-1}^{text} = c_t^{text} + \lambda_g \frac{\nabla_{c_t^{text}} \mathcal{L}_g}{\|\nabla_{c_t^{text}} \mathcal{L}_g\|_2}\)
- where \(\mathcal{L}_g = -y \log(h(f(\hat{x}_0))_i)\) is the classification loss.
-
FERAnno Label Calibrator:
- A diffusion-model-based pseudo-label generator that leverages U-Net intermediate features and cross-attention maps.
- Image inversion (\(t=1\), preserving maximum facial detail) → feature extraction → dual-branch encoder fusion.
- Multi-scale feature maps capture global generative information; cross-attention maps provide class-discriminative information.
- A bidirectional cross-attention block fuses the two feature types → a linear layer outputs expression class probabilities.
- Ensemble voting with external FER models: when predictions are inconsistent, they are replaced with the ensemble prediction average.
Loss & Training¶
- Diffusion model training: standard diffusion loss \(\min_\theta \mathbb{E}\|\epsilon - \epsilon_\theta(x_t, c, t)\|_2^2\)
- AU adapter training: diffusion model is frozen; only the MLP mapping FAU → embeddings is trained.
- FERAnno training: dual-branch architecture utilizing intermediate diffusion model features and attention maps.
Key Experimental Results¶
Main Results¶
Self-supervised pre-training + linear probing (ResNet-50):
| SSL Method | Pre-training Data | Scale | RAF-DB | AffectNet | SFEW |
|---|---|---|---|---|---|
| MoCo v3 | AffectNet | 0.2M | 79.05 | 51.03 | 49.34 |
| MoCo v3 | SynFER | 1.0M | 81.17(+2.12) | 55.56(+4.53) | 50.78(+1.44) |
| MoCo v3 | Both | 1.2M | 81.68(+2.63) | 57.84(+6.81) | 51.26(+1.92) |
Supervised learning improvements:
| Method | RAF-DB | AffectNet |
|---|---|---|
| POSTER++ | 91.59 | 67.49 |
| POSTER++ + SynFER | 91.95 | 69.04 |
| APViT | 91.78 | 66.94 |
| APViT + SynFER | 92.05 | 67.26 |
| FERAnno (standalone) | 92.56 | 70.38 |
Training on synthetic data only: 67.23% on AffectNet (equal data volume) → 69.84% (5× data volume).
Ablation Study¶
Contribution of each component to generation quality and downstream performance:
| Method | FER Acc. | AU Acc. | RAF-DB | AffectNet |
|---|---|---|---|---|
| SD baseline | 20.06% | 87.72% | 89.42 | 65.36 |
| + FEText | 34.62% | 88.91% | 90.54 | 66.62 |
| + FEText + FAUs | 48.74% | 92.37% | 91.68 | 67.68 |
| + FEText + FAUs + SG | 55.14% | 93.31% | 91.95 | 68.13 |
Generation quality comparison (FID↓):
| Method | FID | FER Acc. | User Preference (EA) |
|---|---|---|---|
| Stable Diffusion | 88.40 | 20.06% | 2.86% |
| FineFace | 74.61 | 38.05% | 5.73% |
| SynFER | 16.32 | 55.14% | 59.64% |
Key Findings¶
- FAU control significantly improves expression accuracy: FER accuracy increases from 34.62% to 48.74%, rendering ambiguous expressions (e.g., fear vs. surprise) more distinguishable.
- Semantic guidance provides further improvement: an additional 6.4% gain in expression accuracy on top of FAU control.
- Self-supervised learning benefits the most: synthetic data yields substantially larger gains for SSL than for supervised learning, as the latter demands stricter distributional alignment.
- FERAnno as a standalone strong classifier: 92.56% on RAF-DB and 70.38% on AffectNet, surpassing all prior state-of-the-art FER models.
- Data scaling is effective: particularly when combined with Real-Fake techniques for distribution alignment, supervised learning also benefits from larger volumes of synthetic data.
Highlights & Insights¶
- End-to-end synthetic data pipeline: a closed loop spanning dataset construction (FEText) → generation control (FAU + SG) → label calibration (FERAnno) → downstream validation.
- Dual role of FERAnno: it functions both as a label calibration tool and as a standalone state-of-the-art FER classifier, demonstrating that diffusion model internal features are highly informative for expression understanding.
- Validation across four learning paradigms: comprehensive evaluation over SSL, supervised, zero-shot, and few-shot settings provides strong empirical evidence.
- Addressing the FER data scale bottleneck: provides a scalable data generation solution for training FER foundation models.
Limitations & Future Work¶
- Gains from synthetic data in supervised learning remain modest (<2%), with distributional alignment still being the primary bottleneck.
- Coverage is limited to basic expression categories (7 classes); compound expressions and micro-expressions require finer-grained control.
- Text descriptions in FEText rely on ShareGPT-4V, which may introduce descriptive bias.
- The accuracy of the FAU detection model (OpenGraphAU) inherently limits the precision of the control signal.
Related Work & Insights¶
- The approach of leveraging diffusion model intermediate features for classification in FERAnno is extensible to other visual understanding tasks.
- The FAU control mechanism can be generalized to other generation tasks requiring fine-grained facial control, such as talking head synthesis.
- The finding that SSL benefits more than supervised learning from synthetic data offers a useful reference for synthetic data applications in other domains.
Rating¶
- Novelty: ⭐⭐⭐⭐ First complete diffusion-based FER synthesis pipeline; FAU and semantic guidance design is elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four learning paradigms × six datasets × extensive ablations — extremely comprehensive.
- Writing Quality: ⭐⭐⭐⭐ Pipeline described clearly with rich figures and tables.
- Value: ⭐⭐⭐⭐ Provides a scalable data generation solution for the FER community.