SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data¶

Conference: ICCV 2025 arXiv: 2410.09865 Code: Available Area: Facial Expression Recognition / Synthetic Data Keywords: Facial expression recognition, synthetic data, diffusion models, facial action units, label calibration

TL;DR¶

This paper proposes SynFER, a diffusion-model-based facial expression synthesis framework that achieves fine-grained expression generation via dual control signals — text descriptions and Facial Action Units (FAUs) — and introduces a FERAnno label calibrator to ensure annotation reliability. The effectiveness of synthetic data for FER is validated across four learning paradigms: self-supervised, supervised, zero-shot, and few-shot learning.

Background & Motivation¶

Facial Expression Recognition (FER) suffers from a severe data scarcity problem. Existing FER datasets are far smaller than general-purpose visual datasets (AffectNet ~280K vs. ImageNet 1.4M), and are plagued by high annotation subjectivity, low-quality images, and elevated label error rates. These deficiencies hinder the development of foundation models for FER.

Directly leveraging diffusion models to generate FER data faces two key challenges: (1) generative model training data lacks diverse expressions, making it difficult to capture subtle expressive semantics; and (2) expressions are abstract and subjective concepts that cannot be directly annotated like depth maps or segmentation masks.

Method¶

Overall Architecture¶

A three-stage pipeline: 1. Preparation: Construct the FEText dataset and generate FAU labels and text descriptions. 2. Generation: FAU-controlled, semantically guided diffusion model generates high-fidelity expression images. 3. Annotation: FERAnno label calibrator automatically generates reliable labels, validated via ensemble voting.

Key Designs¶

FEText Dataset Construction:
- The first facial expression image-text pair dataset, integrating FFHQ, CelebA-HQ, AffectNet, and SFEW.
- Contains 400K curated image-text pairs.
- A super-resolution model is applied to unify the resolution of low-resolution images (AffectNet, SFEW).
- ShareGPT-4V, a multimodal large language model, is used to generate detailed expression description texts.
- Carefully designed prompts guide the model to produce precise, emotionally reflective descriptions.
FAU-Controlled Expression Generation:
- Text descriptions provide high-level semantic control but lack fine-grained facial muscle movement information.
- Facial Action Units (FAUs) are introduced as explicit control signals, with each AU corresponding to a specific facial muscle movement.
- Following IP-Adapter, a decoupled cross-attention module integrates FAU embeddings.
- FAU labels are annotated using the OpenGraphAU model.
- The diffusion model parameters are frozen; only the AU adapter (MLP) is trained.
Semantic Guidance:
- Addresses FER label imbalance and inter-expression ambiguity (e.g., disgust vs. anger).
- Layout initialization: images are randomly selected from FEText and inverted into initial noise to preserve natural facial structure.
- Semantic guidance: during the late denoising stage, gradients from an external FER classifier are used to update text embeddings.
- Update rule: \(c_{t-1}^{text} = c_t^{text} + \lambda_g \frac{\nabla_{c_t^{text}} \mathcal{L}_g}{\|\nabla_{c_t^{text}} \mathcal{L}_g\|_2}\)
- where \(\mathcal{L}_g = -y \log(h(f(\hat{x}_0))_i)\) is the classification loss.
FERAnno Label Calibrator:
- A diffusion-model-based pseudo-label generator that leverages U-Net intermediate features and cross-attention maps.
- Image inversion (\(t=1\), preserving maximum facial detail) → feature extraction → dual-branch encoder fusion.
- Multi-scale feature maps capture global generative information; cross-attention maps provide class-discriminative information.
- A bidirectional cross-attention block fuses the two feature types → a linear layer outputs expression class probabilities.
- Ensemble voting with external FER models: when predictions are inconsistent, they are replaced with the ensemble prediction average.

Loss & Training¶

Diffusion model training: standard diffusion loss \(\min_\theta \mathbb{E}\|\epsilon - \epsilon_\theta(x_t, c, t)\|_2^2\)
AU adapter training: diffusion model is frozen; only the MLP mapping FAU → embeddings is trained.
FERAnno training: dual-branch architecture utilizing intermediate diffusion model features and attention maps.

Key Experimental Results¶

Main Results¶

Self-supervised pre-training + linear probing (ResNet-50):

SSL Method	Pre-training Data	Scale	RAF-DB	AffectNet	SFEW
MoCo v3	AffectNet	0.2M	79.05	51.03	49.34
MoCo v3	SynFER	1.0M	81.17(+2.12)	55.56(+4.53)	50.78(+1.44)
MoCo v3	Both	1.2M	81.68(+2.63)	57.84(+6.81)	51.26(+1.92)

Supervised learning improvements:

Method	RAF-DB	AffectNet
POSTER++	91.59	67.49
POSTER++ + SynFER	91.95	69.04
APViT	91.78	66.94
APViT + SynFER	92.05	67.26
FERAnno (standalone)	92.56	70.38

Training on synthetic data only: 67.23% on AffectNet (equal data volume) → 69.84% (5× data volume).

Ablation Study¶

Contribution of each component to generation quality and downstream performance:

Method	FER Acc.	AU Acc.	RAF-DB	AffectNet
SD baseline	20.06%	87.72%	89.42	65.36
+ FEText	34.62%	88.91%	90.54	66.62
+ FEText + FAUs	48.74%	92.37%	91.68	67.68
+ FEText + FAUs + SG	55.14%	93.31%	91.95	68.13

Generation quality comparison (FID↓):

Method	FID	FER Acc.	User Preference (EA)
Stable Diffusion	88.40	20.06%	2.86%
FineFace	74.61	38.05%	5.73%
SynFER	16.32	55.14%	59.64%

Key Findings¶

FAU control significantly improves expression accuracy: FER accuracy increases from 34.62% to 48.74%, rendering ambiguous expressions (e.g., fear vs. surprise) more distinguishable.
Semantic guidance provides further improvement: an additional 6.4% gain in expression accuracy on top of FAU control.
Self-supervised learning benefits the most: synthetic data yields substantially larger gains for SSL than for supervised learning, as the latter demands stricter distributional alignment.
FERAnno as a standalone strong classifier: 92.56% on RAF-DB and 70.38% on AffectNet, surpassing all prior state-of-the-art FER models.
Data scaling is effective: particularly when combined with Real-Fake techniques for distribution alignment, supervised learning also benefits from larger volumes of synthetic data.

Highlights & Insights¶

End-to-end synthetic data pipeline: a closed loop spanning dataset construction (FEText) → generation control (FAU + SG) → label calibration (FERAnno) → downstream validation.
Dual role of FERAnno: it functions both as a label calibration tool and as a standalone state-of-the-art FER classifier, demonstrating that diffusion model internal features are highly informative for expression understanding.
Validation across four learning paradigms: comprehensive evaluation over SSL, supervised, zero-shot, and few-shot settings provides strong empirical evidence.
Addressing the FER data scale bottleneck: provides a scalable data generation solution for training FER foundation models.

Limitations & Future Work¶

Gains from synthetic data in supervised learning remain modest (<2%), with distributional alignment still being the primary bottleneck.
Coverage is limited to basic expression categories (7 classes); compound expressions and micro-expressions require finer-grained control.
Text descriptions in FEText rely on ShareGPT-4V, which may introduce descriptive bias.
The accuracy of the FAU detection model (OpenGraphAU) inherently limits the precision of the control signal.

The approach of leveraging diffusion model intermediate features for classification in FERAnno is extensible to other visual understanding tasks.
The FAU control mechanism can be generalized to other generation tasks requiring fine-grained facial control, such as talking head synthesis.
The finding that SSL benefits more than supervised learning from synthetic data offers a useful reference for synthetic data applications in other domains.

Rating¶

Novelty: ⭐⭐⭐⭐ First complete diffusion-based FER synthesis pipeline; FAU and semantic guidance design is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four learning paradigms × six datasets × extensive ablations — extremely comprehensive.
Writing Quality: ⭐⭐⭐⭐ Pipeline described clearly with rich figures and tables.
Value: ⭐⭐⭐⭐ Provides a scalable data generation solution for the FER community.