FDeID-Toolbox: Face De-Identification Toolbox¶
Conference: CVPR 2026 arXiv: 2603.13121 Authors: Hui Wei, Hao Yu, Guoying Zhao (University of Oulu) Code: infraface/FDeID-Toolbox Area: Image Generation Keywords: face de-identification, privacy preservation, toolbox, benchmark, evaluation protocol
TL;DR¶
This paper presents FDeID-Toolbox, a modular face de-identification toolbox that uniformly integrates 16 de-identification methods (spanning four categories: naive, generative, adversarial, and K-Same), 6 benchmark datasets, and a systematic evaluation protocol covering three dimensions—privacy protection, attribute preservation, and visual quality—addressing the field's persistent problems of fragmented implementations, inconsistent evaluation protocols, and incomparable results.
Background & Motivation¶
Face De-Identification (FDeID) aims to remove personally identifiable information from facial images while retaining task-relevant attributes such as age, gender, and expression. This is critical for privacy-preserving computer vision, yet the field faces three fundamental challenges:
- Fragmented implementations: Methods are developed in isolation using different frameworks, preprocessing pipelines, and data formats, making reproduction and comparison difficult.
- Inconsistent evaluation protocols: FDeID spans multiple downstream tasks (age estimation, gender recognition, expression analysis, etc.) and requires evaluation across privacy protection, attribute preservation, and visual quality dimensions; existing codebases lack a unified standard.
- Incomparable results: Different papers use different data splits, evaluation models, and metric definitions, rendering fair cross-method performance comparisons infeasible.
The core motivation is to construct a unified, modular, and extensible toolbox that enables researchers to fairly compare FDeID methods under fully controlled, consistent conditions, thereby advancing reproducible research.
Method¶
Overall Architecture¶
FDeID-Toolbox adopts a modular design comprising four core components:
- Standardized data loaders: Unified interfaces for 6 mainstream benchmark datasets (LFW, AgeDB, AffectNet, CelebA-HQ, FairFace, PURE).
- Unified method implementations: 16 de-identification methods ranging from classical approaches to state-of-the-art generative models.
- Flexible inference pipeline: YAML configuration file-driven, with CLI parameters capable of overriding any configuration value.
- Systematic evaluation protocol: Covering privacy, attribute preservation, and quality evaluation dimensions.
BaseDeIdentifier Abstract Base Class¶
All methods inherit from a unified BaseDeIdentifier abstract base class providing a standardized interface:
process_frame(frame, face_bbox): Applies de-identification to the facial region of a single frame.process_batch(frames, face_bboxes): Native batch processing support.get_name()/get_config(): Method metadata.- Factory function
get_deidentifier(config)for automatic instantiation via configuration dictionary.
16 De-Identification Methods (Four Categories)¶
| Category | Method | Identifier | Mechanism |
|---|---|---|---|
| Naive | Gaussian Blur | blur | Gaussian blurring of the face region |
| Naive | Pixelation | pixelate | Pixelation (mosaic) of the face region |
| Naive | Black Mask | mask | Black mask overlaid on the face |
| Generative | CIAGAN | ciagan | Conditional identity anonymization GAN |
| Generative | AMT-GAN | amtgan | Adversarial makeup transfer GAN |
| Generative | Adv-Makeup | advmakeup | Adversarial makeup generation |
| Generative | WeakenDiff | weakendiff | Diffusion model-based identity weakening |
| Generative | DeID-rPPG | deid_rppg | De-identification preserving rPPG signals |
| Generative | G2Face | g2face | Generative face replacement |
| Adversarial | PGD | pgd | Projected Gradient Descent adversarial perturbation |
| Adversarial | MI-FGSM | mifgsm | Momentum Iterative FGSM |
| Adversarial | TI-DIM | tidim | Translation-Invariant Diverse Input Method |
| Adversarial | TI-PIM | tipim | Translation-Invariant Patch Input Method |
| Adversarial | Chameleon | chameleon | Natural adversarial perturbation |
| K-Same | k-Same-Average | average | k-nearest neighbor face averaging |
| K-Same | k-Same-Select | select | k-nearest neighbor selection and replacement |
| K-Same | k-Same-Furthest | furthest | k-furthest neighbor replacement |
Three-Dimensional Evaluation Framework¶
Privacy Protection Dimension: - Verification Accuracy: Face verification accuracy (lower is better, indicating successful identity concealment) - TAR@FAR: True acceptance rate at a given false acceptance rate - PSR (Privacy Success Rate): Rate of successful privacy protection - Evaluation models: ArcFace, CosFace, AdaFace
Attribute Preservation Dimension: - Age: MAE (Mean Absolute Error) - Gender: Classification accuracy - Expression: Classification accuracy - Facial landmarks: NME (Normalized Mean Error) - Ethnicity: Classification accuracy - rPPG: Heart rate MAE and RMSE
Visual Quality Dimension: - Reference-based metrics: PSNR, SSIM, LPIPS - Reference-free distribution metric: FID - Reference-free quality metric: NIQE
Key Experimental Results¶
Table 1: Privacy Protection Evaluation (LFW Dataset)¶
| Method | Category | ArcFace Acc↓ | CosFace Acc↓ | AdaFace Acc↓ | PSR↑ |
|---|---|---|---|---|---|
| Original | - | 99.8 | 99.7 | 99.8 | 0.0 |
| Blur | Naive | 56.2 | 58.1 | 55.8 | 87.4 |
| Pixelate | Naive | 62.4 | 63.7 | 61.9 | 79.3 |
| Mask | Naive | 50.1 | 50.3 | 50.0 | 99.6 |
| CIAGAN | Generative | 53.8 | 55.2 | 54.1 | 91.2 |
| AMT-GAN | Generative | 58.7 | 60.3 | 57.9 | 82.6 |
| WeakenDiff | Generative | 51.4 | 52.8 | 51.1 | 96.8 |
| G2Face | Generative | 52.1 | 53.5 | 51.8 | 95.3 |
| PGD | Adversarial | 67.3 | 69.1 | 66.8 | 64.5 |
| Chameleon | Adversarial | 61.8 | 63.4 | 60.9 | 76.2 |
| k-Same-Avg | K-Same | 55.4 | 57.2 | 54.9 | 89.1 |
| k-Same-Furthest | K-Same | 53.1 | 54.8 | 52.7 | 93.5 |
Generative methods (WeakenDiff, G2Face) approach Black Mask in privacy protection, whereas the latter entirely destroys visual information. Adversarial methods offer comparatively weaker privacy protection.
Table 2: Visual Quality vs. Attribute Preservation Trade-off (AgeDB / CelebA-HQ)¶
| Method | FID↓ | SSIM↑ | LPIPS↓ | Age MAE↓ | Gender Acc↑ | Landmark NME↓ |
|---|---|---|---|---|---|---|
| Blur | 142.5 | 0.71 | 0.38 | 8.2 | 78.4 | 12.3 |
| Pixelate | 156.8 | 0.65 | 0.42 | 9.1 | 75.6 | 14.7 |
| Mask | 198.3 | 0.52 | 0.56 | 15.3 | 62.1 | N/A |
| CIAGAN | 78.4 | 0.82 | 0.21 | 4.5 | 89.3 | 5.8 |
| AMT-GAN | 85.2 | 0.79 | 0.24 | 5.1 | 87.8 | 6.2 |
| WeakenDiff | 62.1 | 0.86 | 0.17 | 3.8 | 91.5 | 4.6 |
| G2Face | 58.7 | 0.88 | 0.15 | 3.4 | 92.1 | 4.2 |
| PGD | 45.2 | 0.93 | 0.08 | 2.1 | 95.8 | 2.8 |
| Chameleon | 52.3 | 0.91 | 0.11 | 2.6 | 94.2 | 3.1 |
| k-Same-Avg | 95.6 | 0.76 | 0.28 | 5.8 | 84.7 | 7.4 |
| k-Same-Furthest | 108.3 | 0.73 | 0.31 | 6.5 | 82.3 | 8.1 |
Key findings: - Privacy–attribute preservation trade-off: Adversarial methods (PGD, Chameleon) best preserve attributes but provide the weakest privacy protection; Black Mask offers the strongest privacy protection but completely eliminates attribute information. - Generative methods achieve the best balance: WeakenDiff and G2Face attain the optimal trade-off between privacy protection (PSR > 95%) and attribute preservation (Age MAE < 4, Gender Acc > 91%). - Naive and K-Same methods: Exhibit inferior visual quality (FID > 95) with moderate attribute preservation. - Value of unified evaluation: Only under consistent conditions do the true strengths and weaknesses of each method category become apparent; advantages claimed by certain methods in their original papers no longer hold under unified evaluation.
Highlights & Insights¶
- Unified abstract interface: The
BaseDeIdentifierbase class is cleanly designed (process_frame+process_batch); integrating a new method requires only ~30 lines of code, making the toolbox highly extensible. - YAML configuration-driven: All experiments are fully specified through a single YAML file, with CLI overrides for any parameter, ensuring complete reproducibility.
- First unified three-dimensional evaluation: Privacy protection, attribute preservation, and visual quality have never previously been systematically compared within a single framework using consistent evaluation models and data splits.
- Broad coverage: 16 methods spanning four categories, 6 datasets covering diverse downstream tasks, and 8 evaluation dimensions (privacy + 5 attributes + rPPG + quality) constitute the most comprehensive FDeID benchmark to date.
- Lightweight dependencies: Pure PyTorch implementation with no complex C++ extensions or conflicting frameworks, lowering the barrier to adoption.
- Factory pattern design: Methods can be switched via a configuration dictionary, facilitating large-scale automated experimentation.
Limitations & Future Work¶
- Technical report nature: As a toolbox paper, methodological innovation is limited; the core contribution lies in engineering integration and standardization.
- Incomplete method uploads: The GitHub repository indicates that K-Same and certain generative methods are still being uploaded, and completeness remains to be verified.
- Limited dataset scale: Datasets such as LFW are relatively small with limited facial diversity; generalization to large-scale real-world scenarios requires further validation.
- No video-level evaluation: Although the interface supports frame-by-frame processing, temporal consistency evaluation for video de-identification is absent.
- No formal privacy guarantees: Evaluation relies solely on empirical metrics (face verification) without theoretical privacy guarantees such as differential privacy.
- Limited diffusion model methods: Only WeakenDiff is diffusion model-based; numerous recently proposed diffusion-based de-identification methods have not been incorporated.
Related Work & Insights¶
- Traditional de-identification: Naive methods such as Gaussian Blur, Pixelation, and Black Mask are simple and effective but offer poor attribute preservation, serving as long-standing baselines.
- K-Same family: k-Same-Pixel, k-Same-Select, and k-Same-Furthest aggregate faces in feature space based on k-anonymity principles, providing theoretical privacy guarantees but limited visual quality.
- Generative methods: CIAGAN (conditional identity anonymization GAN), AMT-GAN (adversarial makeup transfer), DeepPrivacy (GAN inpainting), and FALCO (attention-guided conditional generation) offer high visual quality but involve complex training.
- Adversarial perturbation methods: PGD, MI-FGSM, TI-DIM, and others add imperceptible pixel-level perturbations to deceive recognition models; they induce minimal visual change but their privacy protection is model-dependent, limiting generalizability.
- Diffusion model methods: WeakenDiff, RiDDLE, and others leverage the generative capacity of diffusion models to achieve high-quality de-identification and represent a current research focus.
- FDeID-Toolbox positioning: Rather than proposing new methods, this work unifies the implementations and evaluations of existing methods, filling the gap left by the absence of a standardized benchmark in the field.
Rating¶
- Novelty: ⭐⭐⭐ — A toolbox/benchmark paper with limited methodological innovation, though the unified evaluation framework design constitutes a clear contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ — 16 methods × 6 datasets × three-dimensional evaluation with strong diversity across method categories.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure, detailed description of modular design, and abundant code examples.
- Value: ⭐⭐⭐⭐ — Provides the FDeID field with a much-needed standardized evaluation platform; has strong potential to become the standard benchmark tool for this research direction.