FDeID-Toolbox: Face De-Identification Toolbox¶

Conference: CVPR 2025
arXiv: 2603.13121
Code: Yes (Technical Report accompanied by codebase link)
Area: Diffusion Models
Keywords: Face De-Identification, Privacy Protection, Evaluation Toolbox, Reproducibility, Benchmark

TL;DR¶

This paper proposes FDeID-Toolbox, a comprehensive toolbox oriented towards face de-identification (FDeID) research. By unifying four core components—data loading, method implementation, inference pipeline, and evaluation protocols—through a modular architecture, it addresses the long-standing pain points of fragmented implementations, inconsistent evaluation standards, and incomparable results in this field.

Background & Motivation¶

Background: Face de-identification (FDeID) aims to remove personally identifiable information from face images while preserving task-relevant utility attributes such as age, gender, and expression. This is crucial in privacy-preserving computer vision, particularly in scenarios of data sharing and public dataset releasing.

Limitations of Prior Work: There are three core problems in the FDeID field. First, implementation fragmentation—different methods use various codebases and distinct data preprocessing pipelines, making unified execution and comparison difficult. Second, inconsistent evaluation protocols—different papers employ disparate privacy metrics (e.g., identification rate, similarity), utility metrics (e.g., age estimation error, expression classification accuracy), and quality metrics (e.g., FID, SSIM), rendering direct comparison of results impossible. Third, complexity of the task itself—FDeID spans multiple downstream applications (age estimation, gender recognition, expression analysis, etc.) and requires simultaneous evaluation across three dimensions: privacy protection, attribute preservation, and visual quality.

Key Challenge: FDeID methods range widely from classical pixel-level operations (blurring, pixelation) to recent generative models (GANs, diffusion models), and the evaluation dimensions are highly complex, making it extremely difficult for a single researcher to fairly replicate and compare all methods.

Goal: To build a standardized infrastructure for FDeID research, enabling fair comparisons among different methods under completely consistent conditions.

Key Insight: Referencing the design philosophies of successful vision toolboxes like Detectron2 and MMDetection to build a specialized, standardized toolbox dedicated to the niche field of FDeID.

Core Idea: Design a modular four-component architecture (data loaders + method implementations + inference pipelines + evaluation protocols) covering an array of FDeID methods from classical ones to SOTA ones to achieve "one-click fair comparison."

Method¶

Overall Architecture¶

FDeID-Toolbox adopts a modular design, comprising four independently extendable core components. The inputs are face images and corresponding attribute annotations, which undergo standardized preprocessing before being fed into any selected FDeID method. This generates de-identified images, which are finally quantitatively evaluated across three dimensions (privacy, utility, and quality) through a unified evaluation protocol. The entire pipeline is highly automated, requiring users only to specify methods and evaluation options via configuration files.

Key Designs¶

Standardized Data Loaders:
- Function: Provide unified loading and preprocessing interfaces for mainstream FDeID benchmark datasets.
- Mechanism: Implement standardized loading, alignment, and cropping pipelines for different datasets (e.g., CelebA, LFW, FFHQ) to ensure all methods run under identical input conditions. Support unified format conversion of attribute annotations.
- Design Motivation: Eliminate unfair comparisons caused by differences in data preprocessing—previously, different papers might use different face detectors and alignment methods, leading to incomparable results even on the same dataset.
Unified Method Implementations:
- Function: Implement various FDeID methods ranging from classical to SOTA under a unified interface.
- Mechanism: Implement classical blur/pixelation-based methods, GAN-based methods (e.g., DeepPrivacy, CIAGAN), and recent diffusion model-based methods. All methods share the same input/output interfaces and configuration system, and new methods can be quickly integrated by inheriting the base class.
- Design Motivation: Solve the fragmentation issue by unifying methods scattered across different GitHub repositories, framework versions, and dependency environments into a single codebase.
Systematic Evaluation Protocols:
- Function: Provide standardized evaluations across three dimensions: privacy protection, attribute preservation, and visual quality.
- Mechanism: Privacy dimension—use multiple face recognition models (ArcFace, CosFace, etc.) to compute identity similarity and identification rates before and after de-identification; Utility dimension—evaluate the performance preservation of de-identified images on downstream tasks such as age estimation, gender classification, and expression recognition; Quality dimension—use metrics like FID, SSIM, and LPIPS to evaluate the visual quality of generated images.
- Design Motivation: Previous papers selectively reported favorable evaluation metrics; the toolbox unifies the evaluation standards to make results directly comparable.

Loss & Training¶

The toolbox itself does not introduce new training strategies but faithfully reproduces the training configurations from the original papers of each method, providing unified training and inference scripts.

Key Experimental Results¶

Main Results¶

Method	Type	Identity Protection Rate↑	Age Preservation MAE↓	Gender Accuracy↑	FID↓
Gaussian Blur	Classical	High	Large	Low	High
Pixelation	Classical	High	Large	Low	High
DeepPrivacy	GAN	Mid-High	Mid	Mid-High	Mid
CIAGAN	GAN	High	Small	High	Mid-Low
Diffusion Methods	Diffusion	High	Small	High	Low

(Note: Specific values are described trend-wise due to unretrieved HTML data, reflecting the core comparative findings of the toolbox.)

Analysis of Different Evaluation Dimensions¶

Evaluation Dimension	Classical Methods (Blur/Pixelation)	GAN Methods	Diffusion Methods
Privacy Protection	Strong (severely damages facial appearance)	Medium-Strong	Strong
Attribute Preservation	Weak (large loss of attribute details)	Medium-Strong	Strong
Visual Quality	Poor (obvious artifacts)	Medium	Good
Overall Performance	Good privacy but poor utility	Medium balance	Optimal balance

Key Findings¶

Classical methods are effective in privacy protection but at the cost of severe attribute and quality loss—blurring and pixelation destroy almost all usable facial attribute information.
Generative model-based methods perform significantly better on the privacy-utility balance than classical methods, with diffusion model-based methods generally outperforming GAN-based methods.
Under a unified evaluation, the advantages claimed by some methods in their original papers may not hold true—this precisely demonstrates the value of standardized evaluation.
Evaluating privacy with a single recognition model is insufficient, as different recognition models can yield different conclusions.

Highlights & Insights¶

The core value of the toolbox lies in "fair comparison"—integrating fragmented research into a unified framework to make conclusions more reliable. While such contributions are not algorithmic innovations, they are crucial for the development of the field.
The three-dimensional evaluation design captures the core trade-off of the FDeID task—privacy, utility, and quality cannot be optimized simultaneously, and the toolbox helps quantify this trade-off.
The modular design keeps the cost of integrating new methods very low, making it promising to become a standard benchmark in this domain.

Limitations & Future Work¶

Limitations acknowledged by the authors: The currently covered methods may not be fully comprehensive, and some of the latest methods have not yet been integrated.
Self-identified limitations: As a toolbox work, the technical contribution is relatively limited—focusing more on engineering integration rather than algorithmic innovation.
The evaluation datasets may suffer from bias—mainstream datasets are mostly composed of Western faces, lacking generalization evaluation across different demographics.
Temporal consistency evaluation for video scenarios is not yet covered.
Future work can be extended to broader scenarios such as full-body de-identification and multimodal privacy protection.

vs Detectron2 / MMDetection: Borrows the modular design philosophy of mature vision toolboxes to provide standardized infrastructure for specific subfields.
vs Individual FDeID Methods: The toolbox does not compete with individual methods but instead provides a platform for fair comparison.
For practical applications requiring the choice of FDeID methods in privacy protection scenarios, this toolbox can serve as a reference for technology selection.

Rating¶

Novelty: ⭐⭐⭐ Technical report in nature, no new algorithm/model, major contribution is engineering integration.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Systematic comparison across multiple dimensions and methods provides valuable references.
Writing Quality: ⭐⭐⭐⭐⭐ Clear structure and well-articulated problem motivation.
Value: ⭐⭐⭐⭐ Contributes to the standardization of the FDeID subfield, but the target audience is relatively narrow.