Skip to content

CLIP-FTI: Fine-Grained Face Template Inversion via CLIP-Driven Attribute Conditioning

Conference: AAAI 2026 arXiv: 2512.15433 Code: N/A Area: Human Understanding Keywords: Face Template Inversion, CLIP, StyleGAN, Adversarial Attack, Cross-Model Transferability

TL;DR

This paper presents the first approach to leverage CLIP-extracted fine-grained facial semantic attribute embeddings for Face Template Inversion (FTI). A cross-modal feature interaction network fuses leaked templates with attribute embeddings and projects them into the StyleGAN latent space, synthesizing identity-consistent face images with richer attribute details. The method surpasses state-of-the-art in recognition accuracy, attribute similarity, and cross-model attack transferability.

Background & Motivation

Face templates (i.e., deep feature embeddings) stored in face recognition (FR) systems pose serious security and privacy risks upon leakage: attackers can reconstruct realistic face images via Face Template Inversion (FTI), exposing soft biometrics (e.g., age, gender) and enabling impersonation attacks. Existing methods primarily map a single leaked template to the StyleGAN latent space, but the reconstructed images tend to be over-smoothed in facial component attributes (eyes, nose, mouth), lacking fine-grained details, and exhibiting limited cross-model attack transferability.

The core insight of this paper is to exploit CLIP's semantic alignment capability to introduce additional attribute-level semantic information into template inversion. CLIP maps images and text into a shared semantic space rich in facial attribute descriptions (eye shape, nose bridge, lip fullness, etc.). Fusing these attribute embeddings with the leaked template compensates for the fine-grained information lost during template-only inversion.

Method

Overall Architecture

CLIP-FTI consists of two stages:

Training Stage: Assumes access to face images and their corresponding recognition templates (extracted by a surrogate FR model) to (i) extract CLIP semantic attribute embeddings, and (ii) learn two mapping modules — the TAA Adapter and the Fusion-Latent Projector.

Attack Stage (inference): Given only a leaked template \(t\), the TAA Adapter predicts attribute embeddings \(\hat{s}\), which together with a noise vector \(z\) are passed through the fusion mapping network to generate a StyleGAN latent code \(\hat{w} \in \mathcal{W}\). The frozen StyleGAN3 generator then synthesizes the reconstructed face \(\hat{I} = G(\hat{w})\).

Key Designs

1. Facial Feature Attribute Prompt Matching

The face is partitioned into multiple regions (eyes, nose, mouth, etc.), each with a predefined set of textual descriptions. These descriptions are encoded via the CLIP text encoder to obtain features \(v_i\), while the image encoder extracts image features \(I_{\text{feat}}\). Cosine similarity is used to select the best-matching description per region:

\[k_{\text{region}} = \arg\max_i \text{sim}(I_{\text{feat}}, v_i)\]

The best-matching text features across regions are concatenated into a full semantic representation \(s\) that captures attribute information from different facial regions.

2. Template-to-Attribute Alignment (TAA) Adapter

A lightweight MLP (2 fully-connected layers + ReLU) that predicts CLIP attribute embeddings \(\hat{s}\) from the leaked template \(t\). The training loss combines MSE and cosine alignment:

\[\mathcal{L}_{\text{sem}} = 0.7 \|s - \hat{s}\|_2^2 + 0.3(1 - \cos(s, \hat{s}))\]

Optimized with Adam for 20 epochs. During the attack stage, only the TAA Adapter is required — no original images are needed.

3. Fusion-Latent Projector (\(M_{\text{FLP}}\))

A multi-branch feedforward network with three input branches: noise \(n\) (providing stochastic variation), template projection \(t'\), and semantic projection \(s'\) (divided into region-level tokens). The core component is a multi-head attention fusion mechanism where the identity template serves as query and region attribute tokens serve as keys/values:

\[\tilde{s} = \text{MHA}(Q = t', K = [s'_1, \ldots, s'_R], V = [s'_1, \ldots, s'_R])\]

This enables the network to automatically learn which attributes are most important for identity recovery. The three branches are concatenated and passed through an MLP + LeakyReLU to produce \(\hat{w} \in \mathcal{W}\).

Loss & Training

Latent Distribution Alignment (WGAN): A Wasserstein GAN with a 3-layer MLP discriminator \(C\) aligns generated latent codes with the StyleGAN prior.

Reconstruction-Guided Refinement: \(\mathcal{L}_{\text{rec}} = \mathcal{L}_{\text{pix}} + \mathcal{L}_{\text{id}} + \mathcal{L}_{\text{attr}} + \mathcal{L}_{\text{lpips}}\) (all weights set to 1.0).

Total objective: \(\mathcal{L}^{\text{total}} = \mathcal{L}^{\text{WGAN}} + \mathcal{L}_{\text{rec}}\).

Optimized with Adam (lr=0.1) + StepLR on a single RTX 3090; StyleGAN3 generates images at 1024×1024.

Key Experimental Results

Main Results

Table 2: Type-I/II TAR (%) — Identity Verification

Setting (Fdb/Floss) Dataset Otroshi et al. CLIP-FTI
ArcFace/ElasticFace LFW Type-I (FAR=0.1%) 95.01 99.37
ArcFace/ElasticFace LFW Type-II (FAR=0.1%) 46.55 81.74
ArcFace/ElasticFace CelebA-HQ (FAR=0.1%) 89.79 95.35
ArcFace/ElasticFace AgeDB (FAR=0.1%) 79.82 90.02

Table 3: Perceptual and Attribute Quality

Metric Otroshi et al. CLIP-FTI
MS-SSIM ↑ (LFW) 0.2428 0.2527
LPIPS ↓ (LFW) 0.5534 0.5419
FAMSE ↓ (LFW) 0.0503 0.0451
FAMSE ↓ (AgeDB) 0.0473 0.0437

Ablation Study

Architecture Component Ablation (LFW, FAR=0.1%)

Variant Type-I Type-II FAMSE ↓
Full CLIP-FTI 99.37 81.74 0.0451
w/o AttrEmb 95.10 46.55 0.0503
w/o MHA 95.53 47.12 0.0501

Loss Term Ablation: Removing \(\mathcal{L}_{\text{lpips}}\) causes the largest degradation, with Type-II TAR dropping sharply from 72.29 to 44.53.

Key Findings

  • The improvement in Type-II TAR is particularly striking (+35 pp), indicating that CLIP attribute conditioning substantially enhances cross-image identity consistency.
  • Cross-Model Transferability (Table 4): CLIP-FTI outperforms the baseline in 28 out of 30 cross-architecture scenarios, with the largest gains on lightweight models (HRNet: 51.63→65.23).
  • CLIP semantic conditioning does not rely on architectural similarity between the surrogate and target models.

Highlights & Insights

  1. First to introduce auxiliary information beyond the template: Breaks the paradigm of relying solely on the leaked template by leveraging CLIP semantic embeddings to supplement attribute details.
  2. Elegant attention-based fusion: The MHA design using identity template as query and region attribute tokens as key/value automatically learns attribute importance.
  3. Single forward pass inference: Unlike search-based methods requiring hundreds of iterations, the approach is efficient and practically applicable.
  4. Security implications: From an attack perspective, the work reveals the severe privacy risks associated with face template leakage.

Limitations & Future Work

  1. TAA prediction quality is constrained by the coverage of the CLIP attribute prompt set; more fine-grained prompts may yield further improvements.
  2. The method is bounded by StyleGAN3's generative capacity, potentially limiting performance under extreme poses or occlusions.
  3. Evaluation is currently limited to 1024×1024; scalability to higher resolutions remains unexplored.
  4. The current setup assumes direct injection of reconstructed images; physical-world attack scenarios present additional challenges.
  • Arc2Face: Uses ArcFace embeddings for diffusion-based face synthesis — conceptually related but targeting a different objective.
  • StyleCLIP / StyleGAN-NADA: CLIP-guided GAN editing; this paper draws inspiration from the CLIP+GAN paradigm.
  • Insights: The framework is extensible to security analyses of other biometric templates; the CLIP attribute conditioning paradigm is also applicable to controllable face generation.

Rating

  • Novelty: ⭐⭐⭐⭐ — First to introduce CLIP attribute embeddings into FTI, establishing a new attack paradigm.
  • Technical Depth: ⭐⭐⭐⭐ — Complete technical stack combining TAA, MHA fusion, and WGAN alignment.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — 3 datasets × 5 FR models × 30 cross-architecture scenarios.
  • Writing Quality: ⭐⭐⭐⭐ — Clear problem formulation with rigorous formalization of the attack scenario.