Skip to content

Efficient, Validation-Free Intrinsic Quality Estimation for Large-Scale Face Recognition Datasets

Conference: ICML 2026
arXiv: 2605.29720
Code: None
Area: Face Recognition / Dataset Quality Estimation / Representation Learning Diagnosis
Keywords: Intrinsic Quality, Effective Rank, Neighborhood Consistency, Face Recognition Datasets, Validation-Free Evaluation

TL;DR

The authors propose Intrinsic Quality (IQ): after extracting embeddings via a proxy model, this metric performs a weighted fusion of "neighborhood label consistency (Consis)" and "normalized spectral entropy effective rank \(\tilde{r}_{\mathrm{ent}}\)." IQ provides a "trainability" score for million-scale face recognition datasets without requiring full training or clean validation sets. On WebFace4/12/42M and noise-injected settings, it achieves a Spearman correlation of 1.0 with the downstream MFR-ALL validation accuracy rankings.

Background & Motivation

Background: Modern face recognition (FR) training relies heavily on million-scale weakly supervised web data (MS-Celeb-1M, VGGFace2, WebFace260M/42M). Combined with angular margin classification losses like ArcFace, performance is strongly coupled with data scale, shifting the research paradigm from "model-centric" to "data-centric."

Limitations of Prior Work: Traditionally, there are only two ways to determine if a dataset variant is worth the computational cost of scaled training: running a full training session to check downstream validation accuracy, or relying on a clean held-out validation set. The former often consumes thousands of GPU hours, while the latter is often unavailable due to privacy or licensing restrictions. Furthermore, automated cleaning pipelines like WebFace still contain residual noise, identity merges/splits, and long-tail distributions. While training-time denoising methods (Co-Mining, Global-Local GCN, etc.) mitigate these issues, they still require training for validation.

Key Challenge: There is a critical confounder in weakly supervised web data: global spectral complexity (effective rank) increases under both "benign data scaling" and "label contamination." Consequently, a global metric alone (such as RankMe) cannot distinguish between a "more diverse" dataset and a "dirtier" one. A diagnostic signal is needed to decouple these two sources.

Goal: To provide a "trainability" proxy metric that can rank candidate FR datasets without full training, clean validation sets, or dataset-specific hyperparameter tuning, and to verify its correlation and ranking consistency with downstream MFR-ALL validation accuracy.

Key Insight: The authors observe that local signals (label consistency within k-NN neighborhoods) and global signals (effective rank of the embedding covariance spectrum) respond differently to "data scaling" versus "noise injection." Under clean data scaling, neighborhood consistency remains stable while the spectrum expands; under noise injection, the spectrum still expands, but neighborhood consistency collapses. Together, they form a complementary two-dimensional plane that geometrically separates these two regimes.

Core Idea: Use a weighted combination of "local Consis \(\times\) global normalized effective rank" as a dataset-level intrinsic quality score, where Consis acts as a correction term to suppress the "false complexity" caused by noise.

Method

Overall Architecture

The input is a face training set \(\mathcal{D}=\{(x_i,y_i)\}_{i=1}^N\) with (potentially noisy) identity labels; the output is a scalar IQ score used for ranking candidate dataset variants. The pipeline consists of four steps: (1) Train a lightweight proxy FR model \(f_\theta\) on \(\mathcal{D}\) using ArcFace and extract \(\ell_2\)-normalized \(d\)-dimensional embeddings; (2) Perform identity-stratified sampling to obtain a subset \(\widetilde{\mathcal{D}}\) (approx. 1000 identities \(\times\) 10 images/identity \(\approx\) 10k samples, removing intra-identity near-duplicates) to reduce computation to the 10k scale for compatibility with million-scale data; (3) Calculate two complementary signals on \(\widetilde{\mathcal{D}}\)—the neighborhood label consistency rate Consis via cosine k-NN and the normalized spectral entropy effective rank \(\tilde{r}_{\mathrm{ent}}\) of the \(d\times d\) embedding covariance; (4) Weight these signals with fixed constants \(\alpha=0.2, \beta=0.8\) to obtain IQ for candidate ranking, and verify correlation/Kendall \(\tau\) with validation accuracy after full training on MFR-ALL.

Key Designs

  1. Neighbor-Consistency (Local Label Consistency):

    • Function: Characterizes "local semantic cohesion" via the label consistency rate of k-NN neighborhoods, acting as a probe for label noise and identity merges/splits.
    • Mechanism: For each sampled embedding \(e_i\), the \(k\) nearest neighbors (excluding itself) are retrieved using cosine similarity. The proportion of neighbors sharing the same label \(y_i\) is calculated as \(c_i = \frac{1}{k}\sum_{j\in \mathcal{N}_k(i)}\mathbf{1}\{y_j=y_i\}\). The final \(\bar c\) is the mean across the subset; \(k=10\) is the default.
    • Design Motivation: In weakly supervised settings, identity label flips or merges/splits directly destroy label homogeneity within neighborhoods. In contrast, clean data scaling rarely disperses these "locally compact clusters." Thus, \(\bar c\) is sensitive to "contamination" but insensitive to "scale," perfectly complementing the global spectral complexity's ambiguity.
  2. Normalized Effective Rank \(\tilde{r}_{\mathrm{ent}}\) (Global Subspace Complexity):

    • Function: Characterizes "how many dimensions the embeddings span" via spectral entropy, reflecting data diversity and representational richness.
    • Mechanism: After mean-subtraction of subset embeddings, the covariance \(C=\frac{1}{n}\tilde E^\top \tilde E\) is computed to obtain eigenvalues \(\{\lambda_\ell\}\), which are normalized into probabilities \(p_\ell\). The spectral entropy effective rank is first calculated as \(r_{\mathrm{ent}}=\exp\left(-\sum_\ell p_\ell\log p_\ell\right)\) following Roy & Vetterli; then, logarithmic normalization \(\tilde{r}_{\mathrm{ent}}=\log r_{\mathrm{ent}} / \log Q\) (where \(Q=\min(n,d)\)) is applied to ensure comparability across different \((n,d)\) and to compress the near-saturation regions.
    • Design Motivation: "Benign diversity" from data scaling causes the spectrum to spread from a few primary directions to many, leading to a monotonic increase in \(\tilde{r}_{\mathrm{ent}}\). However, noise also injects "false variance" that flattens the spectrum. Using it alone is ambiguous—which is fundamentally why it must be paired with Consis.
  3. Convex Combination Fusion (IQ):

    • Function: Merges local and global signals into a single scalar score for cross-variant ranking.
    • Mechanism: \(\mathrm{IQ}=\alpha\cdot\bar c+\beta\cdot \tilde{r}_{\mathrm{ent}}\), where \(\alpha+\beta=1\). The paper fixes \(\alpha=0.2, \beta=0.8\) throughout all experiments, never tuning per dataset/noise rate/proxy.
    • Design Motivation: Under clean scaling regimes, Consis is near saturation and has a small dynamic range, so \(\tilde{r}_{\mathrm{ent}}\) is weighted higher to capture subspace expansion. Under contamination regimes, Consis provides a downward pull to prevent \(\tilde{r}_{\mathrm{ent}}\) from overestimating quality due to "false complexity." Section 5.4 sweeps \(\beta\), proving a wide high-correlation region rather than a sharp peak.

Loss & Training

The proxy model \(f_\theta\) is trained directly on \(\mathcal{D}\) using standard ArcFace (ResNet-50 or ResNet-100, \(d=1024\)). ResNet-100 is used for primary trend analysis. IQ itself contains no learnable parameters and is a post-hoc statistic of the embedding geometry.

Key Experimental Results

Main Results: Clean Scaling + Noise Injection

Under clean scaling (WebFace 4M → 12M → 42M), IQ increases alongside downstream MFR-ALL performance. Upon injecting closed-set label flips into WebFace12M at rates of {2%, 5%, 10%, 20%, 40%}, downstream accuracy monotonically decreases. While \(\tilde{r}_{\mathrm{ent}}\) is pushed higher by noise, Consis collapses significantly, allowing IQ to track the downstream ranking.

Dataset Noise Acc(MFR-ALL) \(\tilde{r}_{\mathrm{ent}}\) Consis IQ
WebFace4M 0 90.36 0.882 0.980 0.902
WebFace12M 0 94.37 0.916 0.987 0.930
WebFace42M 0 96.26 0.964 0.986 0.968
WebFace12M 5% 94.21 0.927 0.897 0.921
WebFace12M 20% 90.76 0.959 0.676 0.903
WebFace12M 40% 72.01 0.994 0.401 0.875

Comparison with External Validation-Free Baselines (Scaling + Noise Union)

Metric Spearman Pearson Kendall \(\tau\)
RankMe 0.418 0.752 0.300
ER-only (\(\tilde{r}_{\mathrm{ent}}\)) 0.286 0.398 0.190
Consis-only (\(\bar c\)) 0.607 0.491 0.429
IQ (Ours) 1.000 0.891 1.000

Key Findings

  • Spectral complexity is ambiguous: At 40% noise on WebFace12M, \(\tilde{r}_{\mathrm{ent}}=0.994\) is the highest in the table, yet downstream accuracy is only 72.01%. This confirms that "noise pushes up effective rank," explaining why RankMe and ER-only fail.
  • Robustness to \(\beta\) and Sampling: A \(\beta\) sensitivity sweep shows Spearman/Pearson remain near the IQ peak over a wide range, indicating the weights were not over-fitted. Stability tests from 2k to 100k samples show \(\tilde{r}_{\mathrm{ent}}\) and Consis converge after \(\ge 10k\), ensuring controllable estimation costs.
  • Proxy Architectures: Absolute values shift between ResNet-50 and ResNet-100 proxies, but relative rankings across datasets remain consistent. This indicates IQ captures intrinsic dataset structure rather than proxy architecture artifacts.
  • Subset Ranking: In sorting experiments for WebFace12M / HighVar / LowVar subsets, IQ maintained the downstream accuracy ranking (HighVar 94.45 > 12M 94.37 > LowVar 93.04; IQ 0.932 > 0.930 > 0.913), supporting its use in "rank-then-train" scenarios.

Highlights & Insights

  • The observation that "global spectral complexity increases in both data scaling and noise injection regimes" is very clean. It explains why single-spectrum metrics (RankMe/Effective Rank) fail on weakly supervised data. Decoupling this with k-NN local consistency provides two distinct geometric trajectories in the \((\tilde{r}_{\mathrm{ent}}, \mathrm{Consis})\) plane.
  • The entire metric uses fixed weights (\(\alpha=0.2, \beta=0.8\)) without per-dataset tuning. The existence of a wide high-correlation plateau for \(\beta\) makes it more credible for real-world data iteration.
  • By using lightweight proxies and 10k identity-stratified subsets, the authors reduce the cost of million-scale data evaluation significantly. Since IQ is a post-hoc statistic, this "diagnosis-training decoupling" can be transferred to other weakly supervised large-scale fields (e.g., retrieval, re-ID) where "local label homogeneity + global subspace expansion" axes coexist.
  • The per-sample \(c_i\) distribution (shifting from a near-saturated peak to a long-tail distribution under noise) provides a natural debugging perspective, which can guide automated cleaning.

Limitations & Future Work

  • Proxy Dependency: Extreme proxy weakness or strong domain shifts may distort IQ. The paper does not define a minimum capability threshold for the proxy model.
  • Artificial Noise Models: Experiments primarily used uniform closed-set label flips, which do not fully cover realistic web data issues like identity merge/split, near-duplicate clusters, structured confusion between visually similar identities, and long-tail biases.
  • Evaluation Breadth: Downstream testing was limited to the MFR-ALL benchmark. The definition of "trainability" is tied to a specific training+evaluation protocol, and cross-benchmark/cross-architecture generalization remains a hypothesis.
  • Statistical Robustness: The Spearman/Kendall \(\tau\) hitting 1.000 in the main correlation table is a bit suspicious—it might be due to the limited number of discrete settings compared. Adding more mixed-regime points would be necessary to re-evaluate the robustness of these figures.
  • vs RankMe (Garrido et al., 2023): RankMe is also a validation-free effective rank metric but focuses only on the global spectrum. Experiments show RankMe's Spearman (0.418) lags behind IQ (1.000) because RankMe is "fooled" by the false complexity of noise.
  • vs Sample-level Quality (SER-FIQ / MagFace): Those methods provide per-image recognizability scores for "picking good images." IQ provides dataset-level trainability scores for "picking good datasets." The granularities are complementary.
  • vs Robust Training (Co-Mining / Global-Local GCN): These methods mitigate noise during training but still require full runs to evaluate dataset variants. IQ moves judgment to the pre-training phase, acting as a diagnostic pre-filtering module.
  • vs Transferability Metrics (LEEP / TransRate): While similar in spirit (using low-cost signals to predict downstream results), LEEP/TransRate measure source-target task transferability, whereas IQ measures the trainability of dataset variants within the same task while explicitly handling weakly supervised label noise.

Rating

  • Novelty: ⭐⭐⭐⭐ Explicitly identifying and decoupling the "spectral complexity increase" confounder using k-NN consistency is insightful.
  • Experimental Thoroughness: ⭐⭐⭐ Good coverage of scaling, 6-level noise injection, proxy robustness, and stability, though limited by a single downstream benchmark and idealized noise.
  • Writing Quality: ⭐⭐⭐⭐ The link from motivation to hypothesis to fusion is very logical. Each design choice is well-argued, and the emphasis on the confounder helps the reader stay on track.
  • Value: ⭐⭐⭐⭐ Provides a lightweight diagnostic tool for cost-sensitive million-scale FR engineering. The "diagnose-before-train" paradigm is directly applicable to FR pipelines and can be extrapolated to other large-scale weakly supervised domains.