Identifying and Understanding Cross-Class Features in Adversarial Training¶
Conference: ICML2025
arXiv: 2506.05032
Code: PKU-ML/Cross-Class-Features-AT
Area: Adversarial Training / AI Safety
Keywords: Adversarial Training, Cross-Class Features, Robust Overfitting, Knowledge Distillation, Feature Attribution
TL;DR¶
From the perspective of class-level feature attribution, this work reveals how "cross-class features" in adversarial training (AT) are first learned and then forgotten, offering a unified explanation for both robust overfitting and the advantages of soft-label training.
Background & Motivation¶
Adversarial training (AT) is one of the most effective methods to make deep networks robust against adversarial attacks, with its core formulated as a min-max optimization:
However, two poorly understood phenomena exist in AT:
Robust Overfitting: The model reaches its optimal test robust accuracy in the middle of training, after which the test robust accuracy gradually declines while the training robust error continues to decrease, creating a huge generalization gap.
Advantages of Soft Labels: Replacing one-hot labels with soft labels such as knowledge distillation can significantly improve AT performance (e.g., from 41% to 48% on CIFAR-10), yet the underlying reason remains unclear.
Existing explanations approach this from data-level loss, label noise, etc., but lack a unified perspective. This work is the first to propose a unified hypothesis from the perspective of class-level feature attribution.
Method¶
Core Concepts: Cross-Class Features vs. Class-Specific Features¶
- Cross-class Features: Features shared by multiple classes, such as the "wheel" feature shared by cars and trucks in CIFAR-10.
- Class-specific Features: Features belonging exclusively to a single class, such as "frog eyes" for frogs.
Feature Attribution Metric¶
Let the classifier be \(f(\cdot) = Wg(\cdot)\), where \(g\) is the feature extractor and \(W \in \mathbb{R}^{K \times n}\) is the linear layer. For sample \(x\) on the \(i\)-th class, the attribution vector is defined as:
Each component \(g(x)_j W[i,j]\) represents the contribution of the \(j\)-th feature to the logit of the \(i\)-th class.
Cross-Class Feature Correlation Matrix¶
A cross-class feature attribution correlation matrix is constructed, measured by cosine similarity:
High \(C[i,j]\) values indicate that class \(i\) and class \(j\) share more features.
Numerical Metric CAS (Class Attribution Similarity)¶
CAS quantitatively reflects the degree to which the model template utilizes cross-class features, considering only positively correlated terms.
Main Hypothesis: Two-Stage Dynamics of AT¶
- Initial Phase: The model simultaneously learns class-specific and cross-class features, both of which jointly reduce the robust loss.
- Later Phase: When the robust loss decreases to a certain extent, cross-class features hinder further decrease in loss because they produce positive logits on non-target classes. The model starts to abandon cross-class features and shifts to depending solely on class-specific features \(\rightarrow\) leading to robust overfitting.
Theoretical Analysis: Synthetic Data Model¶
Consider a three-class classification task where each class has an exclusive feature \(x_{E,i}\) and a cross-class feature \(x_{C,j}\). The data distribution is:
Theorem 1 (Cross-class features are more sensitive to robust loss): There exists a threshold \(\epsilon_0 \in (0, \mu/2)\) such that when \(\epsilon > \epsilon_0\), the optimal weights in AT satisfy \(w_2 = 0\) (abandoning cross-class features), whereas for any \(\epsilon \in (0, \mu/2)\), \(w_1 > 0\) always holds (retaining class-specific features).
Theorem 2 (Cross-class features help robust classification): Within the range \(w_2 \in [0, w_1]\), increasing \(w_2\) monotonically increases the model's correct classification probability under adversarial attacks.
Theorem 3 (Soft labels retain cross-class features): The threshold for AT with label smoothing satisfies \(\epsilon_1 > \epsilon_0\), and for \(\epsilon \in (0, \epsilon_1)\), \(w_2^{\text{LS}}(\epsilon) > w_2^*(\epsilon)\), indicating that soft labels retain more cross-class features.
Key Experimental Results¶
CAS and Robust Accuracy at Different AT Stages on CIFAR-10 (PreActResNet-18)¶
| Stage | Epoch | Robust Accuracy (RA) | CAS |
|---|---|---|---|
| Underfitting | 70 | 42.6% | 18.2 |
| Best | 108 | 47.8% | 25.6 |
| Overfitting | 200 | 42.5% | 9.0 |
\(\rightarrow\) CAS at the best checkpoint is much higher than at the overfitted checkpoint, validating the positive correlation between cross-class features and robust generalization.
\(\Delta\)CAS (Best - Last) under Different Perturbation Strengths \(\epsilon\)¶
| \(\epsilon\) | \(\Delta\)CAS | Overfitting Degree |
|---|---|---|
| 2/255 | 4.1 | Slight |
| 4/255 | 8.9 | Moderate |
| 6/255 | 13.8 | Severe |
| 8/255 | 16.6 | Very Severe |
\(\rightarrow\) Larger \(\epsilon\) leads to more forgotten cross-class features, which corresponds to more severe robust overfitting.
CAS Changes under Extremely Large \(\epsilon\)¶
| \(\epsilon\) | Epoch 10 CAS/RA | Best CAS/RA | Last CAS/RA |
|---|---|---|---|
| 8/255 | 16.7/36.9% | 25.6/47.8% | 9.0/42.5% |
| 12/255 | 15.6/29.8% | 18.9/38.7% | 8.7/34.1% |
| 16/255 | 14.4/23.8% | 17.5/31.3% | 8.4/28.1% |
\(\rightarrow\) Under extremely large \(\epsilon\), very few cross-class features are learned even in the initial stage, thus the forgetting effect is weakened, and robust overfitting is paradoxically alleviated.
Comparison with Knowledge Distillation AT¶
| Method | Stage | RA | CAS |
|---|---|---|---|
| AT+KD | Best | 48.1% | 25.7 |
| AT+KD | Last | 46.2% | 24.1 |
\(\rightarrow\) KD maintains high CAS throughout the training process, with the CAS gap dropping from 16.6 to 1.6, and robust overfitting is significantly mitigated.
Cross-Dataset and Cross-Architecture Verification¶
- CIFAR-100: Best CAS=569, Last CAS=352
- TinyImageNet: Best CAS=1548, Last CAS=998
- \(\ell_2\)-AT: Best CAS=22.1, Last CAS=10.7
- DeiT-Ti (Transformer): Best CAS=25.4, Last CAS=16.6
\(\rightarrow\) The conclusion is consistent across all setups: the utilization of cross-class features is strongly positively correlated with robust generalization.
Highlights & Insights¶
- Novel Perspective: Provides the first unified explanation for both robust overfitting and the benefits of soft labels in AT from the perspective of "cross-class features".
- Intuitive Quantitative Metric (CAS): Concisely and effectively quantifies the model's reliance on cross-class features based on feature attribution of the final linear layer.
- Theoretical-Experimental Closed-Loop: Three theorems on synthetic data precisely characterize the sensitivity and utility of cross-class features, which are perfectly validated by experiments.
- Saliency Map Visualization: Visually demonstrates via Grad-CAM that the best checkpoint focuses on global features (wheels + car body), whereas the overfitted checkpoint focuses only on local exclusive features (curved roof), providing strong explanatory power.
- Broad Coverage: Synthetically validates hypotheses across \(\ell_\infty\)/\(\ell_2\) norms, CNN/Transformer architectures, multiple datasets, and Fast AT (FAT) scenarios.
Limitations & Future Work¶
- CAS Dependency on Linear Layer Assumption: The attribution vector \(A_i(x) = g(x) \odot W[i]\) is only applicable to architectures where the final layer is linear, and cannot be directly generalized to more complex classification heads.
- Theory Limited to Synthetic Models: The theoretical analysis on the three-class linear model is relatively simplified and still has a gap with the actual training dynamics of deep networks.
- No New Defense Method Proposed: The paper focuses primarily on analysis and understanding, without designing new adversarial training algorithms based on the cross-class feature hypothesis.
- Uncertain Causality: While the correlation between CAS and robust accuracy is confirmed, whether it represents a direct causal relationship still requires more rigorous validation.
- Insufficient Large-Scale Dataset Validation: Experiments are mainly conducted on CIFAR-10/100 and TinyImageNet, lacking validation on the scale of ImageNet.
Related Work & Insights¶
- Ilyas et al., 2019: Proposed the framework of robust vs. non-robust features. Based on this, this paper further distinguishes between cross-class and class-specific robust features.
- Rice et al., 2020: The first to systematically study robust overfitting. This work provides a new feature-level explanation for it.
- Chen et al., 2021 (ARD): A representative work that improves AT using knowledge distillation. This paper explains its success from the perspective of cross-class features.
- Insight: The forgetting mechanism of cross-class features suggests that robust overfitting could be mitigated by explicitly encouraging the learning of cross-class features (e.g., through feature regularization or cross-class contrastive learning).
Rating¶
- Novelty: ⭐⭐⭐⭐ — Novel perspective of cross-class features, providing a unified explanation for two major phenomena.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive validation across multiple datasets, multiple architectures, and multiple norms.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure, combining theory and experiments tightly.
- Value: ⭐⭐⭐⭐ — Provides deep understanding of AT but does not directly yield new methods.