FRET: Feature Redundancy Elimination for Test Time Adaptation¶
Conference: ICCV 2025 arXiv: 2505.10641 Code: GitHub Area: AI Safety Keywords: Test-time adaptation, feature redundancy elimination, distribution shift, graph convolutional network, contrastive learning
TL;DR¶
This paper proposes Feature Redundancy Elimination (FRET) as a novel perspective for test-time adaptation (TTA), observing that embedding feature redundancy increases significantly under distribution shift. Two methods are designed: S-FRET (direct minimization of the redundancy score) and G-FRET (GCN-based attention-redundancy decomposition with bi-level optimization). G-FRET achieves state-of-the-art performance across multiple architectures and datasets.
Background & Motivation¶
Deep neural networks perform well under the i.i.d. assumption, but frequently encounter distribution shift in real-world deployment. TTA requires access only to a pre-trained model and unlabeled test data, making it particularly suitable for privacy-sensitive scenarios.
Taxonomy of existing methods: - BN calibration methods: Replace source-domain BN statistics with target-domain statistics - Pseudo-label methods: Select reliable pseudo-labels via thresholding or entropy filtering - Consistency training methods: Maintain prediction stability under input perturbations - Clustering methods: Reduce prediction uncertainty in the target domain via clustering
Core observation: On CIFAR10-C with ResNet-18, the authors find that as distribution shift intensifies, the redundancy of second-order feature relationship graphs (covariance matrices) increases markedly—a redder covariance heatmap indicates higher inter-feature correlation. Quantitative analysis shows that the redundancy score \(R_e = \|\tilde{Z}^T\tilde{Z} - I_d\|_1\) positively correlates with corruption severity.
Key Challenge: Existing TTA methods overlook the increase in feature redundancy caused by distribution shift, whereas redundant features directly undermine the model's ability to adapt to new data.
Key Insight: Directly eliminate embedding feature redundancy at test time, addressing TTA from a novel redundancy-elimination perspective.
Method¶
Overall Architecture¶
The FRET framework operates at two levels: 1. S-FRET: Directly adopts the redundancy score \(R_e\) as the optimization objective—simple and efficient. 2. G-FRET: Introduces a GCN to decompose feature relationships into an attention component and a redundancy component, eliminating redundancy and enhancing discriminability at both the representation layer and the prediction layer.
Key Designs¶
-
Feature Redundancy Score:
- Function: Quantifies the degree of redundancy in embedding features.
- Mechanism: The embedding matrix \(Z\) is column-normalized to obtain \(\tilde{Z}\); the redundancy score is then computed as \(R_e = \|\tilde{Z}^T\tilde{Z} - I_d\|_1\). Ideally, non-redundant features should yield a covariance matrix close to the identity.
- Design Motivation: Off-diagonal entries of the covariance matrix capture linear correlations between features; minimizing these entries eliminates redundancy.
-
Attention-Redundancy Decomposition:
- Function: Decomposes the feature relationship graph into a useful attention component and a redundancy component to be eliminated.
- Mechanism: A second-order feature relationship graph \(G_F = Z^TZ\) is constructed, then decomposed via a mask matrix \(M_M = I_d\) into an attention graph \(G_A = G_F \odot I_d\) (retaining only the diagonal) and a redundancy graph \(G_R = G_F - G_A\). GCN propagation then yields attention and redundancy representations: \(R_A = Z D_A^{-1/2} G_A D_A^{-1/2}, \quad P_A = R_A \theta^h\) \(R_R = Z D_R^{-1/2} G_R D_R^{-1/2}, \quad P_R = R_R \theta^h\)
- Design Motivation: S-FRET, which directly minimizes the redundancy score, is label-distribution-agnostic and thus cannot handle label shift. By fusing data information with feature relationship information through the GCN, G-FRET can address both covariate shift and label shift.
-
Representation-Layer Redundancy Elimination:
- Function: Uses contrastive learning to make attention representations class-discriminative while pushing them away from redundancy representations.
- Mechanism: A contrastive loss \(\mathcal{L}_R\) is defined, where the positive sample for attention representation \(R_{A_i}\) is its corresponding class centroid \(c_o\), and negatives include other class centroids \(\{c_j\}\) and the redundancy representation \(R_{R_i}\): \(\mathcal{L}_R = -\sum_{i=1}^{n_t} \log \frac{\exp(\text{sim}(R_{A_i}, c_o))}{\sum_{j=1}^{C} \exp(\text{sim}(R_{A_i}, c_j)) + \exp(\text{sim}(R_{A_i}, R_{R_i}))}\) Class centroids are computed via pseudo-label clustering.
- Design Motivation: Redundancy elimination alone is insufficient; the discriminability of useful features must also be enhanced to handle label shift.
-
Prediction-Layer Redundancy Elimination:
- Function: Increases the confidence of attention predictions while suppressing redundancy predictions at the prediction layer.
- Mechanism: Combines entropy minimization with negative learning: \(\mathcal{L}_P = -\sum_{i=1}^{N} \sigma(P_{A_i}) \log \sigma(P_{A_i}) - \sum_{i=1}^{N} \sigma(P_{R_i}) \log \sigma(1 - P_{A_i})\) The first term minimizes the entropy of attention predictions (sharpening predictions); the second penalizes redundancy predictions via negative learning.
- Design Motivation: Bi-level optimization (representation layer + prediction layer) is more effective than operating at a single layer.
Loss & Training¶
- S-FRET loss: \(\mathcal{L}_{SFRET} = \|\tilde{Z}^T\tilde{Z} - I_d\|_1\)
- G-FRET total loss: \(\mathcal{L}_{GFRET} = \mathcal{L}_R + \lambda \mathcal{L}_P\)
- Online adaptation: Upon receiving test data, predictions are generated using the model parameters from the previous step, followed by a single gradient descent update.
- Only BN layer parameters are updated; all other parameters remain frozen.
Key Experimental Results¶
Main Results (Domain Generalization TTA — PACS + OfficeHome)¶
| Method | Backbone | PACS Avg | OfficeHome Avg |
|---|---|---|---|
| Source | ResNet-18 | 81.84 | 62.01 |
| BN | ResNet-18 | 82.66 | 62.03 |
| TENT | ResNet-18 | 85.60 | 63.24 |
| TSD | ResNet-18 | 88.13 | 62.55 |
| TEA | ResNet-18 | 87.98 | 63.06 |
| TIPI | ResNet-18 | 87.23 | 63.29 |
| G-FRET | ResNet-18 | 88.51 | 63.81 |
| TSD | ResNet-50 | 89.97 | 68.74 |
| TEA | ResNet-50 | 88.72 | 68.95 |
| G-FRET | ResNet-50 | 91.28 | 69.96 |
Ablation Study¶
| Configuration | PACS Avg | Note |
|---|---|---|
| S-FRET (redundancy minimization only) | 86.20 | Simple and effective, but vulnerable to label shift |
| G-FRET w/o \(\mathcal{L}_R\) | 87.53 | Without representation-layer contrastive learning |
| G-FRET w/o \(\mathcal{L}_P\) | 87.89 | Without prediction-layer negative learning |
| G-FRET (full) | 88.51 | Bi-level optimization yields best results |
| \(\lambda = 0.1\) | 87.92 | Balancing parameter too small |
| \(\lambda = 1.0\) | 88.51 | Optimal balance |
| \(\lambda = 10\) | 88.05 | Prediction-layer loss weight too large |
Key Findings¶
- Feature redundancy positively correlates with the degree of distribution shift; this observation holds across multiple architectures and datasets.
- Despite its simplicity, S-FRET is already effective under covariate shift.
- Through attention-redundancy decomposition and bi-level optimization, G-FRET substantially outperforms S-FRET under label shift.
- Feature visualizations of G-FRET outputs demonstrate markedly reduced redundancy and enhanced discriminability.
Highlights & Insights¶
- Novel perspective: This is the first work to introduce feature redundancy elimination into TTA, offering a direction orthogonal to BN calibration, pseudo-labeling, and consistency training.
- Progressive method design: S-FRET is concise and elegant (a single formula); G-FRET builds upon it by incrementally incorporating GCN, contrastive learning, and negative learning, with clear logical progression.
- Effective use of GCN: Graph propagation in the GCN is leveraged to model inter-feature relationships, enabling explicit separation of attention and redundancy relationships at the feature level.
- Handling label shift: By incorporating class-centroid-aware contrastive learning, G-FRET compensates for the limitation of pure redundancy minimization approaches.
Limitations & Future Work¶
- G-FRET introduces GCN and contrastive learning, increasing computational overhead at test time (graph construction and propagation are required per batch).
- The mask matrix \(M_M\) is fixed as the identity matrix, which may not be optimal for all scenarios.
- Performance gains under extreme distribution shift (e.g., corruption level 5) are limited.
- Class centroid computation relies on pseudo-label quality and may become unstable under high noise.
- Combinations with other TTA methods remain unexplored.
Related Work & Insights¶
- The approach bears theoretical connections to the SOFT method in feature selection (sharing the mask matrix concept).
- The redundancy elimination idea may extend to other settings such as continual learning and domain adaptation.
- The bi-level framework of contrastive learning combined with negative learning may be applicable to other tasks requiring feature de-redundancy.
- The paper provides a new diagnostic tool for TTA method design (the redundancy score curve).
Rating¶
- Novelty: ⭐⭐⭐⭐ The feature redundancy perspective is a novel contribution to TTA, though the individual technical components (GCN, contrastive learning) are not new in themselves.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers multiple architectures (ResNet-18/50/ViT), multiple datasets (PACS/OfficeHome/CIFAR-C), and comprehensive ablations.
- Writing Quality: ⭐⭐⭐⭐ Motivation is illustrated intuitively, and the method is described clearly, though the mathematical notation is somewhat dense.
- Value: ⭐⭐⭐⭐ Provides a practical new TTA method and novel understanding of feature redundancy, offering meaningful inspiration for subsequent research.