FRET: Feature Redundancy Elimination for Test Time Adaptation¶

Conference: ICCV 2025 arXiv: 2505.10641 Code: GitHub Area: AI Safety Keywords: Test-time adaptation, feature redundancy elimination, distribution shift, graph convolutional network, contrastive learning

TL;DR¶

This paper proposes Feature Redundancy Elimination (FRET) as a novel perspective for test-time adaptation (TTA), observing that embedding feature redundancy increases significantly under distribution shift. Two methods are designed: S-FRET (direct minimization of the redundancy score) and G-FRET (GCN-based attention-redundancy decomposition with bi-level optimization). G-FRET achieves state-of-the-art performance across multiple architectures and datasets.

Background & Motivation¶

Deep neural networks perform well under the i.i.d. assumption, but frequently encounter distribution shift in real-world deployment. TTA requires access only to a pre-trained model and unlabeled test data, making it particularly suitable for privacy-sensitive scenarios.

Taxonomy of existing methods: - BN calibration methods: Replace source-domain BN statistics with target-domain statistics - Pseudo-label methods: Select reliable pseudo-labels via thresholding or entropy filtering - Consistency training methods: Maintain prediction stability under input perturbations - Clustering methods: Reduce prediction uncertainty in the target domain via clustering

Core observation: On CIFAR10-C with ResNet-18, the authors find that as distribution shift intensifies, the redundancy of second-order feature relationship graphs (covariance matrices) increases markedly—a redder covariance heatmap indicates higher inter-feature correlation. Quantitative analysis shows that the redundancy score \(R_e = \|\tilde{Z}^T\tilde{Z} - I_d\|_1\) positively correlates with corruption severity.

Key Challenge: Existing TTA methods overlook the increase in feature redundancy caused by distribution shift, whereas redundant features directly undermine the model's ability to adapt to new data.

Key Insight: Directly eliminate embedding feature redundancy at test time, addressing TTA from a novel redundancy-elimination perspective.

Method¶

Overall Architecture¶

The FRET framework operates at two levels: 1. S-FRET: Directly adopts the redundancy score \(R_e\) as the optimization objective—simple and efficient. 2. G-FRET: Introduces a GCN to decompose feature relationships into an attention component and a redundancy component, eliminating redundancy and enhancing discriminability at both the representation layer and the prediction layer.

Key Designs¶

Feature Redundancy Score:
- Function: Quantifies the degree of redundancy in embedding features.
- Mechanism: The embedding matrix \(Z\) is column-normalized to obtain \(\tilde{Z}\); the redundancy score is then computed as \(R_e = \|\tilde{Z}^T\tilde{Z} - I_d\|_1\). Ideally, non-redundant features should yield a covariance matrix close to the identity.
- Design Motivation: Off-diagonal entries of the covariance matrix capture linear correlations between features; minimizing these entries eliminates redundancy.
Attention-Redundancy Decomposition:
- Function: Decomposes the feature relationship graph into a useful attention component and a redundancy component to be eliminated.
- Mechanism: A second-order feature relationship graph \(G_F = Z^TZ\) is constructed, then decomposed via a mask matrix \(M_M = I_d\) into an attention graph \(G_A = G_F \odot I_d\) (retaining only the diagonal) and a redundancy graph \(G_R = G_F - G_A\). GCN propagation then yields attention and redundancy representations: \(R_A = Z D_A^{-1/2} G_A D_A^{-1/2}, \quad P_A = R_A \theta^h\) \(R_R = Z D_R^{-1/2} G_R D_R^{-1/2}, \quad P_R = R_R \theta^h\)
- Design Motivation: S-FRET, which directly minimizes the redundancy score, is label-distribution-agnostic and thus cannot handle label shift. By fusing data information with feature relationship information through the GCN, G-FRET can address both covariate shift and label shift.
Representation-Layer Redundancy Elimination:
- Function: Uses contrastive learning to make attention representations class-discriminative while pushing them away from redundancy representations.
- Mechanism: A contrastive loss \(\mathcal{L}_R\) is defined, where the positive sample for attention representation \(R_{A_i}\) is its corresponding class centroid \(c_o\), and negatives include other class centroids \(\{c_j\}\) and the redundancy representation \(R_{R_i}\): \(\mathcal{L}_R = -\sum_{i=1}^{n_t} \log \frac{\exp(\text{sim}(R_{A_i}, c_o))}{\sum_{j=1}^{C} \exp(\text{sim}(R_{A_i}, c_j)) + \exp(\text{sim}(R_{A_i}, R_{R_i}))}\) Class centroids are computed via pseudo-label clustering.
- Design Motivation: Redundancy elimination alone is insufficient; the discriminability of useful features must also be enhanced to handle label shift.
Prediction-Layer Redundancy Elimination:
- Function: Increases the confidence of attention predictions while suppressing redundancy predictions at the prediction layer.
- Mechanism: Combines entropy minimization with negative learning: \(\mathcal{L}_P = -\sum_{i=1}^{N} \sigma(P_{A_i}) \log \sigma(P_{A_i}) - \sum_{i=1}^{N} \sigma(P_{R_i}) \log \sigma(1 - P_{A_i})\) The first term minimizes the entropy of attention predictions (sharpening predictions); the second penalizes redundancy predictions via negative learning.
- Design Motivation: Bi-level optimization (representation layer + prediction layer) is more effective than operating at a single layer.

Loss & Training¶

S-FRET loss: \(\mathcal{L}_{SFRET} = \|\tilde{Z}^T\tilde{Z} - I_d\|_1\)
G-FRET total loss: \(\mathcal{L}_{GFRET} = \mathcal{L}_R + \lambda \mathcal{L}_P\)
Online adaptation: Upon receiving test data, predictions are generated using the model parameters from the previous step, followed by a single gradient descent update.
Only BN layer parameters are updated; all other parameters remain frozen.

Key Experimental Results¶

Main Results (Domain Generalization TTA — PACS + OfficeHome)¶

Method	Backbone	PACS Avg	OfficeHome Avg
Source	ResNet-18	81.84	62.01
BN	ResNet-18	82.66	62.03
TENT	ResNet-18	85.60	63.24
TSD	ResNet-18	88.13	62.55
TEA	ResNet-18	87.98	63.06
TIPI	ResNet-18	87.23	63.29
G-FRET	ResNet-18	88.51	63.81
TSD	ResNet-50	89.97	68.74
TEA	ResNet-50	88.72	68.95
G-FRET	ResNet-50	91.28	69.96

Ablation Study¶

Configuration	PACS Avg	Note
S-FRET (redundancy minimization only)	86.20	Simple and effective, but vulnerable to label shift
G-FRET w/o \(\mathcal{L}_R\)	87.53	Without representation-layer contrastive learning
G-FRET w/o \(\mathcal{L}_P\)	87.89	Without prediction-layer negative learning
G-FRET (full)	88.51	Bi-level optimization yields best results
\(\lambda = 0.1\)	87.92	Balancing parameter too small
\(\lambda = 1.0\)	88.51	Optimal balance
\(\lambda = 10\)	88.05	Prediction-layer loss weight too large

Key Findings¶

Feature redundancy positively correlates with the degree of distribution shift; this observation holds across multiple architectures and datasets.
Despite its simplicity, S-FRET is already effective under covariate shift.
Through attention-redundancy decomposition and bi-level optimization, G-FRET substantially outperforms S-FRET under label shift.
Feature visualizations of G-FRET outputs demonstrate markedly reduced redundancy and enhanced discriminability.

Highlights & Insights¶

Novel perspective: This is the first work to introduce feature redundancy elimination into TTA, offering a direction orthogonal to BN calibration, pseudo-labeling, and consistency training.
Progressive method design: S-FRET is concise and elegant (a single formula); G-FRET builds upon it by incrementally incorporating GCN, contrastive learning, and negative learning, with clear logical progression.
Effective use of GCN: Graph propagation in the GCN is leveraged to model inter-feature relationships, enabling explicit separation of attention and redundancy relationships at the feature level.
Handling label shift: By incorporating class-centroid-aware contrastive learning, G-FRET compensates for the limitation of pure redundancy minimization approaches.

Limitations & Future Work¶

G-FRET introduces GCN and contrastive learning, increasing computational overhead at test time (graph construction and propagation are required per batch).
The mask matrix \(M_M\) is fixed as the identity matrix, which may not be optimal for all scenarios.
Performance gains under extreme distribution shift (e.g., corruption level 5) are limited.
Class centroid computation relies on pseudo-label quality and may become unstable under high noise.
Combinations with other TTA methods remain unexplored.

The approach bears theoretical connections to the SOFT method in feature selection (sharing the mask matrix concept).
The redundancy elimination idea may extend to other settings such as continual learning and domain adaptation.
The bi-level framework of contrastive learning combined with negative learning may be applicable to other tasks requiring feature de-redundancy.
The paper provides a new diagnostic tool for TTA method design (the redundancy score curve).

Rating¶

Novelty: ⭐⭐⭐⭐ The feature redundancy perspective is a novel contribution to TTA, though the individual technical components (GCN, contrastive learning) are not new in themselves.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers multiple architectures (ResNet-18/50/ViT), multiple datasets (PACS/OfficeHome/CIFAR-C), and comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐ Motivation is illustrated intuitively, and the method is described clearly, though the mathematical notation is somewhat dense.
Value: ⭐⭐⭐⭐ Provides a practical new TTA method and novel understanding of feature redundancy, offering meaningful inspiration for subsequent research.