Deep Nets with Subsampling Layers Unwittingly Discard Useful Activations at Test-Time¶

Conference: ECCV 2024
arXiv: 2410.01083
Code: https://github.com/ca-joe-yang/discard-in-subsampling
Area: Image Segmentation / Image Classification
Keywords: Test-Time Augmentation, Subsampling Layers, Activation Map Search, Attention Aggregation, Semantic Segmentation

TL;DR¶

This work identifies that subsampling layers in deep networks discard a significant volume of useful activations during the default forward pass, and proposes a search-and-aggregate framework to leverage these discarded activation maps at test-time to improve classification and segmentation performance, complementing traditional TTA methods orthogonally.

Background & Motivation¶

Background: Test-Time Augmentation (TTA) is a widely used technique to enhance model performance by applying multiple augmentations (such as random cropping, flipping, and rotation) to the input image and aggregating the predictions. However, existing TTA methods operate entirely in the image space, which incurs massive computational costs (e.g., demanding 144 forward passes on ImageNet).

Limitations of Prior Work: Almost all deep networks contain subsampling layers (such as stride convolutions or pooling), which discard most activations when reducing spatial dimensions. For instance, a 2D convolution with stride=2 discards \(\frac{3}{4}\) of the spatial activations. The default implementation always selects \(s=0\) (even indices), completely ignoring the activations at \(s=1\) (odd indices). These discarded activations contain useful information about the input image.

Key Challenge: (a) How to determine which discarded activations are useful? (b) How to aggregate these activations efficiently? Naively enumerating all possible combinations of selection indices yields an exponential search space.

Core Idea: Selecting different subsampling indices is formulated as a form of "feature-space augmentation". Useful activation maps are searched via a greedy algorithm and aggregated using an attention mechanism, achieving test-time improvements that are orthogonal to image-space TTA.

Method¶

Overall Architecture¶

Given a pre-trained network containing \(L\) subsampling layers, where each layer has a subsampling factor of \(R^{(l)}\). A selection vector is defined as \(\mathbf{s} = (s_1, s_2, ..., s_L)\), where \(s_l \in \{0, ..., R^{(l)}-1\}\). The default forward pass uses \(\mathbf{s} = \mathbf{0}\). The method consists of two core steps: (1) a greedy search to find a set of useful selection indices \(\hat{\mathcal{S}}\); (2) an attention module to aggregate the corresponding feature maps for prediction.

Key Designs¶

Attention Aggregation Module: For selected \(B_{ours}\) sets of features \(\mathcal{F} = \{\mathbf{f}_\mathbf{s} | \mathbf{s} \in \hat{\mathcal{S}}\}\), a multi-head self-attention layout is utilized to learn the relative importance between features:

\[A_{learned}(\mathcal{F}) = \frac{1}{B_{ours}} \sum_{\mathbf{s} \in \hat{\mathcal{S}}} \left(\mathbf{f}_\mathbf{s} + \text{MLP}\left(\sum_{\mathbf{s}' \in \hat{\mathcal{S}}} W_{\mathbf{s}\mathbf{s}'}\mathbf{v}_{\mathbf{s}'}\right)\right)\]

where the attention weight is \(W_{\mathbf{s}\mathbf{s}'} = \frac{\exp(\mathbf{q}_\mathbf{s}^\top \mathbf{k}_{\mathbf{s}'})}{\sum_{\mathbf{s}''} \exp(\mathbf{q}_\mathbf{s}^\top \mathbf{k}_{\mathbf{s}''})}\). Design Motivation: Attention acts as a set-operator, which can be trained once and used under any test-time budget without retraining. A learning-free counterpart is also provided—entropy-based weighting: \(w_\mathbf{s} = \frac{1}{Z}(1 - \frac{H(C_\phi(\mathbf{f}_\mathbf{s}))}{\log K})\), where low-entropy (high-confidence) features are assigned higher weights.

Spatial Alignment: Activation maps generated by different selection indices \(\mathbf{s}\) are spatially offset. Alignment is performed by calculating the relative displacement:

\[\Delta = \sum_{l=1}^{L} s_l \prod_{l'=1}^{l} R^{(l')}\]

After upsampling the activation maps to the input resolution, they are translation-aligned according to \(\Delta\) to avoid performance degradation caused by spatial mismatch. Ablation experiments show that alignment yields a 0.36% gain in accuracy.

Greedy Search: Since the search space \(|\mathcal{S}|\) grows exponentially with the number of subsampling layers, a top-down greedy search is adopted. Using a priority queue, starting from the default state \(\mathbf{0}\), the algorithm expands the neighbors of the current optimal node at each step, prioritizing the most promising states based on a scoring criterion. The learning-based criterion uses the inverse of attention weights (high weight \(\rightarrow\) prioritize expansion), while the learning-free criterion uses prediction entropy (low entropy \(\rightarrow\) high confidence \(\rightarrow\) prioritize expansion).

Loss & Training¶

The aggregation module is trained on the frozen pre-trained model using a 20K training subset of ImageNet.
\(B_{ours}=30\) is fixed during training, while any budget can be used during testing (due to the set-operation nature of the attention module).
Uses the AdamW optimizer with a cosine-annealing scheduler and lr=\(1e^{-6}\).
The search spaces of the first and last layers are omitted to balance performance and latency.

Key Experimental Results¶

Main Results (ImageNet Image Classification, \(B_{total}=150\))¶

Method Combination	ResNet18	ResNet50	MobileNetV2	InceptionV3
GPS (w/o Ours)	70.69	76.87	72.58	72.02
GPS + Ours	70.74	76.87	72.58	72.02
ClassTTA (w/o Ours)	66.40	73.56	67.81	70.34
ClassTTA + Ours	70.37	76.65	71.63	72.00
AugTTA (w/o Ours)	70.28	76.47	72.46	71.98
AugTTA + Ours	70.74	76.89	72.58	72.24

Average Gain: GPS +0.32%, ClassTTA +2.01%, AugTTA +0.19%

Semantic Segmentation (Cityscapes & ADE20K, mIoU)¶

Architecture	Dataset	Baseline	+Ours(\(B=4\))	+Ours(\(B=10\))
ResNet50-DeepLab	Cityscapes	79.60	79.72	79.73
MiT-SegFormer	Cityscapes	76.54	77.01	77.05
ResNet50-DeepLab	ADE20K	42.72	42.77	42.81
MiT-SegFormer	ADE20K	37.41	37.68	37.67

Ablation Study¶

Ablation Item	Configuration	Top-1 Acc(%)	Description
Aggregation Method	Average	79.38	Simple average
	Entropy-weighted	79.44	Learning-free scheme
	Ours w/o Align	79.52	Attention without alignment
	Ours (w/ Align)	79.88	Full scheme
Search Criterion	Random	79.80	Random expansion
	\(\Delta\) (Displacement)	79.72	Sorted by offset
	Entropy	79.86	Learning-free criterion
	Learned (Ours)	79.88	Attention criterion
Search Space	{1,...,L}	79.64	All layers, latency 53.46
	{2,...,L-1}	79.88	Excluding first and last, latency 21.21

Key Findings¶

Discarded activations are indeed useful: Utilizing discarded activation maps consistently improves model performance across 9 different architectures.
Orthogonal and complementary to TTA: Incorporating this method alongside existing TTA yields additional performance gains because this method operates in the feature space rather than the image space.
Performance gains saturate around \(B_{ours} \approx 15\): Benefits are realized without requiring a large test-time computational budget.
Train once, use under any budget: Leveraging the set-operation nature of attention, a single training run generalizes to different computational budgets at test-time.

Highlights & Insights¶

Novel perspective: This is the first work to design a test-time method from the perspective of "information discarding" in subsampling layers, presenting a completely new path beyond image-space TTA.
Deep theoretical insight: Decomposing stride convolutions into conv + subsampling serves to unify all networks that contain subsampling operations.
Elegant spatial alignment design: Resolves the spatial mismatch problem across different selection indices.
Strong generalizability: Shown to be effective across both CNNs and ViTs.

Limitations & Future Work¶

Increased computational overhead at test-time (as it requires multiple forward passes) makes it less suitable for real-time application scenarios.
Performance gains are relatively modest (approx. 0.3-2% for classification, 0.2-0.5 mIoU for segmentation), showing diminishing returns on already strong models.
The greedy search heuristic might fail to find the globally optimal combinations of activation maps.
The possibility of exploiting multiple selection indices during training to enhance the base model's capacity was not explored.
The formulation is inapplicable to architectures that lack explicit subsampling layers (such as pure MLPs).

GPS [Molchanov et al.] and ClassTTA/AugTTA [Shanmugam et al.] serve as key benchmarks for image-space TTA.
Anti-aliased CNN [Zhang 2019] preserves the information lost during downsampling through blur pooling, representing a training-time mitigation.
Inspiration from this work: Can randomized selection indices be used as a form of architectural data augmentation during model training?
The proposed method could be extended to other domains requiring spatial-temporal downsampling, such as video understanding and 3D vision.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ A fresh perspective, mining exploitable test-time information from within the network architecture.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across 9 architectures, 2 tasks, multiple datasets, and supported by detailed ablation studies.
Writing Quality: ⭐⭐⭐⭐ Clear problem definition, rigorous math derivation, and intuitive illustrations.
Value: ⭐⭐⭐⭐ A new direction orthogonal to TTA, albeit with limited absolute gain.