L-SWAG: Layer-Sample Wise Activation with Gradients information for Zero-Shot NAS on Vision Transformers¶

Conference: CVPR 2025
arXiv: 2505.07300
Code: None
Area: Interpretability
Keywords: Neural Architecture Search, Zero-Shot NAS, Vision Transformer, Proxy Metric, Metric Combination

TL;DR¶

This paper proposes the L-SWAG metric, which characterizes the trainability and expressiveness of CNN and ViT networks through the product of layer-wise gradient variance and the cardinality of activation patterns. It further designs the LIBRA-NAS algorithm to combine complementary proxy metrics, achieving SOTA-level zero-shot NAS performance across ViT search spaces and 14 tasks.

Background & Motivation¶

Background: Zero-shot NAS utilizes zero-cost (ZC) proxy metrics to evaluate network architecture quality without training models, offering both temporal efficiency and interpretability.
Limitations of Prior Work: Existing SOTA proxy metrics are primarily limited to convolutional search spaces (such as NAS-Bench-201) and perform poorly on Vision Transformer search spaces, sometimes even underperforming simple parameter-count metrics.
Key Challenge: Existing metrics either only consider gradients (trainability) or only consider activation patterns (expressiveness); a single dimension is insufficient to comprehensively characterize networks. Moreover, most metrics treat all layers equally, ignoring the differences in gradient statistics across different layers.
Goal: To design a general proxy metric suitable for both CNN and ViT search spaces, and to develop an intelligent metric combination method.
Key Insight: (1) Theoretically analyze the ZiCO metric to prove that the gradient mean should be discarded, keeping only the variance; (2) Empirically discover that the contribution of gradient statistics varies significantly across different layers; (3) Combine expressiveness metrics to compensate for the limitations of pure gradient metrics on ViTs.
Core Idea: Layer-wise gradient variance (trainability) \(\times\) Layer-wise activation pattern cardinality (expressiveness) = L-SWAG.

Method¶

Overall Architecture¶

Input batch + randomly initialized DNN \(\to\) Extract layer-wise gradient statistics (only variance, discarding mean) \(\to\) Select the most informative layer interval \(\to\) Compute trainability score \(\Lambda^{\hat{L}}\) \(\to\) Compute layer-wise SWAP expressiveness score \(\Psi_{\mathcal{N},\theta}^{\hat{L}}\) \(\to\) Multiply both to obtain L-SWAG \(\to\) Use LIBRA-NAS to combine multiple metrics.

Key Designs¶

Layer-wise Gradient Variance Metric (\(\Lambda^{\hat{L}}\)):
- Function: Measures the trainability of the network within the selected layer interval.
- Mechanism: Computes the sample-wise variance \(\text{Var}(|\nabla_w \mathcal{L}|)\) of gradients for each layer \(l\), then takes the reciprocal and performs logarithmic summation. Key improvements: (1) Theoretically proves (Theorem 1) that the gradient mean \(\mu\) in ZiCO should be discarded and replaced with a constant 1; (2) Analyzes the layer-wise gradient statistics of 1000 random networks and reveals that only statistics in specific layer intervals (from \(\hat{l}\) to \(\hat{L}\)) are meaningful, thus only selecting these layers for calculation.
- Design Motivation: The \(\mu/\sigma\) ratio in ZiCO is theoretically invalid (Theorem 1 proves that the contribution of \(\mu\) is canceled out by the learning rate), and treating all layers with equal weight is suboptimal.
Layer-wise Activation Pattern Expressiveness (\(\Psi_{\mathcal{N},\theta}^{\hat{L}}\)):
- Function: Measures the number of linear regions of the network over the input space, reflecting expressiveness.
- Mechanism: Defines Sample-Wise Activation Patterns (SWAP) — binarizing the post-activation values of each layer to obtain a set of activation patterns, where its cardinality serves as the expressiveness score. This method is extended from ReLU to GeLU networks for the first time, making it applicable to ViTs.
- Design Motivation: Pure gradient metrics fail in ViT search spaces because the expressiveness difference in ViTs is a crucial distinguishing factor of architecture quality.
LIBRA-NAS Metric Combination Algorithm:
- Function: Intelligently combines multiple proxy metrics to obtain higher correlation than any single metric.
- Mechanism: A three-step selection: (1) Selects the metric \(z_{\text{best}}\) with the highest correlation; (2) Selects the most complementary metric (lowest conditional mutual information) via information gain; (3) Selects the metric with a bias closest to the validation accuracy distribution for bias realignment. Ultimately, a combination of three metrics replaces a single metric for NAS search.
- Design Motivation: Different search spaces may favor different types of metrics, and a single metric cannot adapt to all scenarios.

Loss & Training¶

Training-free (zero-shot); L-SWAG can be computed with only one forward and one backward propagation. LIBRA-NAS integrated into NAS search discovers an architecture with a 17.0% test error rate on ImageNet1k in 0.1 GPU days. The newly constructed ViT evaluation benchmark contains 2000 trained ViT models, covering the Autoformer search space on CIFAR-10, CIFAR-100, and ImageNet16-120 with three training strategies (AE, Jigsaw, Normal).

Key Experimental Results¶

Main Results¶

Metric	ViT (Average \(\rho\) on 6 Tasks)	NAS-Bench-201 (Average \(\rho\))	TransNasBench (Average \(\rho\))
#Params	0.45	0.58	0.35
ZiCO	0.12	0.72	0.41
NWOT	0.38	0.65	0.28
L-SWAG	0.62	0.74	0.55

Ablation Study¶

Configuration	ViT Average \(\rho\)	Description
L-SWAG (full)	0.62	\(\Lambda \times \Psi\)
Only \(\Lambda\) (Trainability)	0.48	Expressiveness contribution +0.14
Only \(\Psi\) (Expressiveness)	0.41	Trainability contribution +0.21
ZiCO (with \(\mu\))	0.12	Discarding \(\mu\) yields substantial improvement
All layers (Non-layer-wise)	0.51	Layer selection contribution +0.11

Key Findings¶

Existing proxy metrics universally degrade on ViT search spaces, with most performing worse than the parameter count.
Discarding the gradient mean \(\mu\) is validated both theoretically and experimentally (Theorem 1 proves that the contribution of \(\mu\) is canceled out by the learning rate \(\eta\), and the correct upper bound only contains \(\sigma^2\) and \(((M\eta-1)\mu)^2\) terms).
The layer-wise selection strategy significantly improves metric quality and computational efficiency by focusing on information-dense layers.
The combination (multiplication) of trainability and expressiveness is crucial for ViTs — using either dimension alone is insufficient.
The architecture found by LIBRA-NAS within 0.1 GPU days (17.0% error rate) outperforms evolutionary and gradient-based NAS methods.
Ablation of the LIBRA three-step selection strategy: selecting \(z_2\) via min IG consistently outperforms max IG, random selection, and category-based selection; selecting \(z_3\) via bias matching outperforms bias minimization and random selection.
The SWAP expressiveness metric is successfully extended from ReLU to GeLU networks (via binarization approximation), making it applicable to ViTs.

Highlights & Insights¶

Theory-driven metric design: Theorem 1 rigorously proves the redundancy of the gradient mean term in ZiCO — the contribution of \(\mu\) in the training loss upper bound is canceled out by the choice of learning rate \(\eta\), and only the \(\sigma^2\) term is genuinely related to trainability.
Practical value of layer-wise analysis: The visualization of layer-wise gradient statistics of 1000 networks intuitively shows "which layers are important", which is heuristic yet highly effective.
Generality of LIBRA: The metric combination framework is independent of specific metrics and can integrate new proxy metrics at any time.
Search space construction: A newly constructed evaluation benchmark of 2000 trained ViT models (covering 6 tasks) fills the gap in rigorous correlation analysis within ViT search spaces.

Limitations & Future Work¶

The threshold for layer selection needs to be analyzed beforehand for each search space, rather than being fully automated.
On certain search spaces (such as NAS-Bench-201), the advantage of L-SWAG relative to ZiCO is marginal.
Only classification tasks were validated; it has not been extended to tasks like detection and segmentation.
In the future, more automated layer selection strategies and support for more ViT variants can be explored.
The extension of the SWAP expressiveness metric from ReLU to GeLU is based on binarization approximation, and its applicability to other activation functions (e.g., SiLU/Swish) has not been verified.
The mutual information estimation in the three-step selection strategy of LIBRA-NAS may not be accurate enough under small sample sizes.
Corrected a mathematical error in the proof of Theorem 3.1 in the original ZiCO paper (missing the summation over \(i\) from the fourth to the fifth line, and the \(1/2\) factor was not correctly multiplied to all terms), providing the correct upper bound of the training loss.

vs ZiCO: ZiCO uses the \(\mu/\sigma\) ratio, which lacks a solid theoretical basis (Theorem 1) and fails on ViTs (\(\rho\) is only 0.12); L-SWAG uses only \(1/\sigma\) along with an expressiveness term, achieving a \(\rho\) of 0.62 on ViTs.
vs AZ-NAS: Although AZ-NAS includes ViT evaluation, it only evaluates within NAS search (where the performance gap in the search space is small); L-SWAG provides a rigorous correlation analysis of 2000 trained networks.
vs Te-NAS: Te-NAS relies on neural tangent kernel (NTK) theory (which is invalid for modern DNN hypotheses), whereas L-SWAG is based on more reliable gradient variance theory.
vs NWOT: NWOT achieves a \(\rho\) of only 0.38 on ViTs, whereas L-SWAG is significantly superior with a \(\rho\) of 0.62.

Rating¶

Implementation Details¶

Evaluated on NB201, NB301, TransNasBench-101, and the self-built Autoformer ViT search space. Uses bert-base-uncased as the embedding model, 1×NVIDIA A100 GPU. L-SWAG can be computed with only one forward and one backward propagation. - Novelty: ⭐⭐⭐⭐ Novel combination of theoretical proof, layer-wise analysis, and ViT extension - Experimental Thoroughness: ⭐⭐⭐⭐⭐ 14 tasks, 2000 trained ViT evaluations, and detailed ablations - Writing Quality: ⭐⭐⭐⭐ Clear theoretical derivations and rich experimental charts - Value: ⭐⭐⭐⭐ Fills the gap in zero-shot NAS for ViTs