L-SWAG: Layer-Sample Wise Activation with Gradients information for Zero-Shot NAS on Vision Transformers¶
Conference: CVPR 2025
arXiv: 2505.07300
Code: None
Area: Interpretability
Keywords: Neural Architecture Search, Zero-Shot NAS, Vision Transformer, Proxy Metric, Metric Combination
TL;DR¶
This paper proposes the L-SWAG metric, which characterizes the trainability and expressiveness of CNN and ViT networks through the product of layer-wise gradient variance and the cardinality of activation patterns. It further designs the LIBRA-NAS algorithm to combine complementary proxy metrics, achieving SOTA-level zero-shot NAS performance across ViT search spaces and 14 tasks.
Background & Motivation¶
- Background: Zero-shot NAS utilizes zero-cost (ZC) proxy metrics to evaluate network architecture quality without training models, offering both temporal efficiency and interpretability.
- Limitations of Prior Work: Existing SOTA proxy metrics are primarily limited to convolutional search spaces (such as NAS-Bench-201) and perform poorly on Vision Transformer search spaces, sometimes even underperforming simple parameter-count metrics.
- Key Challenge: Existing metrics either only consider gradients (trainability) or only consider activation patterns (expressiveness); a single dimension is insufficient to comprehensively characterize networks. Moreover, most metrics treat all layers equally, ignoring the differences in gradient statistics across different layers.
- Goal: To design a general proxy metric suitable for both CNN and ViT search spaces, and to develop an intelligent metric combination method.
- Key Insight: (1) Theoretically analyze the ZiCO metric to prove that the gradient mean should be discarded, keeping only the variance; (2) Empirically discover that the contribution of gradient statistics varies significantly across different layers; (3) Combine expressiveness metrics to compensate for the limitations of pure gradient metrics on ViTs.
- Core Idea: Layer-wise gradient variance (trainability) \(\times\) Layer-wise activation pattern cardinality (expressiveness) = L-SWAG.
Method¶
Overall Architecture¶
Input batch + randomly initialized DNN \(\to\) Extract layer-wise gradient statistics (only variance, discarding mean) \(\to\) Select the most informative layer interval \(\to\) Compute trainability score \(\Lambda^{\hat{L}}\) \(\to\) Compute layer-wise SWAP expressiveness score \(\Psi_{\mathcal{N},\theta}^{\hat{L}}\) \(\to\) Multiply both to obtain L-SWAG \(\to\) Use LIBRA-NAS to combine multiple metrics.
Key Designs¶
-
Layer-wise Gradient Variance Metric (\(\Lambda^{\hat{L}}\)):
- Function: Measures the trainability of the network within the selected layer interval.
- Mechanism: Computes the sample-wise variance \(\text{Var}(|\nabla_w \mathcal{L}|)\) of gradients for each layer \(l\), then takes the reciprocal and performs logarithmic summation. Key improvements: (1) Theoretically proves (Theorem 1) that the gradient mean \(\mu\) in ZiCO should be discarded and replaced with a constant 1; (2) Analyzes the layer-wise gradient statistics of 1000 random networks and reveals that only statistics in specific layer intervals (from \(\hat{l}\) to \(\hat{L}\)) are meaningful, thus only selecting these layers for calculation.
- Design Motivation: The \(\mu/\sigma\) ratio in ZiCO is theoretically invalid (Theorem 1 proves that the contribution of \(\mu\) is canceled out by the learning rate), and treating all layers with equal weight is suboptimal.
-
Layer-wise Activation Pattern Expressiveness (\(\Psi_{\mathcal{N},\theta}^{\hat{L}}\)):
- Function: Measures the number of linear regions of the network over the input space, reflecting expressiveness.
- Mechanism: Defines Sample-Wise Activation Patterns (SWAP) — binarizing the post-activation values of each layer to obtain a set of activation patterns, where its cardinality serves as the expressiveness score. This method is extended from ReLU to GeLU networks for the first time, making it applicable to ViTs.
- Design Motivation: Pure gradient metrics fail in ViT search spaces because the expressiveness difference in ViTs is a crucial distinguishing factor of architecture quality.
-
LIBRA-NAS Metric Combination Algorithm:
- Function: Intelligently combines multiple proxy metrics to obtain higher correlation than any single metric.
- Mechanism: A three-step selection: (1) Selects the metric \(z_{\text{best}}\) with the highest correlation; (2) Selects the most complementary metric (lowest conditional mutual information) via information gain; (3) Selects the metric with a bias closest to the validation accuracy distribution for bias realignment. Ultimately, a combination of three metrics replaces a single metric for NAS search.
- Design Motivation: Different search spaces may favor different types of metrics, and a single metric cannot adapt to all scenarios.
Loss & Training¶
Training-free (zero-shot); L-SWAG can be computed with only one forward and one backward propagation. LIBRA-NAS integrated into NAS search discovers an architecture with a 17.0% test error rate on ImageNet1k in 0.1 GPU days. The newly constructed ViT evaluation benchmark contains 2000 trained ViT models, covering the Autoformer search space on CIFAR-10, CIFAR-100, and ImageNet16-120 with three training strategies (AE, Jigsaw, Normal).
Key Experimental Results¶
Main Results¶
| Metric | ViT (Average \(\rho\) on 6 Tasks) | NAS-Bench-201 (Average \(\rho\)) | TransNasBench (Average \(\rho\)) |
|---|---|---|---|
| #Params | 0.45 | 0.58 | 0.35 |
| ZiCO | 0.12 | 0.72 | 0.41 |
| NWOT | 0.38 | 0.65 | 0.28 |
| L-SWAG | 0.62 | 0.74 | 0.55 |
Ablation Study¶
| Configuration | ViT Average \(\rho\) | Description |
|---|---|---|
| L-SWAG (full) | 0.62 | \(\Lambda \times \Psi\) |
| Only \(\Lambda\) (Trainability) | 0.48 | Expressiveness contribution +0.14 |
| Only \(\Psi\) (Expressiveness) | 0.41 | Trainability contribution +0.21 |
| ZiCO (with \(\mu\)) | 0.12 | Discarding \(\mu\) yields substantial improvement |
| All layers (Non-layer-wise) | 0.51 | Layer selection contribution +0.11 |
Key Findings¶
- Existing proxy metrics universally degrade on ViT search spaces, with most performing worse than the parameter count.
- Discarding the gradient mean \(\mu\) is validated both theoretically and experimentally (Theorem 1 proves that the contribution of \(\mu\) is canceled out by the learning rate \(\eta\), and the correct upper bound only contains \(\sigma^2\) and \(((M\eta-1)\mu)^2\) terms).
- The layer-wise selection strategy significantly improves metric quality and computational efficiency by focusing on information-dense layers.
- The combination (multiplication) of trainability and expressiveness is crucial for ViTs — using either dimension alone is insufficient.
- The architecture found by LIBRA-NAS within 0.1 GPU days (17.0% error rate) outperforms evolutionary and gradient-based NAS methods.
- Ablation of the LIBRA three-step selection strategy: selecting \(z_2\) via min IG consistently outperforms max IG, random selection, and category-based selection; selecting \(z_3\) via bias matching outperforms bias minimization and random selection.
- The SWAP expressiveness metric is successfully extended from ReLU to GeLU networks (via binarization approximation), making it applicable to ViTs.
Highlights & Insights¶
- Theory-driven metric design: Theorem 1 rigorously proves the redundancy of the gradient mean term in ZiCO — the contribution of \(\mu\) in the training loss upper bound is canceled out by the choice of learning rate \(\eta\), and only the \(\sigma^2\) term is genuinely related to trainability.
- Practical value of layer-wise analysis: The visualization of layer-wise gradient statistics of 1000 networks intuitively shows "which layers are important", which is heuristic yet highly effective.
- Generality of LIBRA: The metric combination framework is independent of specific metrics and can integrate new proxy metrics at any time.
- Search space construction: A newly constructed evaluation benchmark of 2000 trained ViT models (covering 6 tasks) fills the gap in rigorous correlation analysis within ViT search spaces.
Limitations & Future Work¶
- The threshold for layer selection needs to be analyzed beforehand for each search space, rather than being fully automated.
- On certain search spaces (such as NAS-Bench-201), the advantage of L-SWAG relative to ZiCO is marginal.
- Only classification tasks were validated; it has not been extended to tasks like detection and segmentation.
- In the future, more automated layer selection strategies and support for more ViT variants can be explored.
- The extension of the SWAP expressiveness metric from ReLU to GeLU is based on binarization approximation, and its applicability to other activation functions (e.g., SiLU/Swish) has not been verified.
- The mutual information estimation in the three-step selection strategy of LIBRA-NAS may not be accurate enough under small sample sizes.
- Corrected a mathematical error in the proof of Theorem 3.1 in the original ZiCO paper (missing the summation over \(i\) from the fourth to the fifth line, and the \(1/2\) factor was not correctly multiplied to all terms), providing the correct upper bound of the training loss.
Related Work & Insights¶
- vs ZiCO: ZiCO uses the \(\mu/\sigma\) ratio, which lacks a solid theoretical basis (Theorem 1) and fails on ViTs (\(\rho\) is only 0.12); L-SWAG uses only \(1/\sigma\) along with an expressiveness term, achieving a \(\rho\) of 0.62 on ViTs.
- vs AZ-NAS: Although AZ-NAS includes ViT evaluation, it only evaluates within NAS search (where the performance gap in the search space is small); L-SWAG provides a rigorous correlation analysis of 2000 trained networks.
- vs Te-NAS: Te-NAS relies on neural tangent kernel (NTK) theory (which is invalid for modern DNN hypotheses), whereas L-SWAG is based on more reliable gradient variance theory.
- vs NWOT: NWOT achieves a \(\rho\) of only 0.38 on ViTs, whereas L-SWAG is significantly superior with a \(\rho\) of 0.62.
Rating¶
Implementation Details¶
Evaluated on NB201, NB301, TransNasBench-101, and the self-built Autoformer ViT search space. Uses bert-base-uncased as the embedding model, 1×NVIDIA A100 GPU. L-SWAG can be computed with only one forward and one backward propagation. - Novelty: ⭐⭐⭐⭐ Novel combination of theoretical proof, layer-wise analysis, and ViT extension - Experimental Thoroughness: ⭐⭐⭐⭐⭐ 14 tasks, 2000 trained ViT evaluations, and detailed ablations - Writing Quality: ⭐⭐⭐⭐ Clear theoretical derivations and rich experimental charts - Value: ⭐⭐⭐⭐ Fills the gap in zero-shot NAS for ViTs