ZO-SAM: Zero-Order Sharpness-Aware Minimization for Efficient Sparse Training¶
Conference: CVPR 2026 arXiv: 2603.13115 Code: None Area: Other Keywords: sparse training, SAM, zeroth-order optimization, gradient variance, flat minima
TL;DR¶
This paper proposes ZO-SAM, which replaces the backpropagation in SAM's perturbation step with zeroth-order gradient estimation, reducing SAM's computational overhead from two backward passes to one. This makes SAM practical for sparse training for the first time, achieving consistent improvements of 0.38%–2.54% over all mainstream sparse training methods on CIFAR-10/100 and ImageNet-1K.
Background & Motivation¶
Background: Sparse neural networks significantly reduce parameter count and computational cost by retaining only a small fraction of active weights. Mainstream approaches are categorized into static methods (LTH, SNIP, GraSP) and dynamic methods (SET, DSR, RigL, MEST).
Limitations of Prior Work: - Gradient signals in sparse training are noisy and chaotic — after extensive pruning, the remaining parameters bear a disproportionate burden, and gradient variance increases sharply with sparsity. - High sparsity leads to narrower and steeper loss landscapes, resulting in inefficient and meandering optimization trajectories. - SAM can guide models toward flat minima to mitigate these issues, but its dual-backpropagation overhead directly contradicts the computational efficiency goals of sparse training.
Key Challenge: The generalization benefit of SAM versus its doubled computational cost is a fundamental tension, particularly acute in sparse training, which is inherently motivated by computational savings.
Key Insight: The perturbation step of SAM does not require high gradient accuracy — it only needs to determine the perturbation direction — making it amenable to coarse zeroth-order approximation instead of exact gradients.
Core Idea: Apply random gradient estimation (RGE) in SAM's perturbation step while retaining the first-order exact gradient in the update step, reducing the number of backward passes from two to one.
Method¶
Overall Architecture¶
ZO-SAM preserves SAM's two-step structure but modifies the first step:
- Perturbation step (zeroth-order): Estimates the gradient direction via RGE without backpropagation: $\(\epsilon = \rho \frac{\hat{\nabla}\mathcal{L}(\theta)}{\|\hat{\nabla}\mathcal{L}(\theta)\|}\)$
- Update step (first-order): Updates parameters using exact first-order gradients computed at the perturbed point: $\(\theta \leftarrow \theta - \eta \nabla\mathcal{L}(\theta^*(\epsilon))\)$
Key Designs¶
-
Random Gradient Estimation (RGE) as a Substitute for Backpropagation:
- Function: Estimates the gradient direction via forward passes during the perturbation step, eliminating the first backward pass.
- Core formula: \(\hat{\nabla}\mathcal{L}(\theta) = \frac{1}{m}\sum_{i=1}^m \frac{\mathcal{L}(\theta + \delta u_i) - \mathcal{L}(\theta - \delta u_i)}{2\delta} u_i\) where \(u_i \sim \mathcal{N}(0, I)\), \(\delta\) is a small step size, and \(m \ll d\) is the number of samples.
- Design Motivation: The perturbation step only requires identifying an approximate worst-case direction; exact gradients are unnecessary. RGE requires only \(2m\) forward passes (with small \(m\)), far less than a full backward pass. The random directional sampling also provides smoother landscape exploration.
- Why RGE over CGE: CGE requires \(d\) evaluations (where \(d\) is the parameter dimension, on the order of millions), making it infeasible. RGE's random direction sampling additionally yields smoother landscape exploration.
-
Retention of First-Order Exact Update:
- Function: Computes exact gradients via standard backpropagation at the perturbed parameter point \(\theta^*(\epsilon) = \theta + \epsilon\).
- Design Motivation: The parameter update step demands high-precision gradients to ensure training stability and convergence; approximation is not acceptable here.
-
Compatibility with Sparse Training Methods:
- Function: ZO-SAM serves as a drop-in optimizer replacement for SGD and can be combined with any sparse training method.
- Validated in combination with 7 methods: LTH, SNIP, GraSP (static) + SET, DSR, RigL, MEST (dynamic).
- Design Motivation: As a general optimization framework, ZO-SAM does not alter the sparse structure search logic.
Loss & Training¶
Standard classification loss (cross-entropy) is used; ZO-SAM modifies only the optimizer. The hyperparameter \(\rho\) (neighborhood size) inherits SAM's default value, while \(\delta\) (zeroth-order step size) and \(m\) (number of samples) are newly introduced hyperparameters.
Key Experimental Results¶
Main Results — ResNet-32 on CIFAR-10/100 (90%/95%/98% Sparsity)¶
| Method | CIFAR-10 90% | CIFAR-10 98% | CIFAR-100 90% | CIFAR-100 98% |
|---|---|---|---|---|
| RigL | 93.07 | 89.00 | 70.34 | 64.07 |
| RigL+ZO-SAM | 93.66(+0.59) | 90.61(+1.61) | 72.88(+2.54) | 65.17(+1.10) |
| MEST | 92.56 | 89.22 | 70.44 | 64.59 |
| MEST+ZO-SAM | 93.50(+0.94) | 91.53(+2.31) | 72.20(+1.76) | 66.01(+1.42) |
Transformer on ImageNet-1K¶
| Model | Sparsity | Method | Accuracy (%) | Gain |
|---|---|---|---|---|
| DeiT-Small | 70% | RigL | 77.99 | - |
| DeiT-Small | 70% | RigL+ZO-SAM | 79.16 | +1.17 |
| DeiT-Tiny | 50% | SViTE | 70.18 | - |
| DeiT-Tiny | 50% | SNIP+ZO-SAM | 71.32 | +1.14 |
Convergence Speed Comparison¶
| Method | Epochs to reach 90% accuracy (CIFAR-10, sp=0.9) |
|---|---|
| SGD | 104 |
| ESAM | 75 |
| LookSAM(k=5) | 79 |
| GSAM | 84 |
| ZO-SAM | 70 |
Key Findings¶
- Higher sparsity yields greater gains from ZO-SAM: Improvements are most pronounced at 98% sparsity (MEST+ZO-SAM achieves +2.31% on CIFAR-10), as gradient variance issues are more severe at higher sparsity levels.
- ZO-SAM transforms the loss landscape from a narrow, deep basin to a wide, flat basin, as confirmed by visualization.
- Gradient variance is substantially reduced: at 90% sparsity, ZO-SAM's gradient variance is approximately one-third that of SGD.
- Convergence speed is roughly 30 epochs faster than SGD and comparable to efficient SAM variants such as ESAM.
- Effectiveness generalizes to Transformers (DeiT), not limited to CNNs.
- ZO-SAM demonstrates greater robustness on CIFAR-10-C distribution shift benchmarks.
Highlights & Insights¶
- Precise diagnosis of the core problem in sparse training: Rather than broadly attributing difficulty to sparsity, the paper pinpoints "high gradient variance" as the specific root cause and addresses it directly.
- The hybrid zeroth-order/first-order strategy is elegant: The perturbation step does not require directional precision (zeroth-order suffices), while the update step demands exact gradients (first-order retained). This selective allocation of computational resources is a principled design insight worth emulating.
- Plug-and-play generality: Consistent improvements across 7 sparse training methods × 3 sparsity levels × 2 datasets, without modifying the underlying sparse methods.
- First practical deployment of SAM in sparse training: Prior to this work, SAM's doubled computational overhead rendered it impractical for sparse training; ZO-SAM removes this barrier.
Limitations & Future Work¶
- The approximation noise introduced by RGE may accumulate in very high-dimensional settings; further validation on large-scale models is needed.
- The selection of the zeroth-order sample count \(m\) lacks an adaptive strategy and currently requires manual tuning.
- Evaluation is limited to \(\ell_\infty\) unstructured sparsity; structured sparsity (e.g., channel pruning) is not addressed.
- Full ImageNet-1K experiments cover only DeiT-Tiny/Small; the behavior of larger models (e.g., ViT-Large) remains unknown.
- Combinations with more efficient SAM variants (ESAM, LookSAM) are unexplored — whether a "ZO-ESAM" could yield further speedups is an open question.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of zeroth-order estimation with SAM is conceptually straightforward, yet the design decision to apply it selectively to the perturbation step reflects careful reasoning.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive coverage across 7 methods × 3 sparsity levels × multiple datasets and architectures.
- Writing Quality: ⭐⭐⭐⭐ Motivation analysis (gradient variance, loss landscape visualization) is well executed.
- Value: ⭐⭐⭐⭐ Makes SAM genuinely usable in sparse training, with high practical engineering value.