Sparse Training from Random Initialization: Aligning Lottery Ticket Masks using Weight Symmetry¶

Meta Information¶

Conference: ICML 2025
arXiv: 2505.05143
Code: GitHub (Public)
Area: Others
Keywords: Lottery Ticket Hypothesis, Weight Symmetry, Permutation Matching, Sparse Training, Linear Mode Connectivity

TL;DR¶

This work explains why Lottery Ticket Hypothesis (LTH) masks cannot transfer to new initializations from the perspective of weight symmetry, and proposes to achieve sparse training by aligning LTH masks with the optimization basins of new initializations via permutation matching.

Background & Motivation¶

Background¶

Background: LTH: Dense networks contain sparse subnetworks that, when trained with the original initialization, can match the performance of the dense network.

Limitations of Prior Work¶

Limitations of Prior Work: Core Problem: LTH masks are tied to the initialization from which they were found and cannot be transferred to a new random initialization.

Key Challenge¶

Key Challenge: Weight Symmetry: Neural networks exhibit permutation invariance, where swapping two neurons in the same layer does not alter functionality.

Core Idea¶

Core Idea: Hypothesis: Mask transfer fails because the optimization basin of the LTH mask is not aligned with that of the new initialization.

Method¶

1. Core Hypothesis¶

Models trained from different random initializations converge to the same basin (modulo permutation); therefore, LTH masks require corresponding permutations to align with new initializations.

2. Permutation Matching¶

Activation matching (Ainsworth et al., 2023) is used to find the permutation mapping $\pi$: $$\pi_l = \arg\min_\pi \|Z_l^B - \pi Z_l^A\| = \arg\max_\pi \langle \pi, Z^B(Z^A)^\top \rangle_F$$ The linear assignment problem is solved via the Hungarian algorithm.

3. Mask Alignment¶

Train models A and B to convergence.
Find the permutation $\pi$ using activation matching to align $\pi(w_A^{t=T})$ with $w_B^{t=T}$.
Permute the LTH mask $m_A$ to $\pi(m_A)$.
Start sparse training with $\pi(m_A)$ starting from $w_B^{t=k}$ (rewinding point).

4. Variance Collapse Repair¶

The REPAIR method is used to correct the variance collapse of activation statistics in interpolated networks, validating the linear mode connectivity.

Experiments¶

Main Results¶

Dataset/Model	Sparsity	naive vs permuted Gap
ResNet20/CIFAR-10	90%	permuted is significantly better than naive
ResNet20/CIFAR-100	90%	Consistent advantage
VGG11/CIFAR-10	90%	permuted is close to LTH
ResNet50/ImageNet	95%	permuted outperforms naive by about 2%

Width Effect¶

Wider models lead to more accurate permutation matching, reducing the gap between permuted and LTH (progressively narrowing from width=1 to 16).

Diversity Analysis (Table 1)¶

Method	Test Accuracy	Ensemble Accuracy	Disagreement	KL	JS
LTH	91.15%	91.43%	0.035	0.038	0.011
Permuted	89.38%	91.75%	0.107	0.273	0.091

The Permuted method exhibits significantly higher functional diversity than LTH, which in turn leads to better ensemble performance.

Highlights & Insights¶

Novel Insight: Understanding the non-transferability of LTH masks from the perspective of weight symmetry.
Corrected the conclusion of Paul et al. (2023) that LTH and dense solutions do not reside in the same basin (they are connected after accounting for variance collapse).
The permuted models exhibit much higher functional diversity than LTH, which is beneficial for ensembling.
The method is simple, requiring only standard activation matching and mask permutation.

Limitations & Future Work¶

Requires training two dense models to obtain the permutation mapping, which increases computational cost.
Permutation matching is an NP-hard problem; greedy solutions are not precise enough on ImageNet.
Current hardware cannot efficiently exploit unstructured sparsity.
Model pruning may introduce algorithmic bias (Hooker et al., 2020).

Rating¶

⭐⭐⭐⭐ Explains the limitations of LTH from a symmetry perspective with strong insight. The experiments comprehensively demonstrate the effectiveness and diversity advantages of mask alignment.