Skip to content

Sparse Training from Random Initialization: Aligning Lottery Ticket Masks using Weight Symmetry

Meta Information

  • Conference: ICML 2025
  • arXiv: 2505.05143
  • Code: GitHub (Public)
  • Area: Others
  • Keywords: Lottery Ticket Hypothesis, Weight Symmetry, Permutation Matching, Sparse Training, Linear Mode Connectivity

TL;DR

This work explains why Lottery Ticket Hypothesis (LTH) masks cannot transfer to new initializations from the perspective of weight symmetry, and proposes to achieve sparse training by aligning LTH masks with the optimization basins of new initializations via permutation matching.

Background & Motivation

Background

Background: LTH: Dense networks contain sparse subnetworks that, when trained with the original initialization, can match the performance of the dense network.

Limitations of Prior Work

Limitations of Prior Work: Core Problem: LTH masks are tied to the initialization from which they were found and cannot be transferred to a new random initialization.

Key Challenge

Key Challenge: Weight Symmetry: Neural networks exhibit permutation invariance, where swapping two neurons in the same layer does not alter functionality.

Core Idea

Core Idea: Hypothesis: Mask transfer fails because the optimization basin of the LTH mask is not aligned with that of the new initialization.

Method

1. Core Hypothesis

Models trained from different random initializations converge to the same basin (modulo permutation); therefore, LTH masks require corresponding permutations to align with new initializations.

2. Permutation Matching

Activation matching (Ainsworth et al., 2023) is used to find the permutation mapping \(\pi\): $\(\pi_l = \arg\min_\pi \|Z_l^B - \pi Z_l^A\| = \arg\max_\pi \langle \pi, Z^B(Z^A)^\top \rangle_F\)$ The linear assignment problem is solved via the Hungarian algorithm.

3. Mask Alignment

  • Train models A and B to convergence.
  • Find the permutation \(\pi\) using activation matching to align \(\pi(w_A^{t=T})\) with \(w_B^{t=T}\).
  • Permute the LTH mask \(m_A\) to \(\pi(m_A)\).
  • Start sparse training with \(\pi(m_A)\) starting from \(w_B^{t=k}\) (rewinding point).

4. Variance Collapse Repair

The REPAIR method is used to correct the variance collapse of activation statistics in interpolated networks, validating the linear mode connectivity.

Experiments

Main Results

Dataset/Model Sparsity naive vs permuted Gap
ResNet20/CIFAR-10 90% permuted is significantly better than naive
ResNet20/CIFAR-100 90% Consistent advantage
VGG11/CIFAR-10 90% permuted is close to LTH
ResNet50/ImageNet 95% permuted outperforms naive by about 2%

Width Effect

Wider models lead to more accurate permutation matching, reducing the gap between permuted and LTH (progressively narrowing from width=1 to 16).

Diversity Analysis (Table 1)

Method Test Accuracy Ensemble Accuracy Disagreement KL JS
LTH 91.15% 91.43% 0.035 0.038 0.011
Permuted 89.38% 91.75% 0.107 0.273 0.091

The Permuted method exhibits significantly higher functional diversity than LTH, which in turn leads to better ensemble performance.

Highlights & Insights

  • Novel Insight: Understanding the non-transferability of LTH masks from the perspective of weight symmetry.
  • Corrected the conclusion of Paul et al. (2023) that LTH and dense solutions do not reside in the same basin (they are connected after accounting for variance collapse).
  • The permuted models exhibit much higher functional diversity than LTH, which is beneficial for ensembling.
  • The method is simple, requiring only standard activation matching and mask permutation.

Limitations & Future Work

  • Requires training two dense models to obtain the permutation mapping, which increases computational cost.
  • Permutation matching is an NP-hard problem; greedy solutions are not precise enough on ImageNet.
  • Current hardware cannot efficiently exploit unstructured sparsity.
  • Model pruning may introduce algorithmic bias (Hooker et al., 2020).

Rating

⭐⭐⭐⭐ Explains the limitations of LTH from a symmetry perspective with strong insight. The experiments comprehensively demonstrate the effectiveness and diversity advantages of mask alignment.