Random Initialization of Gated Sparse Adapters (RIGSA)¶

Conference: ICML 2025
arXiv: 2511.01794
Code: -
Area: Parameter-Efficient Fine-Tuning / Sparse Adaptation
Keywords: sparse fine-tuning, PEFT, lottery ticket hypothesis, catastrophic forgetting, LoRA

TL;DR¶

Proposes RIGSA, a sparse fine-tuning method based on randomly initialized full-rank adapters + ReZero gating + iterative magnitude pruning, which retains source task performance better than QLoRA while learning new tasks.

Background & Motivation¶

Limitations of Prior Work¶

Limitations of Prior Work: Fine-tuning LLMs on new tasks suffers from the catastrophic forgetting issue.

Background¶

Background: LoRA achieves parameter-efficient fine-tuning via low-rank constraints, but this low-rank limitation degrades performance on complex tasks.

Key Challenge¶

Key Challenge: Sparse fine-tuning offers an alternative that bypasses rank constraints, potentially enabling more expressive adaptation.

Core Idea¶

Core Idea: The lottery ticket hypothesis (LTH) suggests that sparse subnetworks exist within dense networks that can match the performance of the full network.

Additional Notes¶

Additional Notes: Existing sparse fine-tuning methods (such as LT-SFT) initialize the weight delta matrix \(\Delta W\) to zero, failing to incorporate "lucky" initializations.

Method¶

Core Idea¶

Learning \(W = W_0 + \alpha \Delta W\), where: - \(W_0\): Frozen pre-trained weights - \(\Delta W\): Randomly initialized full-rank weight delta matrix - \(\alpha\): ReZero-style learnable gating parameter, initialized to \(10^{-6}\)

Key Designs: The near-zero initialization of \(\alpha\) ensures training starts near the pre-trained weights \(W_0\), while the random initialization of \(\Delta W\) allows deviation from pre-trained weights, and weight decay guides them back toward this region.

Iterative Magnitude Pruning (IMP)¶

Train \(W_0 + \alpha \Delta W\) for one epoch
Pruning: Among parameters that have not changed sign, keep the top 80% with the largest magnitude, reset the rest to their initial values, and freeze them.
Repeat 5 times → final sparsity of approximately 3.46%
Train with the final sparse mask to obtain the winning ticket

Differences from Other Methods¶

Method	Initialization	Rank Constraint	Pruning Strategy
LoRA/QLoRA	B initialized to zero	Low-rank	None
LT-SFT	\(\Delta W = 0\)	None	One-shot pruning
RoSA	Gradient accumulation	Joint low-rank + sparse	Gradient-based
RIGSA	Random + ReZero gating	None (full-rank → sparse)	IMP + sign preservation

Key Experimental Results¶

Target Task: Textual MNIST¶

Converts MNIST images into numeric text matrices (each pixel quantized to 0-9) as a pure-text image classification task: - SmolLM2-1.7B-Instruct zero-shot accuracy is only about 10% (random-guess level) - Effectively learned after fine-tuning

Main Results¶

Method	Textual MNIST	PIQA	HellaSwag	GSM8k
Baseline (un-finetuned)	~10%	75.4	51.7	43.7
RIGSA (step 1, dense)	99.05%	-	-	40.7↓
RIGSA (step 3, sparse)	98.37%	~75	~52	45.1↑
QLoRA (rank=16)	99.46%	~75	~51	14.18↓↓
Random Mask	~97%	-	-	~43

Key Findings¶

Target Task: QLoRA is slightly superior to RIGSA (99.46% vs 98.37%).
Forgetting: Forgetting in RIGSA on GSM8k is significantly less than that of QLoRA (decrease of ~0 vs ~6% drop).
Sparse fine-tuning (including random masking) systematically outperforms QLoRA in retaining source task performance.
High-rank QLoRA actually retains source task performance better, possibly because high-rank adaptation is more "natural".

Highlights & Insights¶

Innovatively applies LTH concepts to LLM adapters, employing ReZero gating to address the instability of random initialization.
Proposes Textual MNIST as a standardized OOD visual-textual task.
Reveals the systematic advantages of sparse fine-tuning in mitigating forgetting.
The method is simple and bypasses the need for complex mask selection strategies.

Textual MNIST Task Design¶

The proposed Textual MNIST serves as a uniquely valuable OOD evaluation benchmark:

Quantizes each pixel of a 28×28 grayscale image into 0-9, forming a pure-text representation.
SmolLM2 zero-shot/5-shot accuracy is around 10% (random-guess level).
The task distribution is completely different from pre-training, clearly measuring the ability to learn new capabilities.
Unlike BigBench's ASCII art approach, using single-character numeric encoding is more suitable for tokenizers.

This design effectively isolates the effect of transfer learning, making it an ideal testbed for evaluating adapter learning capabilities.

Limitations & Future Work¶

Experiments are restricted to 1.7 B-parameter models, leaving larger models unverified.
Pruning 80% in each step is aggressive, potentially leading to suboptimal performance.
Validation with only a single target task (Textual MNIST) is insufficient.
Lacks statistical significance analysis over multiple experimental runs.
Lacks direct comparison with stronger sparse fine-tuning baselines like RoSA.
The choice of high weight decay (1.0) lacks theoretical support.

Rating¶

⭐⭐⭐ — The idea is clear and interesting, combining ReZero gating with IMP in a novel way. However, the experimental scale and depth are insufficient to fully validate the advantages of the method.

Random Initialization of Gated Sparse Adapters (RIGSA)¶

TL;DR¶

Background & Motivation¶

Limitations of Prior Work¶

Background¶

Key Challenge¶

Core Idea¶

Additional Notes¶

Method¶

Core Idea¶

Iterative Magnitude Pruning (IMP)¶

Differences from Other Methods¶

Key Experimental Results¶

Target Task: Textual MNIST¶

Main Results¶

Key Findings¶

Highlights & Insights¶

Textual MNIST Task Design¶

Limitations & Future Work¶

Rating¶

Related Papers¶