Alignment-Enhanced Integration of Connectivity and Spectral Sparsity in Dynamic Sparse Training of LLM¶
Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=jZplmg7Ad9
Code: Provided in supplementary materials (repository not public)
Area: Model Compression / Parameter-Efficient Pre-training
Keywords: Dynamic Sparse Training, Low-rank Decomposition, Connectivity Sparsity, Spectral Sparsity, Cancellation Effect, Alignment Loss
TL;DR¶
This work for the first time integrates dynamic connectivity sparsity (CHTs) and dynamic low-rank spectral sparsity into a unified sparse pre-training framework. It discovers that a naive summation of the two branches leads to an output "cancellation" effect and introduces a simple alignment loss to synchronize them. The resulting CHTsL approaches dense performance on LLaMA-60M/130M while retaining only 10%–30% of the parameters.
Background & Motivation¶
Background: Training LLMs from scratch is extremely memory and compute-intensive, leading to the emergence of parameter-efficient sparse pre-training. This field is divided into two branches: connectivity sparse training (enforcing sparsity on the weight matrix structure, represented by Dynamic Sparse Training (DST) like SET/RigL/MEST/CHT/CHTs) and spectral sparse training (constraining weight subspaces via low-rank decomposition, such as ReLoRA/GaLore, or CoLA which maintains low-rank throughout training and inference). Both have strengths: DST approaches dense performance at 10% parameters, while low-rank excels at capturing the global subspace.
Limitations of Prior Work: There is almost no work combining these two branches. The only pioneer, SLTrain, has two major flaws: (1) its sparse branch is static, acting only as a "supplement" to the low-rank component without leveraging the power of dynamic connectivity; (2) it merely performs a direct summation of sparse and low-rank outputs without any collaborative mechanism.
Key Challenge: The authors observe that when the sparse branch output \(S\) and the low-rank branch output \(L\) are trained together, they often point in opposite directions—one pushes a feature positively while the other pushes it negatively, neutralizing the net effect and wasting expressive capacity. This cancellation effect prevents naive summation \(S+L\) from effectively carrying information from both branches, particularly in the Q and K matrices of the Attention mechanism, where dot products are highly sensitive to inconsistencies.
Goal: Establish a unified framework that truly merges dynamic connectivity sparsity and dynamic spectral sparsity, quantifying and mitigating the cancellation effect to enable synergy between branches.
Core Idea: Use an alignment loss to pull the outputs of the sparse and low-rank branches into the same direction, combined with activation stabilization for the low-rank branch, transforming "cancellation" into "complementary collaboration."
Method¶
Overall Architecture¶
The framework consists of three steps: first, use the OCR metric to identify and quantify the cancellation effect; second, use a training framework composed of alignment loss and activation adjustments to stabilize collaboration; finally, instantiate the framework as CHTsL using state-of-the-art CHTs and low-rank branches. Each layer output is the sum of the branches \(O^{(l)}=S^{(l)}+L^{(l)}\). During the forward pass, branches are calculated in parallel; during the backward pass, the alignment loss pulls them together.
flowchart LR
X[Input x] --> S["Dynamic Connectivity Sparse Branch<br/>CHTs: Connectivity Evolution"]
X --> L["Spectral Sparse Branch<br/>L = B·σ(A·x), σ=SiLU"]
S --> ADD["Element-wise Addition O = S + L"]
L --> ADD
S -.Alignment.-> AL["Alignment Loss<br/>‖S − L‖_F"]
L -.Alignment.-> AL
ADD --> OUT[Layer Output]
AL --> LOSS["Total Loss L = L_task + λ·L_align"]
Key Designs¶
1. OCR Metric: Quantifying "Cancellation" into an Observable Number—The authors translate intuition into a metric, defining the Overlap Cancellation Ratio: \(\mathrm{OCR}=\frac{\sum_i \min(|S_i|,|L_i|)\cdot\mathbb{1}\{S_iL_i<0\}}{\sum_i \min(|S_i|,|L_i|)+\varepsilon}\). The numerator calculates the volume of the neutralized overlapping signal where the two branches have opposite signs (\(S_iL_i<0\)), while the denominator represents the total overlapping signal. OCR ranges in \([0,1)\), where higher values indicate more severe cancellation. This metric allows the "conflict" between branches to be tracked per layer and training step.
2. Alignment Loss: Synchronizing Branches for Collaboration—Since cancellation stems from directional conflict, the direct solution is to penalize the difference between branch outputs. The authors define \(L^{(l)}_{\text{align}}=\frac{1}{BN}\lVert S^{(l)}-L^{(l)}\rVert_F\) for each layer (where \(B\) is batch size and \(N\) is the number of elements in the output). The total loss is \(L_{\text{align}}=\sum_l L^{(l)}_{\text{align}}\). This Frobenius norm penalty encourages the sparse and low-rank outputs to be consistent, reducing destructive interference. The goal is not to make them identical (which would collapse to a single branch) but to reduce conflict so each branch focuses on complementary aspects. It is weighted by \(\lambda\) in the total objective.
3. Low-rank Activation Stabilization: Ensuring Reliable Output—Low-rank decomposition can become unstable or lose scale control under extreme sparsity. Borrowing from CoLA, the authors insert a mild non-linearity between the two low-rank factor matrices: \(L^{(l)}=B^{(l)}\,\sigma(A^{(l)}x)\), where \(\sigma\) is SiLU. The role of activation here is not to enhance expressiveness but to maintain reasonable scales and prevent numerical divergence, ensuring the low-rank branch works stably alongside the sparse branch.
4. CHTsL Instantiation and Unified Objective—The framework is implemented with CHTs (connectivity evolution inspired by the Cannistracci-Hebbian theory from brain connectomics) as the sparse branch and a stabilized low-rank branch. The total objective is \(L=L_{\text{task}}+\lambda L_{\text{align}}\), where \(\lambda\) balances alignment strength (0.5 for LLaMA-60M/OpenWebText and 130M; 0.3 for 60M/C4). Joint optimization under this objective stabilizes training and fosters collaboration.
Key Experimental Results¶
Models: LLaMA-60M / 130M; Data: OpenWebText, C4; Budgets: Retain 10%/20%/30% of dense parameters (total sparsity 0.9/0.8/0.7). Sparsity is defined as \(s=1-\#\text{params}/\#\text{params}_{\text{dense}}\). For hybrid methods, \(s_{\text{total}}=1-d_{\text{conn}}-d_{\text{spec}}\), ensuring total trainable parameters remain consistent across methods.
Main Results (Validation PPL↓, Excerpt)¶
| Dataset | Method | 60M s=0.9 | 60M s=0.8 | 60M s=0.7 | 130M s=0.9 | 130M s=0.8 | 130M s=0.7 |
|---|---|---|---|---|---|---|---|
| OpenWebText | Dense | 26.56 | — | — | 19.46 | — | — |
| CHTs | 33.03 | 29.84 | 28.12 | 24.75 | 22.67 | 21.48 | |
| CoLA | 37.58 | 30.87 | 28.53 | 27.07 | 23.24 | 21.61 | |
| SLTrain | 33.90 | 29.83 | 27.86 | 25.33 | 22.81 | 21.25 | |
| CHTsL | 31.77 | 29.11 | 27.40 | 24.07 | 21.87 | 20.65 | |
| C4 | Dense | 33.21 | — | — | 24.55 | — | — |
| CHTs | 40.62 | 37.55 | 35.23 | 31.00 | 28.69 | 27.46 | |
| SLTrain | 41.05 | 37.00 | 34.89 | 31.38 | 28.28 | 26.78 | |
| CHTsL | 39.29 | 35.95 | 34.19 | 30.03 | 27.59 | 26.19 |
Ours (CHTsL) is the optimal sparse method across all model/data/sparsity combinations, coming closest to dense performance (e.g., 20.65 vs. 19.46 for 130M/OpenWebText/s=0.7).
Ablation Study (Comparison of Integration Strategies, PPL↓)¶
| Model/Data | Total Sparsity | Naive (Sum) | Act (+Activation) | Act+Align (+Alignment) |
|---|---|---|---|---|
| 60M/OpenWebText | 0.9 | 32.64 | 32.21 | 31.77 |
| 60M/C4 | 0.9 | 189.55 | 39.66 | 39.29 |
| 60M/C4 | 0.7 | 591.42 | 34.55 | 34.33 |
| 130M/OpenWebText | 0.9 | 119.35 | 24.45 | 24.07 |
| 130M/C4 | 0.7 | 920.16 | 26.55 | 26.19 |
Wilcoxon signed-rank test: \(p=0.00049\) for Act+Align vs. Naive and Act, both \(<0.05\). Naive summation catastrophically collapses under extreme sparsity (PPL in the hundreds). Activation restores stability, and alignment further improves results.
Key Findings¶
- Cancellation primarily occurs in Q and K: OCR layer-wise curves show alignment loss significantly reduces OCR in Query/Key layers, whereas V/O and FFN are more tolerant due to residual connections. Q and K determine attention weights, making them most sensitive to inconsistencies.
- Sparsity Configuration Sensitivity: At a fixed total sparsity of 0.7, OpenWebText (homogeneous corpus) favors a higher proportion of connectivity sparsity. In contrast, C4 (diverse corpus) benefits from a higher low-rank proportion to adapt to varied linguistic patterns across the weight matrix.
Highlights & Insights¶
- Translating Vague Intuition into Measurable Metrics: OCR is not just a diagnostic tool; it turns "branch conflict" into an observable, optimizable target.
- Compelling Evidence of Collapse under Extreme Sparsity: The massive PPL spikes (189/591/920) for the Naive method demonstrate that simple addition is "unusable" rather than just "suboptimal," highlighting the necessity of alignment/activation.
- Self-consistent Explanation for Q/K Sensitivity: Explaining the benefit through the sensitivity of attention dot products, supported by OCR curves, provides a complete causal chain.
- Unifying Disconnected Sparsity Paradigms: Effectively making dynamic connectivity and spectral sparsity collaborate "dynamically" for the first time.
Limitations & Future Work¶
- Scale Constraints: Only validated on LLaMA-130M via PPL. Testing on larger models (1B+) or downstream tasks is needed to confirm scalability.
- Heuristic Alignment Loss: Directly penalizing \(\lVert S-L\rVert_F\) lacks theoretical grounding for the "optimal degree of alignment," and \(\lambda\) requires per-setting searches (0.3–0.5).
- Search Cost for Sparse Configuration: The allocation between sparse and low-rank parameters requires systemic grid searching, which increases deployment costs.
- Computational Overhead: The additional FLOPs/memory cost of layer-wise alignment loss and activations was not quantitatively analyzed.
Related Work & Insights¶
- Dynamic Connectivity Sparsity: Evolution from SET (random reconnect) → RigL (gradient-based) → MEST (weight+gradient) → CHT/CHTs (SOTA). CHTs is used as the sparse branch here.
- Spectral Sparsity/Low-rank: Evolution from LoRA (fine-tuning) → ReLoRA/GaLore (pre-training with dense forward pass) → CoLA (full low-rank training/inference). Low-rank and activation designs here draw from CoLA.
- Insight: "Output directional conflict" may be a hidden performance killer when combining complementary modules via simple addition. Defining a cancellation metric and consistency regularization is a generic strategy for shifting modules from competition to collaboration, applicable to MoE or ensemble scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First dynamic fusion of connectivity and spectral sparsity; the combination of OCR and alignment loss is elegant and addresses the pain points of SLTrain.
- Experimental Thoroughness: ⭐⭐⭐ — Solid evidence across two models, two datasets, and three sparsity levels, plus statistical tests; however, lacks large-scale and downstream task validation.
- Writing Quality: ⭐⭐⭐⭐ — Clear logic from problem to diagnosis to method; self-consistent explanations for OCR and Q/K behavior.
- Value: ⭐⭐⭐⭐ — Provides a practical recipe for "how to merge the two paradigms" in parameter-efficient sparse pre-training.