Hankel Singular Value Regularization for Highly Compressible State Space Models¶
Conference: NeurIPS 2025 arXiv: 2510.22951 Code: GitHub Area: Model Compression / State Space Models Keywords: SSM compression, Hankel singular values, balanced truncation, regularization, Long Range Arena
TL;DR¶
By regularizing the Hankel singular value nuclear norm of SSM layers during training to encourage rapid decay, the trained model can be compressed to 10% of its original order via balanced truncation with negligible accuracy loss. A block-diagonal rotation matrix parameterization reduces Gramian computation from \(\mathcal{O}(n^3)\) to \(\mathcal{O}(n^2)\).
Background & Motivation¶
SSMs (e.g., S4/S5/Mamba) excel on long-sequence tasks, but inference cost scales with state dimension \(n\). Compression requires reducing \(n\), and the key factor is the decay rate of Hankel singular values.
Core systems-theoretic perspective: The sequence-to-sequence mapping of a linear time-invariant system is characterized by the Hankel operator, whose singular values \(\sigma_1, \ldots, \sigma_n\) determine compressibility. The approximation error when truncating order \(n\) to \(r\) satisfies:
Problem: Hankel singular values of standardly trained SSMs decay slowly, causing sharp accuracy degradation upon direct truncation. Prior work (Ezoe et al.) requires retraining after truncation.
Method¶
Block-Diagonal Rotation Matrix Parameterization¶
The matrix \(\mathbf{A}\) is parameterized as a block diagonal of \(2 \times 2\) rotation matrices:
- Stability guarantee: \(\rho_i\) is constrained to \([0, 1)\) via \(\tanh\)
- Universality: Proposition 1 proves this parameterization is dense in SSM space
- Parameter count: Linear in \(n\), identical to the diagonal parameterization of S5
Efficient Gramian Computation¶
Hankel singular values \(\sigma_i = \sqrt{\lambda_i(\mathbf{P}\mathbf{Q})}\) require solving the Lyapunov equation:
The standard Bartels–Stewart algorithm costs \(\mathcal{O}(n^3)\). Exploiting the block-diagonal structure, the equation decomposes into \(q\) independent \(2 \times 2\) Lyapunov equations and \(q(q-1)/2\) independent \(2 \times 2\) Sylvester equations (\(q = n/2\)), each solvable in parallel:
Total complexity is \(\mathcal{O}(q^2) = \mathcal{O}(n^2)\), independent of sequence length \(n_s\).
Differentiable Regularizer¶
Although individual Hankel singular values are non-differentiable with respect to system matrices, Proposition 2 proves their sum is smooth. The nuclear norm regularizer is defined as:
Added to the training loss, this encourages singular values to concentrate on the leading components, promoting rapid decay.
Associative Scan Acceleration¶
Using the property \(\mathbf{A}(\mathbf{x}, \boldsymbol{\beta}) \mathbf{A}(\mathbf{y}, \boldsymbol{\gamma}) = \mathbf{A}(\mathbf{x} \odot \mathbf{y}, \boldsymbol{\beta} + \boldsymbol{\gamma})\), a new associative binary operation is defined that avoids explicit block matrix products, enabling the same \(\mathcal{O}(\log(n_s) \cdot n)\) parallel scan as diagonal S5.
Post-Training Compression¶
- Solve the final Gramians \(\mathbf{P}\), \(\mathbf{Q}\)
- Apply balanced truncation to obtain a reduced system of order \(r\)
- Layer-adaptive rank allocation: given a total budget \(r_t\), bisection search allocates per-layer ranks \(r^{(\ell)}\) to preserve equal proportions of energy across layers
- Diagonalize the reduced system to restore efficient scanning
Key Experimental Results¶
Long Range Arena Accuracy Under Various Truncation Ratios¶
| Method | sCIFAR 60% | 70% | 80% | 90% | sMNIST 60% | 70% | 80% | 90% |
|---|---|---|---|---|---|---|---|---|
| LAST | 62.93 | 36.66 | 17.35 | 11.19 | 95.11 | 89.17 | 62.37 | 27.67 |
| global | 28.91 | 13.62 | 11.12 | 10.47 | 91.67 | 83.32 | 52.52 | 21.94 |
| no reg. | 71.28 | 41.98 | 21.14 | 9.84 | 91.32 | 13.35 | 11.05 | 10.55 |
| HSVR | 81.84 | 81.75 | 81.37 | 51.08 | 99.45 | 99.22 | 98.90 | 86.95 |
| Method | IMDB 60% | 70% | 80% | 90% | PATH-X 60% | 70% | 80% |
|---|---|---|---|---|---|---|---|
| LAST | 88.48 | 85.05 | 80.26 | 57.08 | 50.33 | 49.16 | 49.53 |
| no reg. | 71.45 | 71.04 | 51.32 | 50.00 | 56.09 | 50.39 | 50.16 |
| HSVR | 87.26 | 87.16 | 86.97 | 86.40 | 87.74 | 82.82 | 54.02 |
Key Findings¶
- sMNIST at 80% truncation: HSVR retains 98.90% accuracy, while no-regularization collapses to 11.05% (near random)
- IMDB at 90% truncation: HSVR loses only ~1% (86.40 vs. 87.26), while no-regularization drops to 50%
- PATH-X at 60% truncation: HSVR achieves 87.74%, while all baselines degrade to chance level (~50%)
- Layer-adaptive rank allocation outperforms uniform allocation: the number of important states varies substantially across layers
Highlights & Insights¶
- Elegant bridge between systems theory and deep learning: Compressibility of neural SSM layers is characterized through classical Hankel singular value analysis
- Non-differentiable made differentiable: Smoothness of the nuclear norm is established by exploiting the smoothness of singular value sums at crossing points — an analytically elegant technique
- Training as compression preparation: A single regularized training run enables post-hoc standard balanced truncation without retraining
- \(\mathcal{O}(n^2)\) Gramian algorithm: A practical efficiency improvement that makes regularization feasible during training
Limitations & Future Work¶
- Not applicable to pretrained models: Regularized training from scratch is required; post-hoc compression of already-trained SSMs is not supported
- Limited to linear time-invariant systems: Time-varying/input-dependent SSMs such as Mamba are not covered
- Only unidirectional scanning is considered; results may differ for bidirectional SSMs (the standard configuration for sCIFAR/IMDB)
- The nuclear norm regularization strength requires manual tuning and exhibits high task-specific sensitivity
- Post-compression diagonalization is numerically unstable when eigenvalues are near-degenerate
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Introduces classical systems theory into SSM compression with a distinctive perspective
- Technical Depth: ⭐⭐⭐⭐⭐ — Contributions span parameterization design, Gramian algorithms, differentiability proofs, and associative scanning
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive coverage of all 5 LRA benchmarks, but lacks practical NLP tasks such as language modeling
- Practicality: ⭐⭐⭐ — The requirement for training from scratch limits real-world applicability
- Overall: ⭐⭐⭐⭐