Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models¶

Conference: NeurIPS 2025 arXiv: 2509.22284 Code: https://github.com/IBM/expressive-sparse-state-space-model Area: Video Understanding

TL;DR¶

This paper proposes PD-SSM, a structured sparse parameterization for the state transition matrix of state-space models (SSMs). The core idea is to factorize the transition matrix as a product of a column-wise one-hot matrix P and a complex diagonal matrix D (i.e., \(A = PD\)), achieving expressiveness equivalent to unstructured (dense) SSMs while retaining computational efficiency comparable to diagonal SSMs at \(\Theta(LN)\). A single layer suffices to simulate any \(N\)-state finite-state automaton (FSA). The paper provides theoretical guarantees on BIBO stability and optimal state dimensionality, with strong empirical results on FSA simulation, multivariate time-series classification, long-sequence benchmarks, and natural-language state-tracking tasks.

Background & Motivation¶

Limited expressiveness of diagonal SSMs: Efficient SSMs such as Mamba employ diagonal transition matrices, which are provably incapable of simulating non-solvable automata, leading to poor performance on algorithmic state-tracking tasks.
Prohibitive cost of dense matrices: Unstructured transition matrices (SD-SSM) can simulate arbitrary FSAs, but the parallel scan complexity is \(\Theta(LN^3)\), making large-scale training infeasible.
Insufficiently compact semi-structured methods: DPLR (diagonal-plus-low-rank) approaches enhance expressiveness but require either many stacked layers or exponentially large linear readout layers to simulate complex automata.
Fundamental tension between efficiency and expressiveness: This is the central challenge in SSM research—how to improve state-tracking capacity without sacrificing computational efficiency.
Theoretical limitations of Transformers on FSA tasks: Transformers face theoretical bottlenecks in automaton state tracking, motivating hybrid architectures with more expressive SSM components.
Demand for hybrid Transformer–SSM architectures: Large-scale LLMs increasingly adopt hybrid designs, requiring SSM layers that are both efficient and expressive.

Starting Point¶

Goal: The PD parameterization decomposes the transition matrix as \(A(u_t) = P(u_t)D(u_t)\):

P matrix (column-wise one-hot binary matrix): Generated via soft selection over a learnable dictionary \(\{M_k\}_{k=1}^K\) followed by column-wise hardmax.

Method¶

PD Parameterization¶

The transition matrix is factorized as \(A(u_t) = P(u_t) D(u_t)\):

P matrix (column-wise one-hot binary matrix): A learnable dictionary \(\{M_k\}_{k=1}^K\) is combined via softmax-derived mixing weights \(s(u_t)\) computed from the input \(u_t\); the weighted sum is then projected column-wise via hardmax to obtain a sparse P.
D matrix (complex diagonal matrix): Two separate feed-forward networks generate the magnitude (constrained to \((0,1)\) via sigmoid) and the phase (mapped to \((0, 2\pi)\)), ensuring BIBO stability.

Key Algebraic Properties¶

The set of PD matrices is closed under matrix multiplication (forming a monoid), so any chain product of such matrices remains a PD matrix.
Matrix multiplication can be performed in \(\Theta(N)\) operations via gather-scatter, enabling efficient parallel scans.

Gradient Approximation¶

Since the gradient of hardmax is almost everywhere zero, a straight-through estimator is employed: softmax serves as the surrogate gradient during backpropagation while hardmax is retained in the forward pass to preserve sparsity.

Theoretical Guarantees¶

BIBO Stability (Proposition 1): The system is bounded whenever the moduli of the D matrix entries are less than 1.
Universal Expressiveness (Proposition 2): A single-layer PD-SSM with \(N\)-dimensional state and an \(N \times N\) readout can exactly represent any \(N\)-state FSA.
Optimality (Proposition 3): Any \(N\)-state FSA requires at least \(N-1\) state dimensions; PD-SSM approaches this lower bound.

Key Experimental Results¶

Table 1: Length Generalization Accuracy on FSA Simulation Tasks (%)¶

Main Results¶

Model	Cycle Nav.	Even Pairs	Mod Arith.	Parity	Avg.
Transformer	24.4%	90.4%	23.6%	52.2%	47.7%
Mamba	48.4%	100.0	33.1%	54.2%	58.9%
Gated DeltaProduct	46.3%	100.0	78.4%	98.0%	80.7%
BD-SLiCE	99.8%	85.9%	54.0%	95.3%	83.8%
D-DE-SLiCE	73.3%	84.8%	98.4%	83.8%	85.1%
PD-SSM	99.5%	99.7%	96.2%	99.9%	98.8%

Table 2: UEA Multivariate Time-Series Classification Accuracy (%)¶

Ablation Study¶

Model	Worms	SCP1	SCP2	Ethanol	Heartbeat	Motor	Avg.
Log-NCDE	85.6%	83.1%	53.7%	34.4%	75.2%	53.7%	64.3%
LRU	87.8%	82.6%	51.2%	21.5%	78.4%	48.4%	61.7%
Mamba	70.9%	80.7%	48.2%	27.9%	76.2%	47.7%	58.6%
LinOSS-IM	95.0%	87.8%	58.2%	29.9%	75.8%	60.0%	67.8%
PD-SSM	90.0%	80.9%	56.1%	34.7%	80.0%	60.0%	67.0%

Key Findings¶

The principal components and modules contribute the most critical performance gains.

Highlights & Insights¶

Theoretical elegance: The PD parameterization simultaneously achieves BIBO stability, universal FSA simulation, and optimal state dimensionality—three theoretical guarantees in a single framework.
Efficiency breakthrough: PD-SSM achieves a 71× speedup over dense SSMs (at \(D=5632\)), with parallel scan complexity identical to diagonal SSMs at \(\Theta(LN)\).
Qualitative leap in FSA simulation: The model achieves an average of 98.8% across four standard FSA tasks, significantly outperforming all existing parallelizable SSMs.
Perfect performance on non-solvable groups: Multiple configurations attain 100% accuracy on \(A_5\) and \(S_5\) groups, in full agreement with theoretical predictions.
Hybrid architecture validation: Attaching a single PD-SSM layer to a frozen Qwen-2.5 1.5B backbone enables successful generalization on natural-language-encoded FSA state-tracking, demonstrating its potential as a hybrid architecture component.
Complementarity of P and D: P encodes arbitrary permutations while D compactly encodes cyclic groups; together they are both universal and efficient.

Limitations & Future Work¶

Overhead in P matrix generation: Although the parallel scan complexity matches that of diagonal SSMs, generating \(P(u_t)\) involves dictionary lookup and hardmax operations, resulting in a practical runtime approximately 7× that of diagonal models.
Absence of large-scale language modeling validation: Experiments focus primarily on synthetic tasks and time-series classification; the effectiveness of PD-SSM in large-scale LLM pretraining remains unverified.
Limited theoretical guarantees for the surrogate gradient: The hardmax-to-softmax gradient approximation is heuristic, lacking rigorous convergence analysis.
Dictionary size \(K\) as a manual hyperparameter: Optimal values differ across tasks (\(K=32\) for FSA, \(K=6\) for LRA), with no automatic selection mechanism.
Time-varying SSMs underperform time-invariant models on LRA: On the Long Range Arena benchmark, PD-SSM achieves the best result among time-varying models but still trails time-invariant SSMs such as S4 and LRU by approximately 10 points.

Diagonal SSMs: S4 (Gu et al., 2022), LRU (Orvieto et al., 2023), and Mamba (Gu & Dao, 2023) achieve computational efficiency via diagonal transition matrices.
DPLR-structured SSMs: DeltaNet (Schlag et al., 2021), GLA (Yang et al., 2024), RWKV-7 (Peng et al., 2025), and DeltaProduct (Siems et al., 2025) augment expressiveness with diagonal-plus-low-rank structures.
Unstructured SSMs: SD-SSM (Terzić et al., 2025) uses dense matrices for complete FSA simulation at the cost of \(O(LN^3)\) complexity.
FSA and neural networks: Merrill & Sabharwal (2024) prove that diagonal SSMs cannot simulate non-solvable automata; Sarrof et al. (2024) construct complex diagonal SSMs for solvable automata.
Hybrid architectures: Griffin (De et al., 2024), Jamba (Lenz et al., 2025), and TransXSSM (Wu et al., 2025) combine Transformers with SSMs.
Neural CDEs: Log-NCDE (Walker et al., 2024) and LinOSS (Rusch et al., 2025) are differential-equation-based methods for time-series analysis.

Rating¶

Dimension	Score	Remarks
Novelty	9/10	PD parameterization is an entirely new matrix structure design, elegantly combining sparsity with algebraic closure
Theoretical Depth	9/10	Complete proofs of stability, expressiveness, and optimality form a rigorous theoretical framework
Experimental Thoroughness	8/10	Covers synthetic FSA, time-series classification, LRA, and natural-language tasks, but lacks large-scale LM experiments
Writing Quality	9/10	Clear exposition with well-organized theory and experiments; high-quality figures and tables
Value	7/10	Well-suited for scenarios requiring precise state tracking, though large-scale practical utility remains to be demonstrated
Overall	8.5/10	A high-quality contribution combining theoretical elegance with empirical validation, with significant impact on SSM expressiveness research