Mamba-3: Improved Sequence Modeling using State Space Principles¶

Conference: ICLR 2026 arXiv: 2603.15569 Code: Available Area: Video Understanding Keywords: State Space Models, Mamba, Sequence Modeling, Inference Efficiency, MIMO

TL;DR¶

Three core improvements are proposed from an SSM perspective: exponential-trapezoidal discretization, complex-valued state spaces, and multi-input multi-output (MIMO) formulation. These advances significantly improve model quality and state-tracking capability without increasing decoding latency, pushing the performance–efficiency Pareto frontier forward.

Background & Motivation¶

Test-time compute has become a key driver of LLM performance, with techniques such as chain-of-thought reasoning and iterative refinement placing inference efficiency at the center of model design. Although Transformers remain the industry standard, they are constrained by: - Quadratic computational complexity: self-attention mechanism - Linear memory requirements: KV cache grows linearly with sequence length

Sub-quadratic models (SSMs, linear attention) offer constant memory and linear computation, yet suffer from three major shortcomings:

Limited expressivity: Mamba-2 sacrifices some expressivity for training speed, underperforming Mamba-1

Lack of state-tracking capability: unable to solve simple tasks such as parity

Poor hardware efficiency: arithmetic intensity during decoding is only ~2.5 ops/byte, leaving most hardware idle

Method¶

Overall Architecture¶

Mamba-3 introduces three SSM-principled core improvements over Mamba-2, along with several architectural refinements. The overall architecture follows the Llama style, interleaving Mamba-3 blocks and SwiGLU MLP blocks with pre-norm.

Key Designs¶

1. Exponential-Trapezoidal Discretization¶

Background: Discretizing continuous-time SSMs into recurrences. The discretization used in Mamba-1/2 lacked theoretical justification.

Contributions: - Formalizes the heuristic discretization of Mamba-1/2 as an "exponential-Euler" method (first-order approximation, error \(O(\Delta_t^2)\)) - Proposes an "exponential-trapezoidal" method (second-order approximation, error \(O(\Delta_t^3)\))

Exponential-trapezoidal recurrence:

\[\mathbf{h}_t = e^{\Delta_t A_t}\mathbf{h}_{t-1} + (1-\lambda_t)\Delta_t e^{\Delta_t A_t}\mathbf{B}_{t-1}x_{t-1} + \lambda_t\Delta_t\mathbf{B}_t x_t\]

where \(\lambda_t \in [0,1]\) is a data-dependent scalar. Setting \(\lambda_t=1\) recovers Mamba-2's Euler method; \(\lambda_t=\frac{1}{2}\) yields the classical trapezoidal rule.

Equivalent convolution perspective: This recurrence is equivalent to applying a width-2 data-dependent convolution to the state input \(\mathbf{B}_t x_t\) before entering the linear recurrence — fundamentally different from the standard short convolution externally applied in Mamba, as it operates inside the recurrence core.

Parallel form: Via the SSD framework, the structured mask \(\mathbf{L}\) corresponding to the new recurrence is the product of a 1-semiseparable matrix and a 2-band matrix (a special 2-semiseparable matrix), supporting efficient parallel matrix-multiplication computation.

2. Complex-Valued SSM¶

Design Motivation: The transition matrix eigenvalues of real-valued SSMs (e.g., Mamba-2) are restricted to real numbers, preventing the representation of "rotational" dynamics — for example, parity can be expressed using a rotation matrix \(\mathbf{R}(\pi x_t)\).

Method: Extends the underlying SSM parameters to complex values. A key equivalence is established:

A complex-valued SSM after discretization is equivalent to a real-valued SSM augmented with data-dependent rotary embeddings (RoPE):

\[\mathbf{h}_t = e^{\Delta_t A_t}\mathbf{R}_t\mathbf{h}_{t-1} + \Delta_t\mathbf{B}_t x_t\]

where \(\mathbf{R}_t\) is a block-diagonal rotation matrix whose angles are produced by data projections.

RoPE trick: By cumulatively applying rotation matrices to the B/C projections (analogous to Q/K in attention), complex-valued SSMs can be implemented efficiently with minimal computational overhead. This establishes a theoretical connection between complex-valued SSMs and data-dependent RoPE.

3. Multi-Input Multi-Output (MIMO) SSM¶

Design Motivation: The arithmetic intensity of SSM decoding is extremely low. Standard SISO arithmetic intensity is ~2.5 ops/byte, whereas H100 matmul peak is ~295 ops/byte, indicating that SSM decoding is severely memory-bound.

SISO→MIMO conversion: - Expand \(\mathbf{B}_t \in \mathbb{R}^N \to \mathbb{R}^{N \times R}\) - Expand \(\mathbf{x}_t \in \mathbb{R}^P \to \mathbb{R}^{P \times R}\) - The outer product \(\mathbf{B}_t\mathbf{x}_t^\top\) becomes a matrix multiplication (leveraging tensor cores)

Effect: FLOPs increase by \(R\times\) but wall-clock latency remains nearly unchanged (as compute overlaps with memory I/O), raising arithmetic intensity from \(\Theta(1)\) to \(\Theta(R)\).

Training: MIMO decomposes into \(R^2\) parallel SISO calls. By adjusting chunk size to \(C_{\text{MIMO}} = \frac{1}{R}C_{\text{SISO}}\), total FLOPs increase by only \(R\times\) (rather than \(R^2\)).

Parameter matching: The additional parameters from MIMO are compensated by reducing MLP width (only a 6.6% reduction in the 1.5B model).

BC normalization: RMSNorm added after B/C projections (analogous to QKNorm in Transformers), enabling removal of the post-gate RMSNorm
B/C bias: Learnable head-specific biases are added, providing a data-independent component (convolution-like behavior)
Removal of short convolution: The combination of exponential-trapezoidal discretization and B/C bias renders the existing short convolution entirely removable

Loss & Training¶

Standard language modeling training: 100B FineWeb-Edu tokens, Llama-3.1 tokenizer, 2K context
Identical training protocol across all scales for fair comparison
MIMO rank \(R=4\); parameter count matched by reducing MLP width

Key Experimental Results¶

Main Results¶

1.5B parameter models trained on 100B FineWeb-Edu tokens, average accuracy across 8 downstream tasks:

Model	FW-Edu ppl↓	Downstream Avg Acc↑
Transformer-1.5B	10.51	55.4
GDN-1.5B	10.45	55.8
Mamba-2-1.5B	10.47	55.7
Mamba-3-SISO-1.5B	10.35	56.4
Mamba-3-MIMO-1.5B	10.24	57.6

Mamba-3 SISO improves over the second-best model GDN by +0.6 points; MIMO yields a further +1.2 points, for a total gain of +1.8 points. MIMO also improves PPL by 0.11.

Model	180M	440M	880M	1.5B
Mamba-2	42.9	49.6	53.4	55.7
Mamba-3 SISO	43.4	49.8	54.4	56.4
Mamba-3 MIMO	43.5	51.0	55.3	57.6

Mamba-3 outperforms the baseline across all model scales.

Ablation Study¶

Component ablation (440M scale):

Variant	PPL↓
Mamba-3 - bias - trap	16.68
Mamba-3 - bias	16.49
Mamba-3	15.72
Mamba-3 + conv	15.85

B/C bias and trapezoidal discretization act synergistically, making the short convolution optional — adding convolution actually increases PPL.

State-tracking tasks:

Model	Parity↑	Arithmetic (no parens)↑	Arithmetic (with parens)↑
Mamba-2	0.90	47.81	0.88
Mamba-3 (w/o RoPE)	2.27	1.49	0.72
Mamba-3 (w/ Std. RoPE)	1.56	20.70	2.62
Mamba-3	100.00	98.51	87.75
GDN [-1,1]	100.00	99.25	93.50

Data-dependent RoPE is the key to state tracking: Mamba-2 fails entirely (near random), while Mamba-3 solves Parity near-perfectly.

State size experiment: Mamba-3 MIMO achieves at \(d_{\text{state}}=64\) the same PPL that Mamba-2 achieves at \(d_{\text{state}}=128\), effectively reaching the same quality at half the latency.

Key Findings¶

Empirical results are better when \(\lambda_t\) in exponential-trapezoidal discretization is not constrained to \(\frac{1}{2}\)
Under BF16 precision, the Mamba-3 SISO kernel is actually faster than Mamba-2 and GDN (0.156ms vs. 0.203ms vs. 0.257ms)
MIMO with \(R=4\) increases training time by only ~2×, while decoding latency remains nearly unchanged
Hybrid models (Mamba-3 + NoPE attention, 5:1 ratio) significantly outperform pure linear models on retrieval tasks

Highlights & Insights¶

Unification through the SSM perspective: All three improvements arise naturally from the continuous-time SSM viewpoint — conclusions that would be difficult to reach from a linear attention or test-time regression perspective
Substantial theoretical contribution: The paper provides the first rigorous proof that Mamba-1/2 discretization constitutes an "exponential-Euler" method, and derives a superior trapezoidal generalization
Complex-valued SSM → RoPE: The established equivalence between complex-valued SSMs and data-dependent RoPE unifies two independently developed research directions
Practical value of MIMO: Increasing computation without increasing latency perfectly exploits the hardware idle capacity during decoding

Limitations & Future Work¶

Pure linear models remain noticeably weaker than Transformers on semi-structured/unstructured data extraction (SWDE, FDA)
The choice of normalization type and placement in hybrid models remains unclear, with competing trade-offs
Validation is limited to scales ≤1.5B and 100B tokens; large-scale results remain to be confirmed
Theoretical guidance for selecting the optimal MIMO rank \(R\) is still lacking
Long-context extrapolation requires additional RMSNorm layers, increasing architectural complexity

Competitive with Gated DeltaNet (GDN) in performance but methodologically distinct (SSM discretization vs. delta rule)
The RoPE equivalence of complex-valued SSMs may inspire new positional encoding designs in Transformers
The arithmetic intensity optimization strategy of MIMO can be generalized to other memory-bound computation scenarios
Already adopted by large-scale hybrid models including NVIDIA Nemotron and Alibaba Qwen3, validating industrial feasibility

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — All three improvements carry theoretical novelty; the complex-valued SSM–RoPE equivalence is particularly elegant
Technical Depth: ⭐⭐⭐⭐⭐ — Full-stack coverage from continuous-time ODEs to discrete recurrences to efficient kernel implementation
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four scales + synthetic tasks + retrieval tasks + kernel performance benchmarks
Value: ⭐⭐⭐⭐⭐ — Open-sourced and already adopted by industry