Mamba-3: Improved Sequence Modeling using State Space Principles¶
Conference: ICLR 2026 arXiv: 2603.15569 Code: Available Area: Video Understanding Keywords: State Space Models, Mamba, Sequence Modeling, Inference Efficiency, MIMO
TL;DR¶
Three core improvements are proposed from an SSM perspective: exponential-trapezoidal discretization, complex-valued state spaces, and multi-input multi-output (MIMO) formulation. These advances significantly improve model quality and state-tracking capability without increasing decoding latency, pushing the performance–efficiency Pareto frontier forward.
Background & Motivation¶
Test-time compute has become a key driver of LLM performance, with techniques such as chain-of-thought reasoning and iterative refinement placing inference efficiency at the center of model design. Although Transformers remain the industry standard, they are constrained by: - Quadratic computational complexity: self-attention mechanism - Linear memory requirements: KV cache grows linearly with sequence length
Sub-quadratic models (SSMs, linear attention) offer constant memory and linear computation, yet suffer from three major shortcomings:
Limited expressivity: Mamba-2 sacrifices some expressivity for training speed, underperforming Mamba-1
Lack of state-tracking capability: unable to solve simple tasks such as parity
Poor hardware efficiency: arithmetic intensity during decoding is only ~2.5 ops/byte, leaving most hardware idle
Method¶
Overall Architecture¶
Mamba-3 introduces three SSM-principled core improvements over Mamba-2, along with several architectural refinements. The overall architecture follows the Llama style, interleaving Mamba-3 blocks and SwiGLU MLP blocks with pre-norm.
Key Designs¶
1. Exponential-Trapezoidal Discretization¶
Background: Discretizing continuous-time SSMs into recurrences. The discretization used in Mamba-1/2 lacked theoretical justification.
Contributions: - Formalizes the heuristic discretization of Mamba-1/2 as an "exponential-Euler" method (first-order approximation, error \(O(\Delta_t^2)\)) - Proposes an "exponential-trapezoidal" method (second-order approximation, error \(O(\Delta_t^3)\))
Exponential-trapezoidal recurrence:
where \(\lambda_t \in [0,1]\) is a data-dependent scalar. Setting \(\lambda_t=1\) recovers Mamba-2's Euler method; \(\lambda_t=\frac{1}{2}\) yields the classical trapezoidal rule.
Equivalent convolution perspective: This recurrence is equivalent to applying a width-2 data-dependent convolution to the state input \(\mathbf{B}_t x_t\) before entering the linear recurrence — fundamentally different from the standard short convolution externally applied in Mamba, as it operates inside the recurrence core.
Parallel form: Via the SSD framework, the structured mask \(\mathbf{L}\) corresponding to the new recurrence is the product of a 1-semiseparable matrix and a 2-band matrix (a special 2-semiseparable matrix), supporting efficient parallel matrix-multiplication computation.
2. Complex-Valued SSM¶
Design Motivation: The transition matrix eigenvalues of real-valued SSMs (e.g., Mamba-2) are restricted to real numbers, preventing the representation of "rotational" dynamics — for example, parity can be expressed using a rotation matrix \(\mathbf{R}(\pi x_t)\).
Method: Extends the underlying SSM parameters to complex values. A key equivalence is established:
A complex-valued SSM after discretization is equivalent to a real-valued SSM augmented with data-dependent rotary embeddings (RoPE):
where \(\mathbf{R}_t\) is a block-diagonal rotation matrix whose angles are produced by data projections.
RoPE trick: By cumulatively applying rotation matrices to the B/C projections (analogous to Q/K in attention), complex-valued SSMs can be implemented efficiently with minimal computational overhead. This establishes a theoretical connection between complex-valued SSMs and data-dependent RoPE.
3. Multi-Input Multi-Output (MIMO) SSM¶
Design Motivation: The arithmetic intensity of SSM decoding is extremely low. Standard SISO arithmetic intensity is ~2.5 ops/byte, whereas H100 matmul peak is ~295 ops/byte, indicating that SSM decoding is severely memory-bound.
SISO→MIMO conversion: - Expand \(\mathbf{B}_t \in \mathbb{R}^N \to \mathbb{R}^{N \times R}\) - Expand \(\mathbf{x}_t \in \mathbb{R}^P \to \mathbb{R}^{P \times R}\) - The outer product \(\mathbf{B}_t\mathbf{x}_t^\top\) becomes a matrix multiplication (leveraging tensor cores)
Effect: FLOPs increase by \(R\times\) but wall-clock latency remains nearly unchanged (as compute overlaps with memory I/O), raising arithmetic intensity from \(\Theta(1)\) to \(\Theta(R)\).
Training: MIMO decomposes into \(R^2\) parallel SISO calls. By adjusting chunk size to \(C_{\text{MIMO}} = \frac{1}{R}C_{\text{SISO}}\), total FLOPs increase by only \(R\times\) (rather than \(R^2\)).
Parameter matching: The additional parameters from MIMO are compensated by reducing MLP width (only a 6.6% reduction in the 1.5B model).
4. Architectural Refinements¶
- BC normalization: RMSNorm added after B/C projections (analogous to QKNorm in Transformers), enabling removal of the post-gate RMSNorm
- B/C bias: Learnable head-specific biases are added, providing a data-independent component (convolution-like behavior)
- Removal of short convolution: The combination of exponential-trapezoidal discretization and B/C bias renders the existing short convolution entirely removable
Loss & Training¶
- Standard language modeling training: 100B FineWeb-Edu tokens, Llama-3.1 tokenizer, 2K context
- Identical training protocol across all scales for fair comparison
- MIMO rank \(R=4\); parameter count matched by reducing MLP width
Key Experimental Results¶
Main Results¶
1.5B parameter models trained on 100B FineWeb-Edu tokens, average accuracy across 8 downstream tasks:
| Model | FW-Edu ppl↓ | Downstream Avg Acc↑ |
|---|---|---|
| Transformer-1.5B | 10.51 | 55.4 |
| GDN-1.5B | 10.45 | 55.8 |
| Mamba-2-1.5B | 10.47 | 55.7 |
| Mamba-3-SISO-1.5B | 10.35 | 56.4 |
| Mamba-3-MIMO-1.5B | 10.24 | 57.6 |
Mamba-3 SISO improves over the second-best model GDN by +0.6 points; MIMO yields a further +1.2 points, for a total gain of +1.8 points. MIMO also improves PPL by 0.11.
| Model | 180M | 440M | 880M | 1.5B |
|---|---|---|---|---|
| Mamba-2 | 42.9 | 49.6 | 53.4 | 55.7 |
| Mamba-3 SISO | 43.4 | 49.8 | 54.4 | 56.4 |
| Mamba-3 MIMO | 43.5 | 51.0 | 55.3 | 57.6 |
Mamba-3 outperforms the baseline across all model scales.
Ablation Study¶
Component ablation (440M scale):
| Variant | PPL↓ |
|---|---|
| Mamba-3 - bias - trap | 16.68 |
| Mamba-3 - bias | 16.49 |
| Mamba-3 | 15.72 |
| Mamba-3 + conv | 15.85 |
B/C bias and trapezoidal discretization act synergistically, making the short convolution optional — adding convolution actually increases PPL.
State-tracking tasks:
| Model | Parity↑ | Arithmetic (no parens)↑ | Arithmetic (with parens)↑ |
|---|---|---|---|
| Mamba-2 | 0.90 | 47.81 | 0.88 |
| Mamba-3 (w/o RoPE) | 2.27 | 1.49 | 0.72 |
| Mamba-3 (w/ Std. RoPE) | 1.56 | 20.70 | 2.62 |
| Mamba-3 | 100.00 | 98.51 | 87.75 |
| GDN [-1,1] | 100.00 | 99.25 | 93.50 |
Data-dependent RoPE is the key to state tracking: Mamba-2 fails entirely (near random), while Mamba-3 solves Parity near-perfectly.
State size experiment: Mamba-3 MIMO achieves at \(d_{\text{state}}=64\) the same PPL that Mamba-2 achieves at \(d_{\text{state}}=128\), effectively reaching the same quality at half the latency.
Key Findings¶
- Empirical results are better when \(\lambda_t\) in exponential-trapezoidal discretization is not constrained to \(\frac{1}{2}\)
- Under BF16 precision, the Mamba-3 SISO kernel is actually faster than Mamba-2 and GDN (0.156ms vs. 0.203ms vs. 0.257ms)
- MIMO with \(R=4\) increases training time by only ~2×, while decoding latency remains nearly unchanged
- Hybrid models (Mamba-3 + NoPE attention, 5:1 ratio) significantly outperform pure linear models on retrieval tasks
Highlights & Insights¶
- Unification through the SSM perspective: All three improvements arise naturally from the continuous-time SSM viewpoint — conclusions that would be difficult to reach from a linear attention or test-time regression perspective
- Substantial theoretical contribution: The paper provides the first rigorous proof that Mamba-1/2 discretization constitutes an "exponential-Euler" method, and derives a superior trapezoidal generalization
- Complex-valued SSM → RoPE: The established equivalence between complex-valued SSMs and data-dependent RoPE unifies two independently developed research directions
- Practical value of MIMO: Increasing computation without increasing latency perfectly exploits the hardware idle capacity during decoding
Limitations & Future Work¶
- Pure linear models remain noticeably weaker than Transformers on semi-structured/unstructured data extraction (SWDE, FDA)
- The choice of normalization type and placement in hybrid models remains unclear, with competing trade-offs
- Validation is limited to scales ≤1.5B and 100B tokens; large-scale results remain to be confirmed
- Theoretical guidance for selecting the optimal MIMO rank \(R\) is still lacking
- Long-context extrapolation requires additional RMSNorm layers, increasing architectural complexity
Related Work & Insights¶
- Competitive with Gated DeltaNet (GDN) in performance but methodologically distinct (SSM discretization vs. delta rule)
- The RoPE equivalence of complex-valued SSMs may inspire new positional encoding designs in Transformers
- The arithmetic intensity optimization strategy of MIMO can be generalized to other memory-bound computation scenarios
- Already adopted by large-scale hybrid models including NVIDIA Nemotron and Alibaba Qwen3, validating industrial feasibility
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — All three improvements carry theoretical novelty; the complex-valued SSM–RoPE equivalence is particularly elegant
- Technical Depth: ⭐⭐⭐⭐⭐ — Full-stack coverage from continuous-time ODEs to discrete recurrences to efficient kernel implementation
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four scales + synthetic tasks + retrieval tasks + kernel performance benchmarks
- Value: ⭐⭐⭐⭐⭐ — Open-sourced and already adopted by industry