Skip to content

TransMamba: A Sequence-Level Hybrid Transformer-Mamba Language Model

Conference: AAAI 2026 arXiv: 2503.24067 Code: N/A Area: LLM/NLP Keywords: Transformer, Mamba, SSM, Hybrid Architecture, Sequence Modeling

TL;DR

This paper proposes TransMamba, a sequence-level Transformer-Mamba hybrid architecture that dynamically switches between Attention and SSM computation at different token positions via shared QKV/CBx parameters and a Memory Converter, achieving efficiency advantages for both short and long sequences.

Background & Motivation

Background: Transformer (\(O(T^2)\) complexity) remains the dominant architecture for LLMs. Mamba (SSM, \(O(T)\) linear complexity) is more efficient on long sequences but exhibits instability in in-context learning and multi-task generalization. Existing hybrid approaches (Jamba, Zamba, etc.) adopt layer-level interleaving (fixed ratios of Transformer and Mamba layers), but suffer from structural rigidity — they must adhere to specific layer ordering and ratio rules.

Limitations of Prior Work: (a) Transformer trains faster on short contexts while Mamba is more efficient on long contexts — yet layer-level mixing cannot exploit the respective efficiency advantages of both within the same sequence; (b) layer-level mixing ratios are fixed (e.g., 4:1), and deviating from the prescribed rules degrades performance; (c) Mamba2 reveals a mathematical duality between Attention and SSM, and Wang et al. demonstrate via distillation that QKV and CBx parameters are mutually transferable — suggesting that a more principled unification of the two mechanisms is possible.

Key Challenge: A flexible framework is needed that can adaptively apply Attention or SSM at different positions within the same sequence without information loss during transitions.

Key Insight: Exploiting the parameter correspondence between Attention and SSM (Q↔C, K↔B, V↔x) to enable a single set of parameters to support both computation modes.

Core Idea: Shared QKV/CBx parameters + lossless Memory Converter + TransPoint scheduling = flexible sequence-level switching between Attention and SSM.

Method

Overall Architecture

TransMamba is a stacked-layer decoder-only autoregressive model. Each layer contains the full Mamba parameter set (C/B/x/A/Δ), with QKV and CBx parameters shared based on the Attention-SSM parameter correspondence (Q↔C, K↔B, V↔x). Tokens before the TransPoint are processed via Attention; tokens after the TransPoint are processed via SSM.

Key Designs

  1. Shared Parameter Mapping (QKV↔CBx):

    • Function: A single parameter set supports both Attention and SSM computation modes.
    • Mechanism: Prefix tokens (\(h_s = h[:TransPoint]\)) compute Q=δ(h_s W_C), K=δ(h_s W_B), V=δ(h_s W_x) via shared parameters and produce output \(y_s = \text{softmax}(QK^T) \cdot V\) through Attention. Subsequent tokens (\(h_l = h[TransPoint:]\)) use the same parameters to compute C/B/x, together with Δ and A parameters, to produce \(y_l\) via the SSM mechanism.
    • Design Motivation: The dual form \((L \circ QK^T)V\) vs. \((A^{\times} \circ CB^T)X\) revealed by Mamba2 suggests that the core weights of both mechanisms are unifiable.
  2. Memory Converter (Lossless Information Transfer):

    • Function: Losslessly converts Attention K/V at the TransPoint into the initial SSM hidden state \(h_0\).
    • Mechanism: \(h_0 = \text{MemoryConverter}(K, V)\). By unrolling the SSM hidden state recurrence into matrix form \(h = (A^{\times} \circ B^T)X\), it is theoretically shown that the Attention K/V can perfectly preserve the sequence state information required by SSM.
    • Design Motivation: Without this conversion, the SSM component would lose all contextual information from prefix tokens upon switching — the Memory Converter is the critical enabler of the entire framework.
  3. TransPoint Scheduling Strategy:

    • Function: Determines at which token position each layer switches from Attention to SSM.
    • Mechanism: Each layer is assigned a single TransPoint. The training FLOPs are \(O(P^2N + (T-P)N^2)\) (where \(P\) is the TransPoint position), forming a quadratic function of \(P\) with an optimal solution. Different TransPoint configurations across layers are also explored.
    • Design Motivation: A \(P\) that is too small results in insufficient Attention coverage, while a \(P\) that is too large dilutes the efficiency advantage of SSM. The optimal TransPoint depends on the ratio between sequence length and model dimension.

Loss & Training

Standard cross-entropy language modeling loss plus a reconstruction loss for the Memory Converter. Training FLOPs are \(O(P^2 N + (T-P)N^2)\) (\(P\) denotes the Attention prefix length), which is superior to the \(O(T^2 N)\) cost of a pure Transformer.

Key Experimental Results

Main Results (400M model, general tasks)

Model ARC-E ARC-C CoQA OBQA PIQA BoolQ
Transformer-400M 60.57 58.72 5.07 42.4 52.75 -
Mamba-400M lower lower higher lower lower -
TransMamba-400M best best best best best best
  • Under an equivalent training budget of 83B tokens, TransMamba outperforms both pure Transformer and pure Mamba baselines on most general benchmarks.
  • Notable improvements are observed on PhoneBook (long-range dependency test) and LongBench-v2 (long-context understanding).

Efficiency Results

Configuration Training FLOPs/layer Notes
Pure Transformer \(O(T^2 N)\) Fast on short sequences but bottlenecked on long sequences
Pure Mamba \(O(TN^2)\) Efficient on long sequences but inferior to Transformer on short sequences
TransMamba (optimal P) \(O(P^2 N + (T-P)N^2)\) Combines advantages of both
  • Under the setting N=1536, T=8192, the optimal TransPoint is P≈2048, yielding peak training efficiency.

Ablation Study

Configuration Performance
Full TransMamba Best — optimal efficiency and performance
w/o Memory Converter Significant degradation — SSM loses prefix context
Fixed TransPoint (uniform across layers) Suboptimal — abrupt switching incurs performance loss
Log-distributed TransPoint (cycled every 8 layers) Best — progressive switching balances diversity and efficiency
Pure Attention (P=T) Strong on short sequences but inefficient on long sequences
Pure SSM (P=0) Strong on long sequences but insufficient expressiveness on short sequences

Key Findings

  • The Memory Converter is a necessary component — removing it causes the SSM portion to collapse in performance, confirming the importance of lossless conversion.
  • Sequence-level mixing is more flexible and efficient than layer-level mixing — short sequences benefit from Attention's global interaction, while long sequences benefit from SSM's linear complexity.
  • The optimal TransPoint position is related to the ratio between model dimension \(N\) and sequence length \(T\) — SSM efficiency dominates when \(T > N\).
  • Assigning different TransPoints to different layers outperforms a uniform assignment — the log-distributed scheme (0/128/256/512/1024/2048/4096/8192 cycled every 8 layers) yields the best results.
  • At inference time, a TransPoint strategy different from that used in training can be adopted — the most efficient structure is used during training, while a task-appropriate structure can be applied at inference.

Highlights & Insights

  • Validates the Transformer-Mamba unification at a deeper level: Beyond the mathematical duality, the paper empirically demonstrates that a single parameter set can operate under both computation modes — a step further than the theoretical correspondence established by Mamba2.
  • Theoretical guarantee of the Memory Converter constitutes the key technical contribution: it proves that lossless conversion from K/V to SSM hidden states is feasible, enabling computation mode switching within a sequence.
  • The shared parameter design allows the model to gracefully degrade to different architectures (pure Transformer, pure Mamba, or hybrid) depending on sequence length, providing exceptional flexibility.

Limitations & Future Work

  • Each layer is assigned only a single TransPoint; more complex multi-TransPoint structures may yield further improvements.
  • TransPoint scheduling is currently determined via prior knowledge or search; adaptive learning of TransPoints is worth exploring.
  • The Memory Converter introduces additional computation — while theoretically lossless, a quantitative analysis of practical precision impact is still needed.
  • Validation is limited to language modeling; applicability to other modalities (vision, multimodal) remains unknown.
  • vs. Jamba/Zamba and other layer-level hybrids: Layer-level mixing operates under fixed and inflexible ratios; TransMamba adaptively switches at the sequence level.
  • vs. Mamba2: Mamba2 reveals the dual form but remains an independent architecture; TransMamba realizes a unified framework with actual parameter sharing.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — First framework to unify Transformer and Mamba at the sequence level; the contributions of parameter sharing and lossless conversion are theoretically significant.
  • Experimental Thoroughness: ⭐⭐⭐ — Language modeling validation is thorough, but downstream task evaluation is lacking.
  • Writing Quality: ⭐⭐⭐⭐ — Architecture design is clearly presented with complete mathematical derivations.
  • Value: ⭐⭐⭐⭐ — Opens a new paradigm for Transformer-SSM hybrid architectures.