MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining¶

Conference: CVPR 2025
arXiv: 2410.00871
Code: yunzeliu/MAP
Area: Self-Supervised Learning
Keywords: Mamba-Transformer, hybrid backbone, masked autoregressive pretraining, self-supervised learning, vision backbone

TL;DR¶

This work proposes Masked Autoregressive Pretraining (MAP). By utilizing a hierarchical pretraining objective that combines local MAE modeling with row-level autoregressive decoding, this work successfully pretrains hybrid Mamba-Transformer vision backbones for the first time, significantly outperforming individual MAE and AR strategies.

Background & Motivation¶

Background: Hybrid Mamba-Transformer networks have recently received widespread attention, as they combine the scalability of Transformers with the efficiency advantages of Mamba in long-sequence modeling.

Limitations of Prior Work: 1. MAE is unsuitable for Mamba: MAE pretraining significantly improves ViT (+1.4) but is almost ineffective for Vim (+0.2). 2. AR is unsuitable for Transformer: AR pretraining is effective for Vim (+1.4), but provides limited improvement for ViT (+0.2). 3. Hybrid architectures require pretraining strategies compatible with both computing paradigms: Existing MAE or AR strategies can only fully unleash the potential of one type of module.

Key Challenge: Transformers require bidirectional context modeling (where MAE excels), while Mamba requires sequential continuity modeling (where AR excels). Their optimal pretraining strategies are fundamentally different.

Key Insight: Design a hierarchical pretraining objective where local MAE allows the Transformer to learn local bidirectional features, while global autoregression enables Mamba to learn cross-region contextual relationships.

Method¶

Overall Architecture¶

Given an image, random masking is performed first, followed by autoregressive reconstruction on a row-by-row basis: 1. The image is divided into \(M\) rows, with each row containing \(N\) tokens. 2. Within each row, 50% of the tokens are randomly masked. 3. The HybridNet encoder processes the unmasked tokens. 4. The Transformer Decoder performs row-level autoregressive decoding: the reconstruction of the \(i\)-th row depends on all tokens from the previous \(i-1\) rows plus the unmasked tokens of the current row.

Key Designs¶

1. Hybrid Network Architecture HybridNet (MMMTMMMT) - Function: Takes 3 Mamba layers + 1 Transformer layer as a block, repeating it 8 times for a total of 32 layers. - Mechanism: Compares training from scratch across various hybrid arrangements (MMMMMMTT, TTMMMMMM, TMMMTMMM, MMMTMMMT), with MMMTMMMT achieving the best performance (83.12%). - Design Motivation: The starting Mamba layers are responsible for sequence feature extraction, while the interspersed Transformer layers enhance local feature modeling and long-range dependencies, balancing local feature extraction and contextual modeling enhancement.

2. Masked Autoregressive Decoding Strategy - Function: For the randomly masked image, a Transformer Decoder is used to reconstruct it in a row-level autoregression manner, predicting all masked tokens within a single row at each step. - Mechanism: The loss function is \(\mathcal{L} = -\sum_{i=1}^{M}\sum_{j \in \mathbf{M}_i} \log p(\mathbf{x}_{ij} | \mathbf{x}_{i,j \notin \mathbf{M}_i}, \mathbf{r}_{<i})\). Intra-row prediction follows the MAE style (bidirectional), while inter-row prediction follows the AR style (causal). - Design Motivation: Rows are chosen as sub-regions because the default scanning order in most Mamba implementations is row-first. The AR order must align with the Mamba scanning order to maximize benefits (experimentally verified: +2.9 when aligned, and only +0.2 when misaligned).

3. Key Findings from Pilot Experiments - Relationship between AR and Scanning Order: Using AR pretraining that is consistent with the scanning order of Vim yields a +2.9 improvement, whereas an inconsistent one yields only +0.2. This is the first time this conclusion has been systematically verified through experiments. - Masking Ratio: The optimal masking ratio for AR pretraining is 20% (Mamba), that for MAE is 75% (Transformer), and the compromise point for MAP is 50%. - Reconstruction Target: Reconstructing normalized raw pixels using MSE loss performs the best, while diffusion loss yields no significant improvement.

Loss & Training¶

Pretraining: AdamW optimizer, 1600 epochs, random cropping used as the sole data augmentation, masking ratio of 50%.
Fine-tuning: Direct fine-tuning for 400 epochs.
Reconstruction Target: MSE loss of normalized raw pixels.

Key Experimental Results¶

Main Results (ImageNet-1K Classification)¶

Model	Pretraining	Params	Top-1 Acc
HybridNet-B	None	128M	83.1
HybridNet-B	MAE	128M	83.9
HybridNet-B	AR	128M	83.8
HybridNet-B	CL	128M	83.1
HybridNet-B	MAP	128M	84.9
HybridNet-B (384)	MAP	128M	85.5
HybridNet-L (384)	MAP	443M	86.2
MambaR-B	AR	99M	83.7
MambaR-B	MAP	99M	84.0
ViT-B	MAE	86M	83.6
ViT-B	MAP	86M	83.6
ViT-L	MAE	307M	85.9
ViT-L	MAP	307M	86.1
MambaVision-B	None	97M	84.2
MambaVision-B	MAP	97M	84.9
MambaVision-L	None	241M	85.3
MambaVision-L	MAP	241M	86.4

Ablation Study¶

Masking Strategy:

Strategy	Accuracy
Scratch	83.1
Random masking	84.9
Sequential masking	84.0
Diagonal masking	83.8

Masking Ratio:

Ratio	Accuracy
0%	83.3
25%	84.5
50%	84.9
75%	84.2

Decoder Strategy:

Strategy	Accuracy
AR decoder	83.7
MAE decoder	84.1
Local MAE	84.2
MAP (ours)	84.9

Downstream Tasks¶

Task	Backbone	Metric
ADE20K Semantic Segmentation	HybridNet-S + MAP	mIoU 46.9 (vs 45.6 without pretraining)
COCO Detection	HybridNet-Ti + MAP	AP_box 47.3 (vs 45.9 without pretraining)

Key Findings¶

MAP yields the most significant improvement on hybrid architectures: On HybridNet-B, MAP (+1.8) >> MAE (+0.8) ≈ AR (+0.7) >> CL (0).
MAP is also effective for pure Mamba: On MambaR-B, MAP (84.0) > AR (83.7) > MAE (83.1). The local MAE mechanism in MAP enhances Mamba's local feature modeling.
MAP outperforms MAE on large models: On ViT-L, MAP (86.1) > MAE (85.9). The advantage of autoregressive modeling in larger-scale models becomes evident, aligning with the scaling law observed in LLMs.
Further improvement at 384 resolution: HybridNet-B at 384 resolution achieves a 0.6 improvement over 224 resolution, proving that Mamba's long-sequence modeling capability indeed yields benefits.

Highlights & Insights¶

In-depth Pilot Study: Systematically analyzed the different effects of MAE/AR/CL on Transformer vs Mamba, verifying for the first time that the AR order must align with the Mamba scanning order.
Elegant Unified Framework: Organically integrates MAE (local bidirectional) and AR (global causal) into a row-level decoding paradigm.
Broad Applicability: MAP applies not only to the custom HybridNet but also improves existing hybrid frameworks like MambaVision.
Optimal 50% Masking Ratio: Balances between MAE's 75% and AR's 20%, naturally derived from the requirements of the hybrid architecture.

Limitations & Future Work¶

Hybrid architectures still do not outperform pure Transformer + MAE under the same setting (the focus of MAP is on unleashing the potential of hybrid architectures).
The current row-level partitioning is simple; a more fine-grained clustering strategy could theoretically yield better results.
Other modalities such as video and point clouds have not been explored (left as future work by the authors).
Pretraining requires 1600 epochs, representing a relatively large computational overhead.

MAE: Highly effective for Transformer pretraining, where a 75% high masking ratio and an asymmetric encoder-decoder are key.
ARM: Performs cluster-based AR pretraining for cross-scanned Mamba, essentially a hybrid of row-first and column-first scanning.
VAR: Proposes the next-scale prediction paradigm, preserving spatial locality.
MAR: Uses AR output as a conditioning signal for diffusion models in generation, which inspired the exploration of diffusion loss in this work.

Rating ⭐⭐⭐⭐¶

Novelty: ⭐⭐⭐⭐ First systematic study on hybrid Mamba-Transformer pretraining; the MAP paradigm is highly insightful.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Pilot study + main experiments + multiple downstream tasks + exhaustive ablations.
Writing Quality: ⭐⭐⭐⭐ Clearly structured; the pilot study naturally motivates the method design with coherent logic.
Value: ⭐⭐⭐⭐ Provides a general methodology and best practices for pretraining hybrid architectures.