MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining¶
Conference: CVPR 2025
arXiv: 2410.00871
Code: yunzeliu/MAP
Area: Self-Supervised Learning
Keywords: Mamba-Transformer, hybrid backbone, masked autoregressive pretraining, self-supervised learning, vision backbone
TL;DR¶
This work proposes Masked Autoregressive Pretraining (MAP). By utilizing a hierarchical pretraining objective that combines local MAE modeling with row-level autoregressive decoding, this work successfully pretrains hybrid Mamba-Transformer vision backbones for the first time, significantly outperforming individual MAE and AR strategies.
Background & Motivation¶
Background: Hybrid Mamba-Transformer networks have recently received widespread attention, as they combine the scalability of Transformers with the efficiency advantages of Mamba in long-sequence modeling.
Limitations of Prior Work: 1. MAE is unsuitable for Mamba: MAE pretraining significantly improves ViT (+1.4) but is almost ineffective for Vim (+0.2). 2. AR is unsuitable for Transformer: AR pretraining is effective for Vim (+1.4), but provides limited improvement for ViT (+0.2). 3. Hybrid architectures require pretraining strategies compatible with both computing paradigms: Existing MAE or AR strategies can only fully unleash the potential of one type of module.
Key Challenge: Transformers require bidirectional context modeling (where MAE excels), while Mamba requires sequential continuity modeling (where AR excels). Their optimal pretraining strategies are fundamentally different.
Key Insight: Design a hierarchical pretraining objective where local MAE allows the Transformer to learn local bidirectional features, while global autoregression enables Mamba to learn cross-region contextual relationships.
Method¶
Overall Architecture¶
Given an image, random masking is performed first, followed by autoregressive reconstruction on a row-by-row basis: 1. The image is divided into \(M\) rows, with each row containing \(N\) tokens. 2. Within each row, 50% of the tokens are randomly masked. 3. The HybridNet encoder processes the unmasked tokens. 4. The Transformer Decoder performs row-level autoregressive decoding: the reconstruction of the \(i\)-th row depends on all tokens from the previous \(i-1\) rows plus the unmasked tokens of the current row.
Key Designs¶
1. Hybrid Network Architecture HybridNet (MMMTMMMT) - Function: Takes 3 Mamba layers + 1 Transformer layer as a block, repeating it 8 times for a total of 32 layers. - Mechanism: Compares training from scratch across various hybrid arrangements (MMMMMMTT, TTMMMMMM, TMMMTMMM, MMMTMMMT), with MMMTMMMT achieving the best performance (83.12%). - Design Motivation: The starting Mamba layers are responsible for sequence feature extraction, while the interspersed Transformer layers enhance local feature modeling and long-range dependencies, balancing local feature extraction and contextual modeling enhancement.
2. Masked Autoregressive Decoding Strategy - Function: For the randomly masked image, a Transformer Decoder is used to reconstruct it in a row-level autoregression manner, predicting all masked tokens within a single row at each step. - Mechanism: The loss function is \(\mathcal{L} = -\sum_{i=1}^{M}\sum_{j \in \mathbf{M}_i} \log p(\mathbf{x}_{ij} | \mathbf{x}_{i,j \notin \mathbf{M}_i}, \mathbf{r}_{<i})\). Intra-row prediction follows the MAE style (bidirectional), while inter-row prediction follows the AR style (causal). - Design Motivation: Rows are chosen as sub-regions because the default scanning order in most Mamba implementations is row-first. The AR order must align with the Mamba scanning order to maximize benefits (experimentally verified: +2.9 when aligned, and only +0.2 when misaligned).
3. Key Findings from Pilot Experiments - Relationship between AR and Scanning Order: Using AR pretraining that is consistent with the scanning order of Vim yields a +2.9 improvement, whereas an inconsistent one yields only +0.2. This is the first time this conclusion has been systematically verified through experiments. - Masking Ratio: The optimal masking ratio for AR pretraining is 20% (Mamba), that for MAE is 75% (Transformer), and the compromise point for MAP is 50%. - Reconstruction Target: Reconstructing normalized raw pixels using MSE loss performs the best, while diffusion loss yields no significant improvement.
Loss & Training¶
- Pretraining: AdamW optimizer, 1600 epochs, random cropping used as the sole data augmentation, masking ratio of 50%.
- Fine-tuning: Direct fine-tuning for 400 epochs.
- Reconstruction Target: MSE loss of normalized raw pixels.
Key Experimental Results¶
Main Results (ImageNet-1K Classification)¶
| Model | Pretraining | Params | Top-1 Acc |
|---|---|---|---|
| HybridNet-B | None | 128M | 83.1 |
| HybridNet-B | MAE | 128M | 83.9 |
| HybridNet-B | AR | 128M | 83.8 |
| HybridNet-B | CL | 128M | 83.1 |
| HybridNet-B | MAP | 128M | 84.9 |
| HybridNet-B (384) | MAP | 128M | 85.5 |
| HybridNet-L (384) | MAP | 443M | 86.2 |
| MambaR-B | AR | 99M | 83.7 |
| MambaR-B | MAP | 99M | 84.0 |
| ViT-B | MAE | 86M | 83.6 |
| ViT-B | MAP | 86M | 83.6 |
| ViT-L | MAE | 307M | 85.9 |
| ViT-L | MAP | 307M | 86.1 |
| MambaVision-B | None | 97M | 84.2 |
| MambaVision-B | MAP | 97M | 84.9 |
| MambaVision-L | None | 241M | 85.3 |
| MambaVision-L | MAP | 241M | 86.4 |
Ablation Study¶
Masking Strategy:
| Strategy | Accuracy |
|---|---|
| Scratch | 83.1 |
| Random masking | 84.9 |
| Sequential masking | 84.0 |
| Diagonal masking | 83.8 |
Masking Ratio:
| Ratio | Accuracy |
|---|---|
| 0% | 83.3 |
| 25% | 84.5 |
| 50% | 84.9 |
| 75% | 84.2 |
Decoder Strategy:
| Strategy | Accuracy |
|---|---|
| AR decoder | 83.7 |
| MAE decoder | 84.1 |
| Local MAE | 84.2 |
| MAP (ours) | 84.9 |
Downstream Tasks¶
| Task | Backbone | Metric |
|---|---|---|
| ADE20K Semantic Segmentation | HybridNet-S + MAP | mIoU 46.9 (vs 45.6 without pretraining) |
| COCO Detection | HybridNet-Ti + MAP | AP_box 47.3 (vs 45.9 without pretraining) |
Key Findings¶
- MAP yields the most significant improvement on hybrid architectures: On HybridNet-B, MAP (+1.8) >> MAE (+0.8) ≈ AR (+0.7) >> CL (0).
- MAP is also effective for pure Mamba: On MambaR-B, MAP (84.0) > AR (83.7) > MAE (83.1). The local MAE mechanism in MAP enhances Mamba's local feature modeling.
- MAP outperforms MAE on large models: On ViT-L, MAP (86.1) > MAE (85.9). The advantage of autoregressive modeling in larger-scale models becomes evident, aligning with the scaling law observed in LLMs.
- Further improvement at 384 resolution: HybridNet-B at 384 resolution achieves a 0.6 improvement over 224 resolution, proving that Mamba's long-sequence modeling capability indeed yields benefits.
Highlights & Insights¶
- In-depth Pilot Study: Systematically analyzed the different effects of MAE/AR/CL on Transformer vs Mamba, verifying for the first time that the AR order must align with the Mamba scanning order.
- Elegant Unified Framework: Organically integrates MAE (local bidirectional) and AR (global causal) into a row-level decoding paradigm.
- Broad Applicability: MAP applies not only to the custom HybridNet but also improves existing hybrid frameworks like MambaVision.
- Optimal 50% Masking Ratio: Balances between MAE's 75% and AR's 20%, naturally derived from the requirements of the hybrid architecture.
Limitations & Future Work¶
- Hybrid architectures still do not outperform pure Transformer + MAE under the same setting (the focus of MAP is on unleashing the potential of hybrid architectures).
- The current row-level partitioning is simple; a more fine-grained clustering strategy could theoretically yield better results.
- Other modalities such as video and point clouds have not been explored (left as future work by the authors).
- Pretraining requires 1600 epochs, representing a relatively large computational overhead.
Related Work & Insights¶
- MAE: Highly effective for Transformer pretraining, where a 75% high masking ratio and an asymmetric encoder-decoder are key.
- ARM: Performs cluster-based AR pretraining for cross-scanned Mamba, essentially a hybrid of row-first and column-first scanning.
- VAR: Proposes the next-scale prediction paradigm, preserving spatial locality.
- MAR: Uses AR output as a conditioning signal for diffusion models in generation, which inspired the exploration of diffusion loss in this work.
Rating ⭐⭐⭐⭐¶
Novelty: ⭐⭐⭐⭐ First systematic study on hybrid Mamba-Transformer pretraining; the MAP paradigm is highly insightful.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Pilot study + main experiments + multiple downstream tasks + exhaustive ablations.
Writing Quality: ⭐⭐⭐⭐ Clearly structured; the pilot study naturally motivates the method design with coherent logic.
Value: ⭐⭐⭐⭐ Provides a general methodology and best practices for pretraining hybrid architectures.