MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation¶

Conference: CVPR 2025
arXiv: 2503.20519
Code: Project Page
Area: 3D Vision
Keywords: 3D generation, masked auto-regressive, pyramid VAE, cascaded generation, condition augmentation

TL;DR¶

This paper proposes a progressive 3D generation framework using a Pyramid VAE paired with Cascaded MAR (MAR-LR → MAR-HR). By utilizing random masking to accommodate the unordered nature of 3D tokens and employing a condition augmentation strategy to mitigate cumulative errors during resolution scaling, the method achieves state-of-the-art (SOTA) performance among open-source approaches.

Background & Motivation¶

Background: 3D generation methods generally fall into three paradigms: Large Reconstruction Models (LRMs) that reconstruct directly from a single image but lack generative priors; multi-view diffusion combined with reconstruction, which is limited by multi-view consistency; and 3D VAE integrated with diffusion/AR for direct generation, which is currently the most promising direction.

Limitations of Prior Work: 1. Difficulty in scaling token resolution: Existing 3D VAEs and generators struggle to preserve geometric details under a limited number of tokens. Directly increasing the number of tokens leads to a quadratic increase in the computational complexity of the Transformer, requiring hundreds of GPUs for training. 2. Unordered nature of 3D data: 3D latent tokens lack an inherent sequence, which conflicts with sequential prediction paradigms of autoregressive models. 3. Vector quantization compression loss: Traditional VQ methods incur significant compression losses on 3D meshes.

Key Challenge: High-quality 3D generation requires a larger token count to represent geometric details, but directly scaling the token complexity faces double challenges of computation and convergence.

Key Insight: A progressive strategy—first generating low-resolution tokens to capture the global shape, and then refining them into high-resolution tokens using a super-resolution model.

Method¶

Overall Architecture¶

MAR-3D consists of two components: Pyramid VAE and Cascaded MAR: 1. Pyramid VAE: Encodes multi-resolution point clouds into multi-scale latent tokens, supporting different resolutions of 256/1024 tokens. 2. MAR-LR: Generates 256 low-resolution tokens conditioned on image tokens. 3. MAR-HR: Generates 1024 high-resolution tokens conditioned on low-resolution tokens and image tokens. 4. Marching Cubes extracts the final mesh from the occupancy field.

Key Designs¶

1. Pyramid VAE - Function: Downsamples the input point cloud into \(K\) levels (16384/4096/1024 points). Each level interacts with learnable queries \(\mathbf{S}\) through an independent cross-attention layer. Coarse levels capture structural features while fine levels extract geometric details, which are then summed and processed by self-attention to obtain the latent token \(\mathbf{X}\). - Mechanism: \(\mathbf{X} = \text{SelfAttn}\left(\sum_{k=1}^{K}\text{CrossAttn}^k(\mathbf{S}, \hat{\mathbf{P}}^k)\right)\). The training objective is the BCE loss (occupancy prediction) + KL divergence (latent regularization). - Design Motivation: Compared with a single-level VAE, the 1024-token Pyramid VAE achieves higher reconstruction quality than a 2048-token single-level VAE, realizing highly efficient token compression.

2. Cascaded MAR - Function: MAR-LR and MAR-HR share the same architecture (MAE encoder-decoder + MLP denoising network), with the main difference being that MAR-HR additionally accepts low-resolution tokens as input. Random masks (ratio 0.7-1.0) are utilized during training, and multiple tokens are decoded in parallel in a random order during inference. - Mechanism: Decomposes the joint distribution into two components: temporal (diffusion) and spatial (autoregressive). Each token is supervised using a diffusion loss instead of Cross-Entropy because the latent tokens reside in a continuous space. A cosine schedule is applied during inference to control the token count generated at each step (fewer in the initial stages, more in the later stages). - Design Motivation: 3D latent tokens lack an inherent order; thus, random masking combined with random-order decoding naturally fits this unordered characteristic. The cascaded strategy avoids convergence difficulties associated with directly training on 1024 tokens.

3. Condition Augmentation - Function: During MAR-HR training, Gaussian noise is added to the input low-resolution tokens: \(x_l' = t\epsilon + (1-t)x_l\), where \(t \sim \mathcal{U}(0.4, 0.6)\). During inference, \(t\) is fixed to 0.5. - Mechanism: During training, MAR-HR receives "clean" low-resolution tokens encoded by the VAE. However, during inference, it receives "noisy" tokens generated by MAR-LR, resulting in a train-test gap. Adding noise successfully narrows this gap. - Design Motivation: Inspired by cascaded diffusion models, this strategy effectively mitigates cumulative errors. Ablation studies show that without condition augmentation, the F-Score drops from 0.944 to 0.902.

Loss & Training¶

VAE Loss: \(\mathcal{L}_{\text{vae}} = \text{BCE}(\hat{\mathcal{O}}, \mathcal{D}(\gamma(x), \mathbf{X})) + \lambda_{\text{kl}}\mathcal{L}_{\text{kl}}\)
MAR Loss: Diffusion loss \(\mathcal{L} = \mathbb{E}[|\epsilon - \epsilon_\theta(x^t|t,z)|^2]\)
Two-stage Training Data: Initially trained for 200 epochs on 260K Objaverse meshes, and then fine-tuned for 100 epochs on 60K high-quality meshes.
Rotation Augmentation: Renders 56 conditional views (8 base \(\times\) 7 random rotations) for each mesh, synchronously rotating the 3D mesh to ensure image-latent consistency.
CFG: Conditional features are randomly dropped with a probability of 0.1, using a linear CFG schedule \(\omega_s = s \cdot \lambda_{cfg} / S\) during inference.

Key Experimental Results¶

Main Results¶

Method	GSO F-Score↑	GSO CD↓	GSO NC↑	OmniObj F-Score↑	OmniObj CD↓	OmniObj NC↑
LGM	0.745	0.813	0.685	0.738	0.821	0.677
CraftsMan	0.776	0.785	0.687	0.771	0.798	0.675
TripoSR	0.834	0.644	0.727	0.825	0.621	0.731
InstantMesh	0.923	0.415	0.780	0.918	0.427	0.779
MAR-3D	0.944	0.351	0.835	0.931	0.364	0.826

CD is reduced by 15.4% compared with InstantMesh.

Ablation Study¶

Configuration	F-Score↑	CD↓	NC↑
w/o Pyramid VAE	0.928	0.397	0.807
w/o condition augmentation	0.902	0.435	0.789
w/o MAR-HR	0.921	0.411	0.794
w/o rotation augmentation	0.934	0.369	0.821
Full Model	0.944	0.351	0.835

Key Findings¶

Pyramid VAE is highly efficient and effective: The 1024-token Pyramid VAE outperforms the 2048-token single-level VAE in reconstruction quality, saving half the token budget while keeping more geometric details.
Condition augmentation is crucial: Removing condition augmentation degrades the CD from 0.351 to 0.435 (+24%), representing the most significant performance drop among all ablation components.
Comparison between MAR and DiT: MAR already outperforms DiT under the same budget of 256 tokens. When attempting direct training with 1024 tokens, both struggle to converge; however, MAR successfully scales up using condition augmentation, while the DiT version still shows noticeable noise.
Progressive advantage of autoregressive decoding: The cosine schedule (generating fewer tokens initially, and more later) outperforms the uniform schedule. This aligns with intuition as initial tokens are harder to predict.

Highlights & Insights¶

Elegant decomposition strategy: Decomposing the joint distribution of 3D generation into temporal (per-token diffusion) and spatial (autoregressive) dimensions makes it more scalable than pure DiT.
Condition augmentation is a simple yet critical technique: Simply injecting noise effectively mitigates the cumulative error problem of the cascaded model.
Exquisite Pyramid VAE design: The multi-resolution cross-attention paired with shared queries preserves details while keeping the token count constrained.
In-the-wild generalization capability: Qualitative results demonstrate robust handling of complex topologies, such as holes and thin structures.

Limitations & Future Work¶

Relies on CLIP + DINOv2 to extract image features, which is limited by these models' 3D-aware understanding.
Only generates geometry (mesh) without addressing texture or material generation.
Training data is sourced from Objaverse, leaving room for improvement regarding generalization to real-world scanned objects.
The two-stage cascaded inference increases latency.

3DShape2VecSet: The pioneer work to encode 3D meshes into shape latents and use a diffusion model for generation.
CLAY: A large-scale 3D diffusion model trained with hundreds of GPUs on a high token count; the progressive strategy proposed in this work serves as a more resource-efficient alternative.
MAR (2D): Combines autoregressive modeling with diffusion for 2D image generation, which this work extends into the 3D domain.
MaskGIT: The pioneer of parallel prediction paradigm with random ordering, whose core concept is inherited by this work.

Rating ⭐⭐⭐⭐¶

Novelty: ⭐⭐⭐⭐ The combined design of Pyramid VAE + Cascaded MAR + condition augmentation is highly instructive.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive comparisons against various methods alongside detailed ablations of the VAE and generator.
Writing Quality: ⭐⭐⭐⭐ Clear structure with well-aligned problems, methods, and experiments.
Value: ⭐⭐⭐⭐ SOTA performance among open-source methods, bearing direct practical value.