Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots¶
Conference: ICML 2025
arXiv: 2505.20288
Code: https://github.com/HiDream-ai/himar
Area: Diffusion Models / Image Generation
Keywords: Autoregressive Models, Hierarchical Generation, Masked Autoregressive, Global Context, Diffusion Transformer Head
TL;DR¶
This work proposes Hi-MAR, which introduces low-resolution tokens as intermediate pivots in masked autoregressive image generation to establish a coarse-to-fine hierarchical generation process. It also enhances inter-token dependency modeling with a Diffusion Transformer Head, significantly outperforming MAR on ImageNet with less computational cost (FID improved by 0.38).
Background & Motivation¶
Background: Autoregressive (AR) models are gradually emerging in visual generation. Masked autoregressive models, represented by MAR, avoid the information loss caused by discretization through continuous-valued tokens and diffusion loss.
Limitations of Prior Work: Models like MAR only perform autoregressive modeling on a single-scale dense token sequence, lacking global contextual information, which is particularly detrimental to the prediction of early tokens. In addition, MAR uses an MLP-based diffusion head to process each token independently, ignoring spatial dependencies among tokens, which can result in artifacts such as abnormal bright spots.
Key Challenge: Single-scale autoregression conflates global structure construction with local detail refinement, which is counterintuitive to the human cognitive process of "global first, local second".
Goal: (a) How can global structural information be introduced in autoregressive modeling? (b) How can inter-token dependencies be modeled within the diffusion head?
Key Insight: First capture the global structure using a small number of low-resolution tokens, and then use this as a condition to guide the generation of high-resolution dense tokens.
Core Idea: Hierarchical masked autoregression using low-resolution tokens as "pivots" + a diffusion head that replaces MLP with a Transformer.
Method¶
Overall Architecture¶
Hi-MAR is a two-stage hierarchical masked autoregressive model. The input image is simultaneously encoded into low-resolution (128×128) and high-resolution (256×256) continuous token sequences. The first stage performs masked autoregressive modeling on the low-resolution tokens, outputting condition tokens (rather than direct visual tokens) that reflect the global structure. The second stage concatenates these condition tokens with high-resolution masked tokens, feeding them into the same Transformer for refined generation.
Key Designs¶
-
Hierarchical Masked Autoregressive Transformer (Hi-MAR Transformer):
- Function: Two-stage modeling, from coarse to fine
- Mechanism: The first stage performs MAR on low-resolution tokens with bidirectional attention to output condition tokens \(Z^s\). The second stage concatenates \(Z^s\) with high-resolution masked tokens and passes them through the Transformer again to generate dense condition tokens.
- Design Motivation: Directly using low-resolution visual tokens (instead of condition tokens) for guidance causes training-inference mismatch—using ground-truth low-resolution tokens during training, but using predicted (noisy) low-resolution tokens during inference. Using condition tokens output by the Transformer instead of visual tokens mitigates this issue.
-
Scale-aware Transformer Block:
- Function: Enables the shared Transformer to perceive which scale is currently being processed.
- Mechanism: Scale information is encoded using sinusoidal embeddings to generate a scale vector \(v\) via an MLP. Subsequently, the adaLN-Zero operation is employed to modulate the scale/shift parameters of LayerNorm and the scaling parameters of residual connections: \(z_{a} = z^i + \gamma_1 \cdot \text{Attention}(\alpha_1 \cdot \text{LN}(z^i) + \beta_1)\).
- Design Motivation: A shared Transformer processes tokens of two different scales simultaneously; without scale guidance, it would lead to blurriness.
-
Diffusion Transformer Head:
- Function: Replaces the MLP-based diffusion head to model dependencies among all tokens during masked token prediction.
- Mechanism: In the second stage, Transformer blocks with self-attention are used as the diffusion head. The input consists of representations of all (masked + unmasked) condition tokens modulated by adaLN, rather than processing only masked tokens independently with an MLP.
- Design Motivation: The MLP head processes each token independently, losing global spatial structure information of the image. The Transformer head captures inter-token interactions through self-attention.
Loss & Training¶
- In the first stage, the masking ratio is randomly sampled from \([0.7, 1.0]\) (same as MAR).
- The second stage utilizes the cosine masking strategy of MaskGIT.
- Both stages employ the standard diffusion denoising loss: \(\mathcal{L}(z_i, x_i) = \mathbb{E}_{\epsilon,t}[\|\epsilon - \epsilon_\theta(x_i^t|t, z_i)\|^2]\).
- During inference, the first stage takes 32 steps, while the second stage takes only 4 steps (due to the stronger modeling capability of the Transformer head and the global structure already provided by the first stage).
Key Experimental Results¶
Main Results¶
| Dataset | Model | FID (w/ CFG) ↓ | IS ↑ | Precision | Recall |
|---|---|---|---|---|---|
| ImageNet 256 | MAR-B | 2.31 | 281.7 | 0.82 | 0.57 |
| ImageNet 256 | Hi-MAR-B | 1.93 | 293.0 | 0.81 | 0.59 |
| ImageNet 256 | MAR-H | 1.55 | 303.7 | 0.81 | 0.62 |
| ImageNet 256 | Hi-MAR-H | 1.52 | 322.78 | 0.80 | 0.63 |
| MS-COCO 256 | MAR | 6.36 | - | - | - |
| MS-COCO 256 | Hi-MAR-S | 4.77 | - | - | - |
Ablation Study¶
| Configuration | FID ↓ | Description |
|---|---|---|
| MAR-B baseline | 2.31 | Basic single-scale MAR |
| + Hierarchical (guided by visual tokens) | 2.28 | Training-inference mismatch, almost no improvement |
| + Hierarchical (guided by condition tokens) | 2.07 | Significant gain of 0.24 |
| + Diff Transformer Head (Stage 2) | 1.98 | Further decreased by 0.09 |
| + Scale vector (Full Hi-MAR) | 1.93 | Final optimal model |
Key Findings¶
- Replacing visual tokens with condition tokens for guidance is the most crucial design, contributing to a 0.24 FID improvement.
- The Diffusion Transformer Head is only effective in the second stage; replacing the MLP head in the first stage yields no significant gain.
- Hi-MAR features faster inference speed: the second stage requires only 4 steps to reach near-saturated quality, with the total computational cost being only 54% of MAR.
Highlights & Insights¶
- Clever Decoupling of Hierarchical Generation: Generating global structures first (low-resolution tokens) and refining details later (dense tokens) aligns with human perception while reducing computational cost.
- Training-Inference Consistency Design: Guiding the second stage with condition tokens output by the Transformer instead of direct visual tokens effectively bypasses the inconsistency between ground truth and predictions.
- Transferable Concept: The design of the Diffusion Transformer Head (using self-attention instead of MLP to model inter-token dependencies) can be transferred to other tasks requiring per-token prediction.
Limitations & Future Work¶
- Only a two-level hierarchical structure is validated; the effects of more levels (e.g., 3-4 levels) remain unexplored.
- The impact of resolution choice for low-resolution tokens (128 vs 64 vs 32) is not analyzed in depth.
- Text-to-image generation is only validated on MS-COCO, lacking experiments on large-scale T2I datasets (such as LAION).
Rating¶
- Novelty: ⭐⭐⭐⭐ Combined innovation of hierarchical MAR and Transformer diffusion head.
- Experimental Thoroughness: ⭐⭐⭐⭐ ImageNet + MS-COCO + thorough ablations.
- Writing Quality: ⭐⭐⭐⭐ Clear structure and intuitive diagrams.
- Value: ⭐⭐⭐⭐ An effective direction of improvement for autoregressive image generation.