Skip to content

Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots

Conference: ICML 2025
arXiv: 2505.20288
Code: https://github.com/HiDream-ai/himar
Area: Diffusion Models / Image Generation
Keywords: Autoregressive Models, Hierarchical Generation, Masked Autoregressive, Global Context, Diffusion Transformer Head

TL;DR

This work proposes Hi-MAR, which introduces low-resolution tokens as intermediate pivots in masked autoregressive image generation to establish a coarse-to-fine hierarchical generation process. It also enhances inter-token dependency modeling with a Diffusion Transformer Head, significantly outperforming MAR on ImageNet with less computational cost (FID improved by 0.38).

Background & Motivation

Background: Autoregressive (AR) models are gradually emerging in visual generation. Masked autoregressive models, represented by MAR, avoid the information loss caused by discretization through continuous-valued tokens and diffusion loss.

Limitations of Prior Work: Models like MAR only perform autoregressive modeling on a single-scale dense token sequence, lacking global contextual information, which is particularly detrimental to the prediction of early tokens. In addition, MAR uses an MLP-based diffusion head to process each token independently, ignoring spatial dependencies among tokens, which can result in artifacts such as abnormal bright spots.

Key Challenge: Single-scale autoregression conflates global structure construction with local detail refinement, which is counterintuitive to the human cognitive process of "global first, local second".

Goal: (a) How can global structural information be introduced in autoregressive modeling? (b) How can inter-token dependencies be modeled within the diffusion head?

Key Insight: First capture the global structure using a small number of low-resolution tokens, and then use this as a condition to guide the generation of high-resolution dense tokens.

Core Idea: Hierarchical masked autoregression using low-resolution tokens as "pivots" + a diffusion head that replaces MLP with a Transformer.

Method

Overall Architecture

Hi-MAR is a two-stage hierarchical masked autoregressive model. The input image is simultaneously encoded into low-resolution (128×128) and high-resolution (256×256) continuous token sequences. The first stage performs masked autoregressive modeling on the low-resolution tokens, outputting condition tokens (rather than direct visual tokens) that reflect the global structure. The second stage concatenates these condition tokens with high-resolution masked tokens, feeding them into the same Transformer for refined generation.

Key Designs

  1. Hierarchical Masked Autoregressive Transformer (Hi-MAR Transformer):

    • Function: Two-stage modeling, from coarse to fine
    • Mechanism: The first stage performs MAR on low-resolution tokens with bidirectional attention to output condition tokens \(Z^s\). The second stage concatenates \(Z^s\) with high-resolution masked tokens and passes them through the Transformer again to generate dense condition tokens.
    • Design Motivation: Directly using low-resolution visual tokens (instead of condition tokens) for guidance causes training-inference mismatch—using ground-truth low-resolution tokens during training, but using predicted (noisy) low-resolution tokens during inference. Using condition tokens output by the Transformer instead of visual tokens mitigates this issue.
  2. Scale-aware Transformer Block:

    • Function: Enables the shared Transformer to perceive which scale is currently being processed.
    • Mechanism: Scale information is encoded using sinusoidal embeddings to generate a scale vector \(v\) via an MLP. Subsequently, the adaLN-Zero operation is employed to modulate the scale/shift parameters of LayerNorm and the scaling parameters of residual connections: \(z_{a} = z^i + \gamma_1 \cdot \text{Attention}(\alpha_1 \cdot \text{LN}(z^i) + \beta_1)\).
    • Design Motivation: A shared Transformer processes tokens of two different scales simultaneously; without scale guidance, it would lead to blurriness.
  3. Diffusion Transformer Head:

    • Function: Replaces the MLP-based diffusion head to model dependencies among all tokens during masked token prediction.
    • Mechanism: In the second stage, Transformer blocks with self-attention are used as the diffusion head. The input consists of representations of all (masked + unmasked) condition tokens modulated by adaLN, rather than processing only masked tokens independently with an MLP.
    • Design Motivation: The MLP head processes each token independently, losing global spatial structure information of the image. The Transformer head captures inter-token interactions through self-attention.

Loss & Training

  • In the first stage, the masking ratio is randomly sampled from \([0.7, 1.0]\) (same as MAR).
  • The second stage utilizes the cosine masking strategy of MaskGIT.
  • Both stages employ the standard diffusion denoising loss: \(\mathcal{L}(z_i, x_i) = \mathbb{E}_{\epsilon,t}[\|\epsilon - \epsilon_\theta(x_i^t|t, z_i)\|^2]\).
  • During inference, the first stage takes 32 steps, while the second stage takes only 4 steps (due to the stronger modeling capability of the Transformer head and the global structure already provided by the first stage).

Key Experimental Results

Main Results

Dataset Model FID (w/ CFG) ↓ IS ↑ Precision Recall
ImageNet 256 MAR-B 2.31 281.7 0.82 0.57
ImageNet 256 Hi-MAR-B 1.93 293.0 0.81 0.59
ImageNet 256 MAR-H 1.55 303.7 0.81 0.62
ImageNet 256 Hi-MAR-H 1.52 322.78 0.80 0.63
MS-COCO 256 MAR 6.36 - - -
MS-COCO 256 Hi-MAR-S 4.77 - - -

Ablation Study

Configuration FID ↓ Description
MAR-B baseline 2.31 Basic single-scale MAR
+ Hierarchical (guided by visual tokens) 2.28 Training-inference mismatch, almost no improvement
+ Hierarchical (guided by condition tokens) 2.07 Significant gain of 0.24
+ Diff Transformer Head (Stage 2) 1.98 Further decreased by 0.09
+ Scale vector (Full Hi-MAR) 1.93 Final optimal model

Key Findings

  • Replacing visual tokens with condition tokens for guidance is the most crucial design, contributing to a 0.24 FID improvement.
  • The Diffusion Transformer Head is only effective in the second stage; replacing the MLP head in the first stage yields no significant gain.
  • Hi-MAR features faster inference speed: the second stage requires only 4 steps to reach near-saturated quality, with the total computational cost being only 54% of MAR.

Highlights & Insights

  • Clever Decoupling of Hierarchical Generation: Generating global structures first (low-resolution tokens) and refining details later (dense tokens) aligns with human perception while reducing computational cost.
  • Training-Inference Consistency Design: Guiding the second stage with condition tokens output by the Transformer instead of direct visual tokens effectively bypasses the inconsistency between ground truth and predictions.
  • Transferable Concept: The design of the Diffusion Transformer Head (using self-attention instead of MLP to model inter-token dependencies) can be transferred to other tasks requiring per-token prediction.

Limitations & Future Work

  • Only a two-level hierarchical structure is validated; the effects of more levels (e.g., 3-4 levels) remain unexplored.
  • The impact of resolution choice for low-resolution tokens (128 vs 64 vs 32) is not analyzed in depth.
  • Text-to-image generation is only validated on MS-COCO, lacking experiments on large-scale T2I datasets (such as LAION).

Rating

  • Novelty: ⭐⭐⭐⭐ Combined innovation of hierarchical MAR and Transformer diffusion head.
  • Experimental Thoroughness: ⭐⭐⭐⭐ ImageNet + MS-COCO + thorough ablations.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure and intuitive diagrams.
  • Value: ⭐⭐⭐⭐ An effective direction of improvement for autoregressive image generation.