CVPR 2025 Image Generation Autoregressive Image Generation Next-Scale Prediction Masked Prediction Markov Process Efficient Attention Loss Reweighting

HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation¶

Conference: CVPR 2025
arXiv: 2506.04421
Code: None
Area: Image Generation / Autoregressive Generation
Keywords: Autoregressive Image Generation, Next-Scale Prediction, Masked Prediction, Markov Process, Efficient Attention, Loss Reweighting

TL;DR¶

HMAR reformulates the next-scale prediction of VAR into a Markov process (relying only on the cumulative reconstruction of the previous scale rather than all prior scales). It introduces multi-step masked generation within each scale to eliminate the conditional independence assumption. Coupled with a customized IO-aware block-sparse attention kernel, HMAR matches or exceeds VAR/DiT quality on ImageNet while achieving 2.5× training acceleration and a 3× reduction in inference memory.

Background & Motivation¶

Background: Visual Auto-Regressive modeling (VAR) bridges the gap between autoregressive models and diffusion models in speed and quality through the next-scale prediction paradigm. VAR decomposes an image into K multi-resolution scales, generating a higher-resolution scale at each step conditioned on the tokens of all preceding scales.

Limitations of Prior Work: (1) Quality Bottleneck: VAR samples all tokens in a single parallel step within each scale, implicitly assuming conditional independence of tokens within the same scale, which leads to "over-smoothing" and cross-scale error accumulation; (2) Efficiency Issues: Conditioning on all prior scales causes the sequence length to grow super-linearly (5.84× longer than next-token at 256×256 resolution), and FlashAttention does not natively support the block-causal attention pattern of VAR; (3) Lack of Flexibility: The number of inference steps is fixed during training, and changing the steps requires retraining.

Key Challenge: There is significant room for improvement in VAR across quality, efficiency, and flexibility—the conditional independence assumption degrades quality, long sequences hinder efficiency, and fixed inference steps restrict flexibility.

Key Insight: It is observed that the cumulative reconstruction \(\tilde{x}_{1:k}\) in VAR encoding already encapsulates all information from the first \(k\) scales (analogous to a Laplacian pyramid). Therefore, next-scale prediction can be reformulated as a Markov process \(p(r_k | \tilde{x}_{1:k-1})\). This leads to a block-diagonal attention pattern (5× sparser than block-causal) and allows for the integration of MaskGIT-like multi-step masked generation within each scale to model intra-scale token dependencies.

Method¶

Overall Architecture¶

HMAR consists of two sub-modules: (1) Markovian Next-Scale Prediction Module: modifies VAR's full-history conditioning to rely solely on the cumulative reconstruction of the previous scale, accelerated during training using an IO-aware block-diagonal attention kernel; (2) Intra-Scale Masked Refinement Module: employs multi-step masked generation within each scale to eliminate the conditional independence assumption, offering a controllable trade-off between quality and speed. The two modules are trained sequentially in stages.

Key Designs¶

Markovian Next-Scale Prediction:
- Function: Reduces the sequence length of next-scale prediction from super-linear to linear, enabling block-diagonal sparse attention.
- Mechanism: Leverages the properties of VQ-VAE residual encoding, where the cumulative reconstruction \(\tilde{x}_{1:k} = \sum_{j=1}^{k} \tilde{x}_j\) contains all information from the first \(k\) scales. Consequently, \(p(r_k | r_1,...,r_{k-1}) = p(r_k | \tilde{x}_{1:k-1})\), reformulating generation as a Markov process. In practice, an interpolation function scales \(\tilde{x}_{1:k-1}\) to \(H_{k-1} \times W_{k-1}\) to act as the condition, converting the attention pattern from block-causal to block-diagonal (achieving a 5× sparsity improvement).
- Design Motivation: Attention analysis (Fig. 9) reveals that most attention in VAR is indeed concentrated on the immediate preceding scale, validating the rationality of the Markov assumption. Eliminating the need for a KV cache during inference directly reduces memory usage by 3×.
Hierarchical Multi-Step Masked Generation:
- Function: Models intra-scale token dependencies to eliminate the conditional independence assumption of VAR.
- Mechanism: At each scale \(k\), an initial next-scale prediction yields \(r_k^0\) (equivalent to VAR's single-step result). This is followed by iteratively refining the output using \(M_k\) masked generation steps—each step randomly masks a portion of tokens and re-predicts them based on the unmasked tokens and the cumulative reconstruction of the previous scale. If \(M_k=0\), it degenerates to VAR; if \(M_k=H_k \times W_k\), it degenerates to next-token prediction. During the fine-tuning training phase, the mask ratio is uniformly sampled as \(\gamma \sim \mathcal{U}(0,1)\). During inference, utilizing multi-step refinement at coarser scales improves FID, while multi-step refinement at finer scales enhances perceptual quality.
- Design Motivation: VAR's parallel generation assumes conditional independence among tokens of the same scale, leading to over-smoothing and error accumulation in practice (Fig. 17). Masked generation offers a controllable trade-off between quality and speed.
Multi-Scale Loss Reweighting:
- Function: Balances the training contributions of different resolution scales.
- Mechanism: The uniform average loss of VAR results in the finest scale contributing 256 times more than the coarsest scale. HMAR introduces scale weights \(w(k)\) such that \(\sum w(k) = 1\). Experiments show that the learning difficulty of each scale approximately follows a log-normal distribution (Fig. 12). Thus, a log-normal weighting function is adopted as \(w(k)\) to align the model capacity allocation with the learning difficulty distribution.
- Design Motivation: Early errors at coarser scales accumulate and propagate to all subsequent scales (Fig. 17). Furthermore, the number of tokens contributed by different scales varies drastically, making uniform weighting suboptimal.

Loss & Training¶

Phase 1 (Next-Scale): Cross-entropy loss with IO-aware window attention + log-normal loss reweighting.
Phase 2 (Masked Refinement): Addition of a mask prediction head, fine-tuned using \(\mathcal{L}_{mask} = \sum_k \mathcal{L}(\gamma r_k | \bar{\gamma} r_k)\).
Utilizes a pre-trained multi-scale VQ-VAE tokenizer from VAR.
\(K=10\) scales (\(1 \times 1\) to \(16 \times 16\)), consistent with VAR.
Uses top-k and top-p sampling during inference, defaulting to 14 steps (10 next-scale steps + a small number of mask steps per scale).

Key Experimental Results¶

Main Results¶

ImageNet 256×256 (cfg=not explicitly specified):

Method	Type	FID↓	IS↑	Params	Steps
DiT-XL/2	Diffusion	2.27	278.2	675M	250
VAR-d16	AR	3.36	277.8	310M	10
VAR-d24	AR	2.15	312.4	1.0B	10
VAR-d30	AR	1.95	303.6	2.0B	10
HMAR-d16	Hybrid	3.01	288.6	465M	14
HMAR-d24	Hybrid	2.10	324.3	1.3B	14
HMAR-d30	Hybrid	1.95	334.5	2.4B	14

ImageNet 512×512:

Method	FID↓	IS↑	Params
DiT-XL/2	3.04	240.8	675M
VAR-d36	2.63	303.2	-
HMAR-d24	Matches or exceeds	Higher IS	-

Efficiency Comparison¶

Metric	HMAR vs VAR
Training Speed	2.5× Faster
Inference Speed	1.75× Faster
Inference Memory	3× Lower
Attention Computation	10× Faster (IO-aware kernel)

Ablation Study¶

Loss weighting strategy (d16, 256×256):

Weighting Strategy	FID↓	IS↑
Uniform (VAR)	3.36	277.8
Linear	~3.2	~280
Log-Normal	3.01	288.6

Key Findings¶

HMAR-d30 matches VAR-d30 on ImageNet 256×256 with an FID of 1.95, but improves IS from 303.6 to 334.5 (+31 points), offering a significant perceptual improvement in image quality.
The Markovian reformulation increases training sequence sparsity by 5× (at 256×256 resolution), and the IO-aware kernel accelerates attention computation by 10×.
Inference requires no KV cache, resulting in a 3× memory reduction, which makes large-scale models and high-resolution inference highly feasible.
The number of mask steps can be flexibly adjusted during inference: more steps at coarser scales improve global structure (FID↓), while more steps at finer scales enhance detail (IS↑).
Log-normal loss weighting reduces FID by approximately 0.35 and increases IS by around 11 compared to uniform weighting.
HMAR can be applied zero-shot to inpainting, outpainting, and class-conditional editing, capabilities that VAR lacks.

Highlights & Insights¶

Elegant Markov Equivalence Derivation: Leverages the mathematical properties of VQ-VAE residual quantization to prove that \(p(r_k|r_{<k}) = p(r_k|\tilde{x}_{1:k-1})\), drawing an analogy to Laplacian/Gaussian pyramids. This formulation is theoretically concise and practically efficient.
Simultaneous Improvements in Quality, Efficiency, and Flexibility: While these three aspects typically present a trade-off, HMAR wins on all fronts through the Markovian reformulation and masked generation, representing a rare Pareto improvement.
Customizable Sampling Schedule: The number of mask steps can be independently adjusted at different scales without retraining, providing extreme flexibility for the trade-off between quality and speed.
Custom IO-aware GPU Kernel: The engineering contribution is equally vital—the Triton-implemented block-sparse attention kernel translates the paper's theoretical sparsity advantages into concrete practical performance gains.

Limitations & Future Work¶

The parameter count is approximately 30-50% larger than the corresponding VAR model (e.g., HMAR-d16 with 465M vs VAR-d16 with 310M) due to the addition of the mask prediction head.
Validation is currently limited to class-conditional generation on ImageNet, with a lack of experiments on text-to-image generation.
Although the Markov assumption is validated by attention analysis, it might lose information in extreme edge cases (e.g., when distant preceding scales contain critical global structures).
The two-stage training process (next-scale training followed by mask fine-tuning) increases the complexity of the training pipeline.

HMAR is a hierarchical integration of VAR and MaskGIT—VAR provides inter-scale causality, while MaskGIT provides intra-scale non-causal refinement.
Comparison with HART (another VAR variant): HART employs continuous-value diffusion for intra-scale refinement, whereas HMAR utilizes discrete masked generation, which is much more efficient.
Implications for Video VAR: The Markovian technique can be applied to the temporal dimension of video frames to reduce the memory over head of long video generation.
The loss reweighting strategy can be generalized to any multi-scale or multi-stage generative models.

Rating¶

⭐⭐⭐⭐⭐ (5/5)

Novelty ⭐⭐⭐⭐⭐: Markov equivalence derivation + masked refinement + IO-aware kernel—the three contributions are tightly coupled and coherent.
Experimental Thoroughness ⭐⭐⭐⭐⭐: Evaluated across quality, efficiency, and flexibility; comprehensive ablation studies; compared against various baselines like VAR, DiT, and MaskGIT.
Writing Quality ⭐⭐⭐⭐⭐: Concise theoretical derivation, clear presentation of experiments, and overall excellent structure.
Value ⭐⭐⭐⭐⭐: Faster, more memory-efficient, and more flexible than VAR without compromising quality, serving as a drop-in upgrade solution.