AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution¶

Conference: CVPR 2026 arXiv: 2603.00589 Code: None Area: Image Generation Keywords: Visual Autoregression, Image Super-Resolution, Spatial Consistency, Hierarchical Consistency, Next-Scale Prediction

TL;DR¶

AlignVAR addresses two consistency failures of visual autoregressive (VAR) models in image super-resolution (ISR): spatially incoherent reconstructions caused by locally biased attention, and cross-scale error accumulation induced by residual supervision. The proposed framework introduces Spatial Consistency Autoregression (SCA) and Hierarchical Consistency Constraint (HCC) to jointly resolve both issues, achieving reconstruction quality superior to diffusion-based methods while delivering over 10× faster inference.

Background & Motivation¶

In the ISR field, GAN-based methods suffer from training instability and tend to produce artifacts, while diffusion-based methods yield high quality but incur prohibitive iterative denoising costs (e.g., StableSR requires 200 steps and 15.32 seconds). Visual Autoregression (VAR) adopts a next-scale prediction strategy to enable coarse-to-fine reconstruction, a structure naturally aligned with the hierarchical nature of ISR, without requiring iterative inference. The predecessor work VARSR has preliminarily demonstrated the feasibility of this paradigm.

However, VARSR exposed fundamental contradictions of the VAR paradigm in ISR:

Spatial inconsistency (locally biased attention): Self-attention weights in VAR models concentrate almost entirely on neighboring regions, preventing long-range structural interactions and leading to texture discontinuities and structural distortions.

Hierarchical inconsistency (cross-scale error accumulation): Residual supervision constrains only the incremental prediction at each scale; minor deviations at coarse scales propagate and amplify through the sequential conditional probability $p(r_k | r_{1:k-1})$, causing color shifts and structural misalignment.

The root cause shared by both problems is the absence of explicit consistency constraints both within scales (spatial dimension) and across scales (hierarchical dimension). AlignVAR addresses this by simultaneously imposing consistency along both dimensions.

Method¶

Overall Architecture¶

AlignVAR builds upon the next-scale prediction architecture of VQ-VAE + autoregressive Transformer, introducing two complementary modules: - SCA (Spatial Consistency Autoregression): intra-scale — reweights attention via adaptive masking to alleviate local bias. - HCC (Hierarchical Consistency Constraint): inter-scale — replaces pure residual supervision with full-scale reconstruction supervision to suppress error accumulation.

Key Designs¶

Spatial Consistency Autoregression (SCA):
- Function: Introduces structure-aware spatial modulation into autoregressive prediction at each scale to break the local bias of attention.
- Mechanism:
  - Extracts a structural guidance map from the low-resolution input $I_{LR}$: $s = |\text{Laplacian}(I_{LR})|$, downsampled to each scale resolution and normalized to obtain $\bar{s}_k$.
  - A lightweight MLP mask generator predicts the spatial modulation field: $m_k = \sigma(\mathcal{M}_\phi([r_k, \bar{s}_k]))$.
  - Token gating generates structure-aware reweighted tokens: $\tilde{r}_k = (1 + m_k) \odot r_k$.
- Design Motivation: The Laplacian operator is sensitive to second-order structural changes, effectively highlighting edges and texture regions. The mask assigns higher weights to structurally reliable regions, guiding the model to propagate information preferentially along trustworthy structural paths, thereby extending the effective receptive field and enhancing long-range dependencies.
Hierarchical Consistency Constraint (HCC):
- Function: Supervises the accumulated full-scale reconstruction at every scale, rather than only the residual increment.
- Mechanism:
  - Constructs the full-scale ground truth: $u_{\text{gt}}^k = \mathcal{Q}(\text{Down}(z, S_k))$, i.e., the VAE encoding of the HR image downsampled to each scale and quantized.
  - Cumulative prediction: $\hat{u}_{\text{pred}}^k = \hat{u}_{\text{pred}}^{k-1} + \hat{r}_{\text{pred}}^k$.
  - HCC loss: $\mathcal{L}_{\text{HCC}} = \sum_{k=1}^{K} \|\hat{u}_{\text{pred}}^k - u_{\text{gt}}^k\|_2^2$.
- Design Motivation: Pure residual cross-entropy loss cannot perceive accumulated deviations — small errors at coarse scales accumulate into large biases through successive stages. HCC directly compares the global reconstruction state against the ground truth at each scale, enabling the model to correct deviations before they propagate further.

Loss & Training¶

The overall training objective is a weighted sum of cross-entropy loss and HCC loss: $$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{CE}} + \lambda \mathcal{L}_{\text{HCC}}$$

Teacher-forcing training is employed, conditioned on reweighted ground-truth tokens $\tilde{r}_{\text{gt}}^{1:k-1}$.
$\lambda = 1.0$ (verified as the optimal balance by ablation study).
Optimizer: AdamW, batch size 32, learning rate $5 \times 10^{-5}$ (cosine annealing), trained for 100 epochs.
Training data: LSDIR + first 10K images from FFHQ, with degradation via the Real-ESRGAN pipeline.
8× NVIDIA H100 GPUs.

Key Experimental Results¶

Main Results (Table 1: Synthetic + Real-World Benchmarks)¶

Method	Type	DIV2K LPIPS↓	DIV2K FID↓	DIV2K MANIQA↑	DIV2K CLIPIQA↑
BSRGAN	GAN	0.3511	50.99	0.3547	0.5253
Real-ESRGAN	GAN	0.3267	44.34	0.3756	0.5205
StableSR	Diffusion	0.3228	28.32	0.4173	0.6752
DiffBIR	Diffusion	0.3638	34.55	0.4598	0.6731
VARSR	VAR	0.2985	28.64	0.4137	0.6312
AlignVAR	VAR	0.2955	25.71	0.4665	0.6754

AlignVAR achieves the lowest FID (25.71) and best LPIPS (0.2955) on DIV2K-Val, while also attaining top perceptual quality scores on MANIQA and CLIPIQA.

Efficiency Comparison (Table 2)¶

Method	Parameters	Inference Steps	Inference Time
StableSR	1409.1M	200	15.32s
DiffBIR	1900.4M	20	5.03s
PASD	1716.7M	50	5.94s
VARSR	1102.9M	10	0.52s
AlignVAR	1056.5M	10	0.43s

AlignVAR is 13.8× faster than PASD, 11.7× faster than DiffBIR, and 17% faster than VARSR with fewer parameters.

Ablation Study¶

SCA Ablation (Table 3):

Configuration	RealSR MANIQA↑	RealSR MUSIQ↑
w/o SCA	0.4351	66.74
Random Input	0.4435	67.21
Structural Guidance (ours)	0.4553	68.53

HCC Ablation (Table 4):

Configuration	RealSR PSNR↑	RealSR MANIQA↑
w/o HCC	25.85	0.4431
w/ HCC	26.11	0.4553

Key Findings¶

Removing SCA slightly improves fidelity metrics but noticeably degrades perceptual quality, indicating that structural guidance is critical for visual coherence.
Applying HCC supervision in latent space outperforms pixel-space supervision — latent representations are more compact and gradients are more direct.
The loss balance coefficient $\lambda = 1.0$ yields the best perceptual quality; larger $\lambda$ biases the model toward fidelity at the expense of perceptual quality.

Highlights & Insights¶

Precise problem diagnosis: Through attention distribution visualization and perturbation injection experiments, the paper clearly localizes two fundamental failure modes of VAR in ISR.
Lightweight and efficient: The mask generator in SCA is a lightweight MLP, and HCC introduces only an L2 loss computation — adding virtually no inference overhead.
10× speed advantage: 0.43 seconds versus 5+ seconds for diffusion-based methods, which is highly significant for practical deployment.
Better performance with fewer parameters: 1056.5M vs. 1900.4M (DiffBIR), demonstrating the efficiency potential of the VAR paradigm for ISR.

Limitations & Future Work¶

Fidelity metrics (PSNR/SSIM) do not reach state-of-the-art levels; recovery remains bottlenecked when high-frequency details in the LR image are severely lost.
The mask relies on hand-crafted Laplacian design; learnable structural detection could be explored.
Only 4× super-resolution (128→512) is evaluated; more extreme upscaling factors (e.g., 8× or 16×) remain unverified.
The VQ-VAE discretization may limit the reconstruction ceiling; comparisons with continuous latent space methods are absent.

VARSR: The direct predecessor of this work, first applying VAR to ISR but exposing consistency issues.
VAR (next-scale prediction): Distinguished from next-token prediction, avoiding the destruction of spatial structure caused by sequence flattening.
StableSR / DiffBIR: Representative diffusion-based methods with high quality but slow inference.
Insight: Locally biased attention and error accumulation are general problems for autoregressive models; the ideas behind SCA and HCC are transferable to other hierarchical generation tasks such as video generation and 3D reconstruction.

Rating¶

Novelty: ⭐⭐⭐⭐ Proposes targeted solutions for specific failure modes of VAR in ISR, with in-depth problem diagnosis.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers synthetic and real-world benchmarks, complete ablations, and efficiency comparisons, though user studies are absent.
Writing Quality: ⭐⭐⭐⭐ Motivation analysis is clear, visualizations are rich, and the problem-solution correspondence is explicit.
Value: ⭐⭐⭐⭐ Advances the practical applicability of VAR for ISR; the 10× speed advantage carries significant engineering value.