Eliminating VAE for Fast and High-Resolution Generative Detail Restoration¶

Conference: ICLR 2026 arXiv: 2602.10630 Code: None (improvement upon GenDR) Area: Image Generation Keywords: VAE elimination, super-resolution, pixel-space diffusion, adversarial distillation, pixel-shuffle

TL;DR¶

By replacing the VAE encoder and decoder with ×8 pixel-(un)shuffle operations, this work converts latent-space diffusion super-resolution (GenDR) into pixel-space super-resolution (GenDR-Pix). Combined with multi-stage adversarial distillation and a PadCFG inference strategy, the method achieves 2.8× speedup and 60% memory reduction with negligible visual degradation, enabling 4K image restoration within 1 second using only 6 GB of VRAM for the first time.

Background & Motivation¶

Diffusion models have achieved breakthroughs in real-world super-resolution (Real-World SR), yet they face two major bottlenecks: slow inference and high memory consumption. Existing acceleration approaches (e.g., step distillation) reduce diffusion steps to one, but memory constraints still limit the maximum processable resolution, requiring tile-by-tile processing for high-resolution images.

Key Finding: VAE is the bottleneck. Profiling analysis reveals: - For a 1024² input, VAE latency is 2.6× that of UNet (191ms vs. 73ms) - VAE memory is 1.3× that of UNet (3.5 GB vs. 2.7 GB) - For a 2880² input, VAE latency is 1.89× that of UNet (3228ms vs. 1710ms) - At 4096², VAE directly causes OOM

Fidelity bottleneck: Even the 16-channel VAE achieves reconstruction PSNR below 35 dB, with lossy compression discarding high-frequency details.

Key Challenge: Existing methods (e.g., AdcSR) only simplify the VAE while still operating in latent space; the generation–reconstruction conflict introduced by simplification cannot be fundamentally resolved (e.g., removing decoder attention leads to texture loss).

Core Idea: pixel-(un)shuffle performs spatial scale transformations analogous to the VAE and can fully replace it, converting latent-space diffusion to pixel space. However, ×8 pixel-shuffle introduces checkerboard/repetitive-pattern artifacts, and no suitable discriminator exists for this setting.

Method¶

Overall Architecture¶

GenDR-Pix = GenDR − VAE + ×8 pixel-(un)shuffle + multi-stage adversarial distillation + PadCFG. The VAE encoder and decoder are progressively removed across two stages.

Key Designs¶

Stage I: Removing the Encoder
- Replace the VAE encoder with a ×8 pixel-unshuffle layer
- Use GenDR's UNet as a feature extractor for the discriminator
- Generator loss: $\mathcal{L}_{\mathcal{G}_1} = \|z_{\text{tea}} - z_{\text{stu}}\|_1 + \lambda_1 \cdot \text{softplus}(-\mathcal{D}(z_{\text{stu}}))$
- Output model GenDR-Adc: low-quality pixels → high-quality latent
Stage II: Removing the Decoder (Core Challenge)
- Replace the decoder with a ×8 pixel-shuffle layer
- Problem 1 – Repetitive pattern artifacts: a single improper weight in ×8 upscaling propagates identical artifacts across all 8×8 patches
- Solution – Masked Fourier Space (MFS) Loss:
  - Artifacts manifest as periodic highlights in the frequency domain, aligned with the pixel-shuffle scale factor
  - A band-reject filter mask $\mathcal{M}$ is designed to penalize anomalous amplitudes: $\mathcal{L}_{\mathcal{F}} = \|\mathcal{M} \cdot (|\mathcal{F}\{y_{\text{stu}}\}| - |\mathcal{F}\{y_{\text{tea}}\}|)\|_1$
- Problem 2 – No suitable discriminator: latent-space discriminators cannot process pixel-shuffled features
- Solution – Use Stage I model as discriminator: GenDR-Adc ($\mathcal{G}_1$) naturally encodes inputs via pixel-unshuffle
- Problem 3 – Discriminator collapse: fixed pixel-unshuffle patterns cause the discriminator to focus on discrete distributional representations
- Solution – Random Padding (RandPad) augmentation: randomly sample $p_h, p_w \in \{0,...,7\}$ and apply random padding to both SR and HQ images: $\text{randpad}(y) = \text{pad}(y, [p_h, 8-p_h, p_w, 8-p_w])$ This encourages the discriminator to extract continuous representations, preventing pattern collapse
PadCFG: Classifier-Free Guidance in Pixel Space
- Applying CFG directly in pixel space exacerbates artifacts
- Self-ensemble (multiple paddings → fusion) is integrated with CFG: $\bar{y} = \omega \times \mathcal{G}_2(\text{pad}(x,[4,4,4,4]), c_{\text{pos}}) + (1-\omega) \times \mathcal{G}_2(\text{pad}(x,[3,5,3,5]), c_{\text{neg}})$
- Requires only 2 forward passes (same as standard CFG) while incorporating the ensemble effect of different paddings

Loss & Training¶

Stage II total generator loss: $$\mathcal{L}_{\mathcal{G}_2} = \|y_{\text{tea}} - y_{\text{stu}}\|_1 + \lambda_1 \cdot \text{softplus}(-\mathcal{G}_1(y_{\text{stu}})) + \lambda_2 \mathcal{L}_{\mathcal{P}} + \lambda_3 \mathcal{L}_{\mathcal{F}}$$

Parameters $\lambda_{1,2,3} = 0.05, 1, 0.1$. Trained with AdamW, learning rate 1e-5, BFloat16, 8×A100 GPUs, DeepSpeed ZeRO2.

Key Experimental Results¶

Main Results¶

ImageNet-Test Quantitative Comparison (×4 SR, 512² input):

Method	Params	MACs	PSNR↑	SSIM↑	CLIPIQA↑	MUSIQ↑
StableSR-50	1410M	79940G	26.00	0.7317	0.5768	64.54
DiffBIR-50	1717M	24234G	25.45	0.6651	0.7486	73.04
OSEDiff-1	1775M	2265G	24.82	0.7017	0.6778	71.74
GenDR-1	933M	1637G	24.14	0.6878	0.7395	74.68
GenDR-Pix	866M	344/744G	25.49	0.7286	0.7168	72.85

Efficiency Comparison (4K output, A100):

Model	VAE Status	Time	Memory	MUSIQ
GenDR	Full VAE	4.92s	20.75 GB	70.96
GenDR-Adc	Encoder removed	2.69s (−45%)	17.75 GB (−14%)	70.44
GenDR-Pix	VAE removed	1.75s (−64%)	8.01 GB (−61%)	70.23
GenDR-Pix⋆	No CFG	0.87s (−82%)	5.03 GB (−76%)	68.64

Ablation Study¶

Discriminator Design:

Discriminator	RandPad	MUSIQ	CLIPIQA
None	−	60.95	0.4924
GenDR	−	61.07	0.5457
GenDR-Adc	✗	63.87	0.5764
GenDR-Adc	✓	63.89	0.5937

MFS Loss: Removing the frequency-domain loss degrades NIQE from 4.11 to 4.84 and LIQE from 3.21 to 2.91.

PadCFG: PadCFG(4,4)+(3,5) achieves MUSIQ +0.45 and CLIPIQA +0.0061 while maintaining the same latency as vanilla CFG.

Key Findings¶

The speedup from VAE removal substantially exceeds that of architectural pruning approaches
Artifacts from ×8 pixel-shuffle can be effectively suppressed via frequency-domain loss and random padding
GenDR-Pix is the only model capable of directly super-resolving to 8K without tiling
User studies show that GenDR-Pix achieves perceptual quality comparable to the original GenDR

Highlights & Insights¶

The finding that VAE is simultaneously an efficiency and fidelity bottleneck for one-step diffusion SR has broad implications
The analogy between pixel-(un)shuffle and VAE is elegant and motivates a novel acceleration direction
The multi-stage distillation strategy of using the previous-stage model as the discriminator for the next stage is a clever design choice
RandPad's solution to discriminator collapse is simple yet effective, inspired by E-LPIPS
The practical performance of 1-second 4K SR with 6 GB VRAM has significant industrial deployment value

Limitations & Future Work¶

Validation is currently limited to GenDR (SD2.1 UNet); generalization to newer architectures such as SDXL and Flux remains unexplored
Only ×4 SR is supported; adapting to other scale factors requires adjusting pixel-shuffle parameters
The padding parameter selection in PadCFG is largely empirical and lacks theoretical justification
Stage II training still requires the Stage I model as a discriminator, resulting in a relatively complex training pipeline
For extreme degradations (e.g., severe compression artifacts), the performance gap with GenDR may widen

AdcSR: replaces only the encoder with ×2 pixel-unshuffle; this work extends the approach to ×8, fully eliminating the VAE
PixelFlow: explores pixel-space diffusion
GenDR / SiD / VSD: foundational one-step diffusion SR methods
E-LPIPS: perceptual loss with random transformations, which inspired the RandPad design
Insight: similar VAE elimination strategies may be applicable to other diffusion-based applications that rely on VAEs, such as image editing and inpainting

Rating¶

Novelty: ⭐⭐⭐⭐ The direction of fully eliminating the VAE is novel; however, the core mechanism (pixel-shuffle substitution) is relatively intuitive, with the main contribution lying in overcoming engineering challenges
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive coverage across multiple datasets, metrics, detailed ablations, user studies, and efficiency analyses
Writing Quality: ⭐⭐⭐⭐ The roadmap-style presentation is clear and intuitive, though the density of equations is high
Value: ⭐⭐⭐⭐⭐ Significant practical acceleration (2.8× speedup / 60% memory reduction); 4K in 1 second has clear industrial application value