InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis¶
Conference: ICCV 2025 arXiv: 2509.10441 Code: GitHub Area: Image Generation Keywords: Arbitrary-resolution generation, latent diffusion models, VAE decoder replacement, implicit neural positional encoding, efficient 4K generation
TL;DR¶
This paper proposes InfGen, a "second-generation" paradigm that replaces the VAE decoder with a Transformer-based generator, decoding fixed-size latents into images at arbitrary resolution in a single forward pass—without modifying or retraining the diffusion model. It reduces 4K image generation to under 10 seconds, achieving over 10× speedup compared to the fastest existing method, UltraPixel.
Background & Motivation¶
- Demand for arbitrary-resolution image generation: Diverse devices (smartphones, 4K displays, etc.) require consistent visual experience across resolutions.
- Bottlenecks of existing methods:
- Computational cost of diffusion models scales quadratically with resolution; generating 4K images incurs latency exceeding 100 seconds.
- Training-free methods (e.g., ScaleCrafter, FouriScale) achieve super-resolution by modifying the inference process (dilated convolutions, etc.), but are tightly coupled to specific network architectures and generalize poorly.
- Methods such as Inf-DiT redesign attention mechanisms but remain slow (255 seconds for 2K image generation).
- UltraPixel requires fine-tuning the diffusion model itself.
- Key insights:
- The generative model has already completed content generation (Stage 1); the Stage 2 decoder requires only a single forward pass.
- Enhancing the decoder's capacity to achieve high-resolution image generation is a more efficient pathway.
- The mapping from latent to high-resolution image is intrinsically a super-resolution task, but requires generative capability to compensate for information loss.
Method¶
Overall Architecture¶
InfGen operates as a "second-generation" model: 1. Stage 1: A diffusion model generates a fixed-size content latent \(z\) (e.g., \(4 \times 64 \times 64\)). 2. Stage 2: InfGen decodes \(z\) into an image at arbitrary resolution \((h, w)\).
This paradigm yields two key advantages: (1) high inference speed—avoiding multi-step denoising on high-resolution latents; and (2) plug-and-play compatibility—applicable to any diffusion model sharing the same latent space.
Arbitrary-Resolution Decoder Architecture¶
Built upon the conventional VAE structure, a Transformer-based latent generator is introduced: - The latent variable \(z\) serves as keys and values. - Mask tokens corresponding to the target image size \((h, w)\), with shape \((\lceil h/8 \rceil, \lceil w/8 \rceil)\), serve as queries. - Across multiple Transformer blocks, mask tokens acquire information from latent keys via cross-attention. - The resulting mask tokens are passed to an upsampling decoder to produce the final image at arbitrary resolution.
Implicit Neural Positional Encoding (INPE)¶
INPE addresses the spatial alignment between fixed-size latents and dynamically sized mask tokens:
-
Coordinate normalization: Mask token and latent token coordinates are mapped to a unified scale. $\((\hat{x}^m, \hat{y}^m) = (x^m/W^m, y^m/H^m)\)$
-
Spherical projection: Normalized 2D coordinates are transformed into 3D Cartesian coordinates on a unit sphere, leveraging spherical geometry to capture complex spatial relationships.
-
Fourier features + neural network mapping: High-frequency Fourier features enhance the model's ability to capture fine patterns. $\(\gamma(x,y,z) = [\cos(B[x,y,z]^T), \sin(B[x,y,z]^T)]\)$ These features are then passed through an implicit neural network to produce dynamic positional encodings.
Loss & Training¶
where \(\ell_1\) denotes the L1 reconstruction loss, \(\mathcal{L}_P\) is the LPIPS perceptual loss, and \(\mathcal{L}_G\) is the adversarial loss from a PatchGAN discriminator. Both \(\lambda_P\) and \(\lambda_G\) are set to 0.1.
Training-Free Resolution Extrapolation¶
Ultra-high resolutions (e.g., 4K) beyond the training resolution are achieved through an iterative procedure: $\(L_n = \text{Encoder}(I_{n-1}), \quad I_n = \text{InfGen}(L_n, k_n^s)\)$
The final resolution is: \(R_f = 512 \cdot \prod_{i=1}^n s_i^h \times 512 \cdot \prod_{i=1}^n s_i^w\)
Key Experimental Results¶
Main Results: Image Tokenizer Reconstruction Quality¶
| Method | Input→Output Resolution | ImageNet rFID↓ | PSNR↑ | SSIM↑ |
|---|---|---|---|---|
| VQGAN | 256²→256² | 1.19 | 23.38 | 0.762 |
| SD-VAE | 256²→256² | 0.74 | 25.68 | 0.820 |
| SDXL-VAE | 256²→256² | 0.68 | 26.04 | 0.834 |
| InfGen | 256²→256² | 1.07 | 24.61 | 0.798 |
| InfGen | 512²→512² | 0.61 | 27.92 | 0.867 |
| SD-VAE | 256²→512² | 1.43 | 24.14 | 0.759 |
| InfGen | 256²→512² | 1.15 | 22.86 | 0.728 |
InfGen substantially outperforms SD-VAE on the cross-resolution reconstruction task (256²→512²).
High-Resolution Enhancement over Diffusion Models¶
| Method | 512² FIDp↓ | 1024² FIDp↓ | 2048² FIDp↓ | 3072² FIDp↓ |
|---|---|---|---|---|
| DiT-XL/2 | 44.17 | 61.52 | 64.87 | 77.84 |
| InfGen+DiT | 39.81 (↓9.9%) | 41.75 (↓32%) | 56.21 (↓13.4%) | 45.94 (↓41%) |
| SD1.5 | 21.58 | 55.30 | - | - |
| InfGen+SD1.5 | 16.92 (↓21%) | 41.12 (↓26%) | - | - |
| FiTv2 | 42.04 | 66.95 | - | 79.30 |
| InfGen+FiTv2 | 38.77 (↓7.8%) | 61.56 (↓8.1%) | - | 45.72 (↓42%) |
Key finding: InfGen yields consistent and significant improvements across all resolution and model combinations, with gains reaching up to 44% at 3072² resolution.
Comparison with SOTA High-Resolution Generation Methods¶
| Method | 1024² FIDp↓ | 2048² FIDp↓ | 1024² Latency (s) | 2048² Latency (s) |
|---|---|---|---|---|
| ScaleCrafter | 55.36 | 144.61 | 7 | 97 |
| Inf-DiT | 48.48 | 142.05 | 50 | 255 |
| UltraPixel | 48.37 | 127.26 | 11 | 20 |
| InfGen+SD1.5 | 44.85 | 139.14 | 2.9+0.4 | 2.9+1.9 |
| InfGen+SDXL | 35.14 | 96.41 | 5.4+0.4 | 5.4+1.9 |
InfGen+SDXL leads by a wide margin in both FID and speed; generating 2K images takes approximately 7 seconds—4× faster than UltraPixel and 35× faster than Inf-DiT.
Training Setup¶
- Training data: 5 million images with resolution >1024² from LAION-Aesthetic, plus a subset with resolution >2048².
- Two-stage training: Stage 1: 512²→1024² (batch=32, 500k iterations); Stage 2: 512²→2048² (batch=8, 100k iterations).
- Hardware: 8×A100, trained for 15 days.
- Optimizer: AdamW with cosine learning rate decay from 2e-4 to 1e-5.
Highlights & Insights¶
- Paradigm shift: The problem of arbitrary-resolution generation is reframed from "modifying the diffusion model" to "enhancing the decoder," substantially reducing complexity and computational cost.
- Plug-and-play design: InfGen leaves the VAE encoder unchanged, enabling it to serve as a direct upgrade for existing diffusion models including SD1.5, SDXL, DiT, and SiT.
- Implicit Neural Positional Encoding: Spherical projection combined with Fourier features elegantly resolves the spatial alignment challenge between fixed latents and dynamic target resolutions.
- Iterative extrapolation strategy: Generation capability is extended to arbitrarily high resolutions without retraining, with each iteration applying a 2× upscaling factor.
Limitations & Future Work¶
- InfGen performs slightly below the original VAE on same-resolution reconstruction at 256² (rFID 1.07 vs. 0.74), reflecting the greater complexity of its task.
- The iterative extrapolation pipeline introduces additional encoding–decoding cycles, each accumulating information loss.
- Adversarial training may cause generated content to deviate from the original semantics.
- The current model is trained only on SDXL's VAE latent space; transferring to other latent spaces requires retraining.
Related Work & Insights¶
- Relation to super-resolution methods: InfGen is fundamentally a conditional generative super-resolution approach, conditioned on latents rather than low-resolution images.
- Relation to VAE improvements: Rather than improving the VAE itself, InfGen adds a generative capability layer on top of it.
- Insight: The second-generation paradigm decouples "resolution" from the diffusion model's burden, eliminating the need to address high-resolution or arbitrary-resolution generation within the diffusion stage itself.
Rating ⭐⭐⭐⭐¶
The approach is conceptually clear, methodologically concise, and practically efficient. The plug-and-play design philosophy and 10× speedup confer strong practical value. The Implicit Neural Positional Encoding is an elegant and well-motivated design.