Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models¶

Conference: CVPR 2025
arXiv: 2503.18352
Code: https://github.com/zhang0jhon/diffusion-4k
Area: Image Generation / Ultra-High Resolution
Keywords: 4K Image Generation, Wavelet Fine-tuning, Partitioned VAE, Diffusion Models, Aesthetic-4K Benchmark

TL;DR¶

This paper proposes the Diffusion-4K framework, which consists of the Aesthetic-4K benchmark dataset, GLCM Score/Compression Ratio evaluation metrics, and a wavelet-based fine-tuning method. It enables large-scale latent diffusion models (LDMs) such as SD3-2B and Flux-12B to directly generate high-quality 4096×4096 images with rich texture details.

Background & Motivation¶

Background: Mainstream latent diffusion models (SD3, Flux, etc.) primarily focus on training and generation at 1024×1024 resolution, leaving direct 4K image synthesis largely unexplored. Although PixArt-Σ and Sana achieve 4K-level generation, they mainly focus on efficiency, ignoring the inherent high-frequency details and rich textures characteristic of 4K images.

Limitations of Prior Work: (1) There is a lack of publicly available 4K image synthesis datasets and benchmarks; (2) Conventional evaluation metrics (FID, CLIP Score) are computed at low resolutions, failing to evaluate local detail quality in 4K images; (3) Direct deployment of standard \(F=8\) VAE models at 4096×4096 leads to Out-of-Memory (OOM) issues.

Key Challenge: Scaling resolution up to 4K introduces quadratic computational overhead, while standard training objectives (noise/velocity prediction) lack explicit focus on high-frequency details, causing 4K images to often be "large yet blurry".

Goal: (1) Establish a complete benchmark for 4K image synthesis; (2) Propose a generalized fine-tuning method to enable various LDMs to generate highly detailed images at 4K resolution.

Key Insight: Wavelet transform can decompose a signal into low-frequency approximations and high-frequency details. Applying wavelet transform to velocity/noise prediction targets allows high-frequency and low-frequency components to simultaneously participate in loss computation, thereby explicitly enhancing focus on details.

Core Idea: Propose the Wavelet-based Latent Enhancement (WLF) fine-tuning method + Partitioned VAE (\(F=16\)) to resolve OOM issues + Aesthetic-4K dataset and dedicated evaluation metrics.

Method¶

Overall Architecture¶

The framework collects high-quality 4K images to construct the Aesthetic-4K dataset and uses GPT-4o to generate precise captions. A Partitioned VAE is used to compress images into an \(F=16\) latent space. In this latent space, pre-trained diffusion models (SD3/Flux) are fine-tuned using a wavelet transform loss while freezing the VAE and text encoders.

Key Designs¶

Partitioned VAE:
- Function: Resolve the OOM issue of the \(F=8\) VAE at 4096×4096 resolution.
- Mechanism: Apply a dilation rate of 2 in the first convolutional layer of the VAE encoder to achieve an additional 2x downsampling (\(F=8 \rightarrow F=16\)). In the last layer of the decoder, the feature map is partitioned, each partition is upsampled by 2x, the same convolution is applied to each partition, and then reconstructed. Pre-trained VAE parameters are fully reused without retraining.
- Design Motivation: Maintain latent space consistency with pre-trained LDMs to prevent distribution shifts, while reducing GPU memory consumption to a feasible range.
Wavelet-based Latent Enhancement (WLF):
- Function: Explicitly enhance focus on high-frequency details in the fine-tuning objective.
- Mechanism: Apply Discrete Wavelet Transform (DWT) with Haar wavelets to both the predicted target (velocity/noise) and ground-truth target of the diffusion model, decomposing them into four sub-bands: LL (low-frequency approximation) and LH/HL/HH (high-frequency details). The loss function is formulated as \(\mathcal{L}_{WLF} = \mathbb{E}[w_t \|f(v_\Theta(z_t,t)) - f(\epsilon - x_0)\|^2]\), where \(f(\cdot)\) represents DWT.
- Design Motivation: Standard MSE loss treats all frequencies equally, which dilutes the relative contribution of high-frequency components in 4K scenarios. DWT forces the model to simultaneously optimize low-frequency structure and high-frequency textures.
Aesthetic-4K Benchmark:
- Function: Provide a training, evaluation, and metric system for 4K image synthesis.
- Mechanism: The training set consists of 12,015 high-quality 4K images with GPT-4o captions. The evaluation set consists of 2,781 images from LAION-Aesthetics. New metrics include the GLCM Score (Gray-Level Co-occurrence Matrix entropy, measuring texture richness) and Compression Ratio (JPEG compression ratio, measuring detail preservation), achieving human perception alignment (SRCC=0.75/0.53) that far exceeds MUSIQ (0.36) and MANIQA (0.20).
- Design Motivation: Fill the gap in 4K image synthesis evaluation and provide detail quality metrics centered on human perception.

Loss & Training¶

The WLF loss is based on MSE in the wavelet transform domain. Training utilizes the AdamW optimizer (learning rate = 1e-6) with mixed-precision training. SD3-2B is trained on 2 A800-80G GPUs, and Flux-12B is trained on 8 A100-80G GPUs.

Key Experimental Results¶

Main Results¶

Aesthetic-Eval@2048 Evaluation:

Model	FID↓	CLIPScore↑	Aesthetics↑
SD3-F16 (baseline)	43.82	31.50	5.91
SD3-F16-WLF (Ours)	40.18	34.04	5.96
Flux-F16 (baseline)	50.57	-	-

Detail Quality Metrics¶

Model	GLCM Score↑	Compression Ratio↓
SD3-F16	Baseline	Baseline
SD3-F16-WLF	Higher	Lower (more details)

Ablation Study¶

Configuration	Effect	Explanation
Standard Fine-tuning (w/o WLF)	FID/CLIP baseline	4K images tend to be blurry
+ WLF Wavelet Loss	Decreased FID, increased CLIP	Details significantly enhanced
Partitioned VAE (\(F=16\))	rFID=1.40, PSNR=28.82	Close to original VAE reconstruction quality

Key Findings¶

WLF comprehensively outperforms standard fine-tuning across FID, CLIPScore, Aesthetics, and detail metrics.
Partitioned VAE maintains latent space consistency without retraining, yielding acceptable 4K reconstruction quality.
GLCM Score and Compression Ratio align better with human perception of details compared to existing NR-IQA metrics.
Large models (Flux-12B) show a more pronounced advantage in 4K generation, validating the scalability of DiTs at ultra-high resolutions.

Highlights & Insights¶

Generality of Wavelet Fine-tuning: WLF only modifies the loss function and can seamlessly adapt to any LDM (SD3, Flux, or even UNet architectures), representing a lightweight yet highly effective approach.
Partitioned VAE Cleverly Solves OOM: The strategy of only modifying the first and last convolutional layers avoids retraining the VAE, making it highly practical.
Human Perception-Centric Evaluation: GLCM Score and Compression Ratio fill the gap in 4K evaluation.

Limitations & Future Work¶

The training set consists of only 12K images, indicating a limited data scale.
The \(F=16\) compression ratio of the Partitioned VAE is higher, which may result in the loss of some fine-grained information.
The framework has only been validated on text-to-image tasks; downstream tasks like image editing and video generation remain unexplored.
Although evaluation metrics align better with human perception, their correlation coefficients still have room for improvement.

vs PixArt-Σ / Sana: These models prioritize 4K efficiency but ignore detail quality, whereas this work focuses on detail enhancement.
vs Stable Cascade: Uses multi-stage diffusion to progressively scale resolution, which may accumulate errors; conversely, this work directly performs end-to-end 4K generation.

Rating¶

Novelty: ⭐⭐⭐⭐ A comprehensive system featuring wavelet fine-tuning, partitioned VAE, and a 4K benchmark.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-model validation, with thorough analysis of the new metrics' alignment with human perception.
Writing Quality: ⭐⭐⭐⭐ Systematic and complete, with clear motivations for each component.
Value: ⭐⭐⭐⭐ The 4K benchmark and WLF method bring practical value to the community.