UltraHR-100K: Enhancing UHR Image Synthesis with A Large-Scale High-Quality Dataset¶

Conference: NeurIPS 2025 arXiv: 2510.20661 Code: Available (marked in paper) Area: Ultra-High-Resolution Image Generation / Diffusion Models Keywords: Ultra-High-Resolution, Dataset, Frequency-Aware, Detail Generation, Post-Training

TL;DR¶

This work constructs UltraHR-100K, a large-scale dataset comprising 100K ultra-high-resolution images with rich annotations, and proposes a Frequency-Aware Post-Training (FAPT) method combining Detail-Oriented Timestep Sampling (DOTS) and Soft-Weighted Frequency Regularization (SWFR) based on DFT, enabling pretrained T2I models to generate fine-grained details at ultra-high resolutions.

Background & Motivation¶

Text-to-image (T2I) diffusion models perform well at 1024×1024 resolution but suffer from significant quality degradation and structural artifacts when directly scaled to ultra-high resolutions (UHR, e.g., 4K). Existing approaches fall into two categories:

Training-free methods (DemoFusion, HiFlow, etc.): Achieve UHR generation by modifying inference strategies, but tend to produce over-smoothed results with unrealistic details and long inference times.

Training-based methods (PixArt-σ, SANA, etc.): Focus primarily on training efficiency while neglecting detail generation quality.

Two core challenges remain unresolved: (1) the lack of large-scale, high-quality open-source UHR T2I datasets — the existing Aesthetic-4K contains only ~10K images without rigorous filtering; and (2) the absence of training strategies targeting fine-grained UHR detail synthesis — current models exhibit strong semantic planning capacity but inadequate high-frequency detail generation at UHR scales.

Method¶

Overall Architecture¶

This work comprises two components: 1. UltraHR-100K Dataset: 100K carefully curated UHR images with rich textual annotations. 2. Frequency-Aware Post-Training (FAPT): A two-stage training strategy — Stage 1 fine-tunes on UltraHR-100K to enhance semantic planning; Stage 2 applies DOTS + SWFR to focus on high-frequency detail learning.

Key Designs¶

UltraHR-100K Dataset Construction:
- Data Collection: A Scrapy-based crawler collects approximately 400K high-resolution images (minimum 3840×2160).
- Preliminary Filtering: Laplacian variance (sharpness) + Sobel operator (edge density) to remove blurry and texture-free images.
- Three-Dimensional Fine Filtering:
  - Detail Richness: GLCM (Gray-Level Co-occurrence Matrix)-based contrast, entropy, and correlation; top 50% retained.
  - Content Complexity: Shannon entropy measuring pixel intensity diversity; top 50% retained.
  - Aesthetic Quality: LAION Aesthetic Predictor scoring; top 50% retained.
- Final Dataset = intersection of the three subsets: UltraHR-100K = \(S_G \cap S_E \cap S_A\), ensuring every image simultaneously meets all three high standards.
- Annotation: Gemini 2.0 generates detailed long captions covering global summaries and fine-grained descriptions, with annotation length substantially exceeding that of Aesthetic-4K.
- Final scale: 104,117 images, average height 3648, average width 5119.
Detail-Oriented Timestep Sampling (DOTS):
- Observation: Early denoising steps primarily reconstruct low-frequency structure, while later steps progressively synthesize high-frequency details.
- Denoising timesteps are sampled from a Beta(\(\alpha, \beta\)) distribution with \(\alpha=2, \beta=4\), biasing sampling toward later steps (near \(t=0\)).
- Effect: Guides the model to focus on high-frequency detail-related denoising steps during post-training.
Soft-Weighted Frequency Regularization (SWFR):
- 2D DFT is applied to both predictions and targets, with weighted constraints imposed in the frequency domain.
- Frequency soft-weighting function: \(w(r) = 1 + \lambda \cdot \frac{\exp(\gamma r)-1}{\exp(\gamma)-1}\), where \(r \in [0,1]\) is the normalized frequency distance.
- \(\lambda\) and \(\gamma\) control the intensity and steepness of high-frequency emphasis.
- \(\mathcal{L}_{\text{freq}} = \mathbb{E}\left[\|w(r)\cdot\hat{x} - w(r)\cdot\hat{y}\|^2\right]\)
- Total loss: \(\mathcal{L} = \mathcal{L}_{\text{diff}} + \lambda_{\text{freq}} \cdot \mathcal{L}_{\text{freq}}\)
- Provides finer, continuous frequency separation compared to DWT-based approaches (as used in Diffusion4K).

Loss & Training¶

Two-stage training: Stage 1 — 4K steps with Logit-Normal sampling fine-tuning; Stage 2 — 8K steps with DOTS + SWFR.
CAMEWrapper optimizer, constant learning rate \(1\times10^{-4}\), mixed-precision training, batch size 24.
Built on the SANA model; trained on 4× H20 GPUs.
Evaluation benchmark: self-constructed UltraHR-eval4K (2,000 images at 4096×4096).

Key Experimental Results¶

Main Results (UltraHR-eval4K, 4096×4096)¶

Method	FID↓	FID_patch↓	IS↑	IS_patch↑	CLIP↑	FG-CLIP↑
FLUX + BSRGAN	37.65	43.14	11.77	5.39	31.45	28.02
I-Max(FLUX)	37.67	37.84	11.99	4.39	31.49	27.78
HiFlow(FLUX)	35.89	38.33	11.77	4.62	31.52	27.75
PixArt-σ	33.17	32.20	12.21	5.39	31.78	28.65
SANA	37.07	38.80	11.78	5.65	31.70	28.60
Diffusion4K	39.86	38.52	10.83	3.24	31.41	26.48
Ours (UltraHR-100K)	34.00	20.93	12.50	5.02	31.85	28.65
Ours (+FAPT)	31.75	15.80	13.00	5.10	31.82	28.68

Ablation Study¶

Model	DOTS	SWFR	Data	FID↓	FID_patch↓	CLIP↑
LoRA	×	×	Full	35.07	35.02	31.80
A (full fine-tuning baseline)	×	×	Full	33.99	20.93	31.85
B	✓	×	Full	32.57	19.95	31.79
C	✓	✓	15K	32.75	18.42	31.81
D (full method)	✓	✓	Full	31.74	15.79	31.82

Key Findings¶

FID_patch improvement is the most pronounced (38.80→15.80), demonstrating the method's advantage in fine-grained detail generation.
In user studies, the proposed method achieves a 70% overall preference rate and a 78% detail quality preference rate, substantially outperforming competing methods.
Data scale is critical: the full 100K dataset yields significantly better results than the 15K subset.
Beta(\(\alpha=2, \beta=4\)) is optimal for DOTS; excessively large \(\alpha\) weakens detail learning, while excessively small \(\alpha\) harms semantic consistency.
SWFR contributes most significantly to FID_patch improvement (19.95→15.79), validating the effectiveness of high-frequency regularization.
The proposed method also achieves state-of-the-art results on the public Aesthetic-Eval@4096 benchmark, demonstrating generalizability.

Highlights & Insights¶

Three-dimensional intersection filtering is concise yet effective: The intersection design across detail richness, content complexity, and aesthetic quality enforces a high quality floor for the dataset.
DOTS exploits the frequency characteristics of the denoising process: It precisely leverages the physical property that early steps generate structure while later steps generate details.
The choice of DFT over DWT is well-motivated: DFT provides a continuous spectrum that enables finer-grained control compared to the coarse discrete decomposition of DWT.
The post-training paradigm is broadly applicable: It demonstrates that significant UHR detail capability can be unlocked through lightweight post-training without retraining from scratch.
The contribution of a 100K-scale UHR dataset is itself of considerable value and is expected to advance the broader UHR generation community.

Limitations & Future Work¶

Frequency-aware post-training slightly reduces text-image alignment (marginal CLIP score decrease).
The dataset contains relatively few portrait images, leaving room for improvement in UHR portrait generation.
Training and evaluation are conducted solely on SANA; generalization to other mainstream models such as FLUX and SD3 remains unverified.
Computational constraints: 4× H20 GPUs.

Compared to Diffusion4K (DWT-based frequency decomposition), the proposed DFT + soft-weighting approach substantially outperforms across all metrics.
Compared to PixArt-σ (efficient token compression) and SANA (efficient 4K pipeline), the proposed method prioritizes detail quality over computational efficiency.
Insight: UHR generation requires not only efficient architectures but also high-quality data and detail-oriented training strategies.

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐