Rethinking Image Super-Resolution from Training Data Perspectives¶
Conference: ECCV 2024
arXiv: 2409.00768
Code: https://github.com/gohtanii/DiverSeg-dataset
Area: Image Restoration
Keywords: Super-Resolution, Training Data, Dataset Construction, Image Quality Assessment, Object Diversity
TL;DR¶
This paper rethinks image super-resolution (SR) from the perspective of training data. It proposes an automated data evaluation pipeline to construct the DiverSeg dataset, which features low-resolution but high-quality and object-diverse images. The authors demonstrate that SR models trained on this dataset can outperform those trained on traditional high-resolution datasets (such as DF2K and LSDIR).
Background & Motivation¶
The field of image super-resolution (SR) has achieved significant progress over the past decade, yet research has primarily focused on updating network architectures. From a training data perspective, conventional methods rely on high-resolution datasets like DIV2K and Flickr2K (collectively referred to as DF2K). Recently, LSDIR further expanded the data scale to 84,991 high-resolution images.
Two core criteria established for existing dataset constructions are:
Resolution and Quality: Demanding HD/2K/4K high resolution, and manually excluding compression artifacts.
Diversity: Containing various scenes, illuminations, and textures.
Key Challenge: Collecting uncompressed, high-resolution images is both difficult and expensive, making it hard to scale datasets up. For instance, ImageNet contains 1.28 million images but includes low-quality and JPEG-compressed images.
Core Problem: What criteria does training data actually require? Is high resolution truly necessary?
Key Findings: Three factors positively influence SR performance: (i) low compression artifacts, (ii) high intra-image diversity (more objects), and (iii) large-scale datasets. Low-resolution images that satisfy these conditions can even outperform high-resolution data during SR training.
Method¶
Overall Architecture¶
An automated image evaluation pipeline is proposed to filter and construct the SR training dataset DiverSeg from large-scale, low-resolution datasets (ImageNet, Places365, and PASS). The pipeline consists of two stages: Source Selection and Object-based Filtering.
Key Designs¶
-
Source Selection — Blockiness Distribution-based Quality Estimation:
- Mechanism: Screen high-quality data sources by estimating the JPEG compression quality of datasets.
- Calculate the blockiness value \(B(x)\) for each image using a blockiness metric, and obtain the dataset-level blockiness distribution \(p_{X,q}(b)\) through kernel density estimation.
- Blockiness is quantified using variations in subband DCT coefficients: $\(B(x) = \sum_{i=1}^{P}\sum_{j=1}^{P}\left|\frac{\bar{V}_{crop}(i,j) - \bar{V}(i,j)}{\bar{V}(i,j)}\right|\)$
- Compare the distribution of the target dataset with the baseline distribution of a reference dataset (DF2K) at different quality levels using KL divergence to estimate dataset quality: $\(\hat{q}_X = \sum_{q \in S} q \frac{\exp(-D_{KL}(p_{X,1.0} || p_{Z,q}))}{\sum_{q' \in S} \exp(-D_{KL}(p_{X,1.0} || p_{Z,q'}))}\)$
- Results: ImageNet quality is estimated at 95.5%, Places365 at 75.0% (filtered out), and PASS at 99.8%.
- Design Motivation: Traditional methods require point-by-point manual checks of image quality, whereas this method automatically estimates the quality of the entire dataset based on statistical distributions, avoiding manual assessment.
-
Object-based Filtering — Image Diversity Screening:
- Core Hypothesis: Images containing more object regions are more effective for SR training.
- Two filtering methods:
- Segmentation-based filtering: Uses SAM (ViT-H) to calculate the number of segmentation masks \(R(x)\), with a threshold of \(\theta = 100\), filtering out 260K images from ImageNet.
- Detection-based filtering: Uses Detic (ViT-B) to calculate the number of detected objects \(R(x)\), with a threshold of \(\theta = 18\), also yielding 260K images.
- Design Motivation: During manual quality evaluation, human annotators tend to favor detail-rich images, which implicitly filters out images with few objects. This method explicitly formulates this implicit preference.
-
DiverSeg Dataset:
- DiverSeg-I: 259K images filtered from ImageNet.
- DiverSeg-P: 267K images filtered from PASS.
- DiverSeg-IP: A combination of both, totaling 527K images.
- Characteristics: Low resolution (averaging 233K pixels vs. 2.8M for DF2K) but high quality (low blockiness) and high diversity (averaging 146 segmentation masks vs. 103 for DF2K).
Loss & Training¶
The models are trained using the original configurations from their respective papers (MSRResNet, EDSR, RCAN, SwinIR, and HAT), with the only difference being the training dataset. Standard \(L_1\) or \(L_2\) losses are utilized. The core objective is to validate the impact of dataset quality rather than to modify the training strategy.
Key Experimental Results¶
Main Results¶
Comparison of \(\times 4\) SR performance (PSNR/SSIM, on 5 benchmark datasets):
| Model | Training Data | Set5 | BSD100 | Urban100 | Manga109 |
|---|---|---|---|---|---|
| SwinIR | DF2K | 32.92/0.9044 | 27.92/0.7489 | 27.45/0.8254 | 32.03/0.9260 |
| SwinIR | LSDIR | 32.86/0.9036 | 27.92/0.7492 | 27.79/0.8331 | 31.98/0.9262 |
| SwinIR | DiverSeg-I | 32.97/0.9053 | 27.98/0.7508 | 27.83/0.8336 | 32.34/0.9283 |
| HAT | DF2K | 33.03/0.9056 | 27.99/0.7514 | 27.93/0.8365 | 32.44/0.9292 |
| HAT | LSDIR | 32.93/0.9053 | 28.01/0.7525 | 28.45/0.8469 | 32.57/0.9306 |
| HAT | DiverSeg-I | 33.15/0.9071 | 28.07/0.7542 | 28.51/0.8477 | 32.90/0.9325 |
| RCAN | DF2K | 32.50/0.8990 | 27.75/0.7421 | 26.73/0.8058 | 31.17/0.9165 |
| RCAN | DiverSeg-I | 32.70/0.9012 | 27.81/0.7443 | 27.03/0.8116 | 31.58/0.9210 |
Ablation Study¶
| Configuration | Key Metric (Urban100 PSNR) | Description |
|---|---|---|
| Full ImageNet (1.28M) | Lower | Contains a large number of low-quality compressed images |
| ImageNet Filtered (260K) | Gain | Performance improves after removing low-quality images |
| DiverSeg-I (260K, \(\theta=100\)) | Best | Dual filtering of high quality + high diversity |
| Places365 | Worst | Quality is only 75%, containing significant compression artifacts |
| PASS (1.44M) | Good | Quality is 99.8%, but diversity is insufficient |
| DiverSeg-P (267K) | Outperforms DF2K | Diversity improved after filtering from PASS |
| Threshold \(\theta=0\) (No filtering) | Baseline | Compared to the full dataset |
| Threshold \(\theta=100\) | Best | Sweet spot for object diversity filtering |
Key Findings¶
- High resolution is not required: High-quality datasets with low resolution (~233K pixels) can outperform high-resolution datasets (DF2K ~2.8M pixels).
- Compression artifacts are harmful: The low quality (75.0%) of Places365 leads to the worst SR performance, demonstrating the negative impact of compression artifacts on SR training.
- Object diversity is crucial: More objects in an image \(\rightarrow\) more textures and edges \(\rightarrow\) better SR performance.
- Scaling effect: Under equal quality, larger quantities of images generally bring better performance.
- DiverSeg-I outperforms DF2K across all 5 SR models, proving effective for both CNN and Transformer architectures.
Highlights & Insights¶
- Reversing traditional perceptions: Proving that high-resolution images are not a necessity for SR training, which dramatically lowers the barrier to constructing SR datasets.
- Automated pipeline: A fully automated dataset construction workflow that eliminates time-consuming manual quality assessments.
- Blockiness-based quality estimation: Cleverly leveraging KL divergence to compare blockiness distributions and estimate dataset quality, avoiding frame-by-frame analysis.
- High generalizability: The method is applicable to all tested SR models (3 CNNs + 2 Transformers), without relying on specific architectures.
- Significant practical value: It enables effortless automated screening of SR training data from any large-scale image dataset in the future.
Limitations & Future Work¶
- The threshold for object filtering (\(\theta = 100/18\)) is manually set, without an automated selection strategy.
- Validated only on \(\times 4\) SR, lacking coverage of other scales such as \(\times 2\) and \(\times 8\).
- The evaluation models (SAM and Detic) themselves incur high computational overhead, requiring substantial resources to process million-scale datasets.
- The impact of semantic category distributions (e.g., natural scenes vs. urban scenes) on SR has not been analyzed.
- The interaction effects with data augmentation strategies have not been investigated.
Related Work & Insights¶
- Provides a deeper understanding of the role of ImageNet pre-training in SR (e.g., HAT uses ImageNet pre-training, but it might not be the optimal SR data).
- The blockiness-based quality estimation method can be generalized to dataset construction for other low-level vision tasks.
- The finding of "object diversity" implies that SR models may benefit more from rich local texture patterns rather than global high resolution.
Rating¶
- Novelty: ⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐⭐