Enhancing Dataset Distillation via Non-Critical Region Refinement¶

Conference: CVPR 2025
arXiv: 2503.18267
Code: https://github.com/tmtuan1307/NRR-DD
Area: Model Compression
Keywords: Dataset Distillation, Non-Critical Region Optimization, Class Activation Mapping, Soft Label Compression, Distance Representation

TL;DR¶

This paper proposes a three-stage framework, NRR-DD: using CAM to select low-confidence patches to initialize synthetic images, freezing critical regions while optimizing only non-critical regions to improve information density, and replacing 1000-dimensional soft labels with 2 distance values to achieve 500x storage compression. It achieves 46.1% accuracy on ImageNet-1K with IPC=10 (outperforming RDED by 25.7%), reducing soft label storage from 120GB to 0.2GB.

Background & Motivation¶

Background: Large-scale dataset distillation methods can be categorized into two paradigms: those focusing on class-general features (e.g., SRe2L, which learns commonalities but loses details) and those focusing on instance-specific features (e.g., RDED, which selects real patches but lacks class-general commonalities).

Limitations of Prior Work: (1) Both paradigms exhibit distinct biases, making it difficult for synthetic images to capture both class-general and instance-specific features simultaneously. (2) The storage overhead for large-scale soft labels is prohibitively high—ImageNet-1K with IPC=200 requires 120GB to store 1000-dimensional soft labels, which is impractical.

Key Challenge: The real patches selected by RDED already contain rich instance features, but the non-critical regions (e.g., background) are underutilized. Meanwhile, storing 1000-dimensional soft labels serves as a bottleneck for large-scale dataset distillation deployment.

Goal: To enhance the learning of class-general features while preserving the instance-specificity of real patches, and to resolve the storage bottleneck of soft labels.

Key Insight: Utilize CAM to identify critical and non-critical regions, freeze the critical regions (preserving instance features), optimize non-critical regions (injecting class-general features), and replace full-dimensional soft labels with 2 distance values.

Core Idea: Freeze high-CAM regions to preserve instance features, optimize low-CAM regions to inject class-general knowledge, and compress soft labels by 500x using a distance-based representation.

Method¶

Overall Architecture¶

The framework consists of three stages: (1) In the CIDD stage, CAM is used to select patches with the lowest confidence (rather than the highest) to compose synthetic images. (2) In the NRR stage, high-activation regions of CAM are frozen, and gradient descent optimization (using cross-entropy and BN regularization) is applied only to the non-critical regions, enabling the non-critical regions of low-confidence patches to learn class-general features. (3) In the DBR stage, a teacher model calculates the cross-entropy distance between the soft label of each synthetic image and the original/augmented one-hot labels, saving only 2 distance values instead of the 1000-dimensional vector.

Key Designs¶

CAM-based Low-Confidence Initialization (CIDD):
- Function: Select patch combinations with the most potential for optimization to initialize synthetic images.
- Mechanism: CAM is utilized to identify highly activated regions in the original image and extract the top-t patches. However, unlike RDED, this method selects the patches with the lowest confidence instead of the highest. Then, \beta low-confidence patches are stitched together to form a synthetic image. Low-confidence patches contain discriminative features that the teacher model is less certain about, thereby providing the maximum optimization space during the subsequent NRR stage.
- Design Motivation: Highly confident patches are already well-recognized by the model and leave little room for optimization. Low-confidence patches contain more non-critical regions that can be refined.
Non-Critical Region Refinement (NRR):
- Function: Inject class-general features while preserving instance-specific features.
- Mechanism: A non-critical region mask \(M = \max\{0, \epsilon - C\}\) is generated using CAM. Gradient updates are applied only to the pixels covered by the mask: \(\tilde{x} = \tilde{x} - M \times \eta \nabla_{\tilde{x}} \mathcal{L}_C\). The loss function includes cross-entropy (urging low-confidence samples to move toward high-confidence space) and BN statistics regularization (aligning with the running statistics of the original model).
- Design Motivation: Critical regions already possess instance-specific features that should not be corrupted. Non-critical regions (e.g., backgrounds) can be treated as a "blank canvas," where optimization can inject general features beneficial for identifying the target class.
Distance-Based Representation for Soft Label Compression (DBR):
- Function: Compress 1000-dimensional soft labels into 2 scalars.
- Mechanism: The cross-entropy distance \(d_{org}\) between the teacher's soft label and the original one-hot label is computed, as well as the distance \(d_{aug}\) for the augmented one-hot label. During student training, a distance matching loss replaces direct Knowledge Distillation (KD), adjusting the student's prediction distance to one-hot labels to match that of the teacher's. Only 2 float values need to be stored per sample.
- Design Motivation: Storing full-dimensional soft labels for ImageNet-1K at IPC=200 requires 120GB, which is impractical. Storing 2 distance values can recover 60-71% of the soft label performance while requiring only 0.2GB of storage.

Loss & Training¶

During the NRR stage: \(\mathcal{L}_C = \mathcal{L}_{ce} + \alpha_{bn}\mathcal{L}_{bn}\). During the student training stage: \(\mathcal{L}_S = \mathcal{L}_{sce} + \alpha_{dbr}\mathcal{L}_{dbr}\).

Key Experimental Results¶

Main Results¶

Dataset	IPC	NRR-DD	RDED	SRe2L	Gain (vs RDED)
CIFAR-10	10	72.2%	50.2%	29.3%	+22.0%
CIFAR-100	10	62.7%	48.1%	27.0%	+14.6%
Tiny-ImageNet	50	61.2%	47.6%	41.1%	+13.6%
ImageNet-1K	10	46.1%	20.4%	21.3%	+25.7%
ImageNet-1K	50	60.2%	38.4%	46.8%	+21.8%

Soft Label Compression¶

Method	ImageNet-1K IPC=50	Storage	Description
Full Soft Labels	60.2%	120GB	Baseline
DBR (2 Distances)	45.1%	0.2GB	500x compression, recovers 60% performance
One-hot	32.4%	~0	Information loss is too high

Key Findings¶

NRR-DD consistently outpeforms RDED and SRe2L across all datasets and IPC settings, with the performance gap being particularly pronounced on large-scale datasets (ImageNet-1K +25%).
Selecting low-confidence patches yields better results than high-confidence ones due to the larger optimization space—an counter-intuitive strategy of "the worse, the more potential."
Non-critical region refinement is the core innovation. Freezing critical regions and optimizing only the background delivers remarkable performance gains, indicating that background/non-critical region information is highly critical for classification.
Under a massive compression ratio (500x), the DBR 2-distance scheme still retains a substantial amount of soft label information.

Highlights & Insights¶

Counter-Intuitive Design of "Freezing Subject, Optimizing Background": Traditional approaches focus on how to better extract features of the main subject. In contrast, NRR-DD discovers that backgrounds and non-critical regions also carry rich discriminative information for classes.
Deep Reason Behind Low-Confidence Selection: Low-confidence patches are not noise—they contain discriminative features but in non-typical modes. After NRR optimization, they transform into "good samples with unique perspectives."
Elegance of DBR Soft Label Compression: Compressing 1000-dimensional information into 2 distance values essentially retains the most crucial knowledge distillation signal: "how far the soft label is from the one-hot label."

Limitations & Future Work¶

Although DBR achieves a 500x compression, the performance drop is still significant (60% recovery rate). Refining distance representations could further close the gap.
The quality of CAM depends heavily on the pre-trained model. If the pre-trained model performs poorly, partition of critical/non-critical regions will be inaccurate.
The number of optimization iterations in the NRR stage requires careful tuning.

vs RDED: RDED directly selects high-confidence patches without optimization, while NRR-DD selects low-confidence ones and optimizes non-critical regions, raising performance from 50.2% to 72.2% on CIFAR-10 at IPC=10.
vs SRe2L series: SRe2L focuses on class-general features but loses instance information. NRR-DD balances both by preserving critical regions and optimizing backgrounds.
vs EDF: EDF improves discriminative regions from a gradient perspective. NRR-DD freezes/optimizes different regions from a pixel perspective, offering a complementary strategy.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The idea of non-critical region refinement is unique; the combination of low-confidence selection and background optimization is highly novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive coverage from CIFAR to ImageNet-1K, backed by exhaustive soft label compression experiments.
Writing Quality: ⭐⭐⭐⭐ Clear structured stages and deep motivation analysis.
Value: ⭐⭐⭐⭐⭐ Massive performance gains coupled with practical storage compression, driving significant progress in the DD community.