Emphasizing Discriminative Features for Dataset Distillation in Complex Scenarios¶
Conference: CVPR 2025
arXiv: 2410.17193
Code: https://github.com/NUS-HPC-AI-Lab/EDF
Area: Model Compression
Keywords: Dataset Distillation, Discriminative Features, Grad-CAM, Complex Scenarios, Common Pattern Dropout
TL;DR¶
Proposes the EDF method to address the performance degradation of dataset distillation in complex scenarios (ImageNet subsets). It introduces Common Pattern Dropout (discarding parameter gradients of low-loss common patterns in trajectory matching) and Discriminative Area Enhancement (utilizing Grad-CAM to scale up gradients of discriminative regions), achieving lossless compression on datasets such as ImageMeow/ImageYellow with only 23% of the data.
Background & Motivation¶
Background: Dataset distillation has achieved nearly lossless compression on simple datasets (CIFAR, MNIST), but its performance drops sharply in complex scenarios (ImageNet and its subsets).
Limitations of Prior Work: Grad-CAM analysis reveals that while discriminative regions occupy most pixels in simple datasets, they are very small in complex scenarios, where non-discriminative features (background, common colors) dominate the learning process. The low-loss supervision signals in trajectory matching contain common patterns, which conversely dilute discriminative information.
Key Challenge: The distillation process indiscriminately matches all parameter gradients, causing synthetic images to be dominated by common patterns (e.g., background textures) while diluting discriminative features (e.g., object details).
Goal: To emphasize discriminative features and suppress common patterns within the trajectory matching framework, thereby recovering the performance of dataset distillation in complex scenarios.
Key Insight: Simultaneously enhance the learning of discriminative features from two dimensions: the parameter space (discarding gradients of common patterns) and the pixel space (amplifying gradients of discriminative regions).
Core Idea: Discarding low-loss common pattern gradients in the parameter space and scaling up gradients of discriminative regions via Grad-CAM weighting in the pixel space, establishing a two-pronged approach to intensify the learning of discriminative features during distillation.
Method¶
Overall Architecture¶
Based on trajectory matching (e.g., DATM), two modules are integrated during the optimization of synthetic data: CPD filters out common patterns in the parameter space, while DAE enhances discriminative regions in the pixel space. They complement each other: CPD removes interference from a subtractive perspective, whereas DAE amplifies signals from an additive perspective.
Key Designs¶
-
Common Pattern Dropout (CPD):
- Function: Filters out common pattern signals within trajectory matching from the parameter space.
- Mechanism: Decomposes the trajectory matching loss into parameter-wise losses \(L = \{l_1, l_2, ..., l_P\}\), sorts them in ascending order, and discards the gradients of the lowest \(\lfloor \alpha \cdot P \rfloor\) parameters. Since low-loss parameters correspond to fully learned common patterns (e.g., background), discarding them ensures that only gradients of high-loss (discriminative) features are backpropagated to the synthetic images. Optimal dropout ratio: 12.5-25% for small IPC, 37.5-50% for large IPC.
- Design Motivation: Low-loss parameters contain easily learnable common patterns, whose gradients tend to dilute discriminative signals; discarding them focuses the optimization on truly distinctive features.
-
Discriminative Area Enhancement (DAE):
- Function: Amplifies the gradients of discriminative regions in the pixel space.
- Mechanism: Periodically computes the Grad-CAM activation map \(M\) of synthetic images, defining a pixel-level gradient weight function \(\mathcal{F}(M, \beta)\)—the weight for pixels with activation values below the mean is 1 (unchanged), while the weight for pixels above the mean is \(\beta + M_{h,w}\) (amplified). The gradient of the synthetic image is rescaled as: \((\nabla D_{syn})_{edf} = \nabla D_{syn} \odot \mathcal{F}(M, \beta)\). A dynamic mean threshold is used instead of a fixed threshold, with \(\beta \in [1, 2]\) being optimal.
- Design Motivation: In complex scenarios, discriminative regions are small but have high information density. Scaling up the gradients of these regions focuses the optimization of synthetic images on key details.
-
Comp-DD Benchmark:
- Function: Standardizes the evaluation of dataset distillation in complex scenarios.
- Mechanism: Constructs 16 subsets (8 easy, 8 hard) from ImageNet-1K, covering categories like Bird, Car, Dog, Fish, Snake, Insect, Round, and Music, using the Grad-CAM activation area percentage as the complexity score.
Loss & Training¶
Based on the trajectory matching loss of DATM, utilizing CPD to discard gradients of low-loss parameters and DAE to rescale pixel gradients. Grad-CAM activation map update frequency: every 50 iterations for small IPC, and every 200 iterations for large IPC.
Key Experimental Results¶
Main Results¶
| Dataset | IPC=10 | IPC=50 | vs DATM Gain |
|---|---|---|---|
| ImageWoof | 41.8% | 60.8% | +2.6%/+3.0% |
| ImageMeow | 52.6% | 55.0% | +3.7%/+2.1% |
| ImageYellow | 68.2% | 75.8% | +3.1%/+3.4% |
| ImageSquawk | 65.4% | 77.2% | +3.2%/+2.8% |
| CIFAR-10 | - | 77.3% | +1.2% |
| Tiny-ImageNet | 32.5% | 41.1% | +1.4% |
Lossless compression: ImageMeow reaches 65.2% (= full dataset performance) at IPC=300, requiring only 23% of the data.
Ablation Study¶
| Configuration | ImageWoof/Meow/Yellow IPC=10 |
|---|---|
| Baseline (DATM) | 39.2 / 48.9 / 65.1 |
| +DAE only | 40.3 / 49.5 / 66.2 |
| +CPD only | 41.1 / 51.2 / 67.5 |
| +Both (EDF) | 41.8 / 52.6 / 68.2 |
The contribution of CPD (+1.9~2.3) is larger than that of DAE (+1.1~0.6), and their combination yields a synergistic effect (+2.6~3.1).
Key Findings¶
- CPD is the primary contributor, indicating that common pattern filtering in the parameter space is more critical than enhancement in the pixel space.
- The dynamic mean threshold consistently outperforms fixed thresholds (0.2/0.5/0.8) because the activation maps continuously change during training.
- Performance gains are also achieved on simple datasets (CIFAR-10 +1.2%), demonstrating that the common pattern issue is not restricted to complex scenarios.
- A CPD ratio above 75% becomes detrimental, suggesting that some common patterns remain necessary for learning.
Highlights & Insights¶
- Explaining Distillation Failure from a Grad-CAM Perspective: Identifies that the small proportion of discriminative regions in complex scenarios is the root cause of distillation degradation, offering a fresh perspective for the entire DD community.
- Simplicity and Effectiveness of Parameter Space Filtering: CPD only requires sorting losses and discarding the lowest fraction, introducing zero extra parameters and working out-of-the-box.
- Contribution of the Comp-DD Benchmark: Provides a standardized evaluation tool for research on DD in complex scenarios.
Limitations & Future Work¶
- The CPD dropout ratio \(\alpha\) needs to be tuned according to the IPC, lacking an adaptive setting mechanism.
- The computation of Grad-CAM increases training overhead, particularly under frequent updates.
- Verified only on trajectory matching frameworks; integration with distribution-matching-based methods remains unexplored.
Related Work & Insights¶
- vs DATM: EDF integrates CPD+DAE on top of DATM, boosting performance by 2-4% across all ImageNet subsets. The method is orthogonal and plug-and-play.
- vs NCFM/CCFS: While these methods improve DD from the perspective of distribution matching or data selection, EDF enhances it via feature distinctiveness, offering complementary strategies.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of Grad-CAM analysis and parameter-level dropout is unique, and the insight is valuable.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely detailed ablations across multiple datasets and the Comp-DD benchmark.
- Writing Quality: ⭐⭐⭐⭐ The analysis of the problem (via the Grad-CAM perspective) is highly persuasive.
- Value: ⭐⭐⭐⭐ A plug-and-play DD enhancement module with strong practicality.