Skip to content

What Makes a Good Dataset for Knowledge Distillation?

Conference: CVPR 2025
arXiv: 2411.12817
Code: https://github.com/osu-cvl/good-kd-dataset
Area: Model Compression / Knowledge Distillation
Keywords: Knowledge Distillation, Proxy Datasets, Synthetic Data, Decision Boundary, Data Characterization Analysis

TL;DR

This work systematically investigates the fundamental question of "what data is effective" in knowledge distillation (KD). It finds that even non-natural synthetic images generated by OpenGL shaders can perform KD effectively. It also concludes that a good distillation dataset should meet several criteria: uniform distribution of teacher predictions, sufficient coverage of the decision space, high data diversity, and containment of decision boundary information.

Background & Motivation

Background: Knowledge Distillation (KD) is a mainstream method for model compression, where the soft label outputs of a teacher network provide richer learning signals for a student network compared to hard labels. Standard KD assumes that the teacher's original training data is available.

Limitations of Prior Work: In many practical scenarios, the teacher's original training data is unavailable: in continual learning, data streams arrive sequentially and cannot be retraced; large companies release model weights but do not disclose the training data (e.g., CLIP, DINOv2, GPT). Practitioners are forced to use alternative proxy data, but there is a lack of systematic research in the academics on what makes such proxy data effective. Prior work assumes that Out-of-Distribution (OOD) data is unsuitable for KD, but is this actually true?

Key Challenge: Intuitively, it is assumed that only in-distribution (ID) real-world data is suitable for distillation, but this assumption hinders KD applications in data-unavailable scenarios. The core question is: what exactly is distillation "transferring"? If the mechanism of transfer is understood, more flexible data selection strategies might be discovered.

Goal: To answer the core question of "what characteristics make a dataset suitable for knowledge distillation?" and demonstrate that even the most unconventional synthetic images can achieve effective distillation.

Key Insight: Viewing KD as "function matching", where the role of the dataset is to reconstruct or sample the teacher's decision function sufficiently. From this perspective, whether the data is in-distribution or real is not the essence; the key is whether the sampling is sufficient.

Core Idea: KD is a sufficient sampling problem. A good distillation dataset needs to uniformly cover all class decision regions of the teacher, rather than strictly sharing the same distribution as the original training data.

Method

Overall Architecture

The study adopts the standard KD framework (Hinton 2015) to systematically evaluate the performance of various alternative datasets in distillation. Experiments train a ResNet50 teacher on 6 teacher datasets (CIFAR10/100, Tiny ImageNet, FGVC-Aircraft, Pets, EuroSAT) and then distill knowledge to student networks using 12 different datasets, including real in-distribution (ID)/out-of-distribution (OOD) datasets and three types of synthetic data (OpenGL shaders, Leaves, and noise images).

Key Designs

  1. Multi-Source Data Systematic Evaluation:

    • Function: Comprehensive evaluation of the distillation performance of various proxy datasets under different teacher/student combinations.
    • Mechanism: Classify dataset into three categories: Real In-Distribution (ID), Real Out-of-Distribution (OOD), and non-natural synthetic. ImageNet is split into ID/OOD subsets based on class overlap with each teacher dataset. Synthetic data includes: OpenGL shader images (rendered by 1089 TwiGL shader programs with rich textures and structures), Leaves images (combinations of simple random shapes), and Gaussian noise sampled based on the mean and variance of the original dataset. All synthetic data are filtered through the teacher network, retaining 50K samples with uniform prediction distributions across all classes.
    • Design Motivation: Starting from the most extreme case (completely non-natural synthetic images) is necessary to truly look into the baseline requirements of data characteristics.
  2. Analysis of Distillation Success Factors:

    • Function: Extract the key properties of a good distillation dataset.
    • Mechanism: (1) Relative Entropy of Class Prediction Histogram: Calculate the relative entropy of the teacher's argmax prediction distribution over the dataset with respect to a uniform distribution. Good distillation datasets have a relative entropy close to 1.0 (uniform), while poor ones are near 0. For instance, OpenGL achieves 0.939 on the CIFAR10 teacher, whereas FGVCA (the worst real dataset) scores only 0.116. (2) Data Diversity: OpenGL > Leaves > Noise, where richer textures and structures lead to better information transfer. (3) Decision Boundary Information: Temperature-scaled soft labels (\(\tau\) > 1) outperform hard labels because they carry inter-class relation information, which is especially vital for OOD data.
    • Design Motivation: Rather than just proving "it works", the goal is to understand "why it works" and propose actionable guidelines.
  3. Decision Boundary Adversarial Attack Strategy:

    • Function: Force data samples toward the teacher's decision boundary to enhance distillation performance.
    • Mechanism: Apply adversarial perturbations to a data sample \(x_j\) to push it across the decision boundary to a target class \(t\) (\(\mathcal{F}_T(x_j) \neq t\)), generating a pair of "pre-success" and "post-success" adversarial samples located on opposite sides of the decision boundary. An additional step is applied to push the sample pair deeper into their respective categories to increase sampling coverage. Bold Driver adaptive step sizes are used to improve attack efficiency. Ultimately, each original sample can yield 4 adversarial samples (a pair near the boundary + a pair deep within each category).
    • Design Motivation: Analysis shows that information near the decision boundary is critical for KD (similar to support vectors in SVMs). For historically poor datasets (such as FGVCA used to distill a CIFAR10 teacher), adversarial attacks can boost accuracy dramatically from 11.39% to 88.14%.

Loss & Training

The standard KD loss is the KL divergence \(\mathcal{L}(p_T||p_S) = \sum_{i \in \mathcal{C}} [p_T(i)\log p_T(i) - p_T(i)\log p_S(i)]\), coupled with a temperature parameter \(\tau\). For general datasets, \(\tau=2\) is used, while \(\tau=20\) is configured for fine-grained and synthetic data. All experiments utilize mixup augmentation.

Key Experimental Results

Main Results

Teacher / Dataset Original Data ImageNet-OOD OpenGL Leaves Noise
CIFAR10→RN18 95.98 94.69 94.02 92.08 69.24
CIFAR100→RN18 78.35 76.66 73.27 66.02 22.09
Tiny-IN→RN18 67.14 59.44 56.89 28.03 5.37
EuroSAT→RN18 98.60 98.55 98.45 98.05 54.55
Pets→RN18 86.80 85.01 72.59 42.19 3.43

Ablation Study

Experiment Setting CIFAR10 OpenGL Description
Standard KD 95.98 94.02 Baseline
One-Hot Labels 96.09 91.68 Soft labels are more crucial for OOD data
20K Long-tailed Sampling 91.85 88.93 Non-uniform sampling significantly degrades performance
20K Uniform Sampling + mixup 95.07 92.32 Uniform + mixup partially compensates
Extreme Augmentation + mixup 94.04 93.95 Strong augmentation turns real data into OOD
No Augmentation No mixup 87.33 31.84 Augmentation is crucial for synthetic data

Key Findings

  • Data does not need to be real or in-distribution: OpenGL shader images perform only 2% lower than the original data on CIFAR10 and display nearly identical performance on EuroSAT.
  • Uniform sampling of decision space is core: The entropy of the teacher's prediction histogram is strongly positively correlated with distillation success. Long-tailed sampling performs roughly 2% worse than uniform sampling.
  • Data augmentation is crucial for synthetic data: OpenGL without augmentation achieves only 31.84% (barely working), whereas enhanced augmentation boosts it to 93.95%.
  • Dramatic improvements via adversarial attacks: Applying adversarial attacks to the worst-performing dataset (FGVCA for distilling a C10 teacher) improves accuracy drastically from 11.39% to 88.14%, a gain of 76.8 percentage points.
  • The more complex the teacher architecture (ViT-S > ConvNeXt-T > RN50), the more "patience" (longer training) is required to distill via synthetic data.

Highlights & Insights

  • Analogy to Signal Sampling Theory: Drawing an analogy between KD and the Nyquist sampling theorem—the dataset serves to sample the teacher's decision function. Insufficient frequency (under-sampling certain classes) leads to "aliasing" (distillation failure). This intuition is mathematically elegant and easy to grasp.
  • 2D GAP Feature Visualization: Visualization on an MNIST teacher intuitively demonstrates why OpenGL is more suitable for distillation than CIFAR10—OpenGL images are distributed across the decision regions of all teacher classes, while CIFAR10 only covers a few classes.
  • Enhancing Distillation Data via Adversarial Attacks: Utilizing adversarial samples as "probes" for decision boundaries to enhance KD is significantly simpler than data-free knowledge distillation (DFKD) approaches (eliminating the need for a generator network) and boasts a clear computational advantage of 2 hours versus 48 hours.

Limitations & Future Work

  • The experimental scale is relatively small (up to Tiny ImageNet with 200 classes). The effectiveness for distilling teachers of ImageNet-1K or even larger scales is not validated.
  • When the number of classes increases or fine-grained differences are introduced, synthetic data performance drops (Pets: 72.59 vs 86.80) because synthetic images have difficulty covering fine-grained decision spaces.
  • The computational overhead of the adversarial attack strategy scales linearly with the number of classes, which might not be cost-effective for large-class scenarios.
  • Future work could explore using generative models (such as Stable Diffusion) to synthesize targeted distillation data.
  • vs DFKD Methods (CMI, Spaceship): DFKD uses generator networks to synthesize data for distillation, which suffers from mode collapse and high computational expenses. This work shows that simple, non-optimized synthetic images can achieve similar performance in only 1/24 of the time.
  • vs Beyer et al. (Function Matching): They proposed that KD is function matching and found that OOD data performs poorly. This work further clarifies that OOD data performs poorly due to non-uniform sampling rather than being intrinsically unusable.
  • vs Single Image Distillation (Asano & Saeed): Distilling from random crops of a single large image is highly novel, but yields only 69.34% on CIFAR100, which is inferior to OpenGL's 73.86%.

Rating

  • Novelty: ⭐⭐⭐⭐ Systematically answers an important open question; the findings on synthetic data distillation are highly inspiring.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extensive ablation and analysis experiments with careful control of variables.
  • Writing Quality: ⭐⭐⭐⭐ The question-driven narrative style is highly engaging, and the analysis deepens layer by layer.
  • Value: ⭐⭐⭐⭐ The proposed criteria for selecting distillation datasets offer direct practical guidance.