EA-KD: Entropy-based Adaptive Knowledge Distillation¶

Conference: ICCV 2025 arXiv: 2311.13621 Code: https://github.com/cpsu00/EA-KD Area: Object Detection Keywords: Knowledge Distillation, Entropy, Adaptive Weighting, Plug-and-Play, Sample Importance

TL;DR¶

This paper proposes EA-KD, a plug-and-play knowledge distillation method based on information entropy. It dynamically reweights distillation losses by combining the entropy values of teacher and student outputs, prioritizing learning from high-entropy (high-information) samples. EA-KD consistently improves multiple KD frameworks across image classification, object detection, and LLM distillation tasks with negligible computational overhead.

Background & Motivation¶

Background: Knowledge distillation (KD) achieves model compression by training a small student model to mimic a large teacher model. Mainstream approaches are divided into logit-based methods (aligning softened probability distributions) and feature-based methods (aligning intermediate layer features). Recent advances include DKD, MLD, CTKD, and related variants.

Limitations of Prior Work: - Drawback of uniform distillation: Most KD methods treat all samples equally, ignoring the differences in learning value across samples. Intuitively, this is analogous to a teacher delivering the same depth of instruction to all students rather than focusing on critical knowledge points. - Inherent bias of KLD: The standard KLD loss naturally assigns larger loss values to low-entropy (simple/high-confidence) samples (when the student is initialized as a uniform distribution, KLD $\approx \log(C)$ for low-entropy teacher outputs, while KLD $\approx 0$ for high-entropy outputs), causing training to be dominated by simple samples and obscuring knowledge transfer from high-value samples. - Value of high-entropy samples: Experiments show that high-entropy samples reside near class decision boundaries in t-SNE visualizations and exhibit larger teacher–student accuracy gaps — precisely these difficult samples carry the most critical knowledge for learning.

Key Challenge: The distillation process should prioritize difficult and information-rich samples, yet the mathematical properties of current loss functions produce the opposite tendency.

Key Insight: Information-theoretic entropy is used as a measure of sample learning value, combining both the teacher's stable assessment and the student's dynamic evolution to adaptively adjust the distillation weight for each sample.

Core Idea: An adaptive reweighting factor $w_{\text{EA}}$ based on teacher and student entropy that can be plugged into any KD framework in a drop-in fashion.

Method¶

Overall Architecture¶

EA-KD does not alter the architecture or loss function form of any KD method; it only multiplies a sample-level entropy-based weight $w_{\text{EA},n}$ during loss computation. Any KD framework — logit-based or feature-based — can be directly integrated: $$L_{\text{EA-KD}} = \sum_{n} w_{\text{EA},n} \cdot L_{\text{KD},n}.$$

Key Designs¶

Entropy-based sample value quantification:
- Function: Quantify the learning value of each sample via entropy.
- Mechanism: Compute the temperature-softened entropy $H_n = -\sum_{i=1}^C p_{n,i}(T') \log(p_{n,i}(T'))$ separately for teacher and student outputs, where $T'$ is a temperature parameter dedicated to entropy computation (distinct from the KD temperature $T$; optimal value $T'=3$).
- Design Motivation: High entropy indicates high uncertainty and high information content, corresponding to difficult samples near class boundaries. The introduction of $T'$ allows entropy to more smoothly reflect the value gradient across different samples.
Dual-perspective reweighting factor $w_{\text{EA}}$:
- Function: Fuse the teacher's stable assessment with the student's dynamic learning state.
- Mechanism:
  - Base term $w_{\text{base},n} = H_n^{\mathcal{T}}$: the teacher's assessment of sample value.
  - Interaction term $w_{\text{interact},n} = \frac{H_n^{\mathcal{T}} \cdot H_n^{\mathcal{S}}}{H_{\text{ub}}}$: normalized product of teacher and student entropies.
  - Final weight $w_{\text{EA},n} = \frac{w_{\text{base}} + w_{\text{interact}}}{2}$, which can be rewritten as $\frac{1}{2}H_n^{\mathcal{T}}\!\left(1 + \frac{H_n^{\mathcal{S}}}{H_{\text{ub}}}\right)$.
- Design Motivation:
  - Using $H^{\mathcal{T}}$ alone yields weights that remain fixed throughout training, failing to capture the student's learning progress (experiments show that the variance of $H^{\mathcal{S}}$ among high-$H^{\mathcal{T}}$ samples increases with epoch).
  - The interaction term moderately reduces the weight of samples the student has already mastered (low $H^{\mathcal{S}}$), while maintaining high weights for samples still being learned.
  - When $H^{\mathcal{T}}$ is low (teacher regards the sample as simple), the weight remains low regardless of the student's state.
Integration with existing KD frameworks:
- Function: Serve as a plug-and-play module for any KD method.
- Mechanism: Multiply the original per-sample distillation loss $L_{\text{KD},n}$ directly by $w_{\text{EA},n}$. For frameworks such as MLD+LS, the KD loss weight should be moderately reduced to avoid excessive penalization.
- Design Motivation: EA-KD is complementary to methods like DKD — DKD prevents NCKD from being overshadowed by TCKD at the class level, while EA-KD prevents high-value samples from being overshadowed by simple samples at the sample level.

Loss & Training¶

No additional hyperparameters are introduced; the original KD framework's training settings are inherited.
Entropy temperature: $T' = 3$ (CIFAR-100 / Tiny-ImageNet), $T' = 2$ (ImageNet / transformer teacher).
Weight range: $w_{\text{EA}} \in [0, H_{\text{ub}}]$, where $H_{\text{ub}} = \log C$.

Key Experimental Results¶

Main Results¶

CIFAR-100 (7 teacher–student pairs, 7 KD methods):

Method	Average Gain Δ
EA-KD (vs KD)	+1.48%
EA-CTKD (vs CTKD)	+0.63%
EA-DKD (vs DKD)	+0.47%
EA-MLD (vs MLD)	+0.38%
EA-ReviewKD (vs ReviewKD)	+0.38%
EA-FCFD (vs FCFD)	+0.42%

All EA variants consistently improve their corresponding baselines, with an average gain of +0.56%.

Cross-dataset / cross-task results:

Task	Dataset	Baseline	EA Variant	Gain
Image Classification	Tiny-ImageNet	56.00 (KD)	59.39 (EA-KD)	+3.39%
Image Classification	ImageNet	71.03 (KD)	71.79 (EA-KD)	+0.76%
Object Detection	MS-COCO (R101→R18)	33.97 AP (KD)	34.78 AP (EA-KD)	+0.81 AP
Object Detection	MS-COCO (R50→MV2)	30.13 AP (KD)	31.81 AP (EA-KD)	+1.68 AP
LLM Distillation	5 datasets avg.	17.63 (KD)	18.34 (EA-KD)	+0.71

Ablation Study¶

Reweighting factor comparison (ResNet32×4 → ResNet8×4, CIFAR-100):

Weight	KD	MLD	MLD+LS	FCFD
None (baseline)	73.33	77.08	78.28	76.62
$w_{\text{base}}$ ($H^{\mathcal{T}}$ only)	75.14	77.47	78.30	77.50
$w_{\text{interact}}$ ($H^{\mathcal{T}} \cdot H^{\mathcal{S}}$)	74.76	77.45	78.20	77.42
$w_{\text{EA}}$	75.46	77.65	78.38	77.44

Both $w_{\text{base}}$ and $w_{\text{interact}}$ are individually effective; the combined $w_{\text{EA}}$ achieves the best performance on almost all frameworks.
Inverse weighting (assigning lower weights to high-entropy samples) leads to a notable performance drop (73.33 → 72.73), validating the core hypothesis.
EA-DKD vs. DKD: the loss landscape is smoother, and the variance of sensitivity to the $\beta$ hyperparameter decreases from 0.31 to 0.10.
Training time overhead is negligible (Figure 1 data).

Personal Reflections¶

Highlights: An extremely simple and general plug-and-play approach with clear theoretical analysis (the inherent bias of KLD toward low-entropy samples); the consistent improvements across tasks and frameworks are impressive.
Limitations: Gains over already strong baselines are modest (+0.16% ~ +0.38%); the validity of sequence-level reweighting in LLM distillation settings requires further investigation.
Insights: Using information entropy as a measure of sample importance is a concise and effective idea that could generalize to other sample weighting scenarios such as active learning and curriculum learning.

Highlights & Insights¶

Limitations & Future Work¶

Rating¶

Novelty: TBD
Experimental Thoroughness: TBD
Writing Quality: TBD
Value: TBD