Condensed Data Expansion Using Model Inversion for Knowledge Distillation¶

Conference: AAAI 2026 arXiv: 2408.13850 Code: N/A Area: Model Compression Keywords: Knowledge Distillation, Dataset Condensation, Model Inversion, Feature Alignment, Data-Free Distillation

TL;DR¶

This paper proposes using condensed datasets as prototypes to guide the model inversion (MI) process. A feature-alignment discriminator enforces distributional consistency between synthesized data and condensed samples, thereby expanding the condensed dataset for knowledge distillation. The method achieves up to 11.4% improvement over standard MI-based distillation on CIFAR/ImageNet.

Background & Motivation¶

State of the Field¶

Background: Dataset condensation can compress large-scale data into a small number of synthetic samples (1–50 per class), but directly applying them to knowledge distillation (KD) yields limited performance. Model inversion can generate synthetic data from a pretrained teacher model, but lacks guidance from the real data distribution.

Limitations of Prior Work¶

Limitations of Prior Work: Condensed samples contain limited information, making them ineffective for KD when used directly.

Root Cause¶

Key Challenge: Data generated by MI may deviate from the real distribution (out-of-distribution).

Solution Direction¶

Solution Direction: Naively mixing condensed and MI-generated data fails to improve performance due to a domain gap.

Key Challenge: Condensed data carries distributional information but is too scarce; MI can generate large volumes of data but may deviate from the real distribution. How can the two be made complementary?

Key Insight: Use condensed samples as "prototypes" to guide MI toward generating synthetic data with consistent distributions—achieved via a conditional discriminator that aligns the semantic features of generated data with those of condensed samples in the teacher's feature space.

Core Idea: Condensed-sample-guided model inversion—a conditional discriminator aligns the feature distributions of synthetic and condensed data at the penultimate layer of the teacher model.

Method¶

Overall Architecture¶

The teacher model is pretrained and condensed data is available. A generator produces synthetic samples via the MI objective, while a feature discriminator distinguishes between synthetic and condensed features; the generator is trained to fool the discriminator. The student is then distilled on a mixed dataset of synthetic and condensed samples.

Key Designs¶

Feature Alignment Mechanism:
- Function: Aligns MI-generated data with condensed data in the teacher's feature space.
- Mechanism: \(\min \mathcal{L}_G = \mathcal{L}_{MI} + \mathcal{L}_{FA}\); the discriminator operates on penultimate-layer features to distinguish condensed from synthetic samples.
- Conditional Discriminator: Performs both real/fake and class-label discrimination, preventing purely global alignment that ignores intra-class consistency.
- Design Motivation: The penultimate layer encodes semantic information (rather than low-level structural features of earlier layers); semantic alignment ensures correct class correspondence in generated data.
Compatibility with Arbitrary MI Methods:
- Pluggable into different MI baselines such as Fast, CMI, and PRE-DFKD—requiring only the addition of the feature alignment loss on top of the original MI objective.
- Consistent improvements across all baselines validate the generality of the approach.
Integrated Training-Distillation Pipeline:
- Each epoch: new synthetic batches are generated via guided MI and added to the dataset, followed by random sampling from the mixed set for KD.
- This avoids a two-stage strategy of generating all data before distillation, enabling iterative quality improvement.

Loss & Training¶

\(\theta_S^* = \arg\min \mathbb{E}_{\hat{x}}[D_{KL}(\hat{y}_S || \hat{y}_T)] + \mathbb{E}_x[D_{KL}(y_S || y_T)]\). Differentiated data augmentation is applied to prevent discriminator overfitting.

Key Experimental Results¶

Main Results¶

Method	CIFAR-100 R34→MBv2	ImageNet-200 R34→MBv2
Fast (MI)	54.62%	35.31%
Fast + CS (simple mix)	56.57%	40.68%
*Fast (Ours)**	63.29%	43.08%
CMI	61.90%	35.55%
*CMI (Ours)**	70.21%	45.83%

The largest gains are observed on heterogeneous teacher–student pairs (R34→MBv2), with CMI* achieving +8.31% on CIFAR-100.

Ablation Study¶

Even a single condensed sample per class yields measurable improvement.
The conditional discriminator (class-aware) outperforms the unconditional variant.
All condensation methods (DSA, DM, MTT) are effective, with MTT performing best.
t-SNE visualization confirms that guided synthesis produces feature distributions more consistent with real data.

Key Findings¶

Naively mixing condensed and MI-generated data is nearly ineffective—the domain gap is the critical bottleneck.
Heterogeneous model pairs (with large architectural differences) benefit the most, as MI alone performs worst in these settings.
Even 1 sample per class of condensed data provides effective guidance for MI.

Highlights & Insights¶

The idea of using condensed data as prototypes to guide MI is highly intuitive—it effectively provides a reference signal for an otherwise unconstrained generation process.
The plug-and-play design compatible with arbitrary MI methods offers strong practical utility.
t-SNE visualizations clearly illustrate the distributional deviation of unguided MI-generated data in the feature space.

Limitations & Future Work¶

The discriminator may overfit when condensed samples are extremely scarce.
Validation is limited to classification tasks; extension to detection and segmentation remains unexplored.
Comparisons with the latest large-scale condensation methods are absent.

vs. CMI/Fast/PRE-DFKD: The proposed method serves as a plug-and-play enhancement for these baselines.
vs. Few-Shot KD: FSKD and NetGraft use a small number of real samples; the proposed condensation+MI expansion yields superior results.
vs. DeepInversion: DeepInversion uses batch normalization statistics as guidance, whereas this method uses condensed data—providing richer distributional information.

Rating¶

Novelty: ⭐⭐⭐⭐ Novel combination of dataset condensation and model inversion
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three datasets, multiple MI baselines, diverse teacher–student pairs, and comprehensive ablations
Writing Quality: ⭐⭐⭐⭐ Clear motivation and strong visualizations
Value: ⭐⭐⭐⭐ A practical and broadly applicable KD enhancement