Heavy Labels Out! Dataset Distillation with Label Space Lightening¶

Conference: ICCV 2025 arXiv: 2408.08201 Code: Coming soon Area: Model Compression / Dataset Distillation Keywords: dataset distillation, soft label compression, CLIP, LoRA, label space lightening

TL;DR¶

This paper proposes the HeLlO framework, which constructs a lightweight image-label projector using a CLIP pretrained model and LoRA-like low-rank knowledge transfer, reducing soft label storage in dataset distillation to 0.003% of the original while maintaining or surpassing SOTA performance.

Background & Motivation¶

Dataset distillation aims to compress large-scale training sets into extremely small synthetic sets. Current SOTA methods (SRe2L, G_VBSM, RDED) significantly reduce image count but heavily rely on large volumes of soft labels generated by pretrained teacher models to preserve performance.

Core Problem: Soft label storage overhead is enormous, potentially comparable to the original dataset. For example: - ImageNet-1K, IPC=1: images are only ~15MB, but soft labels exceed 572MB (38×) - ImageNet-1K, IPC=200: soft labels reach 110GB, comparable to the original dataset - Root cause: each data augmentation generates an independent $C$-dimensional soft label ($C$ = number of classes), totaling $K$ (iterations) × $N_s$ (samples) × $C$

This reveals a neglected bottleneck in current distillation methods: images are distilled, but labels are not.

Method¶

Overall Architecture¶

The HeLlO framework replaces massively stored offline soft labels with a lightweight online projector: 1. Constructs a projector using a CLIP image encoder and a linear transformation 2. Initializes the linear transformation with text embeddings (zero storage cost) 3. Fine-tunes the projector toward the target distribution via LoRA-like low-rank matrices 4. Optionally updates images to reduce projector error 5. Generates soft labels online during downstream training

Key Designs¶

Text Embedding-Based Projector Initialization:
- Leverages CLIP's vision-language alignment by initializing the linear transformation $W = (v_T)^T$ with normalized text embeddings of each class description
- Mathematical equivalence: text embedding initialization is equivalent to pretrained zero-shot classification (Proposition 1)
- No additional storage required (text descriptions are generated from fixed prompt templates)
- Design Motivation: provides a strong starting point, allowing the projector to further adapt from pretrained zero-shot capability
LoRA-Like Low-Rank Knowledge Transfer:
- Decomposes the weight increment as $\Delta\theta = A \cdot B$, where $A \in \mathbb{R}^{d \times r}$, $B \in \mathbb{R}^{r \times k}$, $r \ll d, k$
- Applies LoRA to both the convolutional layers of the CLIP image encoder and the linear transformation component (with different ranks)
- Training objective combines multi-weak-teacher knowledge distillation and cross-entropy: $\mathcal{L}(\mathcal{D};\theta) = MSE(f_\theta(X), Y') + \lambda CE(f_\theta(X), Y)$
- Weak teachers are drawn from different stages (9 checkpoints) of the ResNet-18 training trajectory
- Design Motivation: minimizes fine-tuning cost while closing the gap between the pretrained and target distributions
Synthetic Dataset Initialization and Update:
- Initialization follows RDED: selects the most representative image patches based on teacher model difficulty scores and concatenates them
- Additional image update step: minimizes the discrepancy between original-resolution and downsampled-then-upsampled versions in CLIP feature space: $\mathcal{G}(\mathcal{E}_I, p) = MSE(\mathcal{E}_I(p), \mathcal{E}_I(\hat{p}))$
- Design Motivation: since a surrogate projector is used instead of the original teacher, image updates are needed to reduce information loss on the projector

Loss & Training¶

Loss function during downstream training: $$\phi^e = \phi^{e-1} - \alpha \nabla_\phi (MSE(f_\phi(\mathcal{A}(X_s)), Y^*) + \beta CE(f_\phi(\mathcal{A}(X_s)), Y_s))$$ - $Y^*$ is generated online by the projector (not pre-stored) - Configuration: ImageNet-100 rank=8/64; ImageNet-1K rank=8/128

Key Experimental Results¶

Main Results (ResNet-18 Top-1 Accuracy, %)¶

Dataset	IPC	SRe2L	G_VBSM	RDED	HeLlO	Label Storage Ratio
IN-100	1	3.0	-	8.1	12.5 (+4.4)	0.1×
IN-100	10	9.5	-	36.0	48.9 (+12.9)	0.01×
IN-100	50	27.0	-	61.6	69.4 (+7.8)	0.002×
IN-1K	1	0.1	1.7	6.6	12.9 (+6.3)	1e-4×
IN-1K	10	21.3	31.4	42.0	43.7 (+1.7)	1e-5×
IN-1K	50	46.8	51.8	56.5	52.2 (-)	3e-6×

Teacher model parameter count: RDED 10.7M vs. HeLlO only 0.8M (0.07×)

Ablation Study¶

Cross-architecture generalization (IN-1K, IPC=10):

Architecture	RDED	HeLlO	Gain
ShuffleNet-V2	23.3	26.5	+3.2
MobileNet-V2	34.4	38.1	+3.7
EfficientNet-B0	42.8	44.4	+1.6
Swin-V2-Tiny	17.8	29.5	+11.7
VGG-11	22.7	24.2	+1.5

Incremental component ablation (IN-1K, IPC=10):

Configuration	Acc.	#Params
Probe Linear CLIP	28.2	1.0M
+ Multi-Weak-Teacher	30.1 (+1.9)	1.0M
+ LoRA Knowledge Transfer	43.5 (+13.4)	1.5M
+ Text-Embedding Init	43.6 (+0.1)	0.8M (↓0.7M)
+ Image Update	43.7 (+0.1)	0.8M

Key Findings¶

LoRA knowledge transfer is the most critical component: contributes a large gain of +13.4%, effectively adapting pretrained embeddings to the target distribution
Text embedding initialization serves a dual role: despite only +0.1% accuracy improvement, it reduces storage by 0.7M parameters (no need to store the initial linear transformation weights)
HeLlO shows the greatest advantage at small IPC and on Transformer architectures: exceeds RDED by 12.9% on IN-100 IPC=10 and by 11.7% on Swin-V2-Tiny
Limitations at large scale: HeLlO falls 4.3% below RDED at IN-1K IPC=50, indicating insufficient projector accuracy in very large label spaces

Highlights & Insights¶

Precise problem formulation: the first work to focus on the overlooked "label bloat" problem in dataset distillation, noting that images are distilled but labels are not
Clever use of CLIP's vision-language alignment: text embedding initialization achieves a strong starting point with zero additional storage cost
Extreme compression ratio: 0.003% of original label storage suffices for comparable performance
LoRA applied in a new setting: extends LoRA from LLM fine-tuning to projector construction in dataset distillation
Cross-architecture generalization: the particularly notable advantage on Transformer architectures warrants attention

Limitations & Future Work¶

Performance still falls short of RDED in large-scale, large-IPC settings (IN-1K IPC=50), indicating limited projector accuracy
Relies on a CLIP pretrained model; effectiveness may be limited in domains poorly covered by CLIP (e.g., medical imaging)
The selection of weak teachers (checkpoint stages in the training trajectory) is a hyperparameter requiring adjustment for different IPC values
The gain from the image update step is marginal (+0.1%), and its cost-effectiveness warrants evaluation
Downstream training incurs additional online computation overhead from invoking the CLIP encoder for inference

The patch selection and concatenation strategy from RDED provides an effective foundation for image distillation
The low-rank decomposition idea from LoRA shows strong potential for application in knowledge compression scenarios
Leveraging the vision-language alignment of pretrained models for label space reconstruction is a promising direction

Rating¶

Novelty: ⭐⭐⭐⭐ First to address soft label storage; the solution is elegant, though the core components are combinations of existing techniques
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-dataset, multi-architecture, complete ablations, but lacks validation in more large-scale settings
Writing Quality: ⭐⭐⭐⭐ Problem and method are clearly articulated, with rigorous mathematical derivations
Value: ⭐⭐⭐⭐ Addresses a practical bottleneck in dataset distillation with significant implications for large-scale distillation