Heavy Labels Out! Dataset Distillation with Label Space Lightening¶
Conference: ICCV 2025 arXiv: 2408.08201 Code: Coming soon Area: Model Compression / Dataset Distillation Keywords: dataset distillation, soft label compression, CLIP, LoRA, label space lightening
TL;DR¶
This paper proposes the HeLlO framework, which constructs a lightweight image-label projector using a CLIP pretrained model and LoRA-like low-rank knowledge transfer, reducing soft label storage in dataset distillation to 0.003% of the original while maintaining or surpassing SOTA performance.
Background & Motivation¶
Dataset distillation aims to compress large-scale training sets into extremely small synthetic sets. Current SOTA methods (SRe2L, G_VBSM, RDED) significantly reduce image count but heavily rely on large volumes of soft labels generated by pretrained teacher models to preserve performance.
Core Problem: Soft label storage overhead is enormous, potentially comparable to the original dataset. For example: - ImageNet-1K, IPC=1: images are only ~15MB, but soft labels exceed 572MB (38×) - ImageNet-1K, IPC=200: soft labels reach 110GB, comparable to the original dataset - Root cause: each data augmentation generates an independent \(C\)-dimensional soft label (\(C\) = number of classes), totaling \(K\) (iterations) × \(N_s\) (samples) × \(C\)
This reveals a neglected bottleneck in current distillation methods: images are distilled, but labels are not.
Method¶
Overall Architecture¶
The HeLlO framework replaces massively stored offline soft labels with a lightweight online projector: 1. Constructs a projector using a CLIP image encoder and a linear transformation 2. Initializes the linear transformation with text embeddings (zero storage cost) 3. Fine-tunes the projector toward the target distribution via LoRA-like low-rank matrices 4. Optionally updates images to reduce projector error 5. Generates soft labels online during downstream training
Key Designs¶
-
Text Embedding-Based Projector Initialization:
- Leverages CLIP's vision-language alignment by initializing the linear transformation \(W = (v_T)^T\) with normalized text embeddings of each class description
- Mathematical equivalence: text embedding initialization is equivalent to pretrained zero-shot classification (Proposition 1)
- No additional storage required (text descriptions are generated from fixed prompt templates)
- Design Motivation: provides a strong starting point, allowing the projector to further adapt from pretrained zero-shot capability
-
LoRA-Like Low-Rank Knowledge Transfer:
- Decomposes the weight increment as \(\Delta\theta = A \cdot B\), where \(A \in \mathbb{R}^{d \times r}\), \(B \in \mathbb{R}^{r \times k}\), \(r \ll d, k\)
- Applies LoRA to both the convolutional layers of the CLIP image encoder and the linear transformation component (with different ranks)
- Training objective combines multi-weak-teacher knowledge distillation and cross-entropy: \(\mathcal{L}(\mathcal{D};\theta) = MSE(f_\theta(X), Y') + \lambda CE(f_\theta(X), Y)\)
- Weak teachers are drawn from different stages (9 checkpoints) of the ResNet-18 training trajectory
- Design Motivation: minimizes fine-tuning cost while closing the gap between the pretrained and target distributions
-
Synthetic Dataset Initialization and Update:
- Initialization follows RDED: selects the most representative image patches based on teacher model difficulty scores and concatenates them
- Additional image update step: minimizes the discrepancy between original-resolution and downsampled-then-upsampled versions in CLIP feature space: \(\mathcal{G}(\mathcal{E}_I, p) = MSE(\mathcal{E}_I(p), \mathcal{E}_I(\hat{p}))\)
- Design Motivation: since a surrogate projector is used instead of the original teacher, image updates are needed to reduce information loss on the projector
Loss & Training¶
Loss function during downstream training: $\(\phi^e = \phi^{e-1} - \alpha \nabla_\phi (MSE(f_\phi(\mathcal{A}(X_s)), Y^*) + \beta CE(f_\phi(\mathcal{A}(X_s)), Y_s))\)$ - \(Y^*\) is generated online by the projector (not pre-stored) - Configuration: ImageNet-100 rank=8/64; ImageNet-1K rank=8/128
Key Experimental Results¶
Main Results (ResNet-18 Top-1 Accuracy, %)¶
| Dataset | IPC | SRe2L | G_VBSM | RDED | HeLlO | Label Storage Ratio |
|---|---|---|---|---|---|---|
| IN-100 | 1 | 3.0 | - | 8.1 | 12.5 (+4.4) | 0.1× |
| IN-100 | 10 | 9.5 | - | 36.0 | 48.9 (+12.9) | 0.01× |
| IN-100 | 50 | 27.0 | - | 61.6 | 69.4 (+7.8) | 0.002× |
| IN-1K | 1 | 0.1 | 1.7 | 6.6 | 12.9 (+6.3) | 1e-4× |
| IN-1K | 10 | 21.3 | 31.4 | 42.0 | 43.7 (+1.7) | 1e-5× |
| IN-1K | 50 | 46.8 | 51.8 | 56.5 | 52.2 (-) | 3e-6× |
Teacher model parameter count: RDED 10.7M vs. HeLlO only 0.8M (0.07×)
Ablation Study¶
Cross-architecture generalization (IN-1K, IPC=10):
| Architecture | RDED | HeLlO | Gain |
|---|---|---|---|
| ShuffleNet-V2 | 23.3 | 26.5 | +3.2 |
| MobileNet-V2 | 34.4 | 38.1 | +3.7 |
| EfficientNet-B0 | 42.8 | 44.4 | +1.6 |
| Swin-V2-Tiny | 17.8 | 29.5 | +11.7 |
| VGG-11 | 22.7 | 24.2 | +1.5 |
Incremental component ablation (IN-1K, IPC=10):
| Configuration | Acc. | #Params |
|---|---|---|
| Probe Linear CLIP | 28.2 | 1.0M |
| + Multi-Weak-Teacher | 30.1 (+1.9) | 1.0M |
| + LoRA Knowledge Transfer | 43.5 (+13.4) | 1.5M |
| + Text-Embedding Init | 43.6 (+0.1) | 0.8M (↓0.7M) |
| + Image Update | 43.7 (+0.1) | 0.8M |
Key Findings¶
- LoRA knowledge transfer is the most critical component: contributes a large gain of +13.4%, effectively adapting pretrained embeddings to the target distribution
- Text embedding initialization serves a dual role: despite only +0.1% accuracy improvement, it reduces storage by 0.7M parameters (no need to store the initial linear transformation weights)
- HeLlO shows the greatest advantage at small IPC and on Transformer architectures: exceeds RDED by 12.9% on IN-100 IPC=10 and by 11.7% on Swin-V2-Tiny
- Limitations at large scale: HeLlO falls 4.3% below RDED at IN-1K IPC=50, indicating insufficient projector accuracy in very large label spaces
Highlights & Insights¶
- Precise problem formulation: the first work to focus on the overlooked "label bloat" problem in dataset distillation, noting that images are distilled but labels are not
- Clever use of CLIP's vision-language alignment: text embedding initialization achieves a strong starting point with zero additional storage cost
- Extreme compression ratio: 0.003% of original label storage suffices for comparable performance
- LoRA applied in a new setting: extends LoRA from LLM fine-tuning to projector construction in dataset distillation
- Cross-architecture generalization: the particularly notable advantage on Transformer architectures warrants attention
Limitations & Future Work¶
- Performance still falls short of RDED in large-scale, large-IPC settings (IN-1K IPC=50), indicating limited projector accuracy
- Relies on a CLIP pretrained model; effectiveness may be limited in domains poorly covered by CLIP (e.g., medical imaging)
- The selection of weak teachers (checkpoint stages in the training trajectory) is a hyperparameter requiring adjustment for different IPC values
- The gain from the image update step is marginal (+0.1%), and its cost-effectiveness warrants evaluation
- Downstream training incurs additional online computation overhead from invoking the CLIP encoder for inference
Related Work & Insights¶
- The patch selection and concatenation strategy from RDED provides an effective foundation for image distillation
- The low-rank decomposition idea from LoRA shows strong potential for application in knowledge compression scenarios
- Leveraging the vision-language alignment of pretrained models for label space reconstruction is a promising direction
Rating¶
- Novelty: ⭐⭐⭐⭐ First to address soft label storage; the solution is elegant, though the core components are combinations of existing techniques
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-dataset, multi-architecture, complete ablations, but lacks validation in more large-scale settings
- Writing Quality: ⭐⭐⭐⭐ Problem and method are clearly articulated, with rigorous mathematical derivations
- Value: ⭐⭐⭐⭐ Addresses a practical bottleneck in dataset distillation with significant implications for large-scale distillation