Skip to content

Heavy Labels Out! Dataset Distillation with Label Space Lightening

Conference: ICCV 2025 arXiv: 2408.08201 Code: Coming soon Area: Model Compression / Dataset Distillation Keywords: dataset distillation, soft label compression, CLIP, LoRA, label space lightening

TL;DR

This paper proposes the HeLlO framework, which constructs a lightweight image-label projector using a CLIP pretrained model and LoRA-like low-rank knowledge transfer, reducing soft label storage in dataset distillation to 0.003% of the original while maintaining or surpassing SOTA performance.

Background & Motivation

Dataset distillation aims to compress large-scale training sets into extremely small synthetic sets. Current SOTA methods (SRe2L, G_VBSM, RDED) significantly reduce image count but heavily rely on large volumes of soft labels generated by pretrained teacher models to preserve performance.

Core Problem: Soft label storage overhead is enormous, potentially comparable to the original dataset. For example: - ImageNet-1K, IPC=1: images are only ~15MB, but soft labels exceed 572MB (38×) - ImageNet-1K, IPC=200: soft labels reach 110GB, comparable to the original dataset - Root cause: each data augmentation generates an independent \(C\)-dimensional soft label (\(C\) = number of classes), totaling \(K\) (iterations) × \(N_s\) (samples) × \(C\)

This reveals a neglected bottleneck in current distillation methods: images are distilled, but labels are not.

Method

Overall Architecture

The HeLlO framework replaces massively stored offline soft labels with a lightweight online projector: 1. Constructs a projector using a CLIP image encoder and a linear transformation 2. Initializes the linear transformation with text embeddings (zero storage cost) 3. Fine-tunes the projector toward the target distribution via LoRA-like low-rank matrices 4. Optionally updates images to reduce projector error 5. Generates soft labels online during downstream training

Key Designs

  1. Text Embedding-Based Projector Initialization:

    • Leverages CLIP's vision-language alignment by initializing the linear transformation \(W = (v_T)^T\) with normalized text embeddings of each class description
    • Mathematical equivalence: text embedding initialization is equivalent to pretrained zero-shot classification (Proposition 1)
    • No additional storage required (text descriptions are generated from fixed prompt templates)
    • Design Motivation: provides a strong starting point, allowing the projector to further adapt from pretrained zero-shot capability
  2. LoRA-Like Low-Rank Knowledge Transfer:

    • Decomposes the weight increment as \(\Delta\theta = A \cdot B\), where \(A \in \mathbb{R}^{d \times r}\), \(B \in \mathbb{R}^{r \times k}\), \(r \ll d, k\)
    • Applies LoRA to both the convolutional layers of the CLIP image encoder and the linear transformation component (with different ranks)
    • Training objective combines multi-weak-teacher knowledge distillation and cross-entropy: \(\mathcal{L}(\mathcal{D};\theta) = MSE(f_\theta(X), Y') + \lambda CE(f_\theta(X), Y)\)
    • Weak teachers are drawn from different stages (9 checkpoints) of the ResNet-18 training trajectory
    • Design Motivation: minimizes fine-tuning cost while closing the gap between the pretrained and target distributions
  3. Synthetic Dataset Initialization and Update:

    • Initialization follows RDED: selects the most representative image patches based on teacher model difficulty scores and concatenates them
    • Additional image update step: minimizes the discrepancy between original-resolution and downsampled-then-upsampled versions in CLIP feature space: \(\mathcal{G}(\mathcal{E}_I, p) = MSE(\mathcal{E}_I(p), \mathcal{E}_I(\hat{p}))\)
    • Design Motivation: since a surrogate projector is used instead of the original teacher, image updates are needed to reduce information loss on the projector

Loss & Training

Loss function during downstream training: $\(\phi^e = \phi^{e-1} - \alpha \nabla_\phi (MSE(f_\phi(\mathcal{A}(X_s)), Y^*) + \beta CE(f_\phi(\mathcal{A}(X_s)), Y_s))\)$ - \(Y^*\) is generated online by the projector (not pre-stored) - Configuration: ImageNet-100 rank=8/64; ImageNet-1K rank=8/128

Key Experimental Results

Main Results (ResNet-18 Top-1 Accuracy, %)

Dataset IPC SRe2L G_VBSM RDED HeLlO Label Storage Ratio
IN-100 1 3.0 - 8.1 12.5 (+4.4) 0.1×
IN-100 10 9.5 - 36.0 48.9 (+12.9) 0.01×
IN-100 50 27.0 - 61.6 69.4 (+7.8) 0.002×
IN-1K 1 0.1 1.7 6.6 12.9 (+6.3) 1e-4×
IN-1K 10 21.3 31.4 42.0 43.7 (+1.7) 1e-5×
IN-1K 50 46.8 51.8 56.5 52.2 (-) 3e-6×

Teacher model parameter count: RDED 10.7M vs. HeLlO only 0.8M (0.07×)

Ablation Study

Cross-architecture generalization (IN-1K, IPC=10):

Architecture RDED HeLlO Gain
ShuffleNet-V2 23.3 26.5 +3.2
MobileNet-V2 34.4 38.1 +3.7
EfficientNet-B0 42.8 44.4 +1.6
Swin-V2-Tiny 17.8 29.5 +11.7
VGG-11 22.7 24.2 +1.5

Incremental component ablation (IN-1K, IPC=10):

Configuration Acc. #Params
Probe Linear CLIP 28.2 1.0M
+ Multi-Weak-Teacher 30.1 (+1.9) 1.0M
+ LoRA Knowledge Transfer 43.5 (+13.4) 1.5M
+ Text-Embedding Init 43.6 (+0.1) 0.8M (↓0.7M)
+ Image Update 43.7 (+0.1) 0.8M

Key Findings

  • LoRA knowledge transfer is the most critical component: contributes a large gain of +13.4%, effectively adapting pretrained embeddings to the target distribution
  • Text embedding initialization serves a dual role: despite only +0.1% accuracy improvement, it reduces storage by 0.7M parameters (no need to store the initial linear transformation weights)
  • HeLlO shows the greatest advantage at small IPC and on Transformer architectures: exceeds RDED by 12.9% on IN-100 IPC=10 and by 11.7% on Swin-V2-Tiny
  • Limitations at large scale: HeLlO falls 4.3% below RDED at IN-1K IPC=50, indicating insufficient projector accuracy in very large label spaces

Highlights & Insights

  • Precise problem formulation: the first work to focus on the overlooked "label bloat" problem in dataset distillation, noting that images are distilled but labels are not
  • Clever use of CLIP's vision-language alignment: text embedding initialization achieves a strong starting point with zero additional storage cost
  • Extreme compression ratio: 0.003% of original label storage suffices for comparable performance
  • LoRA applied in a new setting: extends LoRA from LLM fine-tuning to projector construction in dataset distillation
  • Cross-architecture generalization: the particularly notable advantage on Transformer architectures warrants attention

Limitations & Future Work

  • Performance still falls short of RDED in large-scale, large-IPC settings (IN-1K IPC=50), indicating limited projector accuracy
  • Relies on a CLIP pretrained model; effectiveness may be limited in domains poorly covered by CLIP (e.g., medical imaging)
  • The selection of weak teachers (checkpoint stages in the training trajectory) is a hyperparameter requiring adjustment for different IPC values
  • The gain from the image update step is marginal (+0.1%), and its cost-effectiveness warrants evaluation
  • Downstream training incurs additional online computation overhead from invoking the CLIP encoder for inference
  • The patch selection and concatenation strategy from RDED provides an effective foundation for image distillation
  • The low-rank decomposition idea from LoRA shows strong potential for application in knowledge compression scenarios
  • Leveraging the vision-language alignment of pretrained models for label space reconstruction is a promising direction

Rating

  • Novelty: ⭐⭐⭐⭐ First to address soft label storage; the solution is elegant, though the core components are combinations of existing techniques
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multi-dataset, multi-architecture, complete ablations, but lacks validation in more large-scale settings
  • Writing Quality: ⭐⭐⭐⭐ Problem and method are clearly articulated, with rigorous mathematical derivations
  • Value: ⭐⭐⭐⭐ Addresses a practical bottleneck in dataset distillation with significant implications for large-scale distillation