Leveraging Hierarchical Feature Sharing for Efficient Dataset Condensation¶

Conference: ECCV 2024
arXiv: 2310.07506
Code: None (no code link provided in the paper)
Area: Model Compression / Dataset Distillation
Keywords: Dataset Distillation, Data Parameterization, Hierarchical Feature Sharing, Data Redundancy Pruning, Storage Efficiency

TL;DR¶

This paper proposes a Hierarchical Memory Network (HMN) that stores synthetic data for dataset distillation in a three-tier structure (dataset-level, class-level, and instance-level memory). It improves storage efficiency through hierarchical feature sharing and further removes redundancy via instance-level pruning, surpassing all baseline methods using only a low-GPU-memory batch-based loss.

Background & Motivation¶

Core Goal of Dataset Distillation¶

Dataset Condensation (DC) aims to synthesize a tiny dataset \(\mathcal{S}\) (\(|\mathcal{S}| \ll |\mathcal{D}|\)) from a large-scale real dataset \(\mathcal{D}\), such that a model trained on \(\mathcal{S}\) can achieve performance close to one trained on \(\mathcal{D}\). As dataset scales grow, DC demonstrates significant value in scenarios such as continual learning, neural architecture search, and federated learning.

Rise and Limitations of Data Parameterization¶

Recent works have proposed data parameterization, which avoids storing synthetic images directly. Instead, they distill data into a parameterized data container \(f_\theta\), further improving storage efficiency by sharing features. For example: - IDC: Improves storage efficiency by downsampling images. - HaBa/LinBa: Shares common information among images based on matrix factorization.

However, these methods overlook a key fact: feature sharing among images has a hierarchical structure. In a classification taxonomy, two images of cats share "cat-class" features, while images of cats and dogs share higher-level "animal-class" features. Existing factorization methods are flat and fail to capture such hierarchical sharing relationships.

Issue of Redundant Data¶

The authors also found significant data redundancy in existing distilled datasets. By measuring data difficulty/importance via the Area Under the Margin (AUM) metric, it was revealed that in the CIFAR10 10 IPC dataset generated by HaBa, at least 10% of the data could be pruned without affecting accuracy. However, directly pruning is highly challenging due to the coupling of weights in existing data containers—images generated from the same basis vector may vary drastically in difficulty, and simplistic pruning of basis vectors risks deleting valuable data.

Design Motivation¶

The design motivation of HMN directly corresponds to two findings: (1) the hierarchical structure naturally aligns with the hierarchical feature sharing of the taxonomy; (2) the independence of instance-level memories makes pruning straightforward and natural.

Method¶

Overall Architecture¶

HMN is a parameterized data container that outputs a synthetic image \(\mathcal{S}_i\) given an image index \(i\). It consists of a three-tier memory structure and auxiliary networks, with the parameter count of all components accounted for within the storage budget.

Key Designs¶

1. Three-Tier Hierarchical Memory Network¶

Function: Stores distilled data using three tiers of learnable parameter tensors, corresponding to different granularities of feature sharing.

Mechanism:

Dataset-level memory \(m^{(\mathcal{D})}\): Globally shared, storing common information of all images (such as low-level textures, color distributions, etc.), shared across all classes.
Class-level memory \(m_c^{(C)}\): Class-shared, storing class-specific features (such as morphological features of "cats"), with the quantity equal to the number of classes.
Instance-level memory \(m_{c,i}^{(I)}\): Unique to each image, storing information that distinguishes individuals, with the quantity determining the number of generated images.

The generation process for the \(i\)-th image of class \(c\):

\[x_{c,i} = D([f_c(m^{(\mathcal{D})}) \oplus m_c^{(C)} \oplus m_{c,i}^{(I)}])\]

where \(f_c\) is a class-specific feature extractor (extracting class-related features from the shared memory), \(D\) is a unified decoder (converting the concatenated memories into an image), and \(\oplus\) denotes the concatenation operation.

Design Motivation: The hierarchical structure naturally aligns with the classification taxonomy—dataset-tier corresponds to "general visual features," class-tier to "class-discriminative features," and instance-tier to "individual variations." This hierarchical separation of information not only improves storage efficiency (as massive information is shared at higher levels) but also provides a clean interface for subsequent pruning.

2. Over-Budget Distillation + Double-Ended Pruning¶

Function: First distills using a capacity exceeding the storage budget by \(p\%\), and then prunes the instance-level memories to return within the budget while boosting performance.

Mechanism: Measures the learning difficulty of each synthetic image using the AUM metric, and then performs "double-ended pruning":

\[M^{(t)}(\mathbf{x}, y) = z_y^{(t)}(\mathbf{x}) - \max_{i \neq y} z_i^{(t)}(\mathbf{x})\]

\[\text{AUM}(\mathbf{x}, y) = \frac{1}{T}\sum_{t=1}^{T} M^{(t)}(\mathbf{x}, y)\]

The optimal hard-sample pruning ratio \(\beta\) is determined via grid search: - Pruning \(\lfloor\beta k\rfloor\) lowest-AUM (hardest) samples. - Pruning \(k - \lfloor\beta k\rfloor\) highest-AUM (easiest) samples.

The pruning quantity is kept balanced across each class.

Design Motivation: Inspired by CCS, both excessively easy and excessively hard data are detrimental to training. Overhead easy data is redundant (easily mastered by the classifier), while too difficult data can be noise or anomalies. The independence of HMN's instance-level memories ensures that pruning a single image only requires removing its corresponding instance-level memory without affecting other images—this is mathematically impossible for factorization methods like HaBa/LinBa.

3. Training Optimization¶

Function: Optimizes HMN using a low-GPU-memory batch-based loss.

Mechanism: Adopts gradient matching as the training loss:

\[\min_{\mathcal{S}} \mathbf{E}_{\theta_0 \sim P_{\theta_0}} \left[\sum_{t=0}^{T-1} d(\nabla_\theta \mathcal{L}(\theta_t, \mathcal{S}), \nabla_\theta \mathcal{L}(\theta_t, \mathcal{T}))\right]\]

which minimizes the distance between the gradients generated by synthetic data and real data on the model. Compared to trajectory-based losses (such as MTT), gradient matching requires substantially lower memory.

Design Motivation: Although trajectory-based losses typically yield better results, their GPU memory consumption is extremely high (LinBa even requires CPU offloading), which severely limits their practicality. HMN's hierarchical architecture achieves excellent performance even when combined with a simple batch-based loss, highlighting the importance of architectural design.

Loss & Training¶

Uses the gradient matching variant of IDC as the training loss.
Sets the over-budget rate to 10% (higher pruning rates significantly degrade accuracy).
Repeats data distillation 3 times for all experiments, repeating training 10 times after each distillation to report the mean and standard deviation.
Uses ConvNet (3-layer convolution + pooling) as the distillation and evaluation network.

Key Experimental Results¶

Main Results¶

Comparison of Methods on CIFAR10 (Test Accuracy %)

Method	Container Type	1 IPC	10 IPC	50 IPC
DC	Image	28.3	44.9	53.9
DSA	Image	28.8	52.1	60.6
DM	Image	26.0	48.9	63.0
MTT*	Image	46.3	65.3	71.6
IDC	Parameterized	50.0	67.5	74.5
HaBa*	Parameterized	48.3	69.9	74.0
LinBa*	Parameterized	66.4	71.2	73.6
HMN (Ours)	Parameterized	65.7	73.7	76.9

Note: Methods marked with * use a high-memory trajectory-based loss, while HMN uses a low-memory batch-based loss.

Performance Across Multiple Datasets

Dataset	IPC	HMN	Best Baseline	Gain/Gap
CIFAR100	1	36.3	34.0 (LinBa)	+2.3
CIFAR100	10	45.4	42.9 (LinBa)	+2.5
SVHN	1	87.4	87.3 (LinBa)	+0.1
SVHN	10	90.0	89.1 (LinBa)	+0.9
Tiny-ImageNet	1	19.4	16.0	+3.4
Tiny-ImageNet	10	24.4	23.2	+1.2
ImageNet-10	1	64.6	60.4	+4.2

Ablation Study¶

Comparison of Data Containers under Same Loss (CIFAR10, gradient matching)

Data Container	1 IPC	10 IPC	50 IPC
Original Image	36.7	58.3	69.5
IDC	50.0	67.5	74.5
HaBa	48.5	61.8	72.4
LinBa	62.0	67.8	70.7
HMN	65.7	73.7	76.9

Under fair comparison, HMN's advantage is even more pronounced, outperforming the second-best method by 3.7%, 5.9%, and 2.4%, respectively.

Cross-Architecture Generalizability (CIFAR10, ConvNet Distillation → Other Architecture Evaluation)

Architecture	HMN (1/10/50)	IDC (1/10/50)	HaBa (1/10/50)
ConvNet	65.7/73.7/76.9	50.0/67.5/74.5	48.3/69.9/74.0
VGG16	58.5/64.3/70.2	28.7/43.1/57.9	34.1/53.8/61.1
ResNet18	56.8/62.9/69.1	32.3/45.1/58.4	36.0/49.0/60.4
DenseNet121	50.7/56.9/65.1	24.3/38.5/50.5	34.6/49.3/57.8

GPU Memory Comparison (CIFAR10)

IPC	HaBa (MTT)	LinBa (BPTT)	HMN (GM)
1	3368M	OOM	2680M
10	11148M	OOM	4540M
50	48276M	OOM	10426M

Key Findings¶

HMN outperforms HaBa and LinBa (which use high-memory trajectory-based losses) using only a low-memory batch-based loss, proving that architectural design is more critical than training strategies.
Under fair comparison (with the identical loss), HMN's advantage is amplified to 3.7%–5.9%.
The cross-architecture transferability of HMN is far superior to other data parameterization methods (e.g., outperforming IDC by nearly 30% on VGG16).
The GPU memory requirement is only 22%–79% of HaBa's, while LinBa directly encounters OOM (Out of Memory).
The size of instance-level memory needs to be carefully balanced: if it is too small, single-image information is insufficient; if it is too large, the number of generated images becomes too small, compromising diversity.
A 10% over-budget pruning rate is the optimal choice; higher pruning rates significantly degrade accuracy.

Highlights & Insights¶

Natural Alignment of Hierarchical Concepts: Directly encoding the hierarchical structure of the taxonomy into the architecture design of the data container is an intuitive and highly effective insight.
Restraint in Design: The authors experimented with more complex designs (such as class-specific decoders and additional intermediate networks) but found that a simpler design yielded better results, as overly complex parameters led to overfitting. This embodies the principle of "simplicity is key" in dataset distillation.
The "Over-distill-then-prune" Paradigm: Analogous to model over-parameterization followed by pruning, applying this concept to dataset distillation is highly novel.
High Practicality: Training time drops from 14 days for LinBa to 15 hours (on a 2080Ti), with a dramatic reduction in memory requirements.

Limitations & Future Work¶

Sole Use of Batch-Based Loss: Although HMN + batch loss proves highly effective, integration with trajectory-based losses could yield further improvements (unattempted due to GPU memory constraints).
Relatively Simple Pruning Strategy: The current double-ended pruning via AUM requires additional training to compute AUM values, and the optimal pruning parameter \(\beta\) relies on grid searching.
Insufficient Evaluation on Large-Scale Datasets: Experiments were conducted only on Tiny-ImageNet and ImageNet-10; the full ImageNet scale remains unverified.
Fixed Decoder Design: The unified decoder applies the same decoding itinerary for all classes, potentially limiting the expression of inter-class variance.
Identical Distillation and Evaluation Network: The main experiments primarily leverage ConvNet; though cross-architecture experiments exist, a notable degradation in performance is observed.

Relationship with HaBa/LinBa: These two works implement flat feature sharing through matrix factorization, whereas HMN extends this to hierarchical sharing.
Relationship with Coreset Selection: The pruning strategy adopts the double-ended pruning concept from CCS, applying data importance measurements to synthetic data.
Insights: The concept of hierarchical feature sharing can be generalized to other scenarios requiring compact data representations (such as data communication compression in federated learning and memory buffer designs in continual learning).

Rating¶

Novelty: ⭐⭐⭐⭐ — Both the Hierarchical Memory Network and the "over-budget distillation then pruning" paradigm are novel contributions, though the core idea (hierarchical sharing) is a natural generalization.
Experimental Thoroughness: ⭐⭐⭐⭐ — Experiments across multiple datasets, architectures, memory usage comparisons, and ablation studies are thorough, but verification on large-scale datasets is lacking.
Writing Quality: ⭐⭐⭐⭐ — The motivation and methodology are clearly articulated, and the charts are intuitive, although the table format is slightly cramped.
Value: ⭐⭐⭐⭐ — Significantly reduces the computational and memory overhead of dataset distillation, offering high practical value.