Condensing Action Segmentation Datasets via Generative Network Inversion¶
Conference: CVPR 2025
Institution: National University of Singapore (NUS)
Subject: Dataset Distillation / Temporal Action Segmentation
Background & Motivation¶
Background¶
Background: Temporal Action Segmentation aims to assign each frame in a long video sequence to its corresponding action category, which is an important task in video understanding. However, training high-performance action segmentation models requires large amounts of long video data and dense frame-by-frame annotations, incurring massive storage and transmission costs.
Limitations of Prior Work:
Massive Data Scale: Action segmentation datasets typically contain long continuous videos. For example, the Breakfast dataset contains approximately 28GB of feature data, and the 50Salads dataset also contains about 4.5GB.
Extremely High Annotation Cost: Action category annotations are required for every frame of the video, which is significantly more expensive than image classification annotations.
Privacy and Distribution Issues: In certain fields (such as healthcare and industry), original video data may contain sensitive information and cannot be freely distributed.
Limitations of Existing Distillation Methods: Image-level dataset distillation methods (e.g., Dataset Distillation) are difficult to apply directly to temporal data, since action segmentation involves long-sequence temporal structures and action transition patterns.
Temporal Consistency: Distilled data must maintain the temporal structure of actions; simple frame-level sampling destroys action continuity and transition patterns.
The key insight of this paper is that generative models can be utilized to learn data distributions, and network inversion can then be employed to find compact latent representations that represent the entire dataset.
Method¶
Overall Architecture¶
The proposed method consists of two core phases: (1) training a TCA generative model to learn the data distribution; (2) optimizing the latent codes through network inversion to find compact representations that can replace the original dataset.
TCA Generative Model¶
TCA (Temporally Coherent Action) is a generative model specifically designed for action segmentation data, capable of generating temporally coherent action sequences.
Model Architecture:
The generator \(G\) takes the latent code \(z\), the action label sequence \(a\), and the conditional information \(c\) as input, and generates the corresponding feature sequence:
Temporal Consistency Design:
TCA ensures that the generated action sequences are temporally coherent through the following mechanisms:
- The action label sequence \(a\) provides frame-level action guidance
- The conditional information \(c\) encodes the global action transition patterns
- Temporal convolutions are used inside the generator to maintain local temporal continuity
Network Inversion and Latent Code Optimization¶
The core step is to optimize the latent code to find representations that can reconstruct the original data:
where \(D\) is the decoder and \(x\) is the original data. The optimized latent code \(z^*\) is much more compact than the original data, requiring only a low-dimensional vector to store each sample.
Diverse Sequence Sampling¶
To ensure that the distilled dataset covers the diversity of the original data, this paper designs a diversity selection strategy based on edit distance and farthest point sampling:
| Strategy | Objective | Method |
|---|---|---|
| Edit Distance | Measure structural differences between action sequences | Calculate the edit distance between action label sequences |
| Farthest Point Sampling | Select the most scattered samples in the latent space | Greedily select the point furthest from the already selected samples |
| Class Balancing | Ensure sufficient coverage for each action category | Allocate sampling quotas proportionally according to action categories |
Training After Distillation¶
Training action segmentation models using the distilled compact dataset:
- Randomly sample from the stored latent codes
- Decode into feature sequences through the frozen generator
- Train the segmentation model using the decoded features and corresponding labels
Key Experimental Results¶
Compression Ratio¶
| Dataset | Original Size | Distilled Size | Compression Ratio |
|---|---|---|---|
| Breakfast | 28GB | 44MB | 636× |
| 50Salads | 4.5%GB | 3.9%MB | ~1150× |
Segmentation Performance¶
| Dataset | Method | MS-TCN Accuracy |
|---|---|---|
| Breakfast | Distilled Data | 55.5% |
| Breakfast | Full Data | 67.2% |
| 50Salads | Distilled Data | 74.4% |
| 50Salads | Full Data | 80.6% |
Although there is some performance loss with the distilled data, this performance retention is impressive considering the hundreds-fold compression ratio.
Ablation Study¶
- The temporal consistency design of the TCA generative model is critical; using common generative models results in unnatural jumps at action transitions in the generated sequences.
- The diversity sampling strategy significantly outperforms random sampling, validating the importance of covering the diversity of the data distribution.
- There is a trade-off between accuracy and compression ratio regarding the dimension of the latent code.
Summary & Outlook¶
This paper is the first to apply dataset distillation technology to the temporal action segmentation task, achieving extreme data compression through the TCA generative model and network inversion. It achieves a 636× compression ratio on the Breakfast dataset and approximately a 1150× compression ratio on the 50Salads dataset, while retaining respectable segmentation performance. This method opens up a new direction for data-efficient video understanding research and can be extended to other sequential tasks and larger-scale datasets in the future.