Condensing Action Segmentation Datasets via Generative Network Inversion¶

Conference: CVPR 2025
Institution: National University of Singapore (NUS)
Subject: Dataset Distillation / Temporal Action Segmentation

Background & Motivation¶

Background¶

Background: Temporal Action Segmentation aims to assign each frame in a long video sequence to its corresponding action category, which is an important task in video understanding. However, training high-performance action segmentation models requires large amounts of long video data and dense frame-by-frame annotations, incurring massive storage and transmission costs.

Limitations of Prior Work:

Massive Data Scale: Action segmentation datasets typically contain long continuous videos. For example, the Breakfast dataset contains approximately 28GB of feature data, and the 50Salads dataset also contains about 4.5GB.

Extremely High Annotation Cost: Action category annotations are required for every frame of the video, which is significantly more expensive than image classification annotations.

Privacy and Distribution Issues: In certain fields (such as healthcare and industry), original video data may contain sensitive information and cannot be freely distributed.

Limitations of Existing Distillation Methods: Image-level dataset distillation methods (e.g., Dataset Distillation) are difficult to apply directly to temporal data, since action segmentation involves long-sequence temporal structures and action transition patterns.

Temporal Consistency: Distilled data must maintain the temporal structure of actions; simple frame-level sampling destroys action continuity and transition patterns.

The key insight of this paper is that generative models can be utilized to learn data distributions, and network inversion can then be employed to find compact latent representations that represent the entire dataset.

Method¶

Overall Architecture¶

The proposed method consists of two core phases: (1) training a TCA generative model to learn the data distribution; (2) optimizing the latent codes through network inversion to find compact representations that can replace the original dataset.

TCA Generative Model¶

TCA (Temporally Coherent Action) is a generative model specifically designed for action segmentation data, capable of generating temporally coherent action sequences.

Model Architecture:

The generator \(G\) takes the latent code \(z\), the action label sequence \(a\), and the conditional information \(c\) as input, and generates the corresponding feature sequence:

\[x_{\text{gen}} = G(z, a, c)\]

Temporal Consistency Design:

TCA ensures that the generated action sequences are temporally coherent through the following mechanisms:

The action label sequence \(a\) provides frame-level action guidance
The conditional information \(c\) encodes the global action transition patterns
Temporal convolutions are used inside the generator to maintain local temporal continuity

Network Inversion and Latent Code Optimization¶

The core step is to optimize the latent code to find representations that can reconstruct the original data:

\[z^* = \arg\min_{z} \| D(z, a, c) - x \|^2\]

where \(D\) is the decoder and \(x\) is the original data. The optimized latent code \(z^*\) is much more compact than the original data, requiring only a low-dimensional vector to store each sample.

Diverse Sequence Sampling¶

To ensure that the distilled dataset covers the diversity of the original data, this paper designs a diversity selection strategy based on edit distance and farthest point sampling:

Strategy	Objective	Method
Edit Distance	Measure structural differences between action sequences	Calculate the edit distance between action label sequences
Farthest Point Sampling	Select the most scattered samples in the latent space	Greedily select the point furthest from the already selected samples
Class Balancing	Ensure sufficient coverage for each action category	Allocate sampling quotas proportionally according to action categories

Training After Distillation¶

Training action segmentation models using the distilled compact dataset:

Randomly sample from the stored latent codes
Decode into feature sequences through the frozen generator
Train the segmentation model using the decoded features and corresponding labels

Key Experimental Results¶

Compression Ratio¶

Dataset	Original Size	Distilled Size	Compression Ratio
Breakfast	28GB	44MB	636×
50Salads	4.5%GB	3.9%MB	~1150×

Segmentation Performance¶

Dataset	Method	MS-TCN Accuracy
Breakfast	Distilled Data	55.5%
Breakfast	Full Data	67.2%
50Salads	Distilled Data	74.4%
50Salads	Full Data	80.6%

Although there is some performance loss with the distilled data, this performance retention is impressive considering the hundreds-fold compression ratio.

Ablation Study¶

The temporal consistency design of the TCA generative model is critical; using common generative models results in unnatural jumps at action transitions in the generated sequences.
The diversity sampling strategy significantly outperforms random sampling, validating the importance of covering the diversity of the data distribution.
There is a trade-off between accuracy and compression ratio regarding the dimension of the latent code.

Summary & Outlook¶

This paper is the first to apply dataset distillation technology to the temporal action segmentation task, achieving extreme data compression through the TCA generative model and network inversion. It achieves a 636× compression ratio on the Breakfast dataset and approximately a 1150× compression ratio on the 50Salads dataset, while retaining respectable segmentation performance. This method opens up a new direction for data-efficient video understanding research and can be extended to other sequential tasks and larger-scale datasets in the future.