Skip to content

FinePseudo: Improving Pseudo-Labelling through Temporal-Alignability for Semi-Supervised Fine-Grained Action Recognition

Conference: ECCV 2024
arXiv: 2409.01448
Code: Available
Area: Video Understanding / Semi-Supervised Learning
Keywords: Fine-Grained Action Recognition, Pseudo-Labelling, Temporal Alignment, Semi-Supervised Learning, Metric Learning

TL;DR

This paper proposes the FinePseudo framework, which utilizes metric learning based on temporal alignability to improve pseudo-label quality. It represents the first systematic approach to semi-supervised fine-grained action recognition, significantly outperforming existing methods on four fine-grained datasets.

Background & Motivation

Fine-Grained Action Recognition (FGAR) is crucial in practical applications such as sports analysis, surgical videos, and AR/VR. Unlike coarse-grained actions (e.g., "playing guitar" vs. "throwing javelin"), differences between fine-grained actions (e.g., different types of diving) are manifested only in subtle variations of action phases—for example, among the three phases of takeoff, flight, and entry, a difference in only the entry posture can change the action category.

However, annotating fine-grained actions is extremely expensive, requiring experts to repeatedly watch videos to make accurate annotations. This makes semi-supervised learning a natural choice for FGAR. Nevertheless, existing semi-supervised video methods face two core challenges:

Scene Bias Dependency: Existing methods (such as augmentation strategies like token-mix and CutMix) primarily rely on scene context to distinguish actions. Because fine-grained actions typically occur in identical scenes (e.g., all dives take place at a diving platform), these strategies fail.

Insufficient Temporal Granularity: Video-level self-supervised methods (e.g., TimeBal) learn representations that lack frame-level phase information, which is critical for FGAR.

The authors conducted a key preliminary experiment: after extracting embeddings using a frame-level video encoder, they compared the capability of different distance metrics in distinguishing intra-class vs. inter-class fine-grained action pairs. The results indicated that: - Cosine distance after temporal pooling loses temporal details. - Frame-by-frame cosine distance cannot handle variations in the duration of different action phases. - Dynamic Time Warping (DTW) alignment distance allows phase-to-phase comparisons, significantly improving discriminative capability.

This finding has never been previously explored under the label-limited setting of FGAR.

Method

Overall Architecture

FinePseudo is a co-training framework based on pseudo-labeling, consisting of two branches:

  • Action Encoder \(f_E\): Learns video-level high-level semantic features (action classification).
  • Alignability Encoder \(f_A\): A frame-level video encoder (VTN) that learns low-level intra-frame representations based on action phases.

The training workflow consists of three stages: (1) self-supervised pre-training on unlabeled data; (2) alignability-verification metric learning on labeled data; and (3) collaborative pseudo-label self-training.

Key Designs

  1. Alignability-Verification Metric Learning

Core Hypothesis: Fine-grained videos of the same class are more "alignable" than those of different classes.

For a video pair \(U, V\), frame-level embeddings \(\mathbf{u}, \mathbf{v} \in \mathbb{R}^{T \times F}\) are extracted via \(f_A\) to construct a cost matrix \(\mathbb{C}(i,j) = h(\mathbf{u}(i), \mathbf{v}(j))\) (cosine distance). Then, softDTW is used to compute the differentiable alignment distance:

$D(\mathbf{u}, \mathbf{v}) = \mathbb{C}(i,j) + \gamma\text{-smooth-min}(\Pi_{\text{cost}}(i,j))$

Based on this alignment distance, triplet loss is applied for metric learning:

$\mathcal{L}_{AT} = \sum_{i=1}^{N} [D(\mathbf{v}^{(i)}, \mathbf{v}^{(j)}) - D(\mathbf{v}^{(i)}, \mathbf{v}^{(k)}) + m]$

where positive pairs belong to the same class, negative pairs belong to different classes, and hard-negative mining is utilized.

  1. Learnable Alignability Score

The alignment distance \(D\) is mapped to \([0,1]\) through a non-linear scaling function and a sigmoid function:

$S(\mathbf{u}, \mathbf{v}) = \varsigma(f_S(D(\mathbf{u}, \mathbf{v})))$

This score function is trained using binary cross-entropy:

$\mathcal{L}_{Score} = -[y_A \log(S) + (1-y_A)\log(1-S)]$

The overall optimization objective for alignability is \(\mathcal{L}_{AV} = \mathcal{L}_{AT} + \omega \mathcal{L}_{Score}\). This score provides better class discriminative power compared to the raw DTW distance.

  1. Collaborative Pseudo-Labeling

For each unlabeled video \(U\), predictions are obtained simultaneously from both encoders: - \(\mathbf{p}_E\): Output directly by the classification head of \(f_E\). - \(\mathbf{p}_A\): Based on a non-parametric classifier, which is calculated as the average alignability score \(\bar{S}_c\) between \(U\)'s embedding and labeled samples of each class, followed by a softmax with temperature:

$\mathbf{p}_A(c) = \frac{\exp(\bar{S}_c / \tau)}{\sum_j \exp(\bar{S}_j / \tau)}$

The final prediction is \(\mathbf{p} = \mathbf{p}_A + \mathbf{p}_E\). Samples exceeding a confidence threshold \(\theta\) are assigned pseudo-labels and added to the training set. The two branches provide complementary information (video-level vs. alignability) to iteratively update the pseudo-labels.

Loss & Training

  • \(f_A\) is first pre-trained with GITDL in a self-supervised manner to learn inter-frame dynamics, and then trained with \(\mathcal{L}_{AV}\) on labeled data.
  • \(f_E\) is trained using the standard cross-entropy loss \(\mathcal{L}_{CE}\).
  • The self-training phase proceeds iteratively: generating collaborative pseudo-labels \(\rightarrow\) expanding the labeled set \(\rightarrow\) retraining both encoders.

Key Experimental Results

Main Results

Across four fine-grained datasets using the R2plus1D-18 backbone:

Dataset Label Ratio FinePseudo TimeBal (Prev. SOTA) Gain
Diving48 5% 20.9 15.8 +5.1
Diving48 10% 37.6 33.7 +3.9
Diving48 20% 60.4 56.3 +4.1
FineGym99 5% 49.2 44.4 +4.8
FineGym99 10% 69.9 65.9 +4.0
FineGym288 5% 41.7 37.3 +4.4
FineDiving 5% 28.4 25.1 +3.3

Using AIM-ViTB (with CLIP initialization) on Diving48:

Method 5% 10% 20%
SVFormer 38.00 56.02 76.20
TimeBal 38.12 55.80 76.01
FinePseudo 43.02 60.79 80.02

Ablation Study

Configuration 10% Acc 20% Acc Description
Only \(f_E\) + PL 33.40 54.00 Baseline pseudo-labeling
Only \(f_A\) 32.82 51.05 Alignment encoder alone is insufficient
\(f_E\) + \(\mathcal{L}_{AT}\) only 33.73 55.67 Contribution of triplet loss
\(f_E\) + \(\mathcal{L}_{Score}\) only 36.11 59.32 Score loss contributes more
W/o SSL pre-training 35.23 58.64 SSL pre-training is helpful
Full FinePseudo 37.64 60.40 Joint collaboration of all components is optimal

Key Findings

  • FinePseudo consistently outperforms prior methods by 4-5% in absolute accuracy across all fine-grained datasets.
  • It also exhibits competitive or slightly improved performance on coarse-grained datasets (K400, SSv2).
  • Under the open-world setting (where unlabeled data contains out-of-distribution classes), the non-parametric classifier effectively filters out unknown classes, demonstrating robustness.
  • The alignability score loss \(\mathcal{L}_{Score}\) contributes more than the triplet loss \(\mathcal{L}_{AT}\) when used individually, but their combination achieves the best performance.

Highlights & Insights

  • Elegant Core Insight: Extends "alignability" from traditional intra-class alignment assumptions to an inter-class discriminative metric specifically designed for the semi-supervised setting.
  • Strong Complementarity: Video-level semantic prediction and frame-level alignability prediction provide orthogonal sources of information.
  • Generality: The method is applicable to both fine-grained and coarse-grained action recognition.
  • Open-World Robustness: The non-parametric classifier is naturally suited for handling out-of-distribution categories.

Limitations & Future Work

  • The computational complexity of softDTW is \(O(T^2)\), which may raise efficiency issues for long videos.
  • It relies on SSL pre-trained weights (TCLR/Kinetics400), making it sensitive to pre-training quality.
  • The alignability assumption might see diminished effectiveness for unstructured actions (e.g., daily activities without explicit phases).
  • The impact of class imbalance in unlabeled data remains unexplored.
  • Unlike alignment methods like LAV (Learning by Aligning Videos), the "alignability verification" in this work does not assume that videos are necessarily alignable; instead, it learns to assess alignability difficulty.
  • The alignability score can serve as a general video similarity metric, with potential applications in retrieval, quality assessment, and other tasks.
  • The collaborative pseudo-labeling idea can be extended to other semi-supervised scenarios requiring complementary sources of information.

Rating

  • Novelty: ⭐⭐⭐⭐ — Introducing alignability into semi-supervised FGAR is an original and effective insight.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Evaluation on 4 fine-grained and 2 coarse-grained datasets, comprehensive ablation, and a novel open-world setting.
  • Writing Quality: ⭐⭐⭐⭐ — Clear motivation starts from preliminary experiments with smooth logic.
  • Value: ⭐⭐⭐⭐ — First systematic study on semi-supervised FGAR, presenting practical application value.