CLOT: Closed Loop Optimal Transport for Unsupervised Action Segmentation¶

Conference: ICCV 2025 arXiv: 2507.03539 Code: https://github.com/elenabbbuenob/CLOT Area: Video Understanding / Action Segmentation Keywords: Unsupervised Action Segmentation, Optimal Transport, Closed-Loop Learning, Encoder-Decoder, Sliced Wasserstein Distance

TL;DR¶

This paper proposes Closed Loop Optimal Transport (CLOT), a framework that jointly solves three OT problems through a three-level cyclic feature learning pipeline (frame embeddings → segment embeddings → cross-attention refined frame embeddings), establishing an explicit feedback loop between frame-level and segment-level representations to substantially improve boundary detection and clustering quality in unsupervised action segmentation.

Background & Motivation¶

Unsupervised action segmentation aims to annotate video frames with action categories without supervision, with important applications in sports, surveillance, and robotics. Existing methods fall into two categories:

Classical pipeline methods (CTE, VTE, etc.): learn frame representations first and then cluster, lacking feedback between representation learning and clustering.

OT-based methods (TOT, ASOT, etc.): leverage optimal transport to jointly learn action representations and pseudo-labels, enabling feedback via self-training.

Among these, ASOT achieves the best performance by making no assumption on action order and obtaining temporally consistent segmentations between frames and action labels via Gromov-Wasserstein OT. However, ASOT suffers from two issues: (a) an implicit segment-length prior that hinders detection of short-duration actions; and (b) the absence of explicit feedback between frame-level and segment-level representations, causing learned clusters to misalign with true segment boundaries.

On the other hand, HVQ improves short-action detection through hierarchical vector quantization, but its fixed codebook lacks the feedback mechanism of OT-based methods and generalizes poorly. CLOT is designed to combine the strengths of both: preserving the feedback capability of OT while strengthening segment-level consistency.

Method¶

Overall Architecture¶

CLOT comprises a three-level architecture (see Fig. 2 in the original paper): - Level 1: MLP encoder + feature dispatching mechanism → frame embeddings \(F\) + pseudo-labels \(\mathbf{T}\) (via OT_1) - Level 2: Parallel decoder → segment embeddings \(S\) + pseudo-labels \(\mathbf{T}_S\) (via OT_2) - Level 3: Frame-segment cross-attention → refined frame embeddings \(F_R\) + pseudo-labels \(\mathbf{T}_R\) (via OT_3)

The three levels form a closed loop: frames → segments → refined frames, each level imposing OT constraints to generate pseudo-labels, enabling multi-granularity cyclic optimization.

Key Designs¶

Feature Dispatching Encoder: An MLP maps input frame features \(X \in \mathbb{R}^{N \times D}\) to frame embeddings \(F \in \mathbb{R}^{N \times d}\). The feature dispatching mechanism is based on a learnable similarity function \(\phi(A,F) = \sigma(\beta + \alpha \cdot \frac{A \cdot F}{\|A\|\|F\|})\), where \(A\) denotes learnable action cluster embeddings. Each frame is assigned to the most relevant cluster via attention weights and its representation is updated as \(f_i' = f_i + \frac{1}{K}\sum_{k=0}^K \phi(A_k, f_i) \cdot A_k\). This enables frame embeddings to dynamically adapt to the currently learned cluster structure, yielding more organized representations.
Parallel Decoder: Adopts a query-based attention mechanism inspired by DETR, using learnable queries \(Q \in \mathbb{R}^{K' \times d_{dec}}\) as segment prototypes. Frame embeddings are decoded into segment embeddings \(S \in \mathbb{R}^{K' \times d}\) (\(K' \leq K\)) via multi-head cross-attention and self-attention. Unlike autoregressive decoders, the parallel decoder predicts all segments simultaneously, avoiding error accumulation.
Cross-Attention Refinement: Injects segment-level structural information back into frame embeddings: \(F_R = F + \text{softmax}(\frac{FS^\top}{\tau \cdot \sqrt{d}})S\). This is the key to closing the loop—allowing frame representations to be structurally adjusted according to segment embeddings, ensuring that frame-level details are aligned with the temporal segmentation process.
Sliced Wasserstein Distance: Introduced as a complement to cosine distance in the cost matrix. SWD projects high-dimensional distributions onto one-dimensional subspaces via random projections: \(\text{SWD}_p(x_i, a_j) = (\frac{1}{M}\sum_{m=1}^M d(R_{\theta_m\#}x_i, R_{\theta_m\#}a_j))^{1/p}\). The cost matrix is defined as \(\mathbf{C}_{ij}^{sw} = 1 + \text{SWD}(x_i, a_j) - \mathbf{C}_{i,j}^k\), combining the SWD with the visual cost. Using \(p=1\) yields a closed-form solution, ensuring computational efficiency.

Loss & Training¶

The OT formulation is unbalanced, integrating two sub-problems, KOT and GW: \(\min_\mathbf{T} \alpha \mathcal{F}_{GW} + (1-\alpha) \mathcal{F}_{KOT} - \lambda KL(\mathbf{T}^\top \mathbf{1}_n \| \nu)\), where the KL divergence penalty allows flexible non-uniform label assignment. The training objective is the sum of cross-entropy losses across all three levels: \(\mathcal{L}_{train} = \mathcal{L}(\mathbf{T}, \mathbf{P}) + \mathcal{L}(\mathbf{T}_S, \mathbf{P}_S) + \mathcal{L}(\mathbf{T}_R, \mathbf{P}_R)\), where \(P_{ij} = \text{softmax}(FA^\top / \tau)_{ij}\).

Key Experimental Results¶

Main Results (Activity-level, Hungarian Matching)¶

Dataset	Metric	ASOT	HVQ	CLOT	Gain
Breakfast	MoF	56.1	54.4	60.1	+4.0
Breakfast	F1	38.3	39.7	40.1	+0.4
YTI	MoF	52.9	50.3	54.4	+1.5
50Salads(Eval)	F1	53.6	-	63.2	+9.6
50Salads(Eval)	mIoU	30.1	-	38.8	+8.7
DA	F1	68.0	-	72.6	+4.6

Ablation Study¶

Configuration	Breakfast MoF	50Salads(Eval) F1	DA F1	Note
CLOT	60.1	63.2	72.6	Full model
w/o SWD	59.8	52.7	62.3	Removing SWD causes large F1 drop
w/o FD	51.9	51.9	68.3	Removing feature dispatching drops MoF by 8%
w/o Decoder	58.0	52.7	68.4	No segment-level OT, F1 drops noticeably
w/o Refinement	59.7	52.6	68.2	No cross-attention refinement, closed loop broken

Key Findings¶

Closed-loop refinement is the most critical component: removing the Decoder or Refinement drops 50Salads F1 from 63.2 to 52.7, indicating that segment-level feedback is essential for boundary detection.
Feature dispatching contributes most on Breakfast (MoF drops 8.2), where the diversity of activity categories and temporal complexity make structured representations especially important.
SWD yields significant gains on 50Salads and DA (F1 drops of 10.5 and 10.3, respectively), validating its more robust distance measurement in high-dimensional spaces.
Under video-level evaluation, CLOT achieves MoF 66.3 on Breakfast (vs. 63.3 for ASOT) and F1 69.7 on 50Salads(Eval) (vs. 58.9 for ASOT), with even more pronounced improvements.

Highlights & Insights¶

The closed-loop "frames → segments → refined frames" three-level cycle is the core contribution, enabling representations at different granularities to mutually reinforce each other rather than propagating information unidirectionally.
The feature dispatching mechanism realizes "soft-clustering-guided representation updates," offering greater flexibility compared to hard clustering.
Using Sliced Wasserstein Distance in place of cosine distance for constructing the OT cost matrix exploits SWD's advantage in preserving the geometric structure of distributions.
The parallel decoder design avoids the error accumulation inherent in autoregressive decoders.

Limitations & Future Work¶

The number of action categories \(K\) must be specified in advance, which may be difficult to determine in practice.
As with ASOT, computing OT incurs non-trivial overhead, potentially limiting scalability to very long videos.
Video-level and activity-level results are not always consistent, suggesting room for improvement in cross-video generalization.
The strategy for selecting the decoder query count \(K'\) is not sufficiently discussed in the codebase.

Extends the unbalanced GW-OT framework of ASOT by adding segment-level OT and refinement OT.
The parallel decoder is inspired by DETR-style designs from action prediction literature (e.g., FUTR3D).
SWD has been validated in point cloud processing and color transfer; this work is the first to apply it to OT cost matrices for action segmentation.
Unlike TOT and UFSA, CLOT requires no prior on action order, making it more general.

Rating¶

Novelty: ⭐⭐⭐⭐ The multi-level cyclic closed-loop OT design is novel, and the orchestration of three OT problems is well-motivated
Experimental Thoroughness: ⭐⭐⭐⭐ Four benchmarks, two evaluation protocols, and detailed ablations
Writing Quality: ⭐⭐⭐⭐ Clear structure; Fig. 2 conveys rich architectural information
Value: ⭐⭐⭐⭐ Meaningful advancement in unsupervised action segmentation, achieving state-of-the-art on most datasets