CLOT: Closed Loop Optimal Transport for Unsupervised Action Segmentation¶
Conference: ICCV 2025 arXiv: 2507.03539 Code: https://github.com/elenabbbuenob/CLOT Area: Video Understanding / Action Segmentation Keywords: Unsupervised Action Segmentation, Optimal Transport, Closed-Loop Learning, Encoder-Decoder, Sliced Wasserstein Distance
TL;DR¶
This paper proposes Closed Loop Optimal Transport (CLOT), a framework that jointly solves three OT problems through a three-level cyclic feature learning pipeline (frame embeddings → segment embeddings → cross-attention refined frame embeddings), establishing an explicit feedback loop between frame-level and segment-level representations to substantially improve boundary detection and clustering quality in unsupervised action segmentation.
Background & Motivation¶
Unsupervised action segmentation aims to annotate video frames with action categories without supervision, with important applications in sports, surveillance, and robotics. Existing methods fall into two categories:
Classical pipeline methods (CTE, VTE, etc.): learn frame representations first and then cluster, lacking feedback between representation learning and clustering.
OT-based methods (TOT, ASOT, etc.): leverage optimal transport to jointly learn action representations and pseudo-labels, enabling feedback via self-training.
Among these, ASOT achieves the best performance by making no assumption on action order and obtaining temporally consistent segmentations between frames and action labels via Gromov-Wasserstein OT. However, ASOT suffers from two issues: (a) an implicit segment-length prior that hinders detection of short-duration actions; and (b) the absence of explicit feedback between frame-level and segment-level representations, causing learned clusters to misalign with true segment boundaries.
On the other hand, HVQ improves short-action detection through hierarchical vector quantization, but its fixed codebook lacks the feedback mechanism of OT-based methods and generalizes poorly. CLOT is designed to combine the strengths of both: preserving the feedback capability of OT while strengthening segment-level consistency.
Method¶
Overall Architecture¶
CLOT comprises a three-level architecture (see Fig. 2 in the original paper): - Level 1: MLP encoder + feature dispatching mechanism → frame embeddings \(F\) + pseudo-labels \(\mathbf{T}\) (via OT_1) - Level 2: Parallel decoder → segment embeddings \(S\) + pseudo-labels \(\mathbf{T}_S\) (via OT_2) - Level 3: Frame-segment cross-attention → refined frame embeddings \(F_R\) + pseudo-labels \(\mathbf{T}_R\) (via OT_3)
The three levels form a closed loop: frames → segments → refined frames, each level imposing OT constraints to generate pseudo-labels, enabling multi-granularity cyclic optimization.
Key Designs¶
-
Feature Dispatching Encoder: An MLP maps input frame features \(X \in \mathbb{R}^{N \times D}\) to frame embeddings \(F \in \mathbb{R}^{N \times d}\). The feature dispatching mechanism is based on a learnable similarity function \(\phi(A,F) = \sigma(\beta + \alpha \cdot \frac{A \cdot F}{\|A\|\|F\|})\), where \(A\) denotes learnable action cluster embeddings. Each frame is assigned to the most relevant cluster via attention weights and its representation is updated as \(f_i' = f_i + \frac{1}{K}\sum_{k=0}^K \phi(A_k, f_i) \cdot A_k\). This enables frame embeddings to dynamically adapt to the currently learned cluster structure, yielding more organized representations.
-
Parallel Decoder: Adopts a query-based attention mechanism inspired by DETR, using learnable queries \(Q \in \mathbb{R}^{K' \times d_{dec}}\) as segment prototypes. Frame embeddings are decoded into segment embeddings \(S \in \mathbb{R}^{K' \times d}\) (\(K' \leq K\)) via multi-head cross-attention and self-attention. Unlike autoregressive decoders, the parallel decoder predicts all segments simultaneously, avoiding error accumulation.
-
Cross-Attention Refinement: Injects segment-level structural information back into frame embeddings: \(F_R = F + \text{softmax}(\frac{FS^\top}{\tau \cdot \sqrt{d}})S\). This is the key to closing the loop—allowing frame representations to be structurally adjusted according to segment embeddings, ensuring that frame-level details are aligned with the temporal segmentation process.
-
Sliced Wasserstein Distance: Introduced as a complement to cosine distance in the cost matrix. SWD projects high-dimensional distributions onto one-dimensional subspaces via random projections: \(\text{SWD}_p(x_i, a_j) = (\frac{1}{M}\sum_{m=1}^M d(R_{\theta_m\#}x_i, R_{\theta_m\#}a_j))^{1/p}\). The cost matrix is defined as \(\mathbf{C}_{ij}^{sw} = 1 + \text{SWD}(x_i, a_j) - \mathbf{C}_{i,j}^k\), combining the SWD with the visual cost. Using \(p=1\) yields a closed-form solution, ensuring computational efficiency.
Loss & Training¶
The OT formulation is unbalanced, integrating two sub-problems, KOT and GW: \(\min_\mathbf{T} \alpha \mathcal{F}_{GW} + (1-\alpha) \mathcal{F}_{KOT} - \lambda KL(\mathbf{T}^\top \mathbf{1}_n \| \nu)\), where the KL divergence penalty allows flexible non-uniform label assignment. The training objective is the sum of cross-entropy losses across all three levels: \(\mathcal{L}_{train} = \mathcal{L}(\mathbf{T}, \mathbf{P}) + \mathcal{L}(\mathbf{T}_S, \mathbf{P}_S) + \mathcal{L}(\mathbf{T}_R, \mathbf{P}_R)\), where \(P_{ij} = \text{softmax}(FA^\top / \tau)_{ij}\).
Key Experimental Results¶
Main Results (Activity-level, Hungarian Matching)¶
| Dataset | Metric | ASOT | HVQ | CLOT | Gain |
|---|---|---|---|---|---|
| Breakfast | MoF | 56.1 | 54.4 | 60.1 | +4.0 |
| Breakfast | F1 | 38.3 | 39.7 | 40.1 | +0.4 |
| YTI | MoF | 52.9 | 50.3 | 54.4 | +1.5 |
| 50Salads(Eval) | F1 | 53.6 | - | 63.2 | +9.6 |
| 50Salads(Eval) | mIoU | 30.1 | - | 38.8 | +8.7 |
| DA | F1 | 68.0 | - | 72.6 | +4.6 |
Ablation Study¶
| Configuration | Breakfast MoF | 50Salads(Eval) F1 | DA F1 | Note |
|---|---|---|---|---|
| CLOT | 60.1 | 63.2 | 72.6 | Full model |
| w/o SWD | 59.8 | 52.7 | 62.3 | Removing SWD causes large F1 drop |
| w/o FD | 51.9 | 51.9 | 68.3 | Removing feature dispatching drops MoF by 8% |
| w/o Decoder | 58.0 | 52.7 | 68.4 | No segment-level OT, F1 drops noticeably |
| w/o Refinement | 59.7 | 52.6 | 68.2 | No cross-attention refinement, closed loop broken |
Key Findings¶
- Closed-loop refinement is the most critical component: removing the Decoder or Refinement drops 50Salads F1 from 63.2 to 52.7, indicating that segment-level feedback is essential for boundary detection.
- Feature dispatching contributes most on Breakfast (MoF drops 8.2), where the diversity of activity categories and temporal complexity make structured representations especially important.
- SWD yields significant gains on 50Salads and DA (F1 drops of 10.5 and 10.3, respectively), validating its more robust distance measurement in high-dimensional spaces.
- Under video-level evaluation, CLOT achieves MoF 66.3 on Breakfast (vs. 63.3 for ASOT) and F1 69.7 on 50Salads(Eval) (vs. 58.9 for ASOT), with even more pronounced improvements.
Highlights & Insights¶
- The closed-loop "frames → segments → refined frames" three-level cycle is the core contribution, enabling representations at different granularities to mutually reinforce each other rather than propagating information unidirectionally.
- The feature dispatching mechanism realizes "soft-clustering-guided representation updates," offering greater flexibility compared to hard clustering.
- Using Sliced Wasserstein Distance in place of cosine distance for constructing the OT cost matrix exploits SWD's advantage in preserving the geometric structure of distributions.
- The parallel decoder design avoids the error accumulation inherent in autoregressive decoders.
Limitations & Future Work¶
- The number of action categories \(K\) must be specified in advance, which may be difficult to determine in practice.
- As with ASOT, computing OT incurs non-trivial overhead, potentially limiting scalability to very long videos.
- Video-level and activity-level results are not always consistent, suggesting room for improvement in cross-video generalization.
- The strategy for selecting the decoder query count \(K'\) is not sufficiently discussed in the codebase.
Related Work & Insights¶
- Extends the unbalanced GW-OT framework of ASOT by adding segment-level OT and refinement OT.
- The parallel decoder is inspired by DETR-style designs from action prediction literature (e.g., FUTR3D).
- SWD has been validated in point cloud processing and color transfer; this work is the first to apply it to OT cost matrices for action segmentation.
- Unlike TOT and UFSA, CLOT requires no prior on action order, making it more general.
Rating¶
- Novelty: ⭐⭐⭐⭐ The multi-level cyclic closed-loop OT design is novel, and the orchestration of three OT problems is well-motivated
- Experimental Thoroughness: ⭐⭐⭐⭐ Four benchmarks, two evaluation protocols, and detailed ablations
- Writing Quality: ⭐⭐⭐⭐ Clear structure; Fig. 2 conveys rich architectural information
- Value: ⭐⭐⭐⭐ Meaningful advancement in unsupervised action segmentation, achieving state-of-the-art on most datasets