Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection¶
Conference: CVPR 2026 arXiv: 2603.24030 Code: N/A Area: Video Understanding Keywords: open-vocabulary temporal action detection, chain-of-thought prompting, action phase decomposition, cross-modal alignment, knowledge transfer
TL;DR¶
This paper proposes the Phase-wise Decomposition and Alignment (PDA) framework, which leverages the CoT reasoning capability of LLMs to decompose action labels into start–middle–end phase descriptions. Through text-guided foreground filtering and adaptive phase-wise alignment, PDA achieves fine-grained action pattern transfer, attaining an Avg mAP of 46.9 on THUMOS14 OV-TAD, surpassing the previous SOTA Ti-FAD (41.2).
Background & Motivation¶
Background: Open-vocabulary temporal action detection (OV-TAD) requires localizing and classifying unseen action categories, with knowledge transfer from seen categories as the central challenge.
Limitations of Prior Work: Existing methods perform only label-level global text–visual alignment, making it difficult to capture fine-grained temporal patterns shared across different actions. For instance, "LongJump" and "PoleVault" exhibit low label-level similarity, yet their run-up and take-off phases are visually highly similar.
Key Challenge: Label-level semantic alignment fails to discover transferable visual patterns across categories, limiting generalization to unseen classes.
Goal: To extract and transfer shared phase-level visual priors across different actions for improved open-vocabulary generalization.
Key Insight: Simulating human cognition—understanding an action as a sequential unfolding (initiation → execution → completion)—by exploiting the CoT capability of LLMs to automatically decompose actions into multiple phases.
Core Idea: Decompose action labels into phase descriptions → perform text–visual alignment independently for each phase → adaptively aggregate the alignment results across phases.
Method¶
Overall Architecture¶
Three core modules: CSD (CoT-Prompted Semantic Decomposition) → TIF (Text-Infused Foreground Filtering) → APA (Adaptive Phase-wise Alignment). Input videos are processed by a visual encoder for feature extraction; action labels are decomposed by GPT-4o into four phase descriptions—start, middle, end, and global—with each phase undergoing independent visual–text matching followed by adaptive aggregation.
Key Designs¶
-
CoT-Prompting Semantic Decomposition (CSD): GPT-4o's CoT reasoning is employed to decompose each action label into four phase descriptions: {start, middle, end, global}. For example, "LongJump" yields: start = "accelerating along the runway," mid = "planting and taking off," end = "landing in the sandpit." The CLIP text encoder extracts phase embeddings \(t_c^p = \Phi_{txt}(s_c^p)\). Design Motivation: Label-level semantics cannot express phase patterns shared across categories, whereas phase decomposition naturally exposes these transferable knowledge structures.
-
Text-Infused Foreground Filtering (TIF): For each phase \(p\), the cosine similarity between the phase text embedding and video features is computed; a max-then-Softmax operation produces a phase-level foreground confidence score \(S_{fg}^p\), which is binarized to filter phase-relevant video segments: \(F_v^p = \hat{S}_{fg}^p \cdot F_v\). Design Motivation: Naively partitioning video into uniform segments cannot handle real-world scenarios with multiple actions and variable durations; semantic phase-aware segment selection is required.
-
Adaptive Phase-wise Alignment (APA): Cross-attention fusion is performed independently for each phase, \(\bar{F}_v^p = \text{CrossAttn}(F_v^p, F_t^p)\), followed by classification score computation \(C_{cls}^p = \bar{F}_v^p \cdot F_t^{p\top}\). Adaptive aggregation uses a Sigmoid network to predict per-phase weights \(\omega_p\), yielding the final classification \(C_{cls} = \sum_{p} \omega_p \cdot C_{cls}^p\). Design Motivation: The discriminability of each phase varies across actions—some actions are identifiable from the beginning while others require the ending—making adaptive weighting preferable to simple averaging.
Loss & Training¶
- Total loss: \(\mathcal{L} = \mathcal{L}_{cls} + \mathcal{L}_{fg} + \mathcal{L}_{loc}\), comprising classification (cross-entropy), foreground awareness, and DIoU localization losses.
- At inference, the same LLM-based phase decomposition is applied to test categories, and SoftNMS is used for redundancy suppression.
Key Experimental Results¶
Main Results (THUMOS14, 50% Seen / 50% Unseen)¶
| Method | 0.3 | 0.5 | 0.7 | Avg mAP |
|---|---|---|---|---|
| Ti-FAD (NeurIPS'24) | 57.0 | 43.3 | 21.2 | 41.2 |
| STOV (WACV'25) | 56.3 | 34.4 | 11.3 | 34.0 |
| PDA (Ours) | 65.4 | 49.7 | 24.3 | 46.9 |
ActivityNet v1.3
| Method | 0.5 | 0.75 | Avg mAP |
|---|---|---|---|
| Ti-FAD | 50.6 | 32.2 | 32.0 |
| PDA (Ours) | 53.1 | 35.3 | 34.6 |
Ablation Study¶
| Configuration | Avg mAP | Notes |
|---|---|---|
| Global alignment baseline | ~41.2 | Label-level alignment only |
| + CSD | Improved | Phase decomposition exposes transferable patterns |
| + CSD + TIF | Further improved | Adaptive foreground filtering vs. static temporal partitioning |
| + CSD + TIF + APA | 46.9 | Adaptive weighting outperforms average aggregation |
Key Findings¶
- On the THUMOS14 50/50 split, PDA achieves a 5.7-point improvement in Avg mAP over the strongest baseline, Ti-FAD.
- In the cross-category transfer case of LongJump → PoleVault, phase decomposition enables the model to identify shared "run-up acceleration" and "take-off" patterns, substantially improving detection performance on unseen categories.
- Adaptive aggregation demonstrates greater flexibility compared to simple averaging.
Highlights & Insights¶
- CoT reasoning is extended from NLP to action understanding: rather than mere text augmentation, this constitutes structured temporal decomposition that directly corresponds to the cognitive process of action understanding.
- Phase decomposition naturally exposes transferable cross-category knowledge that label-level methods cannot access.
- TIF's text-guided foreground filtering outperforms static temporal partitioning and handles multi-action and variable-duration scenarios.
- In the LongJump → PoleVault transfer case, phase decomposition enables the model to recognize shared "run-up acceleration" and "take-off" patterns.
- Evaluation under the 75/25 split confirms the method's robustness across different seen/unseen ratios.
- On THUMOS14, mAP at IoU@0.5 improves from 43.3 to 49.7 (+6.4%), indicating that fine-grained alignment also enhances localization precision.
Limitations & Future Work¶
- The method relies on GPT-4o for phase decomposition, incurring high cost, and decomposition quality is bounded by the LLM's action knowledge.
- The fixed three-phase decomposition (start/middle/end) may lack flexibility for certain action types (e.g., cyclic actions).
- Adaptive determination of the number of phases remains unexplored.
- The quality of phase description encoding by the CLIP text encoder may become a bottleneck.
- Validation on larger-scale video datasets (e.g., Kinetics) has not been conducted.
- CoT decomposition quality may vary considerably across different LLMs.
Related Work & Insights¶
- Distinction from DeTAL and Ti-FAD: These methods employ global alignment or simple text augmentation, whereas the proposed framework achieves fine-grained knowledge transfer through structured phase decomposition.
- The application of CoT prompting to visual tasks is an emerging direction; this work demonstrates its potential for temporal understanding.
- The design of adaptive phase weights is generalizable to other tasks requiring multi-granularity alignment.
- Under the 75/25 split, Avg mAP improves from 42.9 (Ti-FAD) to 47.3, demonstrating generalization across different ratios.
Technical Details¶
- GPT-4o Prompt: "Decompose the action of ⟨Action⟩ into coherent three phases based on the natural temporal progression"
- Phase text template: 'a video of people's motion that [Description]'
- Cross-attention fusion: \(\bar{F}_v^p = \text{Softmax}(\frac{Q(F_v^p)K(F_t^p)^\top}{\sqrt{D}})V(F_t^p)\)
- Adaptive weights: \(\omega_p = \text{Sigmoid}(W_p(F_v^p))\), allowing different phases to carry different importance across actions
- Localization branch: Concatenation of all phase visual features → MLP projection → foreground-aware head + regression head
- Inference: Test categories are similarly decomposed into phases by LLM; SoftNMS is applied for redundancy removal
- 75/25 split results: THUMOS14 Avg mAP 47.3 (vs. Ti-FAD 42.9); ActivityNet Avg mAP 36.6 (vs. DeTAL 25.5)
- Training objective: \(\mathcal{L} = \mathcal{L}_{cls} + \mathcal{L}_{fg} + \mathcal{L}_{loc}\), combining classification, foreground awareness, and DIoU localization
- Phase set: \(\mathcal{P} = \{start, middle, end, glob\}\), comprising 4 phases
- Foreground binarization threshold: The mean similarity across all temporal positions serves as the binarization threshold
- Compatible visual encoders: Compatible with standard visual encoders such as CLIP ViT-B/16
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of CoT prompting and phase decomposition is novel and cognitively well-motivated.
- Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on two benchmarks (THUMOS14 and ActivityNet) under multiple split settings.
- Writing Quality: ⭐⭐⭐⭐ Motivation figures are intuitive, though the paper is notation-heavy.
- Value: ⭐⭐⭐⭐ Achieves significant gains on OV-TAD, though the task's application scope is relatively narrow.