Learning Procedural-aware Video Representations through State-Grounded Hierarchy Unfolding¶

Conference: AAAI 2026 arXiv: 2511.20073 Code: https://github.com/zhao-jinghan/TSS-unfolding Area: LLM Pretraining Keywords: Procedural video understanding, state grounding, hierarchical learning, progressive pretraining, video representation

TL;DR¶

This paper proposes a Task-Step-State (TSS) three-level semantic framework that introduces "state" as a visual grounding layer within the conventional task-step hierarchy, and designs a progressive pretraining strategy following a U-shaped path (Task→Step→State→Step→Task) to unfold the TSS hierarchy stage by stage. The approach achieves comprehensive state-of-the-art performance on task recognition, step recognition, and step forecasting tasks on the COIN and CrossTask datasets.

Background & Motivation¶

Understanding and executing goal-directed procedural activities (e.g., operational steps in instructional videos) is a core capability for intelligent agents. Existing methods learn procedural video representations by aligning visual content with textual descriptions at the task/step level—for instance, using task names from wikiHow (e.g., "Make Orange Juice") and step descriptions (e.g., "Cut the orange") as supervision signals.

However, this two-level approach has critical limitations: - Task and step descriptions are highly abstract, making it difficult to form robust alignments with specific, observable details in visual data. - There exists a substantial semantic gap between abstract instructions such as "Cut the orange" and the raw pixels that actually depict the action. - During visual-text alignment training, this gap prevents the model from grounding abstract procedures in actually visible content.

Key Challenge: The semantic gap between abstract text and concrete visual observations. The authors' key insight is to introduce "state"—textual snapshots of object configurations (e.g., "the orange is no longer whole; the flesh is exposed")—as a semantic grounding layer that anchors abstract procedures to what the model can actually observe.

From a logical perspective, states constitute the skeleton of any procedural task: a task is a macro-level transition from an initial state to a final state, and steps are the actions that drive intermediate state transitions. This forms the three-level Task-Step-State framework.

Method¶

Overall Architecture¶

The method comprises two core contributions: 1. TSS Framework Construction: An LLM is used to generate pre-, mid-, and post-state descriptions for each step, forming a three-level semantic structure. 2. Progressive Pretraining Strategy: Stage-by-stage training along a U-shaped path: Task→Step→State→Step→Task.

Key Designs¶

Task-Step-State Framework Construction:
- Function: Augments the conventional two-level task-step structure with a "state" layer, associating three types of states with each step \(s_{i,j}\):
  - Pre-state \(c_{i,j}^b\): Object configuration before the action begins.
  - Mid-state \(c_{i,j}^m\): Object configuration during the action.
  - Post-state \(c_{i,j}^a\): Object configuration after the action completes.
- Mechanism: GPT-4o-mini with chain-of-thought (CoT) prompting is employed to automatically generate state texts from wikiHow task-plus-step descriptions, constructing a rich three-level knowledge base.
- Scale: 1,053 tasks, 10,588 steps, and 31,764 state descriptions.
- Design Motivation: State descriptions capture observable object configurations (e.g., shape, attributes, spatial relations), whose semantic distance to visual content is far smaller than that of abstract step descriptions.
Pseudo-Label Generation:
- Function: Generates training supervision signals through video-text feature alignment.
- Mechanism:
  - Frozen S3D visual encoder and Sentence-BERT text encoder are used to extract features respectively.
  - Text features are aggregated and clustered (10,038 step nodes; approximately 9,000–10,000 state nodes).
  - Video clips are matched to text nodes via cosine similarity, with the top-3 nodes selected as multi-class pseudo-labels.
- Five pseudo-label types: TaskVNM, StepVNM, StateVNM, StepNRL (Node Relation Learning), StepTCL (Task Context Learning).
- Design Motivation: Leverages large-scale weak supervision from pretrained encoders, eliminating the need for manual temporal annotations.
Progressive Pretraining Strategy:
- Function: Trains the model in stages along a specific path, with each stage focusing on one semantic level.
- Optimal path: Task→Step→State→Step→Task (Path-5/6).
- Model architecture: Frozen S3D visual encoder + trainable bottleneck adapter (512→128→512) + randomly initialized task head.
- Knowledge transfer mechanism: Adapter weights are retained and passed to the next stage upon completion of each stage; the task head is discarded and re-initialized.
- Design Motivation: The strategy first performs top-down analysis (Task→Step→State) and then bottom-up synthesis (State→Step→Task), forming a complete analysis-synthesis cycle. Key findings include:
  - Directly jumping from State→Task (Path-4) yields poor results, confirming that the step level serves as a necessary intermediate bridge.
  - Joint training (Mix_Train) underperforms progressive training, as it fails to capture the causal relationships between hierarchy levels.
  - The final Step→Task back-traversal (Path-6 vs. Path-5) yields only marginal gains.

Loss & Training¶

BCEWithLogitsLoss multi-class loss.
Adam optimizer, lr=1e-4, batch_size=256.
Training data: 4.1 million video clips (from an 85K-video subset of HowTo100M).
Approximately 90 seconds per epoch (Step/State stages) or 30 seconds (Task stage); 1,500 epochs total.
Trained on 8×H200 GPUs.

Key Experimental Results¶

Main Results¶

Comparison with SOTA (Path-5, COIN dataset):

Method	TR(MLP)	SR(MLP)	SF(MLP)	TR(Trans)	SR(Trans)	SF(Trans)
No pretrain	2.09	1.37	0.84	78.31	39.23	35.43
Paprika (SOTA)	81.54	42.39	34.10	82.83	41.19	38.93
Ours (Path-5)	83.78	44.54	38.07	83.11	42.42	40.40
Gain vs. Paprika	+2.24	+2.15	+3.97	+0.28	+1.23	+1.47

CrossTask dataset:

Method	TR(MLP)	SR(MLP)	SF(MLP)	TR(Trans)	SR(Trans)	SF(Trans)
Paprika	89.65	56.21	55.77	90.27	55.57	55.67
Ours (Path-5)	89.44	57.92	57.13	89.44	57.08	56.50

Ablation Study¶

Configuration	COIN TR(MLP)	COIN SR(MLP)	COIN SF(MLP)	Notes
Path-1 (Task only)	73.31	34.18	23.67	Task-level pretraining only
Path-2 (Task→Step)	82.45	43.06	36.04	Adding step level improves results
Path-3 (→State)	80.73	41.28	34.35	Direct transition to State degrades performance
Path-4 (→State→Task)	77.74	37.51	24.84	State→Task jump is too large
Path-5 (→State→Step)	83.78	44.54	38.07	U-shaped back-traversal is optimal
Path-6 (→Step→Task)	83.30	44.04	36.94	Additional Task back-traversal yields marginal gains
Mix_Train (joint)	77.74	38.43	29.79	Joint training underperforms progressive training
Fusion-AvgPool	83.11	44.35	36.88	Feature fusion is effective but weaker than progressive training

Key Findings¶

The state level is the key driver: The best result without states (Path-2) achieves 43.06 SR; the optimal result with states reaches 44.54 SR.
Progressive training outperforms joint training: Mix_Train SR (38.43) is substantially lower than Path-5 (44.54), highlighting the importance of causal relationships between hierarchy levels.
The step level is a necessary intermediate bridge: Path-4 (direct State→Task jump) suffers a sharp performance drop, while Path-5 (State→Step intermediate transition) achieves the best results.
The U-shaped path simulates analysis-synthesis: Top-down analysis proceeds to the most concrete state level, followed by bottom-up synthesis back to abstract steps.
Gains are more pronounced with a simple MLP downstream head (+3.97 SF), indicating genuinely higher-quality pretrained representations.
Feature fusion (AvgPool/Concat) is also effective, validating the complementary nature of state information.

Highlights & Insights¶

Conceptual innovation of "state" as a visual grounding layer: Abstract procedural knowledge is grounded in visual data through observable object configurations.
Systematic exploration of progressive pretraining paths: Rather than arbitrarily choosing a training order, the paper systematically validates six distinct paths through ablation experiments.
Efficient low-cost design: The visual encoder is frozen, and only a bottleneck adapter (512→128→512) is trained, achieving high parameter efficiency.
The LLM-based state description generation approach is practical and scalable.
The Path-4 vs. Path-5 comparison precisely reveals the existence of the semantic gap and the bridging role of the step level.

Limitations & Future Work¶

Reliance on the S3D encoder and wikiHow corpus may limit applicability to broader video types.
LLM-generated state descriptions may lack accuracy, particularly for abstract or complex operations.
Evaluation is conducted only on COIN and CrossTask, both of which are relatively limited in scale.
The fixed 9.6-second video segmentation may be inappropriate for all steps, as some steps are much shorter or longer.
Direct visual prediction from video to state is not explored; state information is utilized only indirectly through text-visual alignment.

The concept of "state" bridges procedural learning and object-centric video understanding.
Progressive/curriculum learning ideas are particularly effective for hierarchically structured knowledge.
The paradigm of using LLMs as knowledge augmentation tools (generating state descriptions) is generalizable to other tasks requiring intermediate semantic layers.
The analysis-synthesis (U-shaped) learning path offers new insights for designing pretraining strategies over multi-level knowledge structures.
The adapter fine-tuning strategy keeps the computational cost of large-scale pretraining manageable.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Both the TSS three-level framework and the U-shaped progressive training path are original contributions.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Systematic ablation over 6 paths + fusion strategy comparisons + SOTA benchmarking.
Writing Quality: ⭐⭐⭐⭐⭐ — Tight logical chain; the ablation analysis is highly instructive.
Value: ⭐⭐⭐⭐ — The state-grounding idea has broad applicability, though dataset scale limits overall impact.