EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding¶
Conference: ECCV 2024
arXiv: 2406.08877
Code: GitHub
Area: Human Understanding
Keywords: Egocentric Vision, Full-Body Action Understanding, Fitness Dataset, Cross-view, Explainable Action Assessment
TL;DR¶
This paper proposes the EgoExo-Fitness dataset, which contains synchronized egocentric and exocentric fitness videos. It provides two-level temporal boundary annotations and innovative explainable action assessment labels (technical keypoint verification, natural language commentary, and quality scoring) and establishes five benchmark tasks.
Background & Motivation¶
Imagine wearing smart glasses while working out, where a virtual coach can tell you what you did, when you did it, and how well you performed. Achieving this vision requires Egocentric Full-Body Action Understanding (EgoFBAU) capabilities, but existing research has three main gaps:
Single-View Datasets: Existing full-body action datasets (e.g., NTU-RGB+D, FineGym, FineDiving) are predominantly captured from exocentric fixed cameras, which limits their application in flexible scenarios.
Limited Scenarios in Egocentric Datasets: Existing egocentric datasets (e.g., Ego4D, EPIC-KITCHENS) mainly focus on tabletop manipulation and daily interactions rather than full-body action understanding.
Lack of Explainable Assessment Annotations: Existing action quality assessment datasets only provide scores or rankings, making it impossible to directly explore the interpretability of the evaluation (i.e., why a certain score was given).
EgoExo-Fitness fills these gaps by simultaneously providing ego/exo videos and rich annotations.
Method¶
Overall Architecture¶
EgoExo-Fitness is a dataset + benchmark contribution that corely consists of three parts: 1. Multi-view recording system design and data collection. 2. Multi-level rich annotation system. 3. Five benchmark tasks and experimental analysis.
Key Designs¶
1. Recording System¶
Egocentric (Ego): A headset with three action cameras is designed: - Ego-M: GoPro capturing the straight-ahead view. - Ego-L / Ego-R: Two Insta-Go3 cameras capturing the bottom-left and bottom-right views, respectively, to capture more body details.
Exocentric (Exo): Three fixed cameras placed in front (Exo-M), front-left (Exo-L), and front-right (Exo-R) of the participant.
All cameras are manually synchronized via visible timing events.
2. Data Collection¶
- 12 fitness action categories: Covering chest, abdomen, waist, hips, and full-body driver muscles (e.g., kneeling push-ups, push-ups, sit-ups, high knees, jumping jacks, etc.).
- 86 action sequences: Randomly combining 3-6 different actions to enrich temporal diversity.
- Natural collection: Participants only received text instructions and completed actions naturally, with each action repeated at least 4 times.
- Scale: 1276 cross-view sequence videos, 6131 single actions, totaling approximately 32 hours.
3. Annotation System (Core Innovation)¶
Two-Level Temporal Boundary Annotations: - Level 1: Locating the start and end times of each individual action from the action sequence videos. - Level 2: Dividing a single action video into three substeps—Getting ready, Executing, and Relaxing.
Explainable Action Assessment Annotations (three progressive layers):
(1) Technical Keypoint Verification: - Text guidance is provided for each type of action. - LLM is used to split the guidance into several technical keypoints. - Annotators verify whether the actions satisfy each keypoint one by one (True/False).
(2) Natural Language Commentary: - Annotators write paragraph-style reviews based on keypoint verification results. - Including well-performed aspects and suggestions for improvement.
(3) Action Quality Scoring: A subjective score of 1-5, with each action annotated by at least 2 database experts.
4. Five Benchmark Tasks¶
- Action Classification: Predicting fitness action categories from single action videos.
- Action Localization: Temporal action detection.
- Cross-View Sequence Verification (CVSV): Verifying whether two videos from different views perform the same action sequence (newly proposed).
- Cross-View Skill Assessment: Action quality assessment across views.
- Guidance-based Execution Verification (GEV): Determining whether the action execution satisfies the given technical keypoints (newly proposed).
Loss & Training¶
As a dataset paper, the focus is on experimental benchmarks rather than specific model designs. Baseline models include pre-trained models such as I3D, TimeSformer, and EgoVLP, as well as a specialized sequence verification model, CAT.
Key Experimental Results¶
Main Results — Action Classification¶
| Training Data | Model | Pre-training | Exo Test↑ | Ego Test↑ |
|---|---|---|---|---|
| Exo | I3D | K400 | 0.9194 | 0.0927 |
| Ego | I3D | K400 | 0.1025 | 0.7469 |
| Ego&Exo | I3D | K400 | 0.8963 | 0.7266 |
| Exo | TSF | K600 | 0.9274 | 0.0836 |
| Ego | EgoVLP | Ego4D | 0.0887 | 0.7977 |
| Ego | TSF | EE4D | 0.1601 | 0.8000 |
Cross-View Sequence Verification¶
| Training Data | Ego-Ego AUC↑ | Exo-Exo AUC↑ | Exo-Ego AUC↑ |
|---|---|---|---|
| Exo-Exo | 0.532 | 0.800 | 0.577 |
| Ego-Ego | 0.803 | 0.487 | 0.480 |
| Exo-Ego | 0.761 | 0.813 | 0.744 |
| All | 0.751 | 0.814 | 0.743 |
Cross-view retrieval performance: Ego \(\rightarrow\) Exo Rank-1 is only 0.296, and mAP is only 0.228, which is far lower than same-view retrieval.
Ablation Study¶
Impact of Pre-training on Views: Kinetics pre-training achieves the best performance on Exo (0.9274), while Ego-Exo4D pre-training performs best on Ego (0.8000), aligning with the view of the pre-training data.
Ego Training Data Proportion Experiment: Gradually reducing the ego training data (\(100\% \rightarrow 70\% \rightarrow 30\% \rightarrow 0\%\)) leads to a continuous decline in all metrics, indicating that cross-view learning under limited ego data is a significant challenge.
Key Findings¶
- Huge Viewpoint Gap: Models trained solely on ego data fail almost completely on exo (<0.1), and vice versa.
- Mixed Training is Not Always Effective: Training with mixed ego and exo data does not guarantee improvements and may even degrade performance on specific views.
- Ego is More Challenging than Exo: Models consistently achieve lower accuracy on ego videos than on exo videos. This is because action patterns in egocentric views are more similar and offer fewer discriminative cues.
- Cross-View Sequence Verification is Highly Challenging: The AUC for ego-exo pairs (0.744) is significantly lower than for same-view pairs (ego-ego 0.803, exo-exo 0.814).
- Dilemma of Limited Ego Data: Reducing the proportion of ego data leads to continuous performance degradation, whereas collecting egocentric data is much harder than exocentric data in practice.
Highlights & Insights¶
- First Systematic Annotation of Explainable Action Assessment: The three-level annotation system (technical keypoint verification + natural language commentary + quality scoring) opens up a new direction for explainable action assessment.
- First Proposal of Cross-View Sequence Verification: Extending traditional SV to cross-view scenarios closely aligns with the practical demands of smart wearable devices.
- Unique Downward-Looking Ego Camera Design: In addition to the straight-ahead view, two cameras pointing bottom-left and bottom-right are used to capture more body details, compensating for the lack of body visibility in the standard front-facing egocentric camera.
- Action Sequence Design: Each video segment contains 3-6 different actions, naturally supporting action localization and sequence verification tasks.
Limitations & Future Work¶
- The dataset scale (32 hours) is relatively small compared to Ego-Exo4D (hundreds of hours).
- It only covers 12 types of fitness actions, offering limited movement diversity.
- Manual camera synchronization might introduce slight temporal deviations.
- Benchmark experiments mainly utilize existing models, without proposing methods specifically tailored for cross-view challenges.
- Multimodal evaluation frameworks combining technical keypoint verification with large language models can be explored.
Related Work & Insights¶
- Ego-Exo4D: A concurrent large-scale ego-exo dataset, but EgoExo-Fitness focuses specifically on fitness scenarios and offers unique keypoint verification annotations.
- FLAG3D, FineDiving: Exocentric-only full-body action datasets lacking egocentric perspectives.
- Ego4D: A large-scale egocentric dataset, but it rarely involves full-body action understanding.
- CAT (Sequence Verification Model): Exhibits a significant performance drop in cross-view settings, indicating a need for new cross-view temporal modeling methods.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The first ego-exo fitness dataset with explainable action assessment annotations.
- Technical Depth: ⭐⭐⭐ — Primarily a dataset contribution with limited methodological innovation.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 5 tasks with multi-dimensional analysis and exhaustive view impact studies.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure with rich statistical visualizations.