EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT¶
Conference: NeurIPS 2025 arXiv: 2510.23569 Code: GitHub Area: Robotics Keywords: Egocentric video, chain-of-thought reasoning, hand-object interaction, reinforcement fine-tuning, spatio-temporal grounding
TL;DR¶
This paper proposes EgoThinker, which constructs the large-scale egocentric video reasoning dataset EgoRe-5M (with causal CoT annotations and hand-object grounding labels) and adopts a two-stage training paradigm (SFT + GRPO reinforcement fine-tuning) to endow MLLMs with robust egocentric reasoning, hand-object grounding, and temporal localization capabilities, achieving state-of-the-art performance across multiple egocentric benchmarks.
Background & Motivation¶
- Background: MLLMs excel at third-person visual reasoning but lack embodied cognitive understanding from an egocentric perspective.
- Limitations of Prior Work: Existing egocentric datasets (e.g., Ego4D) lack explicit reasoning chains, temporal span annotations, and fine-grained hand-object grounding data.
- Key Challenge: Egocentric reasoning requires inferring the unseen camera-wearer's intent and actions, rather than merely recognizing visible events.
- Goal: Equip MLLMs with comprehensive capabilities for egocentric reasoning, precise hand-object grounding, and long-range temporal understanding.
- Key Insight: Construct large-scale causal CoT-annotated data and adopt a two-stage training paradigm (SFT to establish foundations + RFT to reinforce grounding).
- Core Idea: Mine egocentric data at scale from web videos such as HowTo100M, construct causal reasoning QA pairs, and apply GRPO to reinforce spatio-temporal grounding.
Method¶
Overall Architecture¶
EgoRe-5M dataset → SFT stage (causal CoT reasoning capability) → RFT stage (GRPO-based reinforcement of hand-object and temporal grounding).
Key Designs¶
- EgoRe-5M Dataset: Constructs 5 million QA pairs from 13M egocentric video clips, including causal CoT annotations over multi-minute segments and dense hand-object grounding labels. Egocentric videos are mined from HowTo100M (30M initial clips) via a multi-stage filtering pipeline, leveraging temporally aligned annotations from HTM-AA and Howto-Interlink7M.
- Two-Stage Training Paradigm:
- SFT stage: Establishes foundational egocentric understanding and reasoning on EgoRe-5M by learning from causal CoT annotations.
- RFT stage: Applies GRPO on spatio-temporal grounding data to reinforce precise localization, using IoU and bounding box matching as rewards.
- Spatio-Temporal CoT Annotations: Annotations encode complete causal chains (why an action is performed → how it is executed → what comes next), enabling the model to simulate human egocentric causal reasoning and planning.
- Hand-Object Interaction Data: Dedicated dense hand-object interaction grounding data is constructed, annotating hand positions, grasped objects, and interaction types.
Loss & Training¶
- SFT: Standard cross-entropy loss
- RFT: GRPO with IoU/bounding box matching rewards
- Backbone: InternVL series vision-language models
- Data diversity: Covers temporal spans ranging from a few seconds to several minutes
Key Experimental Results¶
| Benchmark | EgoThinker | Prev. SOTA | Gain |
|---|---|---|---|
| EgoSchema | Significant improvement | — | SOTA |
| Ego4D NLQ | SOTA | — | +Significant |
| Multiple egocentric QA | SOTA | — | — |
Key Findings¶
- Causal CoT annotations are critical for complex egocentric reasoning.
- GRPO reinforcement training significantly improves spatio-temporal grounding precision.
- Mining egocentric data from web videos is a scalable and effective strategy.
EgoRe-5M Dataset Composition¶
| Source | Initial Clips | After Filtering | QA Pairs |
|---|---|---|---|
| HowTo100M | 30M | ~6M | ~3M |
| HTM-AA | 2M | ~1.5M | ~1M |
| Howto-Interlink7M | 7M | ~3M | ~1M |
| Total | 39M | ~10.5M | ~5M |
Ablation Study¶
| Configuration | EgoSchema | Ego4D NLQ | Hand-Object Grounding |
|---|---|---|---|
| SFT only | Good | Moderate | Poor |
| SFT + RFT | SOTA | SOTA | Best |
| RFT only (w/o SFT) | Poor | Poor | Moderate |
Highlights & Insights¶
- The first egocentric MLLM capable of both reasoning and precise hand-object understanding simultaneously.
- The EgoRe-5M data construction pipeline is reusable for other egocentric tasks.
- The two-stage training paradigm (SFT → RFT) is validated as effective in the egocentric domain.
Limitations & Future Work¶
- The data mining pipeline may introduce non-egocentric video noise, and filtering quality directly affects downstream performance.
- Evaluation is limited to existing egocentric benchmarks; generalization to real wearable device scenarios (e.g., AR glasses) remains to be verified.
- Integration with robotic manipulation tasks—an important application of egocentric understanding—has not been explored.
- CoT annotations rely on LLM generation, and quality may vary, particularly for complex causal chains.
- Hand-object grounding accuracy may degrade under severe occlusion.
- Inference latency and efficiency for real-time deployment have not been investigated, though wearable devices impose strict latency requirements.
- The GRPO reward design (IoU/bounding box matching) may not cover all grounding scenarios sufficiently.
- HowTo100M primarily consists of instructional videos, potentially under-representing everyday life scenarios.
Related Work & Insights¶
- vs. EgoVLP: EgoVLP performs vision-language pre-training but lacks causal reasoning chains.
- vs. VideoChat-R1: VideoChat-R1 applies general-purpose video RL fine-tuning, whereas EgoThinker targets egocentric scenarios specifically.
- vs. Ego-Plan-Bench: Evaluates planning capability but does not optimize the model itself.
- vs. InternVL: A general-purpose VLM that lacks embodied understanding from an egocentric perspective.
Additional Discussion¶
- The core innovation lies in transforming the problem from a single dimension to a multi-dimensional analysis framework, providing a more comprehensive understanding perspective.
- The experimental design covers diverse scenarios and baselines, with statistically significant results.
- The modular design of the method facilitates extension to related tasks and new datasets.
- Open-sourcing the code and data provides significant value for community reproduction and follow-up research.
- Compared to concurrent works, this paper demonstrates advantages in the depth of problem formulation and the comprehensiveness of experimental analysis.
- The paper follows a clear logical structure, forming a complete loop from problem definition to method design to experimental validation.
- The computational overhead of the method is reasonable, supporting practical deployability.
- Future work may consider fusion with additional modalities (e.g., audio, 3D point clouds).
- Validating the scalability of the method on larger data and models is an important future direction.
- Combining the method with end-to-end reinforcement learning optimization is a worthwhile direction to explore.
- Cross-domain transfer is worth investigating, as the generality of the method requires further validation.
- Lightweight variants of the method for edge computing and mobile deployment scenarios merit further study.
Rating¶
- Novelty: ⭐⭐⭐⭐ Large-scale egocentric causal reasoning dataset construction is a significant contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ Validated with SOTA results across multiple benchmarks.
- Writing Quality: ⭐⭐⭐⭐ Clear framework presentation with thorough description of data construction.
- Value: ⭐⭐⭐⭐⭐ Directly valuable for wearable assistants and embodied AI.