EgoM2P: Egocentric Multimodal Multitask Pretraining¶
Conference: ICCV 2025 arXiv: 2506.07886 Code: Project Page Area: 3D Vision Keywords: Egocentric Vision, Multimodal Pretraining, Multitask Learning, Masked Modeling, Gaze Prediction, Camera Tracking, Depth Estimation, Video Generation
TL;DR¶
EgoM2P is the first large-scale multimodal multitask model for egocentric 4D understanding. It unifies four modalities — RGB video, depth, gaze, and camera trajectory — within a temporally-aware masked modeling framework, matching or surpassing task-specific models on multiple downstream tasks while being an order of magnitude faster.
Background & Motivation¶
Root Cause¶
Key Challenge: Background: Multimodal signals in egocentric vision (RGB, depth, camera pose, gaze, etc.) are critical for augmented reality, robotics, and human-computer interaction. However, building large-scale egocentric multimodal multitask models faces unique challenges:
Data Heterogeneity: Modality coverage varies greatly across devices and scenarios — some datasets include gaze but no depth, others include depth but no gaze — resulting in severe missing-modality issues.
Infeasibility of Pseudo-Labels: Generating pseudo-labels for modalities such as gaze and head-mounted camera trajectories is often impractical, making standard supervised learning difficult to scale.
Spatiotemporal Complexity: Dynamic camera motion and complex spatiotemporal structures in egocentric video pose additional challenges for existing multimodal foundation models.
Limitations of Prior Work: Third-person video models are ill-suited for the egocentric perspective; existing egocentric foundation models focus solely on video question answering, neglecting body-related modalities and 3D/4D reconstruction capabilities.
Limitations of Image-Level Models: Multimodal foundation models such as 4M operate on single images and lack temporal consistency guarantees.
Paper Goals¶
Goal: ### Overall Architecture
EgoM2P is built on the T5-Base architecture with an encoder-decoder design.
Method¶
Overall Architecture¶
EgoM2P is built on the T5-Base architecture with an encoder-decoder design. The pipeline consists of three stages: data curation → modality tokenization → masked pretraining.
Key Design 1: Data Curation Pipeline¶
Eight heterogeneous datasets are integrated (EgoExo4D, HoloAssist, HOT3D, ARCTIC, TACO, H2O, EgoGen, etc.):
- Segmentation: Clips are uniformly cut into \(T\)-frame segments and encoded as high-quality mp4 files.
- Annotation: RollingDepth is used to generate depth pseudo-labels; EgoGen provides ~30 hours of synthetic data; missing gaze annotations are not pseudo-labeled.
- Normalization: Unified to 30 FPS; depth is represented as inverse depth and normalized per sequence; gaze is projected to 2D; camera trajectories are standardized to OpenCV camera-to-world convention with the first frame as reference.
Key Design 2: Temporal Tokenizers¶
- Video Modalities (RGB/Depth): Cosmos Tokenizer with \(4\times\) temporal and \(8\times\) spatial compression.
- Gaze/Camera Trajectory: Transformer-based VQ-VAE
- N-dimensional convolution with \(2\times\) temporal downsampling + N-dimensional positional encoding + 12-layer Transformer
- Cosine-similarity vector quantization with modality-specific codebooks
- Training loss: reconstruction + codebook + commitment losses
Key Design 3: Multimodal Masked Pretraining¶
Four key adaptations are made relative to the image-level masked modeling of 4M:
- Token Count Expansion: From 256 visible tokens in 4M to 2048, accommodating 5000+ tokens per video.
- Dataset Balancing: ~4 billion vs. 13 million gaze tokens → sampling proportional to dataset size combined with uniform modality sampling.
- Missing Modality Handling: Placeholder tokens with masking, rather than pseudo-labels.
- Temporal Embeddings: 1D sinusoidal-cosine encoding for gaze/camera, 3D sinusoidal-cosine encoding for video, plus learnable modality-type embeddings.
Inference¶
After pretraining, arbitrary-to-arbitrary modality prediction is achieved by sampling different token combinations, supporting parallel inference.
Key Experimental Results¶
Supported Downstream Tasks¶
- Gaze Estimation: Predicting 2D gaze points.
- Egocentric Camera Tracking: 6DoF trajectory estimation.
- Monocular Depth Estimation: Estimating depth from RGB video.
- Conditional Video Generation: Generating video conditioned on multimodal inputs.
Main Results¶
- Matches or surpasses task-specific models on all four tasks.
- Inference speed is an order of magnitude faster.
- A single model covers both understanding and generation.
Training Scale¶
Main Results¶
| Item | Data |
|---|---|
| Number of Datasets | 8 (real + synthetic) |
| Total Training Tokens | ~4 billion |
| Gaze Tokens | ~13 million |
| Video Tokens / Sample | >5000 |
| Visible Token Count | 2048 |
Key Findings¶
- EgoGen synthetic data yields notable improvements for depth estimation.
- Sampling proportional to dataset size combined with uniform modality sampling is the most stable strategy.
- Missing modalities can be effectively predicted without pseudo-labels.
- Temporal tokenization design significantly outperforms image-level processing.
Highlights & Insights¶
- Elegant Handling of Missing Modalities: Masked placeholder tokens naturally accommodate missing modalities, offering a pragmatic solution to the heterogeneity of egocentric data.
- Unified Tokenizer Architecture: Adding new modalities requires only training a new tokenizer and codebook, conferring strong extensibility.
- Unified Understanding and Generation: A single model supports both perception and generation tasks.
- Speed Advantage: Parallel inference is an order of magnitude faster than running multiple specialized models, which is critical for real-time AR/VR applications.
- Data Engineering: The complete pipeline for integrating eight heterogeneous datasets constitutes a practical engineering contribution.
Limitations & Future Work¶
- Gaze data is far less abundant than other modalities (13 million vs. 4 billion tokens), potentially limiting gaze prediction accuracy.
- Only four modalities are covered; audio, touch, and hand pose are not included.
- Video resolution is constrained by the compression capacity of the Cosmos Tokenizer.
- Detailed per-task numerical comparisons against specialized models are provided in the supplementary material.
- The datasets are biased toward indoor and manipulation scenarios; generalization to outdoor settings has not been thoroughly validated.
Related Work & Insights¶
- Image-Level Multimodal Foundation Models: 4M (masked modeling) → EgoM2P extends this paradigm to temporal video.
- Video Foundation Models: VideoMAE, Cosmos → primarily third-person perspective.
- Egocentric Foundation Models: EgoVLP/v2 (video-language QA) → lack body-related modalities and 3D capabilities.
- Multimodal Binding: ImageBind → aligns embedding spaces only.
Rating¶
- Novelty: 8/10 — First multimodal multitask model targeting egocentric 4D understanding.
- Technical Depth: 7/10 — Extension of 4M is well-motivated but not revolutionary.
- Experimental Thoroughness: 6/10 — Detailed numbers are partially deferred to supplementary material.
- Writing Quality: 9/10 — Single-model multitask design with fast inference has direct value for AR/VR.
- Value: 7.5/10