Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance¶
Metadata¶
- Conference: ICCV 2025
- arXiv: 2505.14346
- Code: GitHub
- Area: 3D Vision / Inertial Localization
- Keywords: inertial localization, IMU, point cloud, egocentric view, multimodal alignment, action recognition
TL;DR¶
The EAIL framework leverages egocentric action cues embedded in head-mounted IMU signals and employs hierarchical multimodal alignment (vision-language guidance) to learn associations between actions and environmental structures, enabling accurate inertial localization in 3D point clouds while simultaneously supporting action recognition.
Background & Motivation¶
Inertial localization (tracking human position via IMU) faces two primary challenges:
Trajectory Drift: Sensor noise causes measurement errors to accumulate over time, eventually resulting in significant drift.
Complexity of Human Actions: Wearable IMUs capture not only displacement-inducing motions (walking/stopping) but also actions that produce no positional change (e.g., head movement while cooking), complicating IMU signal processing.
Core Insight: Certain actions are strongly correlated with spatial environment structures (e.g., washing dishes occurs near the sink, bending to check the oven occurs in front of the oven). These actions can serve as spatial anchors to compensate for localization drift.
Limitations of prior work: - Velocity-integration methods (RoNIN, IMUNet) exhibit rapidly growing errors over long sequences. - NILoc predicts positions directly but requires scene-specific training, lacking cross-scene generalization. - Existing datasets and methods focus primarily on walking scenarios, neglecting the diversity of complex human actions.
Method¶
Overall Architecture¶
EAIL adopts a two-stage design:
Stage 1: Short-term Action–Location Alignment - Trains an IMU encoder and a point cloud encoder. - Performs alignment via four-modality contrastive learning (image, text, IMU, point cloud). - Leverages pretrained vision-language models to guide training. - Images and text are only required during training and are not needed at inference.
Stage 2: Sequential Motion Localization - Freezes Stage 1 encoders for feature extraction. - A temporal reasoning module and a spatial reasoning module jointly predict trajectories. - Includes a position-aware action recognition module.
Key Design 1: Four-Modality Contrastive Learning¶
For each 1-second time segment, four synchronized modality inputs are extracted: - Egocentric image \(\mathbf{I}_t\) (encoded via CLIP ViT-Base) - Action description \(\mathbf{L}_t\) (encoded via CLIP Text Transformer) - IMU signal \(\mathbf{M}_t\) (encoded via ResNet18-1D at 800 Hz sampling rate) - Local point cloud \(\mathbf{P}_t\) (encoded via PointNet++ over a 1 m² region)
Contrastive loss:
Hyperparameters: \(\alpha=0.1\); \(\beta, \theta, \delta, \gamma = 1\).
Key Design 2: Spatiotemporal Reasoning¶
Given a \(T=10\)-second IMU sequence and a global point cloud (uniformly divided into \(S=400\) segments):
- Correspondence Heatmap Generation: Computes similarity between IMU features and point cloud features; high-scoring regions indicate likely locations of the motion.
- Temporal Reasoning Module: A 3D convolutional network processes heatmap sequences and IMU features to reason over the temporal dimension.
- Spatial Reasoning Module: A dilated 3D convolutional network reasons over the 2D spatial dimension.
- Trajectory Prediction: Formulated as an \(S\)-class classification problem with cross-entropy loss.
Key Design 3: Position-aware Action Recognition¶
The predicted position probability distribution serves as spatial attention to weight point cloud features, which are then fused with IMU features and mapped to action categories via an MLP. The intuition is that knowing a person is near the sink facilitates recognition of dish-washing.
Key Experimental Results¶
Main Results: Inertial Localization Accuracy¶
| Method | Type | Seen 0.2m | Seen 0.4m | Seen 0.6m | Seen RS | Unseen 0.2m | Unseen 0.4m | Unseen 0.6m | Unseen RS |
|---|---|---|---|---|---|---|---|---|---|
| RoNIN | Velocity Integration | 4.86 | 12.77 | 20.65 | / | 3.96 | 9.52 | 15.65 | / |
| IMUNet | Velocity Integration | 3.74 | 10.15 | 17.23 | / | 3.40 | 9.17 | 14.63 | / |
| NILoc+ | Direct Prediction | 17.03 | 41.31 | 74.15 | 88.17 | 13.32 | 37.85 | 69.21 | 84.08 |
| EAIL (Ours) | Direct Prediction | 43.86 | 70.15 | 89.60 | 96.01 | 26.86 | 65.97 | 90.79 | 89.55 |
Key Gains: 0.2 m accuracy improves from 17.03% to 43.86% (+26.83) in seen scenes, and from 13.32% to 26.86% (+13.54) in unseen scenes.
Ablation Study¶
| Ablation | Seen 0.2m | Seen 0.6m | Seen RS | Unseen 0.2m | Unseen 0.6m |
|---|---|---|---|---|---|
| w/o vision-language guidance | 39.75 | 87.56 | 95.70 | 20.41 | 89.83 |
| w/o action loss | 41.92 | 87.75 | 95.51 | 25.37 | 89.28 |
| w/o spatial reasoning | 38.68 | 87.78 | 95.05 | 25.03 | 83.54 |
| w/o temporal reasoning | 41.44 | 87.96 | 95.78 | 26.41 | 89.13 |
| Full model | 43.86 | 89.60 | 96.01 | 26.86 | 90.79 |
Action Recognition Results¶
| Method | Seen top1 | Seen top5 | Unseen top1 | Unseen top5 |
|---|---|---|---|---|
| DeepConvLSTM | 15.20 | 43.27 | 12.47 | 36.86 |
| IMU2CLIP | 18.96 | 50.43 | 12.27 | 37.04 |
| EAIL (Ours) | 21.48 | 53.62 | 15.03 | 43.34 |
Key Findings¶
- Direct localization substantially outperforms velocity integration: Velocity-integration methods exceed EAIL's error after only 30 seconds; drift is severe in long sequences.
- Strong resistance to trajectory drift: EAIL's localization error does not grow over time (Fig. 4), whereas velocity-integration methods exhibit linear growth.
- Vision-language guidance is effective: Even without images and text at inference, the multimodal alignment from Stage 1 yields significant improvements.
- Action supervision benefits localization: Explicit action classification loss helps the model better align motion patterns with the environment.
- Spatial attention enhances action recognition: IMU alone is approximately equivalent to IMU2CLIP; incorporating position-awareness raises top-1 accuracy by more than 2.5 points.
Highlights & Insights¶
- Actions as Anchors: The paper elegantly exploits action–environment correlations as natural anchors, replacing conventional external signals such as GPS, WiFi, or Bluetooth.
- Training–Inference Modality Decoupling: Vision and language are leveraged during training, while only IMU and point clouds are required at inference, balancing performance and privacy.
- Mutually Beneficial Dual Outputs: Localization and action recognition are mutually reinforcing—predicted positions facilitate action recognition, and action supervision improves localization.
- Heatmap Interpretability: Stage 1 heatmaps intuitively reveal action–location associations (e.g., dish-washing highlights the sink region), while Stage 2 heatmaps converge to a single peak.
Limitations & Future Work¶
- The approach requires a pre-available 3D point cloud of the environment and is unsuitable for frequently changing scenes.
- When a person remains stationary for extended periods, the absence of action cues makes localization difficult.
- Only head-mounted IMU has been validated; other wearing positions (wrist, ankle) would require model adaptation.
- Top-1 action recognition accuracy remains low (21.48%), with substantial ambiguity among the 35 action categories.
Related Work & Insights¶
- RoNIN: A classical method for learning velocity prediction from IMU; velocity integration leads to drift.
- NILoc: Predicts positions directly but requires scene-specific training; this paper introduces point clouds to achieve cross-scene generalization.
- IMU2CLIP: Uses CLIP to guide IMU feature learning; this paper further incorporates point clouds and a localization task.
- EgoExo4D: Provides rich egocentric multimodal data including 564 hours of cooking activities.
Rating¶
⭐⭐⭐⭐ — Novel and practical problem formulation, insightful "actions as anchors" perspective, elegant multimodal alignment design, and substantial improvement in localization accuracy.