Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance¶

Metadata¶

Conference: ICCV 2025
arXiv: 2505.14346
Code: GitHub
Area: 3D Vision / Inertial Localization
Keywords: inertial localization, IMU, point cloud, egocentric view, multimodal alignment, action recognition

TL;DR¶

The EAIL framework leverages egocentric action cues embedded in head-mounted IMU signals and employs hierarchical multimodal alignment (vision-language guidance) to learn associations between actions and environmental structures, enabling accurate inertial localization in 3D point clouds while simultaneously supporting action recognition.

Background & Motivation¶

Inertial localization (tracking human position via IMU) faces two primary challenges:

Trajectory Drift: Sensor noise causes measurement errors to accumulate over time, eventually resulting in significant drift.

Complexity of Human Actions: Wearable IMUs capture not only displacement-inducing motions (walking/stopping) but also actions that produce no positional change (e.g., head movement while cooking), complicating IMU signal processing.

Core Insight: Certain actions are strongly correlated with spatial environment structures (e.g., washing dishes occurs near the sink, bending to check the oven occurs in front of the oven). These actions can serve as spatial anchors to compensate for localization drift.

Limitations of prior work: - Velocity-integration methods (RoNIN, IMUNet) exhibit rapidly growing errors over long sequences. - NILoc predicts positions directly but requires scene-specific training, lacking cross-scene generalization. - Existing datasets and methods focus primarily on walking scenarios, neglecting the diversity of complex human actions.

Method¶

Overall Architecture¶

EAIL adopts a two-stage design:

Stage 1: Short-term Action–Location Alignment - Trains an IMU encoder and a point cloud encoder. - Performs alignment via four-modality contrastive learning (image, text, IMU, point cloud). - Leverages pretrained vision-language models to guide training. - Images and text are only required during training and are not needed at inference.

Stage 2: Sequential Motion Localization - Freezes Stage 1 encoders for feature extraction. - A temporal reasoning module and a spatial reasoning module jointly predict trajectories. - Includes a position-aware action recognition module.

Key Design 1: Four-Modality Contrastive Learning¶

For each 1-second time segment, four synchronized modality inputs are extracted: - Egocentric image \(\mathbf{I}_t\) (encoded via CLIP ViT-Base) - Action description \(\mathbf{L}_t\) (encoded via CLIP Text Transformer) - IMU signal \(\mathbf{M}_t\) (encoded via ResNet18-1D at 800 Hz sampling rate) - Local point cloud \(\mathbf{P}_t\) (encoded via PointNet++ over a 1 m² region)

Contrastive loss:

\[L_{\text{stage1}} = \alpha L_c(\mathbf{F}^I, \mathbf{F}^M) + \beta L_c(\mathbf{F}^I, \mathbf{F}^P) + \theta L_c(\mathbf{F}^L, \mathbf{F}^M) + \delta L_c(\mathbf{F}^L, \mathbf{F}^P) + \gamma L_c(\mathbf{F}^M, \mathbf{F}^P)\]

Hyperparameters: \(\alpha=0.1\); \(\beta, \theta, \delta, \gamma = 1\).

Key Design 2: Spatiotemporal Reasoning¶

Given a \(T=10\)-second IMU sequence and a global point cloud (uniformly divided into \(S=400\) segments):

Correspondence Heatmap Generation: Computes similarity between IMU features and point cloud features; high-scoring regions indicate likely locations of the motion.
Temporal Reasoning Module: A 3D convolutional network processes heatmap sequences and IMU features to reason over the temporal dimension.
Spatial Reasoning Module: A dilated 3D convolutional network reasons over the 2D spatial dimension.
Trajectory Prediction: Formulated as an \(S\)-class classification problem with cross-entropy loss.

\[L_{\text{traj}} = -\sum_{t=1}^{T}\sum_{s=1}^{S} \mathbf{y}_{t,s} \log(\hat{\mathbf{y}}_{t,s})\]

Key Design 3: Position-aware Action Recognition¶

The predicted position probability distribution serves as spatial attention to weight point cloud features, which are then fused with IMU features and mapped to action categories via an MLP. The intuition is that knowing a person is near the sink facilitates recognition of dish-washing.

\[L_{\text{stage2}} = L_{\text{traj}} + L_{\text{action}}\]

Key Experimental Results¶

Main Results: Inertial Localization Accuracy¶

Method	Type	Seen 0.2m	Seen 0.4m	Seen 0.6m	Seen RS	Unseen 0.2m	Unseen 0.4m	Unseen 0.6m	Unseen RS
RoNIN	Velocity Integration	4.86	12.77	20.65	/	3.96	9.52	15.65	/
IMUNet	Velocity Integration	3.74	10.15	17.23	/	3.40	9.17	14.63	/
NILoc+	Direct Prediction	17.03	41.31	74.15	88.17	13.32	37.85	69.21	84.08
EAIL (Ours)	Direct Prediction	43.86	70.15	89.60	96.01	26.86	65.97	90.79	89.55

Key Gains: 0.2 m accuracy improves from 17.03% to 43.86% (+26.83) in seen scenes, and from 13.32% to 26.86% (+13.54) in unseen scenes.

Ablation Study¶

Ablation	Seen 0.2m	Seen 0.6m	Seen RS	Unseen 0.2m	Unseen 0.6m
w/o vision-language guidance	39.75	87.56	95.70	20.41	89.83
w/o action loss	41.92	87.75	95.51	25.37	89.28
w/o spatial reasoning	38.68	87.78	95.05	25.03	83.54
w/o temporal reasoning	41.44	87.96	95.78	26.41	89.13
Full model	43.86	89.60	96.01	26.86	90.79

Action Recognition Results¶

Method	Seen top1	Seen top5	Unseen top1	Unseen top5
DeepConvLSTM	15.20	43.27	12.47	36.86
IMU2CLIP	18.96	50.43	12.27	37.04
EAIL (Ours)	21.48	53.62	15.03	43.34

Key Findings¶

Direct localization substantially outperforms velocity integration: Velocity-integration methods exceed EAIL's error after only 30 seconds; drift is severe in long sequences.
Strong resistance to trajectory drift: EAIL's localization error does not grow over time (Fig. 4), whereas velocity-integration methods exhibit linear growth.
Vision-language guidance is effective: Even without images and text at inference, the multimodal alignment from Stage 1 yields significant improvements.
Action supervision benefits localization: Explicit action classification loss helps the model better align motion patterns with the environment.
Spatial attention enhances action recognition: IMU alone is approximately equivalent to IMU2CLIP; incorporating position-awareness raises top-1 accuracy by more than 2.5 points.

Highlights & Insights¶

Actions as Anchors: The paper elegantly exploits action–environment correlations as natural anchors, replacing conventional external signals such as GPS, WiFi, or Bluetooth.
Training–Inference Modality Decoupling: Vision and language are leveraged during training, while only IMU and point clouds are required at inference, balancing performance and privacy.
Mutually Beneficial Dual Outputs: Localization and action recognition are mutually reinforcing—predicted positions facilitate action recognition, and action supervision improves localization.
Heatmap Interpretability: Stage 1 heatmaps intuitively reveal action–location associations (e.g., dish-washing highlights the sink region), while Stage 2 heatmaps converge to a single peak.

Limitations & Future Work¶

The approach requires a pre-available 3D point cloud of the environment and is unsuitable for frequently changing scenes.
When a person remains stationary for extended periods, the absence of action cues makes localization difficult.
Only head-mounted IMU has been validated; other wearing positions (wrist, ankle) would require model adaptation.
Top-1 action recognition accuracy remains low (21.48%), with substantial ambiguity among the 35 action categories.

RoNIN: A classical method for learning velocity prediction from IMU; velocity integration leads to drift.
NILoc: Predicts positions directly but requires scene-specific training; this paper introduces point clouds to achieve cross-scene generalization.
IMU2CLIP: Uses CLIP to guide IMU feature learning; this paper further incorporates point clouds and a localization task.
EgoExo4D: Provides rich egocentric multimodal data including 564 hours of cooking activities.

Rating¶

⭐⭐⭐⭐ — Novel and practical problem formulation, insightful "actions as anchors" perspective, elegant multimodal alignment design, and substantial improvement in localization accuracy.