Gaze Beyond the Frame: Forecasting Egocentric 3D Visual Span¶
Conference: NeurIPS 2025 arXiv: 2511.18470 Code: Available Area: 3D Vision Keywords: Egocentric Vision, Gaze Prediction, 3D Visual Span, SLAM, Voxel Prediction
TL;DR¶
This paper proposes EgoSpanLift, a method that lifts egocentric 2D gaze predictions into 3D space, constructing multi-level volumetric visual span representations. Combined with a 3D U-Net and a causal Transformer, the framework forecasts future 3D regions of visual attention.
Background & Motivation¶
People accomplish daily activities through continuous perception and interaction with their environment, with visual perception serving as the foundation guiding human behavior — "look before you act." Existing egocentric understanding research focuses primarily on action prediction and contact-based interaction, leaving prediction of visual perception itself largely unexplored.
Limitations of existing 2D gaze prediction:
Ambiguous definition in dynamic scenes: Modeling both the user's self-motion and attention simultaneously is required, yet both are naturally expressed in 3D space.
Information loss from 2D projection: Gaze direction and self-motion inherently point to specific locations in 3D space, not arbitrary regions on a 2D image plane.
Inability to predict beyond the current field of view: Users may turn their heads to attend to regions outside the current frame.
This paper introduces the novel problem of egocentric 3D visual span forecasting: predicting where in the 3D environment a user's visual attention will be directed in the future.
Method¶
Overall Architecture¶
The system consists of three components: 1. EgoSpanLift: Lifts 2D gaze from the image plane into volumetric regions in the 3D scene. 2. Prediction Network: A 3D U-Net (spatial encoder) combined with a causal Transformer (temporal modeling) for future visual span prediction. 3. Benchmark: A 364.6K-sample benchmark curated from raw egocentric multi-sensor data.
Key Designs¶
1. EgoSpanLift — Lifting Gaze from 2D to 3D¶
Inputs: Semi-dense 3D keypoints \(\mathcal{P}\) from SLAM (each containing position \(\mathbf{p}_i \in \mathbb{R}^3\), confidence \(\sigma_i\), and observation timestamp \(t_i\)), and localization information \(\mathcal{E}\) (SE(3) transformation matrices \(\mathbf{E}_t\)).
Observation-based keypoint selection:
Triple filtering: temporal window filtering (retaining only the most recent seconds), spatial filtering (within a \(D=3.2\)m cube), and statistical outlier filtering (preserving dynamic objects such as hands and moving people while removing invalid points).
Gaze-based keypoint classification: After transforming keypoints into the local coordinate frame, a 3D gaze cone determines whether each point falls within the visual span:
where \(\theta\) is the eccentricity angle threshold and \(\mathbf{g}_t\) is the gaze direction.
2. Multi-Level Volumetric Region Localization¶
Inspired by the visual science literature, four hierarchical levels of visual span are defined:
| Level | Eccentricity \(\theta\) | Meaning |
|---|---|---|
| Foveal | 2° | Region targeted by conventional 2D gaze |
| Central | 8° | Compensates for sparse coverage of semi-dense keypoints |
| Near Peripheral | 30° | Broader peripheral perception range |
| Orientation | 55° | Field of view centered on head orientation |
Classified keypoints are voxelized into a 3D grid (resolution \(R\), side length \(D\)), with binary occupancy computed as:
3. Prediction Network¶
Autoregressive Encoder: - Input: \(T_p \times (4+1) \times R \times R \times R\) voxel grids (4 visual span levels + full scene) - A 3D U-Net encoder compresses the spatial dimensions (reduction factor \(R\)), yielding \(T_p \times C\) temporal features - A global embedding is appended as a prediction token, forming a \((T_p+1) \times C\) feature sequence - A causal Transformer learns temporal dependencies, ensuring information flows toward the final global embedding
Decoder: - The output embedding is upsampled via the U-Net decoder - Residual connections from intermediate encoder features are incorporated - Sigmoid activation produces a \(4 \times R \times R \times R\) soft occupancy map in \([0,1]\)
Loss & Training¶
Since visual spans occupy only a tiny fraction of the total space (foveal span < 1%), standard cross-entropy fails to learn meaningful signals. Dice Loss is adopted:
Joint multi-level training outperforms single-task training, with particularly significant improvements in foveal span prediction.
Key Experimental Results¶
Main Results¶
FoVS-Aria test set (daily activities, 23.2K samples):
| Method | Orientation IoU | Peripheral IoU | Central IoU | Foveal IoU |
|---|---|---|---|---|
| CSTS + EgoSpanLift | — | 0.457 | 0.234 | 0.139 |
| EgoChoir | 0.496 | 0.430 | 0.261 | 0.199 |
| Ours (full) | 0.584 | 0.489 | 0.351 | 0.284 |
FoVS-EgoExo test set (skilled activities, 341.4K samples):
| Method | Orientation IoU | Peripheral IoU | Central IoU | Foveal IoU |
|---|---|---|---|---|
| CSTS + EgoSpanLift | — | 0.498 | 0.287 | 0.156 |
| EgoChoir | 0.329 | 0.285 | 0.198 | 0.127 |
| Ours (full) | 0.523 | 0.511 | 0.421 | 0.369 |
3D foveal localization error (distance distribution):
| Method | Min (cm) | Mean (cm) | Max (cm) |
|---|---|---|---|
| CSTS | 59.71 | 73.79 | 87.68 |
| Ours | 19.04 | 34.85 | 51.23 |
Ablation Study¶
| Configuration | Orientation IoU | Central IoU | Foveal IoU |
|---|---|---|---|
| w/o prior spans | 0.342 | 0.107 | 0.059 |
| BCE loss | 0.573 | 0.284 | 0.206 |
| Single-task training | 0.583 | 0.335 | 0.249 |
| w/o global embedding | 0.560 | 0.324 | 0.262 |
| Full model | 0.584 | 0.351 | 0.284 |
Key Findings¶
- The proposed method substantially outperforms all baselines across all levels, exceeding prior work by 50%+ on foveal span prediction.
- Prior visual span information is critical — removing it causes a dramatic performance drop.
- Joint multi-level training significantly outperforms single-task training, exploiting mutual cues between peripheral and foveal attention.
- Back-projecting 3D predictions to 2D (without any 2D-specific training) matches the performance of dedicated 2D gaze methods.
- Inference latency is only 71.2 ms, satisfying real-time requirements.
Highlights & Insights¶
- Novel problem formulation: The first work to formally define egocentric 3D visual span forecasting.
- Visual science grounding: The multi-level visual span hierarchy is derived from established visual science taxonomy.
- Practical efficiency: Built on SLAM keypoints (not dense reconstruction), with low latency (71 ms), making it suitable for AR/VR applications.
- Bidirectional validation: Effectiveness in both 2D→3D and 3D→2D directions demonstrates the superiority of 3D modeling.
- Large-scale benchmark: A 364.6K-sample evaluation platform is curated and released.
Limitations & Future Work¶
- Performance depends on the quality of semi-dense SLAM keypoints, which may be insufficient in certain scenes.
- Absolute foveal span IoU remains moderate (~0.28–0.37), leaving room for improvement in fine-grained prediction.
- Large-scale dynamic scenes (e.g., soccer, basketball) are excluded, limiting generalizability.
- Voxelization is the primary computational bottleneck (45 ms) and requires optimization for more constrained devices.
- Non-visual modalities (auditory, proprioceptive) are not incorporated, potentially missing important cues.
Related Work & Insights¶
- CSTS: Current state-of-the-art 2D gaze prediction method based on multimodal contrastive spatiotemporal fusion.
- EgoChoir: Predicts 3D interaction hotspots from synthetic geometry and motion; serves as the strongest baseline.
- Ego-Exo4D / Aria Everyday Activities: Sources of training and evaluation data.
- Insight: Lifting 2D tasks into 3D is a general and effective strategy that may extend to other egocentric prediction tasks.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Defines a novel problem and proposes a systematic solution.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Two datasets (364.6K samples), multiple baselines, comprehensive ablations, and 2D back-projection validation.
- Writing Quality: ⭐⭐⭐⭐ Clear structure with natural integration of visual science background.
- Value: ⭐⭐⭐⭐ Direct applicability to AR/VR and assistive technologies.