Gaze Beyond the Frame: Forecasting Egocentric 3D Visual Span¶

Conference: NeurIPS 2025 arXiv: 2511.18470 Code: Available Area: 3D Vision Keywords: Egocentric Vision, Gaze Prediction, 3D Visual Span, SLAM, Voxel Prediction

TL;DR¶

This paper proposes EgoSpanLift, a method that lifts egocentric 2D gaze predictions into 3D space, constructing multi-level volumetric visual span representations. Combined with a 3D U-Net and a causal Transformer, the framework forecasts future 3D regions of visual attention.

Background & Motivation¶

People accomplish daily activities through continuous perception and interaction with their environment, with visual perception serving as the foundation guiding human behavior — "look before you act." Existing egocentric understanding research focuses primarily on action prediction and contact-based interaction, leaving prediction of visual perception itself largely unexplored.

Limitations of existing 2D gaze prediction:

Ambiguous definition in dynamic scenes: Modeling both the user's self-motion and attention simultaneously is required, yet both are naturally expressed in 3D space.

Information loss from 2D projection: Gaze direction and self-motion inherently point to specific locations in 3D space, not arbitrary regions on a 2D image plane.

Inability to predict beyond the current field of view: Users may turn their heads to attend to regions outside the current frame.

This paper introduces the novel problem of egocentric 3D visual span forecasting: predicting where in the 3D environment a user's visual attention will be directed in the future.

Method¶

Overall Architecture¶

The system consists of three components: 1. EgoSpanLift: Lifts 2D gaze from the image plane into volumetric regions in the 3D scene. 2. Prediction Network: A 3D U-Net (spatial encoder) combined with a causal Transformer (temporal modeling) for future visual span prediction. 3. Benchmark: A 364.6K-sample benchmark curated from raw egocentric multi-sensor data.

Key Designs¶

1. EgoSpanLift — Lifting Gaze from 2D to 3D¶

Inputs: Semi-dense 3D keypoints \(\mathcal{P}\) from SLAM (each containing position \(\mathbf{p}_i \in \mathbb{R}^3\), confidence \(\sigma_i\), and observation timestamp \(t_i\)), and localization information \(\mathcal{E}\) (SE(3) transformation matrices \(\mathbf{E}_t\)).

Observation-based keypoint selection:

\[\mathcal{P}_t = \{p_i \in \mathcal{P} \mid t_i = t, \|\mathbf{p}_i - \mathbf{t}_t\|_1 < D/2, \mathcal{I}_f(p_i; \mathcal{P}_t) = 1\}\]

Triple filtering: temporal window filtering (retaining only the most recent seconds), spatial filtering (within a \(D=3.2\)m cube), and statistical outlier filtering (preserving dynamic objects such as hands and moving people while removing invalid points).

Gaze-based keypoint classification: After transforming keypoints into the local coordinate frame, a 3D gaze cone determines whether each point falls within the visual span:

\[Q_t^{\theta, \mathbf{g}_t} = \left\{p_i \in \mathcal{P}_t \mid \frac{\langle \mathbf{E}_t^{-1}\mathbf{p}_i, \mathbf{g}_t \rangle}{\|\mathbf{E}_t^{-1}\mathbf{p}_i\| \|\mathbf{g}_t\|} > \cos\theta \right\}\]

where \(\theta\) is the eccentricity angle threshold and \(\mathbf{g}_t\) is the gaze direction.

2. Multi-Level Volumetric Region Localization¶

Inspired by the visual science literature, four hierarchical levels of visual span are defined:

Level	Eccentricity \(\theta\)	Meaning
Foveal	2°	Region targeted by conventional 2D gaze
Central	8°	Compensates for sparse coverage of semi-dense keypoints
Near Peripheral	30°	Broader peripheral perception range
Orientation	55°	Field of view centered on head orientation

Classified keypoints are voxelized into a 3D grid (resolution \(R\), side length \(D\)), with binary occupancy computed as:

\[V_{[t_b,t_e]}^{\theta,\mathbf{g}_t}(i,j,k) = \mathcal{I}(|\{p_i \in \cup_{t} Q_t^{\theta,\mathbf{g}_t} \mid \text{falls in corresponding voxel}\}| > 0)\]

3. Prediction Network¶

Autoregressive Encoder: - Input: \(T_p \times (4+1) \times R \times R \times R\) voxel grids (4 visual span levels + full scene) - A 3D U-Net encoder compresses the spatial dimensions (reduction factor \(R\)), yielding \(T_p \times C\) temporal features - A global embedding is appended as a prediction token, forming a \((T_p+1) \times C\) feature sequence - A causal Transformer learns temporal dependencies, ensuring information flows toward the final global embedding

Decoder: - The output embedding is upsampled via the U-Net decoder - Residual connections from intermediate encoder features are incorporated - Sigmoid activation produces a \(4 \times R \times R \times R\) soft occupancy map in \([0,1]\)

Loss & Training¶

Since visual spans occupy only a tiny fraction of the total space (foveal span < 1%), standard cross-entropy fails to learn meaningful signals. Dice Loss is adopted:

\[\mathcal{L} = 1 - \frac{2 \times \sum \tilde{Y}_{ijk} \odot Y_{ijk}}{\sum \tilde{Y}_{ijk} + \sum Y_{ijk} + 1}\]

Joint multi-level training outperforms single-task training, with particularly significant improvements in foveal span prediction.

Key Experimental Results¶

Main Results¶

FoVS-Aria test set (daily activities, 23.2K samples):

Method	Orientation IoU	Peripheral IoU	Central IoU	Foveal IoU
CSTS + EgoSpanLift	—	0.457	0.234	0.139
EgoChoir	0.496	0.430	0.261	0.199
Ours (full)	0.584	0.489	0.351	0.284

FoVS-EgoExo test set (skilled activities, 341.4K samples):

Method	Orientation IoU	Peripheral IoU	Central IoU	Foveal IoU
CSTS + EgoSpanLift	—	0.498	0.287	0.156
EgoChoir	0.329	0.285	0.198	0.127
Ours (full)	0.523	0.511	0.421	0.369

3D foveal localization error (distance distribution):

Method	Min (cm)	Mean (cm)	Max (cm)
CSTS	59.71	73.79	87.68
Ours	19.04	34.85	51.23

Ablation Study¶

Configuration	Orientation IoU	Central IoU	Foveal IoU
w/o prior spans	0.342	0.107	0.059
BCE loss	0.573	0.284	0.206
Single-task training	0.583	0.335	0.249
w/o global embedding	0.560	0.324	0.262
Full model	0.584	0.351	0.284

Key Findings¶

The proposed method substantially outperforms all baselines across all levels, exceeding prior work by 50%+ on foveal span prediction.
Prior visual span information is critical — removing it causes a dramatic performance drop.
Joint multi-level training significantly outperforms single-task training, exploiting mutual cues between peripheral and foveal attention.
Back-projecting 3D predictions to 2D (without any 2D-specific training) matches the performance of dedicated 2D gaze methods.
Inference latency is only 71.2 ms, satisfying real-time requirements.

Highlights & Insights¶

Novel problem formulation: The first work to formally define egocentric 3D visual span forecasting.
Visual science grounding: The multi-level visual span hierarchy is derived from established visual science taxonomy.
Practical efficiency: Built on SLAM keypoints (not dense reconstruction), with low latency (71 ms), making it suitable for AR/VR applications.
Bidirectional validation: Effectiveness in both 2D→3D and 3D→2D directions demonstrates the superiority of 3D modeling.
Large-scale benchmark: A 364.6K-sample evaluation platform is curated and released.

Limitations & Future Work¶

Performance depends on the quality of semi-dense SLAM keypoints, which may be insufficient in certain scenes.
Absolute foveal span IoU remains moderate (~0.28–0.37), leaving room for improvement in fine-grained prediction.
Large-scale dynamic scenes (e.g., soccer, basketball) are excluded, limiting generalizability.
Voxelization is the primary computational bottleneck (45 ms) and requires optimization for more constrained devices.
Non-visual modalities (auditory, proprioceptive) are not incorporated, potentially missing important cues.

CSTS: Current state-of-the-art 2D gaze prediction method based on multimodal contrastive spatiotemporal fusion.
EgoChoir: Predicts 3D interaction hotspots from synthetic geometry and motion; serves as the strongest baseline.
Ego-Exo4D / Aria Everyday Activities: Sources of training and evaluation data.
Insight: Lifting 2D tasks into 3D is a general and effective strategy that may extend to other egocentric prediction tasks.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Defines a novel problem and proposes a systematic solution.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Two datasets (364.6K samples), multiple baselines, comprehensive ablations, and 2D back-projection validation.
Writing Quality: ⭐⭐⭐⭐ Clear structure with natural integration of visual science background.
Value: ⭐⭐⭐⭐ Direct applicability to AR/VR and assistive technologies.