3DGazeNet: Generalizing 3D Gaze Estimation with Weak-Supervision from Synthetic Views¶

Conference: ECCV 2024
arXiv: 2212.02997
Code: https://github.com/Vagver/3DGazeNet
Area: Human Understanding
Keywords: Gaze Estimation, 3D Eye Mesh Regression, Weakly Supervised, Multi-view Consistency, Synthetic Views

TL;DR¶

Proposes to reformulate gaze estimation as dense 3D eye mesh regression, and performs weakly supervised training via automatic pseudo-label extraction from large-scale in-the-wild face images + HeadGAN-synthesized multi-views, achieving up to 30% improvement over SOTA in cross-domain scenarios.

Background & Motivation¶

Gaze estimation has wide applications in human-computer interaction, VR/AR, and psychological analysis, but the core limitation of prior work lies in the difficulty of cross-domain generalization. Existing gaze datasets are mostly collected in controlled environments, which cover limited facial diversity, head poses, and environments, leading to severe model degradation on unseen domains. Mainstream methods rely on domain adaptation, which requires target domain samples or even annotations, preventing them from being directly deployed as plug-and-play solutions in new scenarios. Meanwhile, traditional methods model gaze estimation as a regression of sparse parameters (angles/vectors), making them susceptible to single-point prediction errors.

Core Problem¶

How to train a generalizable gaze estimation model that can be directly deployed to any scenario without domain adaptation? Two key sub-problems are: (1) how to leverage massive unlabeled in-the-wild facial data to enhance training diversity; (2) how to design a robust representation and training framework to cope with noise in the pseudo-labels.

Method¶

Overall Architecture¶

Given a face image, the left/right eye and full face regions are cropped and resized to 128×128, concatenated along the channel dimension, and fed into ResNet-18 to extract features. Then, two fully connected layers regression-estimate: (a) 2×481 3D eye mesh vertex coordinates, and (b) a 3D gaze vector, respectively. The final gaze direction is the average of these two modalities. The training data consists of three parts: a gaze dataset with GT annotations, a large-scale in-the-wild facial dataset with pseudo-annotations (ITWG, 255K images), and multi-view pairs generated by HeadGAN (ITWG-MV).

Key Designs¶

Unified 3D Eye Representation: A rigid spherical 3D eye template (481 vertices, 928 triangular faces) is defined and automatically fitted to any facial image via iris localization and 3D face alignment. For datasets with GT, the template is rotated by the gaze label and aligned to the iris. For in-the-wild data, RetinaFace is utilized for 3D face reconstruction + 2D iris detection, lifting the 2D iris position to 3D to determine the eye orientation, thereby generating pseudo-GT. This design allows any sparse representation (such as iris boundaries) to be obtained by indexing the dense mesh.
Joint Mesh+Vector Dual-Head Regression: Unlike directly regressing angles or sparse landmarks, the model concurrently predicts the dense 3D eye mesh and the 3D gaze vector. Ablation studies show that the joint M+V objective outperforms using V or M alone—the dense mesh provides robustness against sparse prediction errors, while the vector provides precise label supervision.
Multi-View Consistency Supervision: HeadGAN is utilized to synthesize views with different head poses (while keeping the relative gaze direction unchanged) for each wild face, forming image pairs. The transformation matrix \(P\) between the two views is calculated via 3D face reconstruction, forcing the model's predictions on different perspectives of the same person to remain consistent after transformation. This weakly supervised signal effectively balances the noise in pseudo-labels, allowing training without any real gaze annotations.

Loss & Training¶

Vertex Loss \(\mathcal{L}_{vert}\): L1 distance between coordinate predictions and GT (pseudo) 3D eye vertices.
Edge Loss \(\mathcal{L}_{edge}\): L2 distance of edge lengths based on a fixed triangulation, maintaining mesh topology.
Gaze Loss \(\mathcal{L}_{gaze}\): Angular error between predicted and GT gaze vectors.
Multi-View Vertex Consistency Loss \(\mathcal{L}_{MV,vertex}\): L1 distance between vertex predictions of two views after transformation.
Multi-View Gaze Consistency Loss \(\mathcal{L}_{MV,gaze}\): Angular error of gaze vectors between two views after rotation.
Total Loss: \(\mathcal{L} = \lambda_{GT}\mathcal{L}_{GT} + \lambda_{PGT}\mathcal{L}_{PGT} + \lambda_{MV}\mathcal{L}_{MV}\), with all three weights set to 1.
Hyperparameters: \(\lambda_v=0.1, \lambda_e=0.01, \lambda_g=1\).
Training: Adam, batch=128, lr warmup 1e-6→1e-4 (3 epochs), decayed by 10x at 60/80 epochs, 100 epochs in total, on a single V100 GPU.

Key Experimental Results¶

Dataset/Setting	Metric (°)	3DGazeNet	Prev. SOTA	Gain
Cross-domain (pseudo-labels only): ITWG-MV→G360	gaze err	18.1	22.5 ([41])	~20%
Cross-domain (pseudo-labels only): AVA→G360	gaze err	22.4	29.0 ([41])	~23%
Cross-domain (+GT): GC+ITWG-MV→G360	gaze err	17.6	-	-
Cross-domain (+GT): EXG+ITWG-MV→G360	gaze err	15.4	-	-
In-domain: MPII (M+V)	gaze err	4.0	4.04 (GazeTR)	On par
In-domain: G360 (M+V)	gaze err	9.6	10.1 ([41])	~5%
In-domain: GC (M+V)	gaze err	3.1	3.3 (ETH-XGaze)	~6%
In-domain: EXG (M+V)	gaze err	4.2	4.5 (ETH-XGaze)	~7%
vs SOTA generalization: EXG+IMV→MPII	gaze err	6.0	6.7 (CDG)	~10%
vs SOTA generalization: EXG+IMV→GC	gaze err	7.8	8.2 (RAT)	~5%

Ablation Study¶

M+V vs V vs M: The joint objective M+V outperforms using V or M alone across all datasets (e.g., MPII: 4.0 vs 4.1 vs 4.2, G360: 9.6 vs 9.8 vs 9.8).
Pseudo-labels vs Multi-view Consistency: Multi-view consistency alone without topological constraints performs very poorly (47.4°), and pseudo-labels alone (23.1°) outperforms MV alone. The combination of both yields the best results (18.1°).
ITWG Head Pose Distribution: The more diverse the head poses in the training data (5°→20°→40°→90°), the more the performance in large angles steadily improves, with the full-range ITWG-MV achieving the best results.
Role of ITWG in In-domain Experiments: ITWG brings improvements to G360 which requires diverse head poses (e.g., G360+ITWG-MV: 9.3→15.4), but shows limited in-domain improvements on EXG/MPII which already have sufficient coverage in controlled environments.

Highlights & Insights¶

Reformulating gaze estimation as dense 3D mesh regression is a highly clever idea. The dense representation is inherently robust to sparse prediction errors, and any low-dimensional representation can be retrieved via indexing.
The pseudo-label generation pipeline requires no gaze annotations and automatically generates labels relying solely on 3D face alignment and 2D iris detection, making it possible to leverage massive in-the-wild face datasets.
Multi-view consistency regularizes pseudo-label noise using geometric constraints from HeadGAN-synthesized views, presenting an elegant weakly supervised strategy.
The model architecture is highly minimalist (ResNet-18 + two FC layers), showing that the core contributions are in the data and training framework rather than model design.

Limitations & Future Work¶

Pseudo-labels are relatively weak along the pitch axis, potentially due to the limited vertical gaze variations in the data or the bias of the pseudo-annotating pipeline itself.
The accuracy of pseudo-annotations depends on the performance of 3D face alignment and 2D iris detection, showing uncertainty under extreme occlusions or low resolutions.
Unable to handle cases where the face is invisible (facing away from the camera).
The quality of synthetic views generated by HeadGAN is limited; utilizing more advanced face reenactment methods might yield further improvements.
The spherical eye template ignores the kappa coefficient (optical vs. visual axis offset), representing an accuracy ceiling in personalized scenarios.
The potential of stronger backbones such as Transformers has not been explored.

vs Kothari et al. [41] (CVPR 2021): The most related work, which also learns gaze from in-the-wild data using 3D scene geometric constraints. The key advantages of 3DGazeNet are: (a) dense 3D mesh representation instead of sparse parameters; (b) HeadGAN-based multi-view consistency instead of mutual gaze constraints in social scenes; (c) larger-scale ITWG dataset. In cross-domain settings, 3DGazeNet outperforms it by 20-30%.
vs RUDA/CRGA/PureGaze (Domain Adaptation Methods): The first stage (no target domain knowledge) of these methods is used for a fair comparison. 3DGazeNet comprehensively outperforms them after using ITWG-MV, because competing methods cannot effectively utilize pseudo-labels, while the multi-view consistency framework of 3DGazeNet can regularize pseudo-label noise.
vs Eyeball Model Methods (DPG, Wood et al.): Traditional parametric eyeball model fitting is limited by model construction difficulty and in-the-wild fitting precision. 3DGazeNet performs better under the end-to-end learning paradigm (on Columbia dataset: 5.6° vs 7.1°/7.5°).

Insights & Connections¶

The idea of substituting sparse parameters with dense representations can be transferred to other pose estimation tasks (e.g., hand or human body pose), suggesting that dense regression + weak supervision can be a universally effective strategy in annotation-scarce scenarios.
The weakly supervised paradigm of synthetic multi-views + consistency constraints is generalizable and can be applied to any task assuming geometric transformation invariance.
The pseudo-label pipeline of this method can serve as a reference for the automatic construction of large-scale gaze estimation datasets.

Rating¶

Novelty: ⭐⭐⭐⭐ Reformulating gaze estimation as dense 3D mesh regression is a novel idea, and the multi-view weakly supervised framework is cleverly designed.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive coverage of cross-domain, in-domain, ablation studies, and applications (gaze redirection), with a rich set of baseline comparisons.
Writing Quality: ⭐⭐⭐⭐ The paper is well-structured and complies with ECCV standards, with detailed supplementary materials.
Value: ⭐⭐⭐⭐ Provides a plug-and-play generalizable gaze estimation solution, which holds direct value for practical application scenarios.