OmniGaze: Reward-inspired Generalizable Gaze Estimation in the Wild¶
Conference: NeurIPS 2025 arXiv: 2510.13660 Code: GitHub Area: Human Understanding Keywords: Gaze estimation, semi-supervised learning, pseudo-labels, reward model, cross-domain generalization
TL;DR¶
This paper proposes OmniGaze, a semi-supervised 3D gaze estimation framework that employs a reward model — fusing visual embeddings, MLLM-generated semantic gaze descriptions, and geometric direction vectors — to assess pseudo-label quality. Trained on 1.4 million unlabeled face images, OmniGaze achieves state-of-the-art performance under both within-domain and cross-domain settings across 5 datasets, and demonstrates zero-shot generalization on 4 unseen datasets.
Background & Motivation¶
3D gaze estimation aims to directly predict gaze direction from facial images and serves as a foundational technology for VR, human-computer interaction, and medical diagnosis. Existing methods face two core challenges:
Scarcity and limited diversity of annotated data: Existing annotated datasets (e.g., ETH-XGaze, Gaze360) simplify real-world scenarios through predefined assumptions (e.g., panoramic cameras, multi-view photography), yet remain limited in scale and scene diversity, causing significant performance degradation on unseen domains.
Generalization methods constrained by annotation coverage: Cross-domain generalization approaches (domain adaptation, domain generalization) are effective but fundamentally bounded by the coverage of annotated source-domain data, whereas large quantities of unlabeled face images can be readily obtained through web crawling or generative models.
Gap in semi-supervised frameworks for gaze estimation: Weakly supervised methods require annotations from gaze-related domains; self-supervised pretext tasks (e.g., image reconstruction, gaze redirection) exhibit weak semantic relevance to gaze estimation and utilize unlabeled data inefficiently. Critically, existing SSL pseudo-label selection strategies (e.g., FixMatch's fixed threshold, FlexMatch's class-aware threshold) are designed for classification tasks and cannot be directly applied to continuous regression targets such as gaze direction.
This gap motivates the design of OmniGaze: a semi-supervised framework tailored for continuous regression tasks that employs a reward model to assess pseudo-label quality while leveraging large-scale unlabeled data to improve generalization.
Method¶
Overall Architecture¶
OmniGaze follows a standard three-stage semi-supervised training pipeline: (1) the teacher model is trained supervisedly on labeled data; (2) the teacher model generates pseudo-labels for unlabeled data, and a reward model filters high-quality pseudo-labels; (3) the student model is jointly trained on labeled data combined with the filtered pseudo-labeled data. The core innovations lie in the reward model design and multimodal cue fusion.
Key Designs¶
-
Reward model with multimodal cue fusion: The reward model \(h_G\) assesses pseudo-label reliability from three perspectives:
-
Visual cues: A CLIP visual encoder extracts visual embeddings \(\boldsymbol{f}_k^v\) from input face images.
- Semantic cues: An MLLM (InstructBLIP) is queried with "where is this person looking in 3D space?" to obtain a natural language description, which is encoded via the CLIP text encoder into \(\boldsymbol{f}_k^t\).
- Geometric cues: The yaw/pitch angles of the pseudo-label are converted into a 3D direction vector.
Visual and semantic cues are fused via cross-attention into a semantics-aware gaze representation: \(\hat{\boldsymbol{f}}_k^v = \text{AvgPool}(\text{LN}(\text{CrossAttn}(\boldsymbol{f}_k^v, \boldsymbol{f}_k^t)))\)
where visual features serve as queries and text features as keys/values, enabling the reward model to dynamically attend to relevant semantic cues when interpreting visual features.
- Geometric representation via spherical-to-Cartesian conversion: Pseudo-labels \((θ_k, ψ_k)\) are converted into 3D direction vectors: \(\boldsymbol{v}_k = [\cos(ψ_k)\cdot\sin(θ_k),\ \cos(ψ_k),\ \cos(ψ_k)\cdot\cos(θ_k)]\)
This representation is more expressive and continuous than raw angles, facilitating precise alignment with semantics-aware representations. The reward model predicts a confidence score via cross-attention followed by an MLP: \(\hat{r}_k = \text{Sigmoid}(\text{MLP}(\text{CrossAttn}(\hat{\boldsymbol{f}}_k^v, \boldsymbol{v}_k)))\)
- Label scorer and final confidence: The initial confidence \(\hat{r}_k\) from the reward model and the cosine similarity between the student model's prediction and the pseudo-label are jointly fed into a lightweight MLP to produce the final confidence: \(r_k = \text{Sigmoid}(\text{MLP}([(\hat{r}_k, \text{sim}(\hat{y}_k, y_k))]))\)
This design encourages mutual verification between the reward model and the student model, improving assessment robustness.
Loss & Training¶
-
Supervised loss (labeled data): Angular L2 loss \(\mathcal{L}^s = \frac{1}{N_L}\sum\|\hat{y} - y_i^l\|_2\)
-
Reward model loss: Binary cross-entropy where ground-truth labels are marked as reliable (\(c_k=1\)) and pseudo-labels as unreliable (\(c_k=0\)), training the reward model to discriminate between them: $\(\mathcal{L}^g = \sum -(c_k\log(r_k) + (1-c_k)\log(1-r_k))\)$
-
Unsupervised loss (pseudo-labeled data): Angular loss with threshold filtering and confidence weighting: $\(\mathcal{L}^u = \sum \mathbb{1}[r_j \geq 0.5] \cdot r_j \cdot \|h_S(x_j^u) - y_j^u\|_2\)$
Threshold \(τ=0.5\): pseudo-labels below the threshold are discarded, while high-confidence pseudo-labels receive larger weights.
-
Periodic pseudo-label update: Every \(K=10\) epochs, the student model weights are transferred to the teacher model and pseudo-labels are regenerated, preventing overfitting to noisy labels at early stages.
-
Overall training objective: A combined average of \(\mathcal{L}^s\) and \(\mathcal{L}^u\).
Key Experimental Results¶
Main Results — Within-domain Gaze Estimation (Angular Error °↓)¶
| Method | MPIIFaceGaze | EyeDiap | RT-Gene | Gaze360 | IVGaze |
|---|---|---|---|---|---|
| GazeTR (ICPR'22) | 4.00 | 5.17 | 6.55 | 10.62 | 7.33 |
| AGE-Net (ICIP'24) | 3.61 | 4.78 | - | - | - |
| 3DGazeNet (ECCV'24) | 4.00 | - | - | 9.60 | - |
| OmniGaze | 2.97 | 4.07 | 5.40 | 9.12 | 6.72 |
| Gain | -0.64 | -0.71 | -1.15 | -0.48 | -0.61 |
Ablation Study — Core Components (Zero-shot Setting)¶
| Configuration | MPIIGaze | EyeDiap | RT-Gene | Notes |
|---|---|---|---|---|
| Labeled data only | 6.17 | 8.45 | 20.73 | Baseline |
| + Unlabeled data (no filtering) | 4.97 | 5.42 | 13.75 | Scale alone yields significant gains |
| + Reward model filtering | 3.44 | 4.31 | 9.09 | Filtering further improves by 1.5°+ |
Ablation Study — Reward Model Components¶
| Component | MPIIGaze | EyeDiap | RT-Gene |
|---|---|---|---|
| Semantic gaze description only | 3.69 | 4.78 | 10.23 |
| 3D direction vector only | 4.03 | 4.93 | 10.56 |
| Both combined (full) | 3.44 | 4.31 | 9.09 |
Key Findings¶
- Data scale and quality filtering act synergistically: The three-step progression from labeled-only → adding unlabeled data → adding reward filtering yields significant improvements at each step.
- Semantic descriptions contribute more than geometric vectors: Semantic gaze descriptions as a standalone component outperform 3D direction vectors, indicating that MLLM scene understanding provides valuable priors for regression tasks.
- Zero-shot performance surpasses within-domain SOTA: OmniGaze-Base's zero-shot performance (3.44° on MPIIFaceGaze) even outperforms prior SOTA requiring within-domain training (3DGazeNet at 4.00°).
- Scalable with model size: Performance consistently improves from ViT-S → ViT-B → ViT-L.
Highlights & Insights¶
- Using MLLMs as semantic priors for regression tasks is the paper's most elegant design: rather than directly applying MLLMs to gaze estimation, the approach leverages MLLM scene understanding to assist pseudo-label quality assessment — an indirect yet effective form of multimodal knowledge transfer.
- Filling the methodological gap of SSL for continuous regression: Existing SSL pseudo-label selection strategies are almost exclusively designed for classification; the reward model with confidence scoring provides a general framework for regression tasks.
- Potential as a scalable data engine: OmniGaze can generate reliable gaze annotations for arbitrary face images, with performance continuously improving as more unlabeled data is incorporated.
Limitations & Future Work¶
- Reward model training relies on a certain amount of labeled data to initialize the teacher model, making it inapplicable in fully unsupervised settings.
- MLLM inference (InstructBLIP description generation) introduces additional data preprocessing costs, presenting efficiency bottlenecks at the million-image scale.
- The binary classification objective of the reward model (ground truth vs. pseudo-label) may be an oversimplification — ground-truth labels can also be noisy.
- Validation is limited to ViT architectures; other backbones such as CNNs remain unexplored.
Related Work & Insights¶
- SemiReward (reward scoring in SSL): This work extends the SemiReward concept from classification to regression, with the key distinction of incorporating multimodal cues and geometric representations.
- Implications for other regression SSL tasks: Continuous regression tasks such as depth estimation, pose estimation, and optical flow estimation may similarly benefit from reward-driven pseudo-label filtering.
- MLLMs as auxiliary priors: MLLMs need not make direct predictions; their semantic understanding capabilities can serve as "meta-knowledge" to assist training in other tasks.
Rating¶
- Novelty: ⭐⭐⭐⭐ The trimodal fusion of reward model, MLLM semantic descriptions, and geometric representation is novel; the SSL framework for regression tasks represents a methodological contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Five datasets × three settings (within-domain/cross-domain/zero-shot) with extensive ablations across three model scales.
- Writing Quality: ⭐⭐⭐⭐ Method description is clear with well-articulated correspondence between challenges and proposed solutions.
- Value: ⭐⭐⭐⭐ The scalable data engine paradigm offers strong inspirational value for other visual regression tasks.