Gaze Target Detection Based on Head-Local-Global Coordination¶

Conference: ECCV 2024
Code: None
Area: Others
Keywords: Gaze Target Detection, Head-Local-Global Coordination, Field of View, Multi-view Fusion, Gaze Prediction

TL;DR¶

A gaze target detection method based on head-local-global tri-view coordination is proposed. By introducing a field of view (FOV)-based local view and designing global-local position and representation consistency mechanisms, the accuracy of gaze target prediction is significantly improved.

Background & Motivation¶

Background: Gaze target detection is an important task in computer vision, aiming to predict the location or object that a person in an image is looking at. This task is widely used in human-computer interaction, social behavior analysis, and pedestrian intention understanding in autonomous driving. Existing methods typically adopt a two-stage strategy: (a) estimating the gaze direction from the head region, and (b) identifying salient targets in the global image, then combining both to determine the gaze target.
Limitations of Prior Work:
- Limitations of Relying on Global Views: Traditional methods mainly rely on head crops and global views (full images). Global views contain a large amount of gaze-irrelevant background information, which easily introduces noise, especially when the scene is complex with multiple potential targets.
- Lack of Local Context: There is an excessively large scale gap between the global view and the head view—the head region is too small while the global image is too large, lacking a mid-scale context to help the model understand objects in the gaze direction.
- Insufficient Information Integration Across Views: Simple concatenation or attention-based fusion of head and global features fails to effectively utilize spatial positional relationships.
Key Challenge: Information in the global view is overly redundant and lacks focus, whereas the head view is too limited—an improved perspective is required to bridge the information gap between the head and global views.
Goal: To design a more effective multi-view fusion scheme by introducing a local view to supplement the missing mid-scale information between head and global views, and designing corresponding view coordination mechanisms to integrate features from all three views.
Key Insight: Introducing a local view based on the human field of view (FOV) as the third view. The FOV local view focuses on the image region within a certain range in front of the gaze direction, which is more targeted than the global view and provides richer context than the head view.
Core Idea: By constructing a head-local-global tri-view framework and designing position and representation consistency mechanisms for multi-view coordination, more accurate gaze target prediction is achieved.

Method¶

Overall Architecture¶

The method consists of four core modules: 1. View Construction: Extracting the head view, FOV local view, and global view from the input image. 2. Feature Extraction: Extracting visual features from each of the three views independently. 3. View Coordination: Integrating tri-view features through global-local position and representation consistency mechanisms. 4. Target Prediction: Predicting the gaze target position (in the form of a heatmap) based on the fused features.

Key Designs¶

FOV-based Local View Construction:
- Function: Cropping a local region in front of the gaze direction from the global image to serve as the local view.
- Mechanism: First, a rough gaze direction is estimated from the head region. Then, centered on the head, a specific-sized region is cropped from the global image along the gaze direction. The size and orientation of this region are determined by the estimated gaze direction and the human field of view (FOV).
- Design Motivation: Human gaze targets are highly likely to be located within a limited area in front of the gaze direction. The FOV local view effectively narrows down the target search space while providing a mid-scale context between the head and global scales. This design mimics the human gaze mechanism—first determining the rough direction, and then precisely locating the target within the local area.
Global-Local Position Consistency:
- Function: Establishing spatial position correspondence between the global view and the local view.
- Mechanism: Since the local view is cropped from the global view, a clear spatial correspondence exists between them. This module maps the target position detected in the local view back to the global coordinate system via positional encoding and coordinate transformation, ensuring spatial consistency of the prediction results.
- Design Motivation: Avoiding contradictory position predictions between the local and global views, ensuring that information from both scales can complement each other.
Global-Local Representation Consistency:
- Function: Fusing the representation of the three views at the feature level to ensure semantic consistency.
- Mechanism: Designing a cross-view attention mechanism to interact features among the head view, local view, and global view. Specifically, the head feature is utilized as a query to attend to features of the local and global views respectively, obtaining information related to the gaze direction. Meanwhile, a consistency constraint is applied to the local and global features to ensure their representations of the same region are consistent.
- Design Motivation: The three views observe the same scene from different scales and perspectives. It is necessary to unify their representations in the feature space to effectively leverage complementary information from each view.

Loss & Training¶

Gaze Heatmap Loss: Binary cross-entropy loss is used to supervise the predicted gaze target heatmap.
Gaze Direction Regression Loss: L2 loss is used to supervise the estimated gaze direction angle.
Consistency Loss: Contrastive learning or MSE constraints are applied to regularize the consistency of global-local representations.
In-frame / Out-of-frame Classification Loss: Binary classification loss is used to determine whether the gaze target is within the image frame.
Multi-task Joint Training: Simultaneously optimizing position prediction, direction estimation, and in-frame/out-of-frame judgment.

Key Experimental Results¶

Main Results¶

Dataset	Metric	Ours	Prev. SOTA	Gain
GazeFollow	AUC	SOTA	Previous best method	Significant improvement
GazeFollow	Avg. Dist.	SOTA (Lower)	Previous best method	Significant reduction
GazeFollow	Min. Dist.	SOTA (Lower)	Previous best method	Significant reduction
VideoAttentionTarget	AUC	SOTA	Previous best method	Significant improvement
VideoAttentionTarget	Avg. Dist.	SOTA	Previous best method	Significant reduction
ChildPlay	AUC	SOTA	Previous best method	Improvement

Ablation Study¶

Configuration	Key Metric	Description
Global + Head only (No local view)	Lower AUC	Baseline of the traditional two-view scheme
With local view (No coordination mechanism)	AUC improvement	Local view itself contributes to performance
With position consistency	Further AUC improvement	Spatial correspondence helps precise localization
With representation consistency	Continuous AUC improvement	Feature fusion at the semantic level is critical
Full model (All components)	Highest AUC	Three components cooperate complementarily
Different FOV sizes	Optimal value exists	Too large degrades to global, too small fails to cover target

Key Findings¶

The introduction of the local view brings systematic and consistent performance improvements to gaze target detection.
The design of the FOV-based local view is more effective than random cropping or fixed-size cropping.
The proposed method has good scalability and can act as a plug-and-play module to enhance existing gaze target detection methods.
State-of-the-art performance is achieved across multiple benchmark datasets.
Both components of the view coordination mechanism (position consistency and representation consistency) make independent contributions.

Highlights & Insights¶

Elegant Design Intuition: The introduction of the FOV local view mimics the human gaze mechanism—first estimating a rough direction from the head, then searching for targets within the field of view, which aligns well with cognitive intuition.
Scalability: The paper demonstrates the capability of applying this framework to enhance existing gaze detection methods, proving its generalizability.
Complementary Multi-scale Information: The head view provides face orientation information, the local view provides target candidate regions, and the global view provides scene context—forming effective information complementarity.
Dual Consistency of Position and Representation: Consistency is pursued not only in spatial position but also in feature representation, presenting a comprehensive fusion strategy.

Limitations & Future Work¶

The construction of the FOV local view depends on initial gaze direction estimation; if the initial estimation has a large deviation, the local view may fail to cover the actual gaze target.
When multiple people are present in the scene, local views of different characters may overlap, increasing processing complexity.
For cases where the gaze target is at the image edge or outside the image, the effectiveness of the local view may be reduced.
Exploring temporal information (such as gaze trajectories across consecutive frames in videos) to further improve prediction accuracy is a promising direction.
Combining depth information could help resolve depth ambiguities of gaze targets in 3D space.

GazeFollow / VideoAttentionTarget: Classic datasets and baseline methods for gaze target detection.
Detecting Attended Visual Targets in Video: Pioneering work in video gaze target detection.
Gaze360: Gaze estimation method in 360-degree environments.
Where are they looking / Who is looking at who: Gaze analysis methods in social scenarios.
Insight: Combining multi-scale view designs with human visual cognitive mechanisms is an effective methodology.

Rating¶

Novelty: ⭐⭐⭐ companionship Introducing the FOV local view and tri-view coordination mechanism represents an entirely new framework design.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple benchmarks, comprehensive ablation studies, and scalability validation.
Writing Quality: ⭐⭐⭐⭐ Clear motivation and natural design logic.
Value: ⭐⭐⭐⭐ Proposes an extensible framework that practically advances the field of gaze target detection.