Toward Gaze Target Detection in Young Autistic Children¶

Conference: AAAI 2026 arXiv: 2511.11244 Code: ShijianDeng/AGT Area: Signal & Communication Keywords: Gaze Target Detection, Autism Spectrum Disorder, Class Imbalance, Multimodal Large Language Models, Coarse-to-Fine Framework

TL;DR¶

To address the severe class imbalance in gaze target detection for autistic children—where face-directed gaze accounts for only 6.6% of samples—this paper proposes the Socially Aware Coarse-to-Fine (SACF) framework. A fine-tuned Qwen2.5-VL serves as a social-context-aware gate that routes inputs to either a socially aware or a socially agnostic expert model. Evaluated on the newly introduced AGT dataset, the framework substantially improves face gaze detection performance (Face L2 reduced by 13.9% on Sharingan; F1 improved from 0.753 to 0.761).

Background & Motivation¶

Autism Spectrum Disorder (ASD) affects approximately 1 in 31 eight-year-old children. A hallmark characteristic is atypical social attention, particularly difficulties in initiating and responding to joint attention.
Joint attention assessment is a cornerstone of early ASD diagnosis and intervention, yet it relies on highly trained professionals and is labor-intensive and difficult to scale.
Existing gaze target detection research focuses almost exclusively on neurotypical adults and children; such models exhibit substantially degraded performance on autistic children.
Root cause: autistic children direct their gaze toward faces far less frequently than neurotypical peers, resulting in severe class imbalance in the data (only 6.6% face-directed gaze). Standard models tend to predict non-social targets and miss clinically critical social interaction moments.
No autism-specific gaze target detection dataset exists.

Core Problem¶

Given the extreme scarcity of face-directed gaze (6.6%) in autistic children's gaze data, how can gaze targets be accurately detected—especially without missing clinically critical face-gaze events? This represents a joint challenge of domain adaptation and class imbalance.

Method¶

Overall Architecture¶

An input image \(I\) and the child's head bounding box \(B_\text{head}\) are passed to the Social Context Awareness (SCA) module (fine-tuned Qwen2.5-VL-7B), which estimates a social context score \(s\) (the probability that the child is looking at a face). A threshold gate routes the input: if \(s\) is high (social scene), the input is background-blurred and forwarded to the Socially Aware Gaze Expert; otherwise, it is sent to the Socially Agnostic Gaze Expert. Each expert outputs a gaze heatmap; argmax yields the predicted gaze point. A spatial check then determines whether the predicted point falls within an adult face bounding box, producing the final semantic label (Face / Not Face).

Key Designs¶

Autism Gaze Target (AGT) Dataset:
- The first gaze target detection dataset for autistic children, sourced from 59 ethically approved videos recorded during CSBS-DP assessments.
- 16,582 annotated frames: 9,874 training / 3,344 validation / 3,364 test.
- Annotations include child head bounding boxes, adult face bounding boxes, gaze target points, and semantic labels (object / face / person-non-face / no object).
- Class distribution: Face 6.6% (1,088 frames), Not Face 93.4% (15,494 frames).
- Inter-annotator agreement: Cohen's Kappa = 0.757 (substantial agreement).
Social Context Awareness (SCA) Module:
- Qwen2.5-VL-7B-Instruct is fine-tuned as a binary classifier to estimate the probability of the child looking at a face.
- A threshold converts the score into a coarse label: Face or Not Face.
- SCA performance: Face recall 65.53%, Not-Face recall 98.10%, Face F1 = 0.673.
- Leverages the large-scale pretraining of MLLMs to understand scene-level social context.
Two-Pathway Gated Experts:
- Socially Aware Gaze Expert (\(\text{Ex}_\text{aware}\)): Trained on augmented data where irrelevant background regions are strongly Gaussian-blurred when the target is known, guiding the model to focus on plausible targets (e.g., faces) and avoiding extreme failures; optimized specifically for face-directed gaze scenarios.
- Socially Agnostic Gaze Expert (\(\text{Ex}_\text{agnostic}\)): Trained on the original unmodified data; maximizes performance for high-frequency non-social and ambiguous scenes without constraints imposed by face-class requirements.
- Both experts are built on the GazeLLE architecture (frozen DINOv2 encoder + lightweight Transformer decoder) or the Sharingan architecture.
Gating and Inference:
- The SCA output determines which expert processes the input.
- Final semantic classification is based on a spatial check: whether the predicted gaze point falls within an adult face bounding box.
- When the SCA gate is correct (~96% of the time), the framework fully exploits the complementary strengths of both experts.

Loss & Training¶

Gaze heatmap loss: pixel-level Binary Cross-Entropy (BCE) between the predicted heatmap and the Gaussian-blurred ground-truth heatmap.
SCA module: Qwen2.5-VL-7B fine-tuned on the AGT training set for binary classification (face-looking / not-face-looking).
Socially Aware Expert: trained with BCE loss on background-blurred augmented data.
Socially Agnostic Expert: trained with BCE loss on the original data.
Hardware: NVIDIA A6000 GPU.

Key Experimental Results¶

Method	L2	L2_obj	L2_face	L2_pnf	Precision	Recall	F1
Sharingan (original)	0.0615	0.0615	0.0647	0.0561	0.4377	0.8010	0.5660
Sharingan-AGT	0.0486	0.0451	0.0949	0.0595	0.7744	0.7330	0.7531
Sharingan-SACF (Ours)	0.0480	0.0453	0.0817	0.0616	0.7647	0.7573	0.7610
GazeLLE (original)	0.0670	0.0630	0.1092	0.1041	0.3868	0.6553	0.4865
GazeLLE-AGT	0.0460	0.0405	0.1130	0.0804	0.6984	0.6408	0.6684
GazeLLE-SACF (Ours)	0.0453	0.0405	0.1019	0.0804	0.7041	0.6699	0.6866

Upper bound (oracle gate): Sharingan-SACF reaches F1 = 0.9786, L2_face = 0.0378; GazeLLE-SACF reaches F1 = 0.9903, L2_face = 0.0307.

Ablation Study¶

Neurotypical models fail on autism data: The original Sharingan/GazeLLE models, trained on the neurotypical Childplay dataset, achieve only F1 = 0.566/0.487 on AGT, systematically overestimating face-directed gaze in autistic children.
Substantial improvement in Face L2: SACF reduces face gaze error for Sharingan from 0.0949 to 0.0817 (−13.9%) and for GazeLLE from 0.1130 to 0.1019 (−9.8%).
Clear expert specialization: When GT = Face, the Socially Aware Expert achieves L2 = 0.0303, three times better than the Agnostic Expert (0.0898); when misrouted to Face but GT = Not Face, the Agnostic Expert is 3.4× better.
Gate quality is the key bottleneck: The oracle gate raises F1 from 0.761 to 0.979, indicating substantial headroom for SCA improvement; stronger future MLLMs can directly boost overall system performance.
Clinical significance: A 10–14% reduction in face localization error corresponds to the predicted point moving 4–6 pixels closer to the true face region at 224×224 resolution.

Highlights & Insights¶

The first systematic study of gaze target detection in autistic children, along with the first AGT dataset (16,582 frames), filling an important gap in the field.
The SACF "divide-and-conquer" framework is elegantly designed: an MLLM performs coarse social scene classification while expert models handle fine-grained localization, effectively addressing extreme class imbalance.
The upper bound analysis clearly identifies the system bottleneck (gate quality) and implies that performance will naturally improve as MLLMs advance.
The problem formulation carries strong clinical value, bridging AI research and early autism intervention.

Limitations & Future Work¶

The SCA module (Qwen2.5-VL-7B) achieves only 65.53% Face recall, representing the primary system bottleneck.
The framework involves multiple models (one MLLM plus two experts), resulting in relatively high inference cost.
The AGT dataset is derived from a single assessment context (CSBS-DP); generalization to other naturalistic settings remains to be validated.
Only binary Face / Not Face classification is modeled; finer-grained target semantics (e.g., specific toys or individuals) are not addressed.
Temporal information (gaze trajectories across video frames) is unexplored, which may be more valuable for joint attention detection.

vs. GazeLLE / Sharingan (trained on Childplay): These models are trained on neurotypical data and remain biased toward high-frequency social gaze distributions; applied directly to autistic children, they achieve only F1 = 0.49–0.57. Fine-tuning on AGT raises F1 to 0.67–0.75, and SACF yields further gains.
vs. direct fine-tuning (GazeLLE-AGT): Direct fine-tuning on AGT improves overall L2 but may worsen Face L2 due to class imbalance (GazeLLE: 0.1130 vs. original 0.1092). SACF resolves this tension through expert routing.
vs. using Qwen2.5-VL directly for gaze detection: MLLMs excel at scene understanding but are not suited for precise spatial localization. SACF restricts the MLLM to a coarse-grained routing role while delegating precise localization to dedicated gaze models.
The design pattern of using an MLLM as a "semantic scene router" is generalizable to other vision tasks with severe class imbalance.
The two-stage paradigm of coarse semantic classification followed by fine-grained expert localization has analogous applicability in medical imaging and similar domains.
AI-assisted autism assessment tools can be extended toward automatic joint attention scoring and quantitative tracking of intervention outcomes.
The AGT dataset can serve as a benchmark for studying domain transfer and extreme class imbalance.

Rating¶

Novelty: ⭐⭐⭐⭐ First to define and address the autism gaze detection problem; the MLLM-routed expert framework is creative, though technically it combines existing components.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive multi-baseline comparisons; insightful upper bound analysis; lacks cross-setting generalization experiments.
Writing Quality: ⭐⭐⭐⭐⭐ Motivation is clear, problem definition is precise, and clinical significance is thoroughly articulated—an exemplary paper for socially impactful research.
Value: ⭐⭐⭐⭐ High social impact; lays the foundation for AI-assisted autism assessment; the technical approach also offers reference value for other imbalanced detection tasks.