Toward Gaze Target Detection in Young Autistic Children¶
Conference: AAAI 2026 arXiv: 2511.11244 Code: ShijianDeng/AGT Area: Signal & Communication Keywords: Gaze Target Detection, Autism Spectrum Disorder, Class Imbalance, Multimodal Large Language Models, Coarse-to-Fine Framework
TL;DR¶
To address the severe class imbalance in gaze target detection for autistic children—where face-directed gaze accounts for only 6.6% of samples—this paper proposes the Socially Aware Coarse-to-Fine (SACF) framework. A fine-tuned Qwen2.5-VL serves as a social-context-aware gate that routes inputs to either a socially aware or a socially agnostic expert model. Evaluated on the newly introduced AGT dataset, the framework substantially improves face gaze detection performance (Face L2 reduced by 13.9% on Sharingan; F1 improved from 0.753 to 0.761).
Background & Motivation¶
- Autism Spectrum Disorder (ASD) affects approximately 1 in 31 eight-year-old children. A hallmark characteristic is atypical social attention, particularly difficulties in initiating and responding to joint attention.
- Joint attention assessment is a cornerstone of early ASD diagnosis and intervention, yet it relies on highly trained professionals and is labor-intensive and difficult to scale.
- Existing gaze target detection research focuses almost exclusively on neurotypical adults and children; such models exhibit substantially degraded performance on autistic children.
- Root cause: autistic children direct their gaze toward faces far less frequently than neurotypical peers, resulting in severe class imbalance in the data (only 6.6% face-directed gaze). Standard models tend to predict non-social targets and miss clinically critical social interaction moments.
- No autism-specific gaze target detection dataset exists.
Core Problem¶
Given the extreme scarcity of face-directed gaze (6.6%) in autistic children's gaze data, how can gaze targets be accurately detected—especially without missing clinically critical face-gaze events? This represents a joint challenge of domain adaptation and class imbalance.
Method¶
Overall Architecture¶
An input image \(I\) and the child's head bounding box \(B_\text{head}\) are passed to the Social Context Awareness (SCA) module (fine-tuned Qwen2.5-VL-7B), which estimates a social context score \(s\) (the probability that the child is looking at a face). A threshold gate routes the input: if \(s\) is high (social scene), the input is background-blurred and forwarded to the Socially Aware Gaze Expert; otherwise, it is sent to the Socially Agnostic Gaze Expert. Each expert outputs a gaze heatmap; argmax yields the predicted gaze point. A spatial check then determines whether the predicted point falls within an adult face bounding box, producing the final semantic label (Face / Not Face).
Key Designs¶
-
Autism Gaze Target (AGT) Dataset:
- The first gaze target detection dataset for autistic children, sourced from 59 ethically approved videos recorded during CSBS-DP assessments.
- 16,582 annotated frames: 9,874 training / 3,344 validation / 3,364 test.
- Annotations include child head bounding boxes, adult face bounding boxes, gaze target points, and semantic labels (object / face / person-non-face / no object).
- Class distribution: Face 6.6% (1,088 frames), Not Face 93.4% (15,494 frames).
- Inter-annotator agreement: Cohen's Kappa = 0.757 (substantial agreement).
-
Social Context Awareness (SCA) Module:
- Qwen2.5-VL-7B-Instruct is fine-tuned as a binary classifier to estimate the probability of the child looking at a face.
- A threshold converts the score into a coarse label: Face or Not Face.
- SCA performance: Face recall 65.53%, Not-Face recall 98.10%, Face F1 = 0.673.
- Leverages the large-scale pretraining of MLLMs to understand scene-level social context.
-
Two-Pathway Gated Experts:
- Socially Aware Gaze Expert (\(\text{Ex}_\text{aware}\)): Trained on augmented data where irrelevant background regions are strongly Gaussian-blurred when the target is known, guiding the model to focus on plausible targets (e.g., faces) and avoiding extreme failures; optimized specifically for face-directed gaze scenarios.
- Socially Agnostic Gaze Expert (\(\text{Ex}_\text{agnostic}\)): Trained on the original unmodified data; maximizes performance for high-frequency non-social and ambiguous scenes without constraints imposed by face-class requirements.
- Both experts are built on the GazeLLE architecture (frozen DINOv2 encoder + lightweight Transformer decoder) or the Sharingan architecture.
-
Gating and Inference:
- The SCA output determines which expert processes the input.
- Final semantic classification is based on a spatial check: whether the predicted gaze point falls within an adult face bounding box.
- When the SCA gate is correct (~96% of the time), the framework fully exploits the complementary strengths of both experts.
Loss & Training¶
- Gaze heatmap loss: pixel-level Binary Cross-Entropy (BCE) between the predicted heatmap and the Gaussian-blurred ground-truth heatmap.
- SCA module: Qwen2.5-VL-7B fine-tuned on the AGT training set for binary classification (face-looking / not-face-looking).
- Socially Aware Expert: trained with BCE loss on background-blurred augmented data.
- Socially Agnostic Expert: trained with BCE loss on the original data.
- Hardware: NVIDIA A6000 GPU.
Key Experimental Results¶
| Method | L2 | L2_obj | L2_face | L2_pnf | Precision | Recall | F1 |
|---|---|---|---|---|---|---|---|
| Sharingan (original) | 0.0615 | 0.0615 | 0.0647 | 0.0561 | 0.4377 | 0.8010 | 0.5660 |
| Sharingan-AGT | 0.0486 | 0.0451 | 0.0949 | 0.0595 | 0.7744 | 0.7330 | 0.7531 |
| Sharingan-SACF (Ours) | 0.0480 | 0.0453 | 0.0817 | 0.0616 | 0.7647 | 0.7573 | 0.7610 |
| GazeLLE (original) | 0.0670 | 0.0630 | 0.1092 | 0.1041 | 0.3868 | 0.6553 | 0.4865 |
| GazeLLE-AGT | 0.0460 | 0.0405 | 0.1130 | 0.0804 | 0.6984 | 0.6408 | 0.6684 |
| GazeLLE-SACF (Ours) | 0.0453 | 0.0405 | 0.1019 | 0.0804 | 0.7041 | 0.6699 | 0.6866 |
- Upper bound (oracle gate): Sharingan-SACF reaches F1 = 0.9786, L2_face = 0.0378; GazeLLE-SACF reaches F1 = 0.9903, L2_face = 0.0307.
Ablation Study¶
- Neurotypical models fail on autism data: The original Sharingan/GazeLLE models, trained on the neurotypical Childplay dataset, achieve only F1 = 0.566/0.487 on AGT, systematically overestimating face-directed gaze in autistic children.
- Substantial improvement in Face L2: SACF reduces face gaze error for Sharingan from 0.0949 to 0.0817 (−13.9%) and for GazeLLE from 0.1130 to 0.1019 (−9.8%).
- Clear expert specialization: When GT = Face, the Socially Aware Expert achieves L2 = 0.0303, three times better than the Agnostic Expert (0.0898); when misrouted to Face but GT = Not Face, the Agnostic Expert is 3.4× better.
- Gate quality is the key bottleneck: The oracle gate raises F1 from 0.761 to 0.979, indicating substantial headroom for SCA improvement; stronger future MLLMs can directly boost overall system performance.
- Clinical significance: A 10–14% reduction in face localization error corresponds to the predicted point moving 4–6 pixels closer to the true face region at 224×224 resolution.
Highlights & Insights¶
- The first systematic study of gaze target detection in autistic children, along with the first AGT dataset (16,582 frames), filling an important gap in the field.
- The SACF "divide-and-conquer" framework is elegantly designed: an MLLM performs coarse social scene classification while expert models handle fine-grained localization, effectively addressing extreme class imbalance.
- The upper bound analysis clearly identifies the system bottleneck (gate quality) and implies that performance will naturally improve as MLLMs advance.
- The problem formulation carries strong clinical value, bridging AI research and early autism intervention.
Limitations & Future Work¶
- The SCA module (Qwen2.5-VL-7B) achieves only 65.53% Face recall, representing the primary system bottleneck.
- The framework involves multiple models (one MLLM plus two experts), resulting in relatively high inference cost.
- The AGT dataset is derived from a single assessment context (CSBS-DP); generalization to other naturalistic settings remains to be validated.
- Only binary Face / Not Face classification is modeled; finer-grained target semantics (e.g., specific toys or individuals) are not addressed.
- Temporal information (gaze trajectories across video frames) is unexplored, which may be more valuable for joint attention detection.
Related Work & Insights¶
- vs. GazeLLE / Sharingan (trained on Childplay): These models are trained on neurotypical data and remain biased toward high-frequency social gaze distributions; applied directly to autistic children, they achieve only F1 = 0.49–0.57. Fine-tuning on AGT raises F1 to 0.67–0.75, and SACF yields further gains.
- vs. direct fine-tuning (GazeLLE-AGT): Direct fine-tuning on AGT improves overall L2 but may worsen Face L2 due to class imbalance (GazeLLE: 0.1130 vs. original 0.1092). SACF resolves this tension through expert routing.
-
vs. using Qwen2.5-VL directly for gaze detection: MLLMs excel at scene understanding but are not suited for precise spatial localization. SACF restricts the MLLM to a coarse-grained routing role while delegating precise localization to dedicated gaze models.
-
The design pattern of using an MLLM as a "semantic scene router" is generalizable to other vision tasks with severe class imbalance.
- The two-stage paradigm of coarse semantic classification followed by fine-grained expert localization has analogous applicability in medical imaging and similar domains.
- AI-assisted autism assessment tools can be extended toward automatic joint attention scoring and quantitative tracking of intervention outcomes.
- The AGT dataset can serve as a benchmark for studying domain transfer and extreme class imbalance.
Rating¶
- Novelty: ⭐⭐⭐⭐ First to define and address the autism gaze detection problem; the MLLM-routed expert framework is creative, though technically it combines existing components.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive multi-baseline comparisons; insightful upper bound analysis; lacks cross-setting generalization experiments.
- Writing Quality: ⭐⭐⭐⭐⭐ Motivation is clear, problem definition is precise, and clinical significance is thoroughly articulated—an exemplary paper for socially impactful research.
- Value: ⭐⭐⭐⭐ High social impact; lays the foundation for AI-assisted autism assessment; the technical approach also offers reference value for other imbalanced detection tasks.