GazeInterpreter: Parsing Eye Gaze to Generate Eye-Body-Coordinated Narrations¶

Conference: AAAI 2026 arXiv: 2511.16245 Code: https://github.com/EvergreenChang/GazeInterpreter Area: LLM Evaluation Keywords: Eye Gaze Analysis, Human Behavior Understanding, Large Language Models, Multimodal Fusion, Motion Generation

TL;DR¶

This paper proposes GazeInterpreter, an LLM-based hierarchical framework that converts raw gaze signals into textual narrations via a symbolic gaze parser, integrates them with body motion narrations to produce eye-body-coordinated descriptions, and iteratively refines outputs through a self-correction loop, yielding significant improvements on downstream tasks including text-driven motion generation, action prediction, and behavior summarization.

Background & Motivation¶

Comprehensively interpreting human behavior is a central challenge in human-centric perceptual AI. Existing work, however, focuses almost exclusively on body behavior interpretation, largely neglecting eye gaze and its coordination with body movement:

Gaze as a direct window into intent: When reaching for a cup, the eyes typically fixate on the target before or during arm movement, directly revealing latent intention.

Strong eye-body coupling: Extensive prior research has established strong correlations between eye gaze and head, torso, and whole-body motion.

Limitations of prior work: Methods such as MotionGPT and MotionLLM project body motion or video into language space to generate descriptions, but entirely omit gaze information.

Core challenge: How can low-level continuous numerical gaze sensor data be reliably converted into high-level structured semantic representations? Feeding raw numerical values directly to an LLM risks factual hallucination and signal disconnection.

Core solution: Raw gaze data is first abstracted into an intermediate vocabulary of symbolic gaze events, providing a reliable semantic grounding, before multi-level fusion is performed via LLM.

Method¶

Overall Architecture¶

GazeInterpreter adopts a three-phase hierarchical coarse-to-fine architecture:

Phase 1: Parse raw gaze signals → textual narration (symbolic parsing + LLM generation)
Phase 2: Integrate gaze narration with body motion atomic narration → eye-body-coordinated narration
Phase 3: Self-correction loop → multi-dimensional iterative refinement

Key Designs¶

1. Symbolic Gaze Parser¶

A deterministic module that converts continuous gaze signals into discrete symbolic event sequences.

Processing steps: - Given raw gaze signal $S_i^g \in \mathbb{R}^{N_g \times 2}$ (yaw, pitch sequence), compute instantaneous angular velocity: $$\omega_j = \frac{\sqrt{(y_j - y_{j-1})^2 + (p_j - p_{j-1})^2}}{t_j - t_{j-1}}$$ - Apply the I-VT (Identification-by-Velocity-Threshold) algorithm with dual thresholds ($v_{\text{low}}=30°/s$, $v_{\text{high}}=100°/s$) to classify signals into three event primitives: Fixation, Saccade, and SmoothPursuit. - Each event encapsulates not only the category but also quantitative attributes such as duration, amplitude, and peak velocity, along with corresponding qualitative descriptors.

Design Motivation: Abstracting noisy high-dimensional signals into compact, machine-readable symbolic representations avoids hallucination issues that arise when LLMs process raw numerical data directly.

2. Symbolic-to-Text Synthesizer¶

An LLM (Gemini-2.5-Flash) translates the symbolic event sequence $E_i$ into a coherent textual narration $T_i^g$. The core idea is to reframe the LLM's task from "high-risk numerical inference" to "low-risk factual translation"—rendering symbolically grounded, verifiable behavioral descriptions into fluent natural language. Carefully designed few-shot prompts are employed.

3. Eye-Body Motion Integration (Phase 2)¶

Historical context: A sliding observation window ($W=2$) aggregates historical context $\mathcal{H}_i$, comprising (i) previously inferred integrated narrations and (ii) feedback from the preceding self-correction round.

Integrated narration generation: A structured prompt template is constructed: $$\Pi_{\text{integ}}(i) = [\texttt{CTX}:\mathcal{H}_i;\ \texttt{GAZE}:T_i^g;\ \texttt{MOTION}:S_i^m]$$

The LLM performs reasoning over the structured input (rather than mere summarization), for example inferring "the user is carefully scanning the ground while walking" by associating a gaze shift with the concurrent walking motion.

4. Self-Correction Loop (Phase 3)¶

Multi-dimensional evaluation and iterative refinement are performed through the collaboration of $\text{LLM}_{\text{eval}}$ and $\text{LLM}_{\text{refine}}$:

Evaluation dimensions (each scored 1–5):

Type	Dimension	High Score	Low Score
Gaze narration	Continuity	Natural, fluid gaze transitions	Abrupt or illogical event descriptions
Integrated narration	Modal matching	Cross-modal mutually supportive integration	Modal disconnection, redundancy, or contradiction
Integrated narration	Temporal consistency	Clear temporal logical progression	Absence of identifiable temporal structure
Integrated narration	Completeness	All key elements fully covered	Missing critical information or behavioral events

The loop iterates up to $K_{\text{max}}=3$ times until all scores $\geq \tau=4.5$ or the maximum iteration count is reached.

Loss & Training¶

GazeInterpreter requires no conventional model training; it leverages the few-shot in-context learning capability of a pretrained LLM (Gemini-2.5-Flash). Key hyperparameters: - I-VT thresholds: $v_{\text{low}}=30°/s$, $v_{\text{high}}=100°/s$ - Sliding window size: $W=2$ - Self-correction maximum iterations: $K_{\text{max}}=3$, score threshold $\tau=4.5$

Key Experimental Results¶

Main Results¶

Text-driven motion generation is evaluated on the large-scale Nymeria benchmark with fixed MotionGPT weights, comparing different textual inputs:

Scene Type	Method	MM Dist↓	FID↓	Top-1↑	Top-3↑	MM↑
Low-level	MotionGPT	6.748	7.458	0.052	0.187	3.469
Low-level	+GazeInterpreter	6.406	6.801	0.102	0.214	3.727
High-level	MotionGPT	7.133	8.804	0.054	0.162	3.223
High-level	+GazeInterpreter	6.862	8.134	0.062	0.193	3.864
All	MotionGPT	6.941	8.131	0.053	0.175	3.346
All	+GazeInterpreter	6.634	7.468	0.082	0.204	3.796

Downstream tasks:

Task	Method	Cosine Sim↑	BERT F1↑	ROUGE-L↑	Action F1↑
Action Prediction	Nymeria	0.459	0.868	0.202	0.226
Action Prediction	GazeInterpreter	0.506	0.879	0.231	0.248
Behavior Summarization	Nymeria	0.480	0.836	0.197	0.150
Behavior Summarization	GazeInterpreter	0.537	0.860	0.575	0.229

Ablation Study¶

Configuration	MM Dist↓	FID↓	Top-1↑	Note
w/o hierarchical structure	8.135	9.124	0.059	Largest performance drop, validating the centrality of hierarchical integration
w/o symbolic parser	7.642	7.893	0.061	Direct use of raw signals causes degradation
w/o self-correction	7.425	7.831	0.063	Absence of iterative refinement reduces quality
Full GazeInterpreter	6.634	7.468	0.082	All modules enabled

Incremental analysis of self-correction quality dimensions:

Continuity	Matching	Temporal	Completeness	Top-1↑	FID↓
				0.063	7.831
✓				0.069	7.722
✓	✓			0.072	7.644
✓	✓	✓		0.074	7.573
✓	✓	✓	✓	0.082	7.468

The introduction of each evaluation dimension yields cumulative performance gains.

Key Findings¶

Gaze information is critical for motion generation: Solely enriching text descriptions with gaze information—without modifying the generative model—yields significant FID improvement (8.131→7.468).
Low-level scenes benefit more: The fine-grained intent information provided by gaze is particularly beneficial for precise atomic motion generation.
Eye-body-coordinated narrations are more predictive than human annotations: In action prediction, GazeInterpreter narrations achieve higher Action F1 than the manually annotated Nymeria data.
The symbolic intermediate layer is essential: Having the LLM process raw numerical signals directly leads to substantial degradation.
Sliding window $W=2$ is optimal: Enlarging the window yields diminishing marginal returns and introduces redundant noise.

Highlights & Insights¶

Opens a new research direction: This is the first work to systematically integrate eye gaze parsing with body motion narration, revealing the substantial potential of gaze for behavior understanding.
The numerical → symbolic → textual decomposition strategy is notably elegant, mitigating hallucination risks that arise when LLMs process sensor values directly.
The multi-dimensional evaluation framework of the self-correction loop is transferable to other generative tasks.
A training-free, inference-only framework: Built on LLM few-shot prompting and multi-stage reasoning, requiring no expensive task-specific model training.
Consistent advantages are demonstrated across three tasks: motion generation, action prediction, and behavior summarization.

Limitations & Future Work¶

Validated on a single dataset (Nymeria): Currently the only publicly available dataset containing both gaze and motion annotations, limiting generalizability.
High inference cost: Three-stage LLM reasoning combined with the self-correction loop requires multiple LLM calls.
Reliance on predefined thresholds: The velocity thresholds of the I-VT classifier must be set manually and may require adjustment across different scenarios.
Lack of end-to-end joint optimization: The symbolic parsing, narration generation, and integration stages are entirely decoupled.
Joint exploitation of egocentric image/video signals and gaze remains unexplored.

Fundamental distinction from MotionGPT/MotionLLM: This work addresses not only body motion but also the coordination between eye gaze and body.
The I-VT algorithm has been widely used in classical gaze analysis; this work is the first to combine it with LLMs.
The self-correction loop conceptually parallels the self-revision mechanism in Constitutional AI.
Implications for embodied intelligence: When inferring human intent, gaze signals may reveal goals earlier and more directly than limb movements.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Pioneering integration of gaze parsing with LLM-based behavior understanding, opening a new direction
Experimental Thoroughness: ⭐⭐⭐⭐ — Motion generation + downstream tasks + complete ablation, but limited to a single dataset
Writing Quality: ⭐⭐⭐⭐ — Framework description is clear; motivation is thoroughly articulated
Value: ⭐⭐⭐⭐ — Reveals the substantial potential of gaze in behavior understanding with long-term impact