Few-Shot Personalized Scanpath Prediction¶

Conference: CVPR 2025
arXiv: 2504.05499
Code: https://github.com/cvlab-stonybrook/few-shot-scanpath
Area: Video Understanding
Keywords: Scanpath Prediction, Few-Shot Learning, Personalization, Subject Embedding, Eye Tracking

TL;DR¶

This paper proposes the Few-Shot Personalized Scanpath Prediction (FS-PSP) task and the Subject-Embedding Network (SE-Net). By decoupling subject-embedding learning from scanpath prediction, the model can adapt to a new user using gaze data from only 1-10 images. It outperforms the runner-up by 5.9%-7.9% in the ScanMatch metric across three datasets (OSIE, COCO-FreeView, COCO-Search18), with an adaptation time of only 3.6 seconds and requiring no fine-tuning.

Background & Motivation¶

Background: Scanpath prediction aims to predict the sequence of fixation points when a human views an image, including spatial and temporal information. Personalized scanpath prediction (PSP) further requires predicting unique attention patterns for specific individuals, as differences in personal cultural background, memory, and experience affect gaze behavior. Existing PSP methods such as ISP and EyeFormer achieve personalization by assigning a learnable embedding vector to each training user.

Limitations of Prior Work: Existing PSP methods require a large amount of data to train each user's embedding. ISP suffers a severe performance drop with only 10 support samples; EyeFormer requires at least 50 scanpaths to obtain stable personalized embeddings. More fundamentally, these methods jointly learn the subject embedding as a "byproduct" of scanpath prediction. Consequently, adapting to a new user requires fine-tuning to relearn the embedding, which is both time-consuming and prone to overfitting.

Key Challenge: Personalization requires sufficient data to characterize individual attention features, but in practical applications, it is impossible to have every new user record eye-tracking data in a lab for an extended period. Thus, catching individual gaze characteristics rapidly under extremely limited data (1-10 images) is required.

Goal: Given scanpaths of 1-10 images from a new user, how can personalized scanpaths be predicted for them instantly without model fine-tuning?

Key Insight: The authors propose a decoupling strategy—separating "learning what personalized attention features are" from "predicting scanpaths given these features". First, a network dedicated to extracting subject embeddings (SE-Net) is trained, followed by a scanpath predictor conditioned on these embeddings. New users only need a single forward pass through SE-Net to extract their embedding for prediction. This is conceptually similar to Prototypical Networks, which extract prototype representations from a few support samples.

Core Idea: Decouple personalized feature extraction and scanpath prediction using a dedicated subject embedding network, allowing support for new users with a single forward pass to obtain usable personalized embeddings without any fine-tuning.

Method¶

Overall Architecture¶

A two-stage training scheme is adopted, with both models frozen during inference. Training stage: (1) Train SE-Net on the base dataset to extract embeddings of known users (using classification + contrastive loss); (2) Train ISP-SENet (a conditional scanpath predictor) using the embeddings generated by SE-Net. Inference stage: (1) Extract and average the embeddings from a new user's \(n\)-shot support set using SE-Net; (2) ISP-SENet predicts the scanpaths on a new image conditioned on this averaged embedding.

Key Designs¶

SE-Net (Subject-Embedding Network):
- Function: Extract an embedding vector reflecting individual attention characteristics from a single image-scanpath pair.
- Mechanism: Feature extraction is split into three levels. (1) Image + Scanpath Semantic Features: Encode the image using ResNet + Deformable Attention to obtain \(F_I\), and encode the scanpath to obtain \(F_S\). (2) Context-Scanpath Encoder (CSE): Encode the task description via RoBERTa into a task embedding \(t\), fuse it with image features via Self-Attention to form a context \(C\), and jointly/simultaneously encode it with the scanpath features (with duration and position embeddings added). The context portion \(C\) is then discarded to obtain the updated \(F_S'\) (removing scene-specific biases). (3) User-Scanpath Decoder (USD): Initialize a subject token \(e\) and extract individual features from \(F_S'\) via Cross-Attention: \(e = \text{ReLU}(\text{Linear}(e + \text{CrossAttn}(e, F_S')))\).
- Design Motivation: Directly learning embeddings (like the lookup table in ISP) cannot generalize to new users. SE-Net extracts embeddings in an input-driven manner, permitting it to exploit prior experiences of known users; discarding context \(C\) prevents the embedding from over-encoding scene content rather than personal traits.
SE-Net Training: Classification + Contrastive Loss:
- Function: Ensure that the embeddings can distinguish different users and that different scanpath embeddings of the same user are close to each other.
- Mechanism: Construct triplets \((d, d_+, d_-)\), where \(d_+\) belongs to the same user and \(d_-\) belongs to a different user. The training loss is defined as \(\mathcal{L}_{cls}(d) + \mathcal{L}_{cls}(d_+) + \mathcal{L}_{cls}(d_-) + \mathcal{L}_{contrast}\). The classification loss prompts the embedding to predict user IDs, whereas the contrastive loss (triplet loss) pulls the embeddings of the same user closer than those of different users: \(\max(\|f(d)-f(d_+)\|^2 - \|f(d)-f(d_-)\|^2 + m, 0)\).
- Design Motivation: The classification loss provides strong gradient signals to make embeddings discriminative; the contrastive loss ensures a well-behaved geometry of the embedding space (intra-class compactness and inter-class dispersion), facilitating prototype aggregation in few-shot scenarios.
ISP-SENet (Conditional Scanpath Predictor):
- Function: Predict personalized scanpaths conditioned on the subject embeddings generated by SE-Net.
- Mechanism: Based on the Gazeformer-ISP architecture, the original fixed lookup table embeddings are replaced with the outputs of SE-Net. During training, SE-Net is frozen while generating embeddings for base users, and only the predictor is trained. During inference, both networks are frozen.
- Design Motivation: The decoupled design enables the predictor to focus entirely on "how to predict the scanpath given personal features" rather than simultaneously learning "what the personal features are".

Loss & Training¶

SE-Net training: classification loss + triplet contrastive loss, trained for 25 epochs. ISP-SENet training: first supervised training, then fine-tuning via SCST (Self-Critical Sequence Training) reinforcement learning, for 10 epochs each. During \(n\)-shot inference, the embeddings of the \(n\) support samples are averaged (inspired by Prototypical Networks). Fixation durations are discretized into 10 bins (instead of continuous values) to reduce noise.

Key Experimental Results¶

Main Results¶

Dataset	n-shot	ISP-SENet SM↑	Runner-up SM	Relative Gain
OSIE	1	0.368	0.354	+3.9%
OSIE	10	0.375	0.354	+5.9%
COCO-FreeView	10	0.367	0.340	+7.9%
COCO-Search18	10	0.482	0.449	+7.3%

Comparison of adaptation time: ISP-SENet requires only 3.62 seconds (forward pass), while Gazeformer-ISP requires 267 seconds (fine-tuning).

Ablation Study¶

Configuration	SM (OSIE)	Description
ISP-SENet-Seen (seen users during training)	0.390	Upper Bound
ISP-SENet-Unseen (10-shot)	0.375	Close to upper bound
w/o CSE module	Significant degradation	Task awareness is crucial
w/o contrastive loss	Reduced embedding discriminativeness	Geometric structure is necessary

Key Findings¶

ISP-SENet achieves performance close to 10-shot under the 1-shot setting, indicating that SE-Net can extract effective personalized features from extremely limited data.
Baseline methods tend to overfit to the image content of the support set during fine-tuning rather than learning the actual attention patterns; ISP-SENet effectively avoids this issue by discarding context features.
The scanpath accuracy metric indicates that ISP-SENet is most capable of distinguishing predictions between different users (35.57 vs. 31.99), showing that personalization is indeed captured.
The performance of ISP-SENet-Unseen is close to that of ISP-SENet-Seen (0.375 vs. 0.390), proving that few-shot adaptation incurs almost no loss in quality.

Highlights & Insights¶

Decoupling subject embedding learning from path prediction is the core innovation: This renders new user adaptation down to a single forward pass (3.6 seconds), which is over 70 times faster than fine-tuning methods. The paradigm is similar to the "learning to learn" concept in meta-learning but with a simpler implementation.
Discarding context features to eliminate scene bias is a clever design: Although image context is introduced in CSE to help understand scanning behaviors, discarding the context part in final embeddings ensures the embedding specifically encodes "how this person looks" rather than "what is in this image".
Discretizing fixation duration into 10 bins is a highly practical tip: Tiny differences in duration (e.g., 200ms vs. 203ms) hold no significance for personalization; binning reduces noise and the parameter scale. This approach can be extended to other tasks involving continuous temporal signals.

Limitations & Future Work¶

The number of users in public datasets is limited (10-15 subjects); scaling to a larger population requires broader studies.
The classification head of SE-Net only distinguishes base users during training, so its generalization ability toward "never-before-seen attention patterns" depends largely on the diversity of base users.
The model is validated only on free-viewing and visual search tasks, leaving other gaze scenarios like reading or driving unexplored.
The selection of margin \(m\) in the contrastive loss requires adjustment depending on the task type (e.g., a small margin for free-viewing, and a large margin for search tasks).
Prototype aggregation via simple averaging may not be the optimal strategy; attention-weighted aggregation or graph-based aggregation might yield better results.

vs ISP: ISP learns a fixed embedding vector (lookup table) for each user, requiring fine-tuning for new users. ISP-SENet dynamically generates embeddings through SE-Net, supporting zero-fine-tuning adaptation.
vs EyeFormer: EyeFormer employs reinforcement learning + a viewer encoder for PSP, but it requires 50+ samples to stabilize. ISP-SENet is capable of working from just 1-shot.
vs Prototypical Networks: The "extract embedding \(\rightarrow\) average \(\rightarrow\) conditional prediction" workflow of ISP-SENet shares key similarities with Prototypical Networks. However, the input for embedding extraction consists of image-scanpath pairs rather than simple images, necessitating the decoupling of scene information from personal traits.

Rating¶

Novelty: ⭐⭐⭐⭐ Proposes the FS-PSP task for the first time, with a highly reasonable and effective decoupled design.
Experimental Thoroughness: ⭐⭐⭐⭐ Thorough evaluations across three datasets, three n-shot settings, and multiple baselines.
Writing Quality: ⭐⭐⭐⭐ Clear problem formulation, with detailed descriptions of the methodology.
Value: ⭐⭐⭐⭐ Possesses direct promotional value for real-world applications of personalized attention prediction (e.g., recommendation systems, advertising, auxiliary diagnosis).