Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents¶

Conference: NeurIPS 2025 (Spotlight) arXiv: 2510.22443 Code: https://github.com/facebookresearch/WAGIBench/ Area: Audio & Speech Keywords: goal inference, wearable agents, multimodal benchmark, egocentric video, VLM evaluation

TL;DR¶

Meta proposes WAGIBench, a multimodal goal inference benchmark for assistive wearable agents, comprising 3,477 egocentric recordings (29 hours) from 348 participants across four modalities — visual, audio, digital, and longitudinal. Human accuracy reaches 93% versus the best VLM at 84% (MCQ); under generative evaluation, models produce relevant goals only 55% of the time, exposing a substantial gap between current VLMs and real-world wearable deployment.

Background & Motivation¶

Assistive wearable agents (e.g., AI assistants on smart glasses) have attracted considerable attention, with representative use cases including mobile digital agents, memory augmentation, and visual assistance for the visually impaired. A fundamental bottleneck shared by all such systems is that users must explicitly state their intentions (e.g., "Where are my keys?"), imposing high interaction costs. If an agent could proactively infer user goals from passive behavioral cues — what the user is looking at, what is being said, what is on the phone screen, and past habits — interaction friction could be substantially reduced.

Existing egocentric datasets (e.g., Ego4D) exhibit several critical limitations: (1) annotations are typically re-generated by LLMs from narration text, lacking authentic ground-truth goals; (2) modalities are limited, generally restricted to video and audio, without digital context (calendars, search history, etc.) or longitudinal history; and (3) scenarios lack sufficient ecological validity to elicit situations genuinely requiring agent assistance.

Core Problem¶

Given multimodal passive observations from a user — egocentric video, ambient audio, mobile app state, and historical behavioral records — can an agent automatically infer the digital action the user intends to perform (search, shopping, reminders, etc.)? This goal inference problem is the key link between passive perception and proactive assistance for wearable agents. The paper's primary contribution lies not in proposing a new method, but in constructing a high-quality, multimodal, ground-truth benchmark to measure progress on this problem.

Method¶

Overall Architecture¶

WAGIBench comprises three main components:

Dataset construction: Multimodal egocentric data are collected via scripted interactions, ensuring every recording has an unambiguous reference goal.
Evaluation task design: Two evaluation paradigms are employed — discriminative (MCQ) and generative (LLM Judge).
Meta-evaluation: Automatic metrics are validated against human judgments.

Input: Egocentric video + audio transcription + digital app state (complete state across seven apps: Calendar, Messaging, Notes, Search, Videos, Maps, Music) + longitudinal history (past behavioral records from the same user). Output: Predicted digital action (e.g., {type: "search", query: "how to file taxes"}).

Key Designs¶

Scripted data collection for ground-truth quality: Unlike prior approaches that re-annotate narration text with LLMs, WAGIBench first designs 165 scripted scenarios (spanning kitchen, office, outdoor, and gym environments across various app contexts), each with an explicit reference goal. The 348 participants recorded using Meta Aria glasses, with each script captured approximately 21 times by an average of 6 participants. Three annotators performed quality review (agreement rate > 0.5, temporal window IoU > 0.7), retaining approximately 80% of recordings. Variable parameters within scripts (e.g., user-chosen recycled items) further increase goal diversity.
Four-modality context design with signal-noise control: Each data point is annotated with which modalities are "relevant" to goal inference, yielding subsets — \(S_V\) (visual only), \(S_{VA}\) (visual + audio), \(S_{VD}\) (visual + digital), and \(S_{VL}\) (visual + longitudinal). Digital context is generated by an LLM (Llama3.3-70B) from persona and scenario cues, producing complete seven-app states in which the vast majority of information constitutes irrelevant noise. Longitudinal history is presented as a "history bank" of five support videos from the same participant, at most one of which shares a script with the current scenario (positive support); the remainder serve as distractors. Longitudinal videos are represented via a Socratic approach — captions are generated independently by Qwen2.5-72B and InternVL-78B, then merged by an LLM with inconsistencies removed.
Dual-paradigm evaluation with LLM Judge meta-evaluation:
- MCQ: Each sample yields one "similar distractor" MCQ and one "dissimilar distractor" MCQ (7K questions total). Distractors are selected via sentenceBERT embeddings: similar distractors are sampled from the 95th–99th percentile, dissimilar distractors from the 0th–80th percentile, with a greedy strategy ensuring inter-distractor diversity.
- Generative: VLMs generate structured digital actions (selecting a type from a predefined template and filling in parameters), scored by an LLM Judge (GPT-4.1) on a three-level scale (0 / 0.5 / 1.0: irrelevant / marginally relevant / highly relevant).
- Meta-evaluation: On a high-quality subset of 586 examples, pairwise agreement between various Judge variants (reference goal, script cues, Socratic descriptions, and combinations) and human assessments is compared. The "reference goal + script cues" LLM Judge achieves 76.8% pairwise agreement with humans, indistinguishable from human-human agreement (75.2%).

Loss & Training¶

This work involves no model training — it is a purely evaluative benchmark. Evaluated models include Llama-3.2-11B, Qwen2.5-VL-3B/7B/72B, InternVL2.5-MPO-2B/8B/78B, and GPT-4.1. Audio is uniformly transcribed with Whisper-base; videos are uniformly sampled at 32 frames (except Llama, which supports only a single frame).

Key Experimental Results¶

Discriminative Evaluation (MCQ Accuracy)¶

Model	Parameters	MCQ (Full)	Generative (Full)
Llama-3.2	11B	0.4311	0.3197
InternVL-2B	2B	0.4422	0.2134
InternVL-8B	8B	0.6741	0.3503
InternVL-78B	78B	0.8680	0.4866
Qwen-3B	3B	0.7153	0.2468
Qwen-7B	7B	0.7754	0.3999
Qwen-72B	72B	0.8755	0.4980
GPT-4.1	—	0.8774	0.5498
Human	—	0.93 (similar) / 0.97 (dissimilar)	—

LLM Judge Meta-Evaluation (Pairwise Agreement with Humans)¶

Judge Variant	Agreement with Humans
SBERT Similarity	59.5%
Socratic	63.0%
Snap-MCQ	67.8%
Reference	~73%
Cues + Reference	76.8%
Human–Human	75.2%

Ablation Study¶

Modality ablation: On the \(S_{VA}\) subset, adding audio (V→VA) yields the largest gain — up to 35% on MCQ and 30% on generative evaluation. Digital and longitudinal modalities contribute smaller improvements, primarily due to low signal-to-noise ratio.
High-signal modality validation: Controlled subsets containing only relevant information — \(D^*\) (relevant app sub-states only) and \(L^*\) (positive support history only) — show that \(VD^*\) outperforms \(VD\) by up to 12% and \(VL^*\) outperforms \(VL\) by up to 5.6%.
Model scale effect: Performance is strongly positively correlated with parameter count. Large models (≥72B) are better at filtering irrelevant information from noisy modalities; small and medium models can even be harmed by full multimodal input.
Full-modality input (VADL): Large models successfully disentangle relevant features from mixed-modality input, whereas small and medium models exhibit modality interference.

Highlights & Insights¶

Elegant data collection paradigm: The scripted approach simultaneously achieves ecological validity and clean ground-truth goals, elegantly addressing the core challenge of egocentric goal inference dataset construction.
Comprehensive four-modality coverage with explicit control: Beyond collecting four modalities — a first in this field — the benchmark carefully annotates which modalities are relevant for each instance, enabling principled modality ablation studies.
Rigorous LLM Judge meta-evaluation: Rather than simply deploying an LLM Judge, the paper compares multiple Judge variants against human assessments, finding that the "reference + cues" variant matches human-human agreement, providing a reliable automatic evaluation protocol for the community.
Clear identification of key challenges: The signal-to-noise problem in digital/longitudinal modalities and the model-scale constraint (wearable devices require small models, yet the performance gap is substantial) constitute well-defined directions for future research.

Limitations & Future Work¶

Human validation covers only visual + audio modalities: Digital and longitudinal contexts are too complex for human annotators to process effectively with current tools, leaving the human baseline incomplete.
Only user-initiated interactions are considered: A truly proactive assistive system must also determine when to intervene, requiring large numbers of negative examples (scenarios requiring no assistance), which the current dataset does not include.
Limited longitudinal history modeling: The benchmark captures only "repeated habit" types of longitudinal cues; richer longitudinal signals such as user preferences (e.g., vegetarian diet) and environmental state (e.g., home tidiness) are not represented.
Synthetic digital context: Although privacy-preserving, synthesized app states may differ in distribution from real-world usage patterns.
Ecological validity of scripted collection: Despite efforts to naturalize the scripts, participants are executing assigned tasks, which remains distinct from fully naturalistic behavior.

Dimension	WAGIBench (Ours)	PARSE-Ego4D	MM-Ego / EgoLife
Task	Goal inference	Goal inference	Agent policy
Modalities	V + A + D + L (longitudinal)	V or A (single)	V + A (longitudinal)
Annotation	Scripted ground-truth	LLM re-annotation from narration	LLM re-annotation from narration/captions
Digital context	✓ (seven apps)	✗	✗
Participants	348	10,133 (Ego4D videos)	629 / 6
Evaluation	MCQ + LLM Judge (meta-evaluated)	NLL / RougeL	MCQ

Compared with PARSE-Ego4D, WAGIBench's core advantages lie in multimodal coverage and scripted ground-truth (as opposed to LLM re-annotation). Compared with MM-Ego/EgoLife, WAGIBench focuses on goal inference rather than policy execution, and is the first to incorporate a digital context modality.

Broader implications:

Multimodal signal-to-noise problem: The signal-to-noise challenges identified in digital and longitudinal modalities have implications for all VLM applications involving long contexts and multi-source information — naively concatenating all available information is insufficient; models must learn to selectively ignore irrelevant content.
On-device deployment gap: The large performance disparity between small models (≤3B) and large models (≥72B) highlights efficient on-device inference as an urgent open problem; model distillation and targeted fine-tuning are promising directions.
Proactive inference vs. reactive response: This work establishes a new paradigm shifting from "user queries → system responds" to "system observes → system proactively assists," closely aligned with the broader proactive agent research direction.

Rating¶

Novelty: ⭐⭐⭐⭐ — First four-modality wearable goal inference benchmark with a clear problem formulation, though no new model is proposed.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Seven model families, detailed modality ablations, rigorous human meta-evaluation, and rich qualitative analysis.
Writing Quality: ⭐⭐⭐⭐⭐ — Clear structure, polished figures, and an exceptionally detailed appendix including complete prompt templates.
Value: ⭐⭐⭐⭐ — NeurIPS Spotlight; establishes a standard benchmark for goal inference in wearable agents.