Skip to content

ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

Conference: ICLR 2026
arXiv: https://enact-embodied-cognition.github.io
Code: https://github.com/enact-embodied-cognition
Area: robotics
Keywords: embodied cognition, world model, VLM evaluation, egocentric perception, POMDP

TL;DR

ENACT formalizes embodied cognition evaluation as world-modeling VQA based on first-person interaction—revealing significant gaps and anthropomorphic biases in current top-tier VLMs compared to humans through forward/inverse sequence reshuffling tasks.

Background & Motivation

Background: Embodied cognition theory posits that intelligence originates from sensorimotor interaction rather than passive observation. Recently, VLMs (GPT-5, Gemini 2.5, Claude, etc.) have demonstrated impressive interactive capabilities through large-scale non-embodied training, making the question of whether "VLMs possess embodied cognition" a critical scientific inquiry.

Limitations of Prior Work: Existing efforts either focus on spatial perception in static scenes, evaluate only language planning capabilities, or examine simple object-to-object interactions. They lack a unified evaluation framework that tightly couples first-person perception with long-horizon embodied interaction. Subjective classification systems (e.g., Yang et al., 2025) struggle to provide reproducible objective measurements.

Key Challenge: While VLMs excel at static visual understanding, their multi-step, causal, and partially observable embodied world-modeling capabilities have not been rigorously quantified. The lack of evaluation tools limits the understanding of the boundaries of VLM embodied capabilities.

Goal: Construct a scalable, objective, and image-generation-decoupled embodied cognition benchmark to systematically measure forward/inverse world-modeling capabilities under a unified framework.

Core Idea: Transform embodied cognition evaluation into POMDP-based sequence reshuffling VQA—Forward World Modeling (reordering shuffled observations given a sequence of actions) and Inverse World Modeling (reordering shuffled actions given a sequence of observations). Actions are represented as scene graph differences, which strips interference from low-level image synthesis while implicitly requiring the model to demonstrate affordance recognition, action-effect reasoning, embodied perception, and long-term memory.

Method

Overall Architecture

flowchart TD
    A[Robot Manipulation Trajectory\nBEHAVIOR Simulator] --> B[Keyframe Extraction\nNon-empty Scene Graph Diffs]
    B --> C[Keyframe Trajectory Sampling\nLength L∈3..10\nCombinatorial C(M,L) Expansion]
    C --> D1[Forward World Modeling QA\nGiven Action Seq + Initial Obs\nReorder Shuffled Obs Images]
    C --> D2[Inverse World Modeling QA\nGiven Ordered Obs Seq\nReorder Shuffled Actions]
    D1 --> E[ENACT Benchmark\n8972 QA Pairs\n29 Home Activities]
    D2 --> E
    E --> F[VLM Evaluation\nOnline Validator\nTask Acc + Pairwise Acc]

Key Designs

1. Sequence Reshuffling VQA Formalized by POMDP: Decoupled from Image Generation

ENACT defines world modeling on a POMDP \((S, O, A)\), where state space \(S\) is a scene graph, observation space \(O \subset \mathbb{R}^{H \times W \times 3}\) is the robot's first-person RGB view, and action space \(A\) is the scene graph difference \(a_t = \delta(s_t, s_{t-1})\). Evaluation is formalized as two permutation inference tasks:

  • Forward: Given \(o_0\), an ordered action sequence \((a_0,\ldots,a_{L-2})\), and a shuffled set of observations \(O'\), the model outputs a permutation \(\sigma \in \text{Sym}([L-1])\) such that \((o'_{\sigma(1)}, \ldots, o'_{\sigma(L-1)}) = (o_1, \ldots, o_{L-1})\).
  • Inverse: Given \(o_0\) and an ordered observation sequence \((o_1,\ldots,o_{L-1})\), along with a shuffled set of actions \(A'\), the model outputs a permutation \(\tau\) to align actions with observational progress.

This design decouples long-horizon interactive visual reasoning from high-fidelity video prediction, ensuring evaluation signals are clean and reproducible while implicitly examining affordance recognition, contact reasoning, and spatial memory under partial observability.

2. Scalable Keyframe Trajectory Synthesis: Combinatorial Data Explosion

In raw robot trajectories (30Hz), many moments lack semantic changes. ENACT extracts keyframes \(K = \{t_1 < \cdots < t_M\}\) by detecting non-empty scene graph differences and filters near-duplicate frames using cosine similarity targeting predicate-level change signatures \(c_j\). Trajectories of length \(L\) are sampled from \(M\) keyframes. Since \(L \ll M\) (practically \(L \leq 10, M \gtrsim 30\)), a single trajectory can generate up to \(\binom{M}{L}\) different candidates. This moves data scaling from "number of trajectories" to "number of combinations," theoretically enabling millions of QA pairs from a single trajectory to achieve true scalability.

3. Multi-Granularity Metrics and Online Validator

ENACT designs two layers of metrics: Task Accuracy (TA), requiring exact matches—\(\text{TA} = \frac{1}{|D|}\sum_{x \in D} \mathbf{1}[\text{accepted}(x)]\); and Pairwise Accuracy (PA), providing partial credit for local correctness—\(\text{PA} = \frac{\sum_x \#\text{Correct Adjacent Pairs}_x}{\sum_x L_x}\). As multiple valid permutations may satisfy constraints, an online validator accepts any permutation consistent with input constraints, preventing multi-solution problems from being misjudged as errors and accurately reflecting causal reasoning.

4. Fine-grained Error Analysis Framework: Five Category Classification

By converting predicted permutations into corresponding action sequences and performing Venn diagram analysis against ground truth scene graph differences, ENACT classifies atomic state changes into five categories: Correct, Omission, Hallucination, Entity Substitution, Polarity Inversion, and Predicate Substitution. This semantic-level classification is more diagnostic than permutation-level comparisons, directly revealing the root causes of cognitive failure.

Key Experimental Results

Main Results (Pairwise Accuracy, Select Horizons)

Model Forward L=3 Forward L=6 Forward L=10 Inverse L=3 Inverse L=6 Inverse L=10
Human 93.62 93.87 95.13 92.05 94.25 96.29
GPT-5 84.62 64.18 46.93 86.28 68.78 55.33
GPT-5 mini 87.50 63.41 44.11 85.05 67.67 50.02
Gemini 2.5 Pro 86.10 60.80 36.98 87.94 70.03 56.62
InternVL3.5-241B 75.79 45.85 25.24 82.26 53.38 30.56
Qwen2.5-VL-72B 78.15 41.92 25.07 77.80 48.19 36.27
Claude Sonnet 4 65.65 30.52 20.16 73.25 43.07 28.49

Ablation Study (Image Fidelity vs Camera Config, GPT-5 mini, Pairwise Acc Change Δ)

Configuration Variant Significance (p) Impact Δ Description
Path Tracing p≥0.2 Small Rendering fidelity does not affect performance
Real Images (GPT-image-1 conversion) p≥0.2 Small Minimal sim-to-real gap
FOV 60/80/Fisheye p≤0.01 Significant Drop VLM biased toward human eye intrinsics
Camera Height +0.5m (Forward) p<0.05 Δ=−0.13 Non-standard height significantly impairs performance
Right vs Left Hand (Confusion Rate) Right 4.67% vs Left 9.38% Right hand significantly outperforms left

Key Findings

  • Inverse tasks consistently outperform forward tasks across all models and horizons, suggesting linguistic retrospective reasoning is stronger than visual proactive simulation.
  • Accuracy scales monotonically downward as trajectory length increases. At \(L=8-10\), most models' Task Accuracy approaches zero while humans remain stable at \(>93\%\).
  • GPT-5 and Gemini 2.5 Pro only approach human-level performance at \(L=3\); the gap widens rapidly in long horizons.
  • Primary error types: Forward tasks involve hallucinations (43.9%) + omissions (37.1%) ≈ 81%; Inverse tasks share these roughly equally at 41.8%.
  • Cosmos-Reason1-7B (trained on embodied data) is more stable than same-sized models at \(L>5\).
  • Sim-to-real results are highly consistent, validating minimal simulation gaps.

Highlights & Insights

  • Elegant yet comprehensive task design: The "narrow" form of sequence reshuffling VQA implicitly demands core embodied capabilities like affordance recognition, action-effect reasoning, and partially observable long-term memory while avoiding video synthesis noise.
  • Truly scalable data pipeline: Combinatorial keyframe sampling allows a single trajectory to generate millions of QA pairs, providing a foundation for large-scale embodied cognition research.
  • Stark human contrast: Humans maintain ~94% accuracy across all horizons, while the strongest VLM (GPT-5) drops to 47% at \(L=10\)—revealing a gap much larger than previously assumed.
  • Quantified anthropomorphic bias: Reveals a deep-seated bias in training data. VLMs' default world-view is tightly coupled with human perspectives, making it difficult to generalize to non-human robotic viewpoints.
  • Semantic error framework: Directs future improvements by identifying that the main issue isn't misidentifying specific changes, but rather omitting or hallucinating state changes that do not exist.

Limitations & Future Work

  • Relying solely on simulation (BEHAVIOR); despite small sim-to-real gaps, the diversity of real-world trajectories remains limited.
  • Action representation via scene graph differences depends on ground truth states from the simulator, which is difficult to replicate on physical robot platforms.
  • Evaluation scale (8972 QA) is smaller than common LLM benchmarks and covers only 29 home activities; scene diversity needs expansion.
  • Does not yet address cross-modal execution (language → action) or actual robot control, measuring only "understanding."
  • vs EmbodiedScan / ScanQA (Static Scene VQA): ENACT introduces temporal action chains and partial observability, upgrading from static spatial understanding to dynamic causal reasoning.
  • vs Aurora-Bench: Aurora-Bench focuses on short-horizon general video world modeling; ENACT focuses on long-horizon robot manipulation with explicit action semantics.
  • vs BEHAVIOR Challenge: ENACT reuses BEHAVIOR trajectory data but transforms it into an evaluation-oriented rather than training-oriented benchmark.
  • vs Cosmos-Reason1 (Embodied VLMs): Results highlight the value of embodied data training for long-horizon stability, providing quantitative evidence for future training data design.

Rating

  • Novelty: ⭐⭐⭐⭐ Fusion of world modeling, POMDP, and sequence reshuffling into a unified framework with clear concepts and formalization.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated 30 models, 8972 QAs, multi-dimensional bias analysis (perspective/FOV/handedness), and sim-to-real comparison.
  • Writing Quality: ⭐⭐⭐⭐⭐ Highly structured, balancing mathematical and intuitive explanations with excellent readability.
  • Value: ⭐⭐⭐⭐⭐ Fills the gap in long-horizon embodied cognition evaluation with scalable data, providing essential infrastructure for VLM research.