ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction¶

Conference: ICLR 2026
arXiv: https://enact-embodied-cognition.github.io
Code: https://github.com/enact-embodied-cognition
Area: robotics
Keywords: embodied cognition, world model, VLM evaluation, egocentric perception, POMDP

TL;DR¶

ENACT formalizes embodied cognition evaluation as world-modeling VQA based on first-person interaction—revealing significant gaps and anthropomorphic biases in current top-tier VLMs compared to humans through forward/inverse sequence reshuffling tasks.

Background & Motivation¶

Background: Embodied cognition theory posits that intelligence originates from sensorimotor interaction rather than passive observation. Recently, VLMs (GPT-5, Gemini 2.5, Claude, etc.) have demonstrated impressive interactive capabilities through large-scale non-embodied training, making the question of whether "VLMs possess embodied cognition" a critical scientific inquiry.

Limitations of Prior Work: Existing efforts either focus on spatial perception in static scenes, evaluate only language planning capabilities, or examine simple object-to-object interactions. They lack a unified evaluation framework that tightly couples first-person perception with long-horizon embodied interaction. Subjective classification systems (e.g., Yang et al., 2025) struggle to provide reproducible objective measurements.

Key Challenge: While VLMs excel at static visual understanding, their multi-step, causal, and partially observable embodied world-modeling capabilities have not been rigorously quantified. The lack of evaluation tools limits the understanding of the boundaries of VLM embodied capabilities.

Goal: Construct a scalable, objective, and image-generation-decoupled embodied cognition benchmark to systematically measure forward/inverse world-modeling capabilities under a unified framework.

Core Idea: Transform embodied cognition evaluation into POMDP-based sequence reshuffling VQA—Forward World Modeling (reordering shuffled observations given a sequence of actions) and Inverse World Modeling (reordering shuffled actions given a sequence of observations). Actions are represented as scene graph differences, which strips interference from low-level image synthesis while implicitly requiring the model to demonstrate affordance recognition, action-effect reasoning, embodied perception, and long-term memory.

Method¶

Overall Architecture¶

flowchart TD
    A[Robot Manipulation Trajectory\nBEHAVIOR Simulator] --> B[Keyframe Extraction\nNon-empty Scene Graph Diffs]
    B --> C[Keyframe Trajectory Sampling\nLength L∈3..10\nCombinatorial C(M,L) Expansion]
    C --> D1[Forward World Modeling QA\nGiven Action Seq + Initial Obs\nReorder Shuffled Obs Images]
    C --> D2[Inverse World Modeling QA\nGiven Ordered Obs Seq\nReorder Shuffled Actions]
    D1 --> E[ENACT Benchmark\n8972 QA Pairs\n29 Home Activities]
    D2 --> E
    E --> F[VLM Evaluation\nOnline Validator\nTask Acc + Pairwise Acc]

Key Designs¶

1. Sequence Reshuffling VQA Formalized by POMDP: Decoupled from Image Generation

ENACT defines world modeling on a POMDP \((S, O, A)\), where state space \(S\) is a scene graph, observation space \(O \subset \mathbb{R}^{H \times W \times 3}\) is the robot's first-person RGB view, and action space \(A\) is the scene graph difference \(a_t = \delta(s_t, s_{t-1})\). Evaluation is formalized as two permutation inference tasks:

Forward: Given \(o_0\), an ordered action sequence \((a_0,\ldots,a_{L-2})\), and a shuffled set of observations \(O'\), the model outputs a permutation \(\sigma \in \text{Sym}([L-1])\) such that \((o'_{\sigma(1)}, \ldots, o'_{\sigma(L-1)}) = (o_1, \ldots, o_{L-1})\).
Inverse: Given \(o_0\) and an ordered observation sequence \((o_1,\ldots,o_{L-1})\), along with a shuffled set of actions \(A'\), the model outputs a permutation \(\tau\) to align actions with observational progress.

This design decouples long-horizon interactive visual reasoning from high-fidelity video prediction, ensuring evaluation signals are clean and reproducible while implicitly examining affordance recognition, contact reasoning, and spatial memory under partial observability.

2. Scalable Keyframe Trajectory Synthesis: Combinatorial Data Explosion

In raw robot trajectories (30Hz), many moments lack semantic changes. ENACT extracts keyframes \(K = \{t_1 < \cdots < t_M\}\) by detecting non-empty scene graph differences and filters near-duplicate frames using cosine similarity targeting predicate-level change signatures \(c_j\). Trajectories of length \(L\) are sampled from \(M\) keyframes. Since \(L \ll M\) (practically \(L \leq 10, M \gtrsim 30\)), a single trajectory can generate up to \(\binom{M}{L}\) different candidates. This moves data scaling from "number of trajectories" to "number of combinations," theoretically enabling millions of QA pairs from a single trajectory to achieve true scalability.

3. Multi-Granularity Metrics and Online Validator

ENACT designs two layers of metrics: Task Accuracy (TA), requiring exact matches—\(\text{TA} = \frac{1}{|D|}\sum_{x \in D} \mathbf{1}[\text{accepted}(x)]\); and Pairwise Accuracy (PA), providing partial credit for local correctness—\(\text{PA} = \frac{\sum_x \#\text{Correct Adjacent Pairs}_x}{\sum_x L_x}\). As multiple valid permutations may satisfy constraints, an online validator accepts any permutation consistent with input constraints, preventing multi-solution problems from being misjudged as errors and accurately reflecting causal reasoning.

4. Fine-grained Error Analysis Framework: Five Category Classification

By converting predicted permutations into corresponding action sequences and performing Venn diagram analysis against ground truth scene graph differences, ENACT classifies atomic state changes into five categories: Correct, Omission, Hallucination, Entity Substitution, Polarity Inversion, and Predicate Substitution. This semantic-level classification is more diagnostic than permutation-level comparisons, directly revealing the root causes of cognitive failure.

Key Experimental Results¶

Main Results (Pairwise Accuracy, Select Horizons)¶

Model	Forward L=3	Forward L=6	Forward L=10	Inverse L=3	Inverse L=6	Inverse L=10
Human	93.62	93.87	95.13	92.05	94.25	96.29
GPT-5	84.62	64.18	46.93	86.28	68.78	55.33
GPT-5 mini	87.50	63.41	44.11	85.05	67.67	50.02
Gemini 2.5 Pro	86.10	60.80	36.98	87.94	70.03	56.62
InternVL3.5-241B	75.79	45.85	25.24	82.26	53.38	30.56
Qwen2.5-VL-72B	78.15	41.92	25.07	77.80	48.19	36.27
Claude Sonnet 4	65.65	30.52	20.16	73.25	43.07	28.49

Ablation Study (Image Fidelity vs Camera Config, GPT-5 mini, Pairwise Acc Change Δ)¶

Configuration Variant	Significance (p)	Impact Δ	Description
Path Tracing	p≥0.2	Small	Rendering fidelity does not affect performance
Real Images (GPT-image-1 conversion)	p≥0.2	Small	Minimal sim-to-real gap
FOV 60/80/Fisheye	p≤0.01	Significant Drop	VLM biased toward human eye intrinsics
Camera Height +0.5m (Forward)	p<0.05	Δ=−0.13	Non-standard height significantly impairs performance
Right vs Left Hand (Confusion Rate)	—	Right 4.67% vs Left 9.38%	Right hand significantly outperforms left

Key Findings¶

Inverse tasks consistently outperform forward tasks across all models and horizons, suggesting linguistic retrospective reasoning is stronger than visual proactive simulation.
Accuracy scales monotonically downward as trajectory length increases. At \(L=8-10\), most models' Task Accuracy approaches zero while humans remain stable at \(>93\%\).
GPT-5 and Gemini 2.5 Pro only approach human-level performance at \(L=3\); the gap widens rapidly in long horizons.
Primary error types: Forward tasks involve hallucinations (43.9%) + omissions (37.1%) ≈ 81%; Inverse tasks share these roughly equally at 41.8%.
Cosmos-Reason1-7B (trained on embodied data) is more stable than same-sized models at \(L>5\).
Sim-to-real results are highly consistent, validating minimal simulation gaps.

Highlights & Insights¶

Elegant yet comprehensive task design: The "narrow" form of sequence reshuffling VQA implicitly demands core embodied capabilities like affordance recognition, action-effect reasoning, and partially observable long-term memory while avoiding video synthesis noise.
Truly scalable data pipeline: Combinatorial keyframe sampling allows a single trajectory to generate millions of QA pairs, providing a foundation for large-scale embodied cognition research.
Stark human contrast: Humans maintain ~94% accuracy across all horizons, while the strongest VLM (GPT-5) drops to 47% at \(L=10\)—revealing a gap much larger than previously assumed.
Quantified anthropomorphic bias: Reveals a deep-seated bias in training data. VLMs' default world-view is tightly coupled with human perspectives, making it difficult to generalize to non-human robotic viewpoints.
Semantic error framework: Directs future improvements by identifying that the main issue isn't misidentifying specific changes, but rather omitting or hallucinating state changes that do not exist.

Limitations & Future Work¶

Relying solely on simulation (BEHAVIOR); despite small sim-to-real gaps, the diversity of real-world trajectories remains limited.
Action representation via scene graph differences depends on ground truth states from the simulator, which is difficult to replicate on physical robot platforms.
Evaluation scale (8972 QA) is smaller than common LLM benchmarks and covers only 29 home activities; scene diversity needs expansion.
Does not yet address cross-modal execution (language → action) or actual robot control, measuring only "understanding."

vs EmbodiedScan / ScanQA (Static Scene VQA): ENACT introduces temporal action chains and partial observability, upgrading from static spatial understanding to dynamic causal reasoning.
vs Aurora-Bench: Aurora-Bench focuses on short-horizon general video world modeling; ENACT focuses on long-horizon robot manipulation with explicit action semantics.
vs BEHAVIOR Challenge: ENACT reuses BEHAVIOR trajectory data but transforms it into an evaluation-oriented rather than training-oriented benchmark.
vs Cosmos-Reason1 (Embodied VLMs): Results highlight the value of embodied data training for long-horizon stability, providing quantitative evidence for future training data design.

Rating¶

Novelty: ⭐⭐⭐⭐ Fusion of world modeling, POMDP, and sequence reshuffling into a unified framework with clear concepts and formalization.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated 30 models, 8972 QAs, multi-dimensional bias analysis (perspective/FOV/handedness), and sim-to-real comparison.
Writing Quality: ⭐⭐⭐⭐⭐ Highly structured, balancing mathematical and intuitive explanations with excellent readability.
Value: ⭐⭐⭐⭐⭐ Fills the gap in long-horizon embodied cognition evaluation with scalable data, providing essential infrastructure for VLM research.