LabUtopia: High-Fidelity Simulation and Hierarchical Benchmark for Scientific Embodied Agents¶

Conference: NeurIPS 2025 arXiv: 2505.22634 Code: Available (project page) Area: Embodied AI / Scientific Experimentation / Simulation Platform Keywords: embodied agent, laboratory simulation, chemical reaction, hierarchical benchmark, imitation learning

TL;DR¶

This paper proposes LabUtopia — a high-fidelity simulation and hierarchical benchmark suite for scientific laboratory environments. It comprises the LabSim simulator with chemical reaction modeling, LabScene for procedural laboratory scene generation, and LabBench, a five-level benchmark spanning atomic operations to long-horizon mobile manipulation. The suite reveals significant bottlenecks in existing imitation learning methods with respect to long-horizon experimental workflows and object generalization.

Background & Motivation¶

Background: Self-driving laboratories (SDLs) can accelerate scientific discovery, but existing systems are constrained by predefined protocols and hardware dependencies, lacking a general-purpose platform for training and evaluating embodied agents.

Limitations of Prior Work: Mainstream simulators (AI2-THOR, OmniGibson, ManiSkill3) focus on household or industrial environments, cannot model chemical reaction dynamics (color changes, product formation, etc.), and lack laboratory-grade assets and evaluation protocols.

Key Challenge: Laboratory manipulation is substantially more complex than household manipulation — it involves perception of physicochemical changes, multi-step long-horizon planning, and precise instrument control, all of which demand dedicated simulation and benchmarking infrastructure.

Key Insight: Construct an integrated simulation–scene–benchmark platform to fill the gap in embodied AI for scientific experimentation.

Core Idea: High-fidelity chemical reaction simulation + procedural laboratory scene generation + five-level hierarchical benchmark = a complete testbed for scientific embodied AI.

Method¶

Overall Architecture¶

LabUtopia consists of three core components: (1) LabSim — an Isaac Sim-based simulation engine extended with chemical reaction modeling; (2) LabScene — a library of laboratory scene and instrument assets with procedural scene generation; and (3) LabBench — a five-level hierarchical benchmark comprising 30+ tasks.

Key Designs¶

LabSim Chemical Reaction Engine:
- A structured database of 200 common chemical substances is constructed from PubChem, encoding attributes such as color, molar mass, and pH.
- Given reactants, GPT-4o is used to reason about the chemical reaction process (products, color changes, etc.), and the physical states and visual attributes of substances are dynamically updated within the simulation.
- Supports multi-physics interactions involving rigid bodies, deformable bodies, and fluids (GPU-accelerated PBD).
- Design Motivation: Visual changes from chemical reactions (color, phase transitions) constitute critical perceptual signals for laboratory agents.
LabScene Procedural Scene Generation:
- Approximately 100 expert-validated laboratory scenes, ~60 categories of equipment, and ~80 categories of glassware/plasticware.
- A hybrid layout strategy combining grid-based random sampling and constraint-aware depth-first search (handling collision, boundary, and instrument-specific constraints).
- Layout scoring accounts for proximity to edges, inter-object spacing, and orientation alignment.
- Design Motivation: Diverse scenes are critical for agent generalization, yet laboratory scenes are scarce.
LabBench Five-Level Task Hierarchy:
- Level 1 (Atomic Operations): Single-step actions such as grasp, pour, stir, press, and place.
- Level 2 (Short-Horizon Composite): 2–3 step action sequences (e.g., open container + pour reagent).
- Level 3 (Generalization Testing): Evaluation on unseen object shapes, materials, and scenes.
- Level 4 (Long-Horizon Manipulation): Multi-step experimental workflows (e.g., instrument cleaning procedures) requiring high-level planning.
- Level 5 (Mobile Manipulation): Combined navigation and manipulation tasks.
Automated Trajectory Collection: Atomic actions are executed via finite state machines and RMPflow controllers; task-level controllers compose atomic actions; navigation employs A* with occupancy grid maps.

Key Experimental Results¶

Level 1–2: ACT vs. Diffusion Policy¶

Task	ACT	DP
Stir (Level 1)	86.7%	95.0%
Pick (Level 1)	75.0%	86.7%
Pour Liquid (Level 2)	67.5%	50.0%
Heater Beaker (Level 2)	86.7%	25.0%
Stir w/ GlassRod (Level 2)	55.0%	10.0%

Level 3: Generalization Testing (ID / OOD)¶

Task	π₀	ACT	DP
Pick	83.3/85.8	81.7/71.7	53.3/41.7
Heater Beaker	89.1/86.7	86.7/80.0	21.6/8.3
Pour Liquid	40.0/38.3	75.0/65.0	46.6/31.6

Level 4: Long-Horizon Task (Clean Beaker, 7 steps)¶

ACT: SP=14.0%, A1=99.3%, A2=51.9% ... A7=1.6% — cumulative errors cause a steep decline in later steps.
DP fails almost completely in later steps.

Extreme Shape Generalization Test¶

ACT Pick OOD: 1.7%, Pour Liquid OOD: 0.0% — both models completely fail to manipulate objects of unseen sizes.

Key Findings¶

DP tends to "stall": Short prediction horizons lead to hovering without action (e.g., button-pressing tasks).
Long-horizon tasks are the primary bottleneck: 7-step task success rate reaches only 1.6%; cumulative error is the core challenge.
Shape generalization is near zero: After joint training on objects of different sizes, OOD success rates approach 0%.
Fine-tuned π₀ (pretrained VLA) shows robust OOD performance but no decisive advantage: It generalizes well to visual variation but does not consistently outperform models trained from scratch.

Highlights & Insights¶

Fills an important gap: The first embodied AI simulation platform supporting chemical reaction modeling, specifically targeting scientific experimentation.
Well-designed hierarchical benchmark: Five progressively challenging levels from atomic operations to long-horizon mobile manipulation systematically expose capability bottlenecks.
Large asset scale: 200+ scene and instrument assets are expert-validated, supporting large-scale training.
Reveals critical bottlenecks: Long-horizon cumulative error and near-zero shape generalization are identified as fundamental weaknesses of current imitation learning approaches.

Limitations & Future Work¶

The chemical reaction engine relies on GPT-4o for reasoning, which may introduce chemical knowledge errors.
Validation is conducted in simulation only; sim-to-real transfer has not been tested.
Level 5 (mobile manipulation) results are not fully reported.
Fluid simulation is based on PBD rather than SPH/MPM, limiting chemical fidelity.
Scene generation relies primarily on heuristic layout strategies rather than learned methods.

vs. OmniGibson/ManiSkill3: These platforms do not support chemical reaction modeling, lack laboratory assets, and provide no hierarchical evaluation.
vs. RLBench: RLBench offers high simulation quality but limited scene diversity and no navigation tasks.
vs. ClevrSkills: A compositional reasoning benchmark without chemical or scientific experimentation scenarios.
The work has direct implications for the advancement of scientific embodied AI and automated laboratory systems.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First platform combining chemical reaction simulation with a hierarchical scientific experimentation benchmark.
Experimental Thoroughness: ⭐⭐⭐⭐ — Three models evaluated across five task levels with thorough bottleneck analysis.
Writing Quality: ⭐⭐⭐⭐ — System design is clearly presented with detailed component descriptions.
Value: ⭐⭐⭐⭐⭐ — Fills a critical infrastructure gap for embodied AI in scientific experimentation.