LabUtopia: High-Fidelity Simulation and Hierarchical Benchmark for Scientific Embodied Agents¶
Conference: NeurIPS 2025 arXiv: 2505.22634 Code: Available (project page) Area: Embodied AI / Scientific Experimentation / Simulation Platform Keywords: embodied agent, laboratory simulation, chemical reaction, hierarchical benchmark, imitation learning
TL;DR¶
This paper proposes LabUtopia — a high-fidelity simulation and hierarchical benchmark suite for scientific laboratory environments. It comprises the LabSim simulator with chemical reaction modeling, LabScene for procedural laboratory scene generation, and LabBench, a five-level benchmark spanning atomic operations to long-horizon mobile manipulation. The suite reveals significant bottlenecks in existing imitation learning methods with respect to long-horizon experimental workflows and object generalization.
Background & Motivation¶
Background: Self-driving laboratories (SDLs) can accelerate scientific discovery, but existing systems are constrained by predefined protocols and hardware dependencies, lacking a general-purpose platform for training and evaluating embodied agents.
Limitations of Prior Work: Mainstream simulators (AI2-THOR, OmniGibson, ManiSkill3) focus on household or industrial environments, cannot model chemical reaction dynamics (color changes, product formation, etc.), and lack laboratory-grade assets and evaluation protocols.
Key Challenge: Laboratory manipulation is substantially more complex than household manipulation — it involves perception of physicochemical changes, multi-step long-horizon planning, and precise instrument control, all of which demand dedicated simulation and benchmarking infrastructure.
Key Insight: Construct an integrated simulation–scene–benchmark platform to fill the gap in embodied AI for scientific experimentation.
Core Idea: High-fidelity chemical reaction simulation + procedural laboratory scene generation + five-level hierarchical benchmark = a complete testbed for scientific embodied AI.
Method¶
Overall Architecture¶
LabUtopia consists of three core components: (1) LabSim — an Isaac Sim-based simulation engine extended with chemical reaction modeling; (2) LabScene — a library of laboratory scene and instrument assets with procedural scene generation; and (3) LabBench — a five-level hierarchical benchmark comprising 30+ tasks.
Key Designs¶
-
LabSim Chemical Reaction Engine:
- A structured database of 200 common chemical substances is constructed from PubChem, encoding attributes such as color, molar mass, and pH.
- Given reactants, GPT-4o is used to reason about the chemical reaction process (products, color changes, etc.), and the physical states and visual attributes of substances are dynamically updated within the simulation.
- Supports multi-physics interactions involving rigid bodies, deformable bodies, and fluids (GPU-accelerated PBD).
- Design Motivation: Visual changes from chemical reactions (color, phase transitions) constitute critical perceptual signals for laboratory agents.
-
LabScene Procedural Scene Generation:
- Approximately 100 expert-validated laboratory scenes, ~60 categories of equipment, and ~80 categories of glassware/plasticware.
- A hybrid layout strategy combining grid-based random sampling and constraint-aware depth-first search (handling collision, boundary, and instrument-specific constraints).
- Layout scoring accounts for proximity to edges, inter-object spacing, and orientation alignment.
- Design Motivation: Diverse scenes are critical for agent generalization, yet laboratory scenes are scarce.
-
LabBench Five-Level Task Hierarchy:
- Level 1 (Atomic Operations): Single-step actions such as grasp, pour, stir, press, and place.
- Level 2 (Short-Horizon Composite): 2–3 step action sequences (e.g., open container + pour reagent).
- Level 3 (Generalization Testing): Evaluation on unseen object shapes, materials, and scenes.
- Level 4 (Long-Horizon Manipulation): Multi-step experimental workflows (e.g., instrument cleaning procedures) requiring high-level planning.
- Level 5 (Mobile Manipulation): Combined navigation and manipulation tasks.
-
Automated Trajectory Collection: Atomic actions are executed via finite state machines and RMPflow controllers; task-level controllers compose atomic actions; navigation employs A* with occupancy grid maps.
Key Experimental Results¶
Level 1–2: ACT vs. Diffusion Policy¶
| Task | ACT | DP |
|---|---|---|
| Stir (Level 1) | 86.7% | 95.0% |
| Pick (Level 1) | 75.0% | 86.7% |
| Pour Liquid (Level 2) | 67.5% | 50.0% |
| Heater Beaker (Level 2) | 86.7% | 25.0% |
| Stir w/ GlassRod (Level 2) | 55.0% | 10.0% |
Level 3: Generalization Testing (ID / OOD)¶
| Task | π₀ | ACT | DP |
|---|---|---|---|
| Pick | 83.3/85.8 | 81.7/71.7 | 53.3/41.7 |
| Heater Beaker | 89.1/86.7 | 86.7/80.0 | 21.6/8.3 |
| Pour Liquid | 40.0/38.3 | 75.0/65.0 | 46.6/31.6 |
Level 4: Long-Horizon Task (Clean Beaker, 7 steps)¶
- ACT: SP=14.0%, A1=99.3%, A2=51.9% ... A7=1.6% — cumulative errors cause a steep decline in later steps.
- DP fails almost completely in later steps.
Extreme Shape Generalization Test¶
- ACT Pick OOD: 1.7%, Pour Liquid OOD: 0.0% — both models completely fail to manipulate objects of unseen sizes.
Key Findings¶
- DP tends to "stall": Short prediction horizons lead to hovering without action (e.g., button-pressing tasks).
- Long-horizon tasks are the primary bottleneck: 7-step task success rate reaches only 1.6%; cumulative error is the core challenge.
- Shape generalization is near zero: After joint training on objects of different sizes, OOD success rates approach 0%.
- Fine-tuned π₀ (pretrained VLA) shows robust OOD performance but no decisive advantage: It generalizes well to visual variation but does not consistently outperform models trained from scratch.
Highlights & Insights¶
- Fills an important gap: The first embodied AI simulation platform supporting chemical reaction modeling, specifically targeting scientific experimentation.
- Well-designed hierarchical benchmark: Five progressively challenging levels from atomic operations to long-horizon mobile manipulation systematically expose capability bottlenecks.
- Large asset scale: 200+ scene and instrument assets are expert-validated, supporting large-scale training.
- Reveals critical bottlenecks: Long-horizon cumulative error and near-zero shape generalization are identified as fundamental weaknesses of current imitation learning approaches.
Limitations & Future Work¶
- The chemical reaction engine relies on GPT-4o for reasoning, which may introduce chemical knowledge errors.
- Validation is conducted in simulation only; sim-to-real transfer has not been tested.
- Level 5 (mobile manipulation) results are not fully reported.
- Fluid simulation is based on PBD rather than SPH/MPM, limiting chemical fidelity.
- Scene generation relies primarily on heuristic layout strategies rather than learned methods.
Related Work & Insights¶
- vs. OmniGibson/ManiSkill3: These platforms do not support chemical reaction modeling, lack laboratory assets, and provide no hierarchical evaluation.
- vs. RLBench: RLBench offers high simulation quality but limited scene diversity and no navigation tasks.
- vs. ClevrSkills: A compositional reasoning benchmark without chemical or scientific experimentation scenarios.
- The work has direct implications for the advancement of scientific embodied AI and automated laboratory systems.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First platform combining chemical reaction simulation with a hierarchical scientific experimentation benchmark.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Three models evaluated across five task levels with thorough bottleneck analysis.
- Writing Quality: ⭐⭐⭐⭐ — System design is clearly presented with detailed component descriptions.
- Value: ⭐⭐⭐⭐⭐ — Fills a critical infrastructure gap for embodied AI in scientific experimentation.