Skip to content

LabUtopia: High-Fidelity Simulation and Hierarchical Benchmark for Scientific Embodied Agents

Conference: NeurIPS 2025 arXiv: 2505.22634 Code: Available (project page) Area: Embodied AI / Scientific Experimentation / Simulation Platform Keywords: embodied agent, laboratory simulation, chemical reaction, hierarchical benchmark, imitation learning

TL;DR

This paper proposes LabUtopia — a high-fidelity simulation and hierarchical benchmark suite for scientific laboratory environments. It comprises the LabSim simulator with chemical reaction modeling, LabScene for procedural laboratory scene generation, and LabBench, a five-level benchmark spanning atomic operations to long-horizon mobile manipulation. The suite reveals significant bottlenecks in existing imitation learning methods with respect to long-horizon experimental workflows and object generalization.

Background & Motivation

Background: Self-driving laboratories (SDLs) can accelerate scientific discovery, but existing systems are constrained by predefined protocols and hardware dependencies, lacking a general-purpose platform for training and evaluating embodied agents.

Limitations of Prior Work: Mainstream simulators (AI2-THOR, OmniGibson, ManiSkill3) focus on household or industrial environments, cannot model chemical reaction dynamics (color changes, product formation, etc.), and lack laboratory-grade assets and evaluation protocols.

Key Challenge: Laboratory manipulation is substantially more complex than household manipulation — it involves perception of physicochemical changes, multi-step long-horizon planning, and precise instrument control, all of which demand dedicated simulation and benchmarking infrastructure.

Key Insight: Construct an integrated simulation–scene–benchmark platform to fill the gap in embodied AI for scientific experimentation.

Core Idea: High-fidelity chemical reaction simulation + procedural laboratory scene generation + five-level hierarchical benchmark = a complete testbed for scientific embodied AI.

Method

Overall Architecture

LabUtopia consists of three core components: (1) LabSim — an Isaac Sim-based simulation engine extended with chemical reaction modeling; (2) LabScene — a library of laboratory scene and instrument assets with procedural scene generation; and (3) LabBench — a five-level hierarchical benchmark comprising 30+ tasks.

Key Designs

  1. LabSim Chemical Reaction Engine:

    • A structured database of 200 common chemical substances is constructed from PubChem, encoding attributes such as color, molar mass, and pH.
    • Given reactants, GPT-4o is used to reason about the chemical reaction process (products, color changes, etc.), and the physical states and visual attributes of substances are dynamically updated within the simulation.
    • Supports multi-physics interactions involving rigid bodies, deformable bodies, and fluids (GPU-accelerated PBD).
    • Design Motivation: Visual changes from chemical reactions (color, phase transitions) constitute critical perceptual signals for laboratory agents.
  2. LabScene Procedural Scene Generation:

    • Approximately 100 expert-validated laboratory scenes, ~60 categories of equipment, and ~80 categories of glassware/plasticware.
    • A hybrid layout strategy combining grid-based random sampling and constraint-aware depth-first search (handling collision, boundary, and instrument-specific constraints).
    • Layout scoring accounts for proximity to edges, inter-object spacing, and orientation alignment.
    • Design Motivation: Diverse scenes are critical for agent generalization, yet laboratory scenes are scarce.
  3. LabBench Five-Level Task Hierarchy:

    • Level 1 (Atomic Operations): Single-step actions such as grasp, pour, stir, press, and place.
    • Level 2 (Short-Horizon Composite): 2–3 step action sequences (e.g., open container + pour reagent).
    • Level 3 (Generalization Testing): Evaluation on unseen object shapes, materials, and scenes.
    • Level 4 (Long-Horizon Manipulation): Multi-step experimental workflows (e.g., instrument cleaning procedures) requiring high-level planning.
    • Level 5 (Mobile Manipulation): Combined navigation and manipulation tasks.
  4. Automated Trajectory Collection: Atomic actions are executed via finite state machines and RMPflow controllers; task-level controllers compose atomic actions; navigation employs A* with occupancy grid maps.

Key Experimental Results

Level 1–2: ACT vs. Diffusion Policy

Task ACT DP
Stir (Level 1) 86.7% 95.0%
Pick (Level 1) 75.0% 86.7%
Pour Liquid (Level 2) 67.5% 50.0%
Heater Beaker (Level 2) 86.7% 25.0%
Stir w/ GlassRod (Level 2) 55.0% 10.0%

Level 3: Generalization Testing (ID / OOD)

Task π₀ ACT DP
Pick 83.3/85.8 81.7/71.7 53.3/41.7
Heater Beaker 89.1/86.7 86.7/80.0 21.6/8.3
Pour Liquid 40.0/38.3 75.0/65.0 46.6/31.6

Level 4: Long-Horizon Task (Clean Beaker, 7 steps)

  • ACT: SP=14.0%, A1=99.3%, A2=51.9% ... A7=1.6% — cumulative errors cause a steep decline in later steps.
  • DP fails almost completely in later steps.

Extreme Shape Generalization Test

  • ACT Pick OOD: 1.7%, Pour Liquid OOD: 0.0% — both models completely fail to manipulate objects of unseen sizes.

Key Findings

  • DP tends to "stall": Short prediction horizons lead to hovering without action (e.g., button-pressing tasks).
  • Long-horizon tasks are the primary bottleneck: 7-step task success rate reaches only 1.6%; cumulative error is the core challenge.
  • Shape generalization is near zero: After joint training on objects of different sizes, OOD success rates approach 0%.
  • Fine-tuned π₀ (pretrained VLA) shows robust OOD performance but no decisive advantage: It generalizes well to visual variation but does not consistently outperform models trained from scratch.

Highlights & Insights

  • Fills an important gap: The first embodied AI simulation platform supporting chemical reaction modeling, specifically targeting scientific experimentation.
  • Well-designed hierarchical benchmark: Five progressively challenging levels from atomic operations to long-horizon mobile manipulation systematically expose capability bottlenecks.
  • Large asset scale: 200+ scene and instrument assets are expert-validated, supporting large-scale training.
  • Reveals critical bottlenecks: Long-horizon cumulative error and near-zero shape generalization are identified as fundamental weaknesses of current imitation learning approaches.

Limitations & Future Work

  • The chemical reaction engine relies on GPT-4o for reasoning, which may introduce chemical knowledge errors.
  • Validation is conducted in simulation only; sim-to-real transfer has not been tested.
  • Level 5 (mobile manipulation) results are not fully reported.
  • Fluid simulation is based on PBD rather than SPH/MPM, limiting chemical fidelity.
  • Scene generation relies primarily on heuristic layout strategies rather than learned methods.
  • vs. OmniGibson/ManiSkill3: These platforms do not support chemical reaction modeling, lack laboratory assets, and provide no hierarchical evaluation.
  • vs. RLBench: RLBench offers high simulation quality but limited scene diversity and no navigation tasks.
  • vs. ClevrSkills: A compositional reasoning benchmark without chemical or scientific experimentation scenarios.
  • The work has direct implications for the advancement of scientific embodied AI and automated laboratory systems.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — First platform combining chemical reaction simulation with a hierarchical scientific experimentation benchmark.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Three models evaluated across five task levels with thorough bottleneck analysis.
  • Writing Quality: ⭐⭐⭐⭐ — System design is clearly presented with detailed component descriptions.
  • Value: ⭐⭐⭐⭐⭐ — Fills a critical infrastructure gap for embodied AI in scientific experimentation.