MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning¶
Conference: NeurIPS 2025 arXiv: 2509.22281 Code: Available Area: Robotics / 3D Scene Generation Keywords: Tabletop scene generation, spatial reasoning chain, LLM scene generation, DPO, robotic manipulation
TL;DR¶
This paper proposes the MesaTask framework, which decomposes task descriptions into a Spatial Reasoning Chain — object reasoning → spatial relationship reasoning → scene graph construction → 3D layout — and combines a 10K+ manually annotated dataset with DPO optimization to generate physically plausible, task-aligned tabletop manipulation scenes.
Background & Motivation¶
Background: Robotic manipulation requires diverse tabletop scenes for policy training, yet traditional approaches rely on hand-designed or purely random layouts, making it difficult to achieve both diversity and physical plausibility.
Limitations of Prior Work: Existing LLM-based scene generation methods (e.g., LayoutGPT) exhibit limited zero-shot capability and struggle to model complex inter-object relationships such as stacking and containment. Image reconstruction methods are severely affected by occlusion.
Key Challenge: A substantial gap exists between high-level task descriptions and concrete 3D layouts — how does "prepare a dinner" translate into precise 3D positions and orientations of tableware and food?
Key Insight: The Spatial Reasoning Chain decomposes the problem into a CoT pipeline: object reasoning → attribute description → spatial relationships → scene graph → 3D coordinates.
Core Idea: SFT injects spatial reasoning capability, and DPO eliminates collisions and task misalignment.
Method¶
Overall Architecture¶
(1) MesaTask-10K dataset construction (T2I → depth estimation → 3D retrieval → manual refinement → physics simulation); (2) Spatial Reasoning Chain training data construction; (3) LLM SFT + DPO training.
Key Designs¶
-
MesaTask-10K Dataset
- Function: Constructs 10,700 manually annotated tabletop scenes.
- Mechanism: LLM generates scene descriptions → FLUX generates reference images → depth estimation and detection yield coarse layouts → manual refinement in Blender (10–20 min/scene) → IsaacSim physics simulation eliminates collisions.
- Design Motivation: Covers 6 tabletop categories (office, dining, kitchen, etc.), 12,000+ 3D assets (including articulated objects), and 200+ object categories.
-
Spatial Reasoning Chain
- Function: Structures the task-to-scene process as a reasoning chain.
- Mechanism: Task description → object list reasoning (what objects are needed) → spatial relationship reasoning (who is on/inside/beside whom) → scene graph construction (nodes + edges) → 3D coordinate generation.
- Design Motivation: Direct prediction of 3D coordinates is overly challenging; step-by-step reasoning reduces complexity.
-
SFT + DPO Training
- Function: SFT injects fundamental spatial reasoning capability; DPO eliminates collisions and misalignment.
- Mechanism: The SFT stage trains on reasoning chain data; the DPO stage constructs preferred/rejected pairs (collision-free vs. collision, task-aligned vs. misaligned).
- Design Motivation: Residual collisions and task drift persist after SFT; DPO effectively corrects these issues.
Loss & Training¶
SFT stage: standard language modeling loss. DPO stage: \(\mathcal{L}_{DPO} = -\log\sigma(\beta(\log\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \log\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}))\).
Key Experimental Results¶
Main Results¶
| Method | FID↓ | Task Alignment↑ | Physical Plausibility↑ | Layout Quality↑ |
|---|---|---|---|---|
| LayoutGPT (zero-shot) | 185.3 | 3.2/5 | 2.8/5 | 2.5/5 |
| LLPlace (SFT) | 142.7 | 3.8/5 | 3.5/5 | 3.2/5 |
| MesaTask (SFT) | 98.5 | 4.2/5 | 4.0/5 | 3.8/5 |
| MesaTask (SFT+DPO) | 87.3 | 4.5/5 | 4.3/5 | 4.1/5 |
Ablation Study¶
| Configuration | Collision Rate↓ | Task Alignment↑ |
|---|---|---|
| SFT only | 12.3% | 4.2/5 |
| + DPO (collision pairs) | 4.1% | 4.2/5 |
| + DPO (task pairs) | 11.8% | 4.5/5 |
| + DPO (both) | 3.8% | 4.5/5 |
Key Findings¶
- DPO reduces the collision rate from 12.3% to 3.8% while simultaneously improving task alignment.
- Generation quality for complex relationships (stacking, containment) is markedly superior to zero-shot methods.
- In user studies, MesaTask achieves the highest scores across all evaluation dimensions.
Highlights & Insights¶
- Spatial Reasoning Chain: Decomposes the large gap between abstract task descriptions and 3D coordinates into learnable steps. This structured reasoning paradigm is transferable to other 3D generation tasks.
- Dataset Contribution: The 10K+ manually annotated scenes, encompassing complex relationships such as stacking and containment, fill a critical data gap in the field.
- DPO for 3D: This work represents the first application of DPO to physical collision elimination, yielding significant improvements.
Limitations & Future Work¶
- Manual annotation is costly (10–20 min/scene), making large-scale expansion difficult.
- Despite the substantial size (12K+), the 3D asset library still has coverage gaps.
- Validation is conducted solely in simulation; real-robot deployment requires cross-domain transfer.
Related Work & Insights¶
- vs. LayoutGPT: Limited zero-shot capability and difficulty modeling complex relationships; MesaTask addresses these via data-driven SFT.
- vs. SetItUp: Fixed object sets limit diversity; MesaTask covers 200+ categories with 12K+ assets.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First formal treatment of task-to-scene generation.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ FID + VLM evaluation + user study.
- Writing Quality: ⭐⭐⭐⭐⭐ Dataset, method, and evaluation are systematically complete.
- Value: ⭐⭐⭐⭐⭐ Foundational infrastructure for robotic manipulation scene generation.