Action-guided Generation of 3D Functionality Segmentation Data¶
Conference: CVPR 2026 arXiv: 2511.23230 Code: Project Page Area: 3D Vision / Embodied AI Keywords: 3D functionality segmentation, synthetic data generation, action descriptions, LLM retrieval, scene layout
TL;DR¶
This paper presents SynthFun3D, the first method for automatically generating 3D functionality segmentation training data from action descriptions. By leveraging metadata-driven 3D object retrieval and scene layout generation, it produces precise part-level interaction masks without manual annotation. Training on combined synthetic and real data yields +2.2 mAP / +6.3 mAR / +5.7 mIoU improvements on the SceneFun3D benchmark.
Background & Motivation¶
Task Definition: 3D functionality segmentation — given a natural language action description (e.g., "open the second drawer of the nightstand"), segment the interactable part in a 3D scene (e.g., the drawer handle). This is a key perception task for embodied AI.
Core Limitation: Annotated data is extremely scarce. The only publicly available dataset, SceneFun3D, contains only 230 scenes with 3,041 functional masks, and collection and annotation costs are prohibitively high (estimated at $25K per 230 scenes).
Key Challenge: Deep learning models require large-scale training data, yet fine-grained 3D functional masks are practically impossible to annotate at scale. While synthetic data has proven effective in other perception tasks, no targeted data generation approach has been proposed for 3D functionality segmentation.
Core Idea: Starting from action descriptions, the method uses LLMs to reason about scene composition, retrieves 3D assets with part-level annotations, and automatically generates scene layouts and accurate functional masks satisfying spatial-semantic constraints.
Method¶
Overall Architecture¶
Action description → LLM parsing (target object + functional part + room layout) → Non-target object retrieval from Objaverse → Target object retrieval from PartNet-Mobility (metadata-driven) → DFS scene layout → Multi-view rendering + material augmentation → RGB frames + functional masks.
Key Designs¶
-
Metadata-driven Mask Retrieval (Core Contribution):
- Text-to-asset retrieval: Uses PerceptionEncoder for text-image similarity retrieval, retaining all candidates above a threshold.
- Requirement filtering: An LLM infers functional part requirements from the action description (e.g., "open the third drawer" → handle count ≥ 3) and filters non-qualifying candidates.
- Functional part spatial arrangement: Computes 3D centroids of functional parts in candidate objects, projects them to 2D, and uses an LLM to verify whether the spatial arrangement matches semantic constraints (e.g., "top-left drawer" requires a grid arrangement).
- Hierarchical metadata: Leverages PartNet-Mobility's hierarchical structure to enrich labels (e.g., "handle" → "door handle" vs. "drawer handle") to resolve ambiguity.
- Design Motivation: Action descriptions implicitly impose highly specific structural requirements on objects (e.g., "left door" implies a horizontally arranged double-door configuration), which simple text retrieval cannot handle.
-
Scene Layout Optimization:
- An LLM generates layout constraint clauses (e.g., "nightstand bed
"). - A DFS algorithm searches for layouts satisfying all constraints.
- A feasible solution is randomly selected.
- Design Motivation: Spatial relations in action descriptions (e.g., "cabinet near the window") are critical for training data validity.
- An LLM generates layout constraint clauses (e.g., "nightstand bed
-
Material-augmented Rendering: 200 material types (metallic/matte/plastic/glass, etc.) are randomly generated and applied to walls and target objects, expanding data diversity at near-zero cost.
Loss & Training¶
- SynthFun3D is a data generation pipeline and does not involve loss functions.
- Downstream validation: Gemma3-4B (LoRA) is fine-tuned to learn grounding from action descriptions to functional parts.
- Integrated into the Fun3DU pipeline: Gemma3 grounding → SAM segmentation → 2D mask lifting to 3D.
Key Experimental Results¶
Main Results¶
| Training Data | mAP | AP50 | AP25 | mAR | mIoU | P-acc |
|---|---|---|---|---|---|---|
| Zero-shot | 0 | 0 | 0 | 8.4 | 0.07 | 0.003 |
| R (Real only) | 0.31 | 0.67 | 1.12 | 20.22 | 1.18 | 0.170 |
| S (Synthetic only) | 0.43 | 0.90 | 1.57 | 18.29 | 1.23 | 0.118 |
| S + A (Synthetic + Augmented) | 0.38 | 1.35 | 3.60 | 18.49 | 2.25 | 0.176 |
| R + S | 1.17 | 2.92 | 7.42 | 26.20 | 4.40 | 0.320 |
| R + S + A | 2.56 | 5.17 | 12.81 | 26.54 | 6.91 | 0.384 |
Ablation Study¶
| Configuration | Key Finding | Notes |
|---|---|---|
| Synthetic only vs. Real only | mIoU: 1.23 vs. 1.18 | Synthetic data can substitute real data |
| Material augmentation effect | 2.25 vs. 1.23 | +83% mIoU |
| Importance of mixed training | 4.40 vs. 2.25 (S+A) | Real data compensates for domain gap |
| Full data combination | 6.91 | Optimal: diversity is key |
| Per-category analysis | Furniture: large gain; Window: limited gain | Affected by asset library coverage |
Key Findings¶
- Synthetic-only data matches real-data performance (1.23 vs. 1.18 mIoU).
- Mixed synthetic + real training is critical: significantly outperforms either alone.
- Material augmentation contributes substantial gains at near-zero cost (+83% mIoU).
- Synthetic data costs ~$1/scene vs. ~$109/scene for real data — a 100× reduction.
- Point accuracy doubles from 0.170 to 0.384, indicating that synthetic data helps VLMs learn more precise grounding.
Highlights & Insights¶
- First data generation approach for functionality segmentation: fills a gap in this sub-field.
- Elegant metadata-driven retrieval: three-stage filtering (text similarity → requirement filtering → spatial arrangement) ensures retrieved objects precisely match the implicit requirements of action descriptions.
- "Correct spatial relations matter more than visual realism" is a key finding, suggesting that functional understanding depends more on structure than appearance.
- Highly cost-efficient: $1/scene vs. $109/scene.
Limitations & Future Work¶
- Relies on the PartNet-Mobility asset library (~2K objects / 46 categories), limiting coverage.
- Categories such as windows suffer from insufficient frequency due to occasional layout strategy failures.
- The current approach generates 2D multi-view images rather than direct 3D functional masks.
- Material augmentation is relatively simple; more advanced style transfer could further reduce the domain gap.
- Overall task performance remains low (best mIoU of 6.91 vs. GT upper bound of 29.26), highlighting the inherent difficulty of the task.
Related Work & Insights¶
- Builds on Holodeck's LLM-driven scene layout approach, extending it with functionality constraints.
- Key distinction from 3D scene synthesis methods (PhyScene, SceneFactor): focuses on precise annotation at the functional part level.
- As generative models for 3D articulated objects (CAGE, ArtFormer) mature, asset library coverage will naturally improve.
Rating¶
- Novelty: ⭐⭐⭐⭐ First synthetic data generation approach targeting functionality segmentation, though primarily a combination of existing techniques.
- Experimental Thoroughness: ⭐⭐⭐⭐ Detailed data combination comparisons with per-category analysis.
- Writing Quality: ⭐⭐⭐⭐ Clear pipeline description and precise problem formulation.
- Value: ⭐⭐⭐⭐ Provides a scalable solution to the data bottleneck in embodied AI.