MetaFind: Scene-Aware 3D Asset Retrieval for Coherent Metaverse Scene Generation¶
Conference: NeurIPS 2025 arXiv: 2510.04057 Code: None Area: Others Keywords: 3D asset retrieval, scene awareness, graph neural networks, SE(3) equivariance, multimodal fusion
TL;DR¶
MetaFind is a scene-aware tri-modal (text + image + point cloud) 3D asset retrieval framework that encodes scene layout information via an SE(3)-equivariant spatial-semantic graph neural network (ESSGNN), enabling iterative asset retrieval with style consistency and spatial coherence for metaverse scene generation.
Background & Motivation¶
Metaverse and virtual scene generation require retrieving appropriate objects from large-scale 3D asset libraries to assemble scenes. Existing methods suffer from three core issues:
Ignoring scene context: Existing 3D retrieval methods (e.g., ULIP, OpenShape) focus primarily on object-level geometric features without considering spatial relationships, semantic consistency, or stylistic coherence.
Lack of a standardized retrieval paradigm: Unlike the well-established dual-encoder (DPR) architecture in NLP, 3D asset retrieval lacks a dedicated, standardized framework.
Single-modality query limitation: Most methods support only a single modality (3D→3D or text→3D) and cannot handle multimodal combination queries.
Furthermore, when an object is placed into a scene, its positional relationships with existing objects, functional compatibility, and overall aesthetic harmony must be considered — aspects that purely object-level retrieval cannot address.
Method¶
Overall Architecture¶
MetaFind adopts a dual-tower retrieval architecture: a query encoder (supporting arbitrary modality combinations with optional layout information) and a gallery encoder (precomputing embeddings for all 3D assets). The framework builds on the ULIP-2 backbone for tri-modal alignment and injects scene spatial context via the ESSGNN module.
Key Designs¶
-
ESSGNN Equivariant Spatial-Semantic Graph Encoder: The core contribution. The scene is modeled as a graph \(G = (\mathcal{V}, \mathcal{E})\), where nodes represent existing objects in the scene (with 3D coordinates \(x_i\) and text features \(t_i\)), and edges include physical relation edges (e.g., "cup on table") and semantic relation edges (functional association descriptions generated by an LLM and encoded via CLIP). Message passing follows a modified EGCL structure:
- Feature update: \(h_i^{l+1} = h_i^l + \sum_{j \in \mathcal{N}(i)} f_h(d_{ij}^l, h_i^l, h_j^l, e_{ij})\)
- Coordinate update: \(x_i^{l+1} = x_i^l + \sum_{j \in \mathcal{N}(i)} (x_i^l - x_j^l) \cdot f_x(d_{ij}^l, h_i^{l+1}, h_j^{l+1}, e_{ij})\)
This guarantees full SE(3) equivariance — scene rotations and translations do not affect the embeddings.
-
Modality-Aware Fusion Strategy: Supports any subset of text, image, and point cloud as query input. During training, 30% random modality dropout (masking) is applied to simulate missing modality conditions, using mask embeddings rather than zero-padding to prevent model degradation. Fusion options include mean pooling, MLP, gated fusion, and Transformer.
-
Iterative Scene Composition: At inference time, objects are retrieved and placed one at a time (Algorithm 1); after each placement, the layout embedding is recomputed so that subsequent retrievals are aware of the updated spatial context. Region-decomposition-based parallel retrieval is also supported for improved efficiency.
Loss & Training¶
A two-stage training strategy is employed:
Stage 1 (Cross-modal Alignment Pretraining): Trained on the Objaverse-LVIS dataset (48K 3D assets) using contrastive learning to align the multimodal embedding space: $\(\mathcal{L}_{pre} = -\log \frac{\exp(\text{sim}(f_{query}(Q), f_{gallery}(A))/\tau)}{\sum_{A'} \exp(\text{sim}(f_{query}(Q), f_{gallery}(A'))/\tau)}\)$
Stage 2 (Layout-Aware Fine-tuning): Fine-tuned on the ProcTHOR-10K dataset, incorporating the ESSGNN encoder and a bidirectional contrastive loss: $\(\mathcal{L}_{layout} = \frac{1}{2}(\mathcal{L}_{layout}^{q2g} + \mathcal{L}_{layout}^{g2q})\)$ The gallery encoder is frozen; only the query-side fusion layers and the ESSGNN module are updated. A 30% scene dropout is applied to ensure generalization to layout-free inputs.
Key Experimental Results¶
Object-Level Retrieval Performance (Objaverse-LVIS, R@1/R@5 %)¶
| Method | Text Only | Image Only | PC Only | T+I | T+PC | I+PC | T+I+PC |
|---|---|---|---|---|---|---|---|
| ULIP | 0.1/0.9 | 0.1/1.3 | 97.9/99.4 | 0/0.3 | 33.9/58 | 22.6/41.6 | 6.4/15.9 |
| OpenShape | 0.6/1.7 | 0.3/1.1 | 98.4/99.7 | 0/0.5 | 35.1/61.4 | 25.0/44.3 | 7.0/17.2 |
| OmniBind (Full) | 5.3/11.7 | 2.3/3.5 | 99.0/99.7 | 0.5/1.2 | 37.5/60.8 | 27.5/46.4 | 11.9/23.4 |
| MetaFind w/o ESSGNN | 13.8/23.1 | 11.7/19.2 | 75.1/78.0 | 17.2/21.8 | 44.5/71.3 | 45.8/73.1 | 51.7/76.5 |
| MetaFind w/ ESSGNN | 11.3/21.5 | 10.5/15.9 | 63.2/66.5 | 15.9/20.3 | 41.2/68.8 | 42.0/70.4 | 48.2/74.9 |
Scene-Level Quality Evaluation (1–5 scale)¶
| Method | Aesthetics (GPT/Human) | Color & Material (GPT/Human) | Scene Consistency (GPT/Human) | Geometric Realism (GPT/Human) |
|---|---|---|---|---|
| ULIP | 2.91/3.02 | 2.84/2.97 | 2.76/2.89 | 2.70/2.81 |
| OpenShape | 3.14/3.28 | 3.08/3.19 | 3.01/3.11 | 2.95/3.06 |
| MetaFind w/o ESSGNN | 3.42/3.55 | 3.31/3.41 | 3.26/3.33 | 3.22/3.30 |
| MetaFind w/ ESSGNN | 4.13/4.25 | 4.04/4.17 | 4.10/4.21 | 4.06/4.18 |
Ablation Study (Text Only)¶
| Variant | R@1 (%) | Aesthetics (GPT) | Scene Consistency (GPT) |
|---|---|---|---|
| Full (bidirectional + iterative + ESSGNN) | 11.4 | 4.1 | 4.2 |
| w/o layout context | 13.5 | 3.4 | 3.3 |
| w/ GAT replacing ESSGNN | 11.0 | 3.4 | 3.7 |
| Fusion = Mean | 9.4 | 3.2 | 3.5 |
| Modality Dropout = 50% | 13.2 | 3.1 | 3.2 |
| Zero-padding for missing modalities | 10.5 | 3.1 | 3.1 |
Key Findings¶
- ESSGNN substantially improves scene quality: Although incorporating ESSGNN causes a slight decrease in object-level R@1 (due to feature attribution shift), scene-level scores improve by approximately 0.7–0.9 points (out of 5), demonstrating the value of spatial context encoding.
- GAT is sensitive to coordinate normalization: Standard GAT lacks translation invariance and produces unstable embeddings under large-scale or unnormalized coordinate systems; the SE(3) equivariance of ESSGNN effectively addresses this issue.
- 30% modality dropout is optimal: Values below this threshold lead to overfitting on full-modality inputs, while higher values introduce training instability.
- Iterative retrieval outperforms one-shot retrieval: Placing objects one at a time and updating the scene graph significantly improves spatial coherence.
Highlights & Insights¶
- This work is the first to introduce scene-aware layout encoding into 3D asset retrieval, advancing the paradigm from object-level to scene-level retrieval.
- The SE(3)-equivariant design of ESSGNN elegantly resolves coordinate system inconsistencies encountered in open-world environments.
- Bilateral semantic relations (physical relations + LLM-generated functional relations) enrich the expressive power of the graph.
- The iterative scene composition strategy allows retrieval results to dynamically adapt as the scene evolves, analogous to how humans furnish a room incrementally.
Limitations & Future Work¶
- Asset descriptions rely on GPT-4o generation, which may introduce linguistic bias and hallucinations.
- Evaluation is currently limited to indoor single-room scenes; generalization to open-world settings remains unverified.
- The object-level accuracy degradation introduced by ESSGNN is not fully resolved (the authors suggest maintaining dual fusion heads).
- The computational overhead of iterative retrieval grows linearly with the number of objects in the scene.
- End-to-end comparisons with dedicated scene generation methods (e.g., LayoutGPT, CTRL-Room) are absent.
Related Work & Insights¶
- The equivariant graph neural network design is inspired by EGNN from drug discovery (Satorras et al., 2021) and extended to incorporate semantic edge features.
- The I-Design scene generation pipeline (Celen et al., 2024) is integrated for downstream evaluation.
- Insight: In 3D scene understanding, joint modeling of spatial and semantic relationships is key to improving retrieval quality, and equivariance ensures robust generalization of representations.
Rating¶
- Novelty: ⭐⭐⭐⭐ (ESSGNN and the scene-aware retrieval framework are genuinely novel)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Comprehensive multi-dimensional evaluation with thorough ablation studies)
- Writing Quality: ⭐⭐⭐⭐ (Clear structure, though some details are overly verbose)
- Value: ⭐⭐⭐⭐ (Advances 3D retrieval from object-level to scene-level)