MetaFind: Scene-Aware 3D Asset Retrieval for Coherent Metaverse Scene Generation¶

Conference: NeurIPS 2025 arXiv: 2510.04057 Code: None Area: Others Keywords: 3D asset retrieval, scene awareness, graph neural networks, SE(3) equivariance, multimodal fusion

TL;DR¶

MetaFind is a scene-aware tri-modal (text + image + point cloud) 3D asset retrieval framework that encodes scene layout information via an SE(3)-equivariant spatial-semantic graph neural network (ESSGNN), enabling iterative asset retrieval with style consistency and spatial coherence for metaverse scene generation.

Background & Motivation¶

Metaverse and virtual scene generation require retrieving appropriate objects from large-scale 3D asset libraries to assemble scenes. Existing methods suffer from three core issues:

Ignoring scene context: Existing 3D retrieval methods (e.g., ULIP, OpenShape) focus primarily on object-level geometric features without considering spatial relationships, semantic consistency, or stylistic coherence.

Lack of a standardized retrieval paradigm: Unlike the well-established dual-encoder (DPR) architecture in NLP, 3D asset retrieval lacks a dedicated, standardized framework.

Single-modality query limitation: Most methods support only a single modality (3D→3D or text→3D) and cannot handle multimodal combination queries.

Furthermore, when an object is placed into a scene, its positional relationships with existing objects, functional compatibility, and overall aesthetic harmony must be considered — aspects that purely object-level retrieval cannot address.

Method¶

Overall Architecture¶

MetaFind adopts a dual-tower retrieval architecture: a query encoder (supporting arbitrary modality combinations with optional layout information) and a gallery encoder (precomputing embeddings for all 3D assets). The framework builds on the ULIP-2 backbone for tri-modal alignment and injects scene spatial context via the ESSGNN module.

Key Designs¶

ESSGNN Equivariant Spatial-Semantic Graph Encoder: The core contribution. The scene is modeled as a graph $G = (\mathcal{V}, \mathcal{E})$, where nodes represent existing objects in the scene (with 3D coordinates $x_i$ and text features $t_i$), and edges include physical relation edges (e.g., "cup on table") and semantic relation edges (functional association descriptions generated by an LLM and encoded via CLIP). Message passing follows a modified EGCL structure:
- Feature update: $h_i^{l+1} = h_i^l + \sum_{j \in \mathcal{N}(i)} f_h(d_{ij}^l, h_i^l, h_j^l, e_{ij})$
- Coordinate update: $x_i^{l+1} = x_i^l + \sum_{j \in \mathcal{N}(i)} (x_i^l - x_j^l) \cdot f_x(d_{ij}^l, h_i^{l+1}, h_j^{l+1}, e_{ij})$

This guarantees full SE(3) equivariance — scene rotations and translations do not affect the embeddings.

Modality-Aware Fusion Strategy: Supports any subset of text, image, and point cloud as query input. During training, 30% random modality dropout (masking) is applied to simulate missing modality conditions, using mask embeddings rather than zero-padding to prevent model degradation. Fusion options include mean pooling, MLP, gated fusion, and Transformer.
Iterative Scene Composition: At inference time, objects are retrieved and placed one at a time (Algorithm 1); after each placement, the layout embedding is recomputed so that subsequent retrievals are aware of the updated spatial context. Region-decomposition-based parallel retrieval is also supported for improved efficiency.

Loss & Training¶

A two-stage training strategy is employed:

Stage 1 (Cross-modal Alignment Pretraining): Trained on the Objaverse-LVIS dataset (48K 3D assets) using contrastive learning to align the multimodal embedding space: $$\mathcal{L}_{pre} = -\log \frac{\exp(\text{sim}(f_{query}(Q), f_{gallery}(A))/\tau)}{\sum_{A'} \exp(\text{sim}(f_{query}(Q), f_{gallery}(A'))/\tau)}$$

Stage 2 (Layout-Aware Fine-tuning): Fine-tuned on the ProcTHOR-10K dataset, incorporating the ESSGNN encoder and a bidirectional contrastive loss: $$\mathcal{L}_{layout} = \frac{1}{2}(\mathcal{L}_{layout}^{q2g} + \mathcal{L}_{layout}^{g2q})$$ The gallery encoder is frozen; only the query-side fusion layers and the ESSGNN module are updated. A 30% scene dropout is applied to ensure generalization to layout-free inputs.

Key Experimental Results¶

Object-Level Retrieval Performance (Objaverse-LVIS, R@1/R@5 %)¶

Method	Text Only	Image Only	PC Only	T+I	T+PC	I+PC	T+I+PC
ULIP	0.1/0.9	0.1/1.3	97.9/99.4	0/0.3	33.9/58	22.6/41.6	6.4/15.9
OpenShape	0.6/1.7	0.3/1.1	98.4/99.7	0/0.5	35.1/61.4	25.0/44.3	7.0/17.2
OmniBind (Full)	5.3/11.7	2.3/3.5	99.0/99.7	0.5/1.2	37.5/60.8	27.5/46.4	11.9/23.4
MetaFind w/o ESSGNN	13.8/23.1	11.7/19.2	75.1/78.0	17.2/21.8	44.5/71.3	45.8/73.1	51.7/76.5
MetaFind w/ ESSGNN	11.3/21.5	10.5/15.9	63.2/66.5	15.9/20.3	41.2/68.8	42.0/70.4	48.2/74.9

Scene-Level Quality Evaluation (1–5 scale)¶

Method	Aesthetics (GPT/Human)	Color & Material (GPT/Human)	Scene Consistency (GPT/Human)	Geometric Realism (GPT/Human)
ULIP	2.91/3.02	2.84/2.97	2.76/2.89	2.70/2.81
OpenShape	3.14/3.28	3.08/3.19	3.01/3.11	2.95/3.06
MetaFind w/o ESSGNN	3.42/3.55	3.31/3.41	3.26/3.33	3.22/3.30
MetaFind w/ ESSGNN	4.13/4.25	4.04/4.17	4.10/4.21	4.06/4.18

Ablation Study (Text Only)¶

Variant	R@1 (%)	Aesthetics (GPT)	Scene Consistency (GPT)
Full (bidirectional + iterative + ESSGNN)	11.4	4.1	4.2
w/o layout context	13.5	3.4	3.3
w/ GAT replacing ESSGNN	11.0	3.4	3.7
Fusion = Mean	9.4	3.2	3.5
Modality Dropout = 50%	13.2	3.1	3.2
Zero-padding for missing modalities	10.5	3.1	3.1

Key Findings¶

ESSGNN substantially improves scene quality: Although incorporating ESSGNN causes a slight decrease in object-level R@1 (due to feature attribution shift), scene-level scores improve by approximately 0.7–0.9 points (out of 5), demonstrating the value of spatial context encoding.
GAT is sensitive to coordinate normalization: Standard GAT lacks translation invariance and produces unstable embeddings under large-scale or unnormalized coordinate systems; the SE(3) equivariance of ESSGNN effectively addresses this issue.
30% modality dropout is optimal: Values below this threshold lead to overfitting on full-modality inputs, while higher values introduce training instability.
Iterative retrieval outperforms one-shot retrieval: Placing objects one at a time and updating the scene graph significantly improves spatial coherence.

Highlights & Insights¶

This work is the first to introduce scene-aware layout encoding into 3D asset retrieval, advancing the paradigm from object-level to scene-level retrieval.
The SE(3)-equivariant design of ESSGNN elegantly resolves coordinate system inconsistencies encountered in open-world environments.
Bilateral semantic relations (physical relations + LLM-generated functional relations) enrich the expressive power of the graph.
The iterative scene composition strategy allows retrieval results to dynamically adapt as the scene evolves, analogous to how humans furnish a room incrementally.

Limitations & Future Work¶

Asset descriptions rely on GPT-4o generation, which may introduce linguistic bias and hallucinations.
Evaluation is currently limited to indoor single-room scenes; generalization to open-world settings remains unverified.
The object-level accuracy degradation introduced by ESSGNN is not fully resolved (the authors suggest maintaining dual fusion heads).
The computational overhead of iterative retrieval grows linearly with the number of objects in the scene.
End-to-end comparisons with dedicated scene generation methods (e.g., LayoutGPT, CTRL-Room) are absent.

The equivariant graph neural network design is inspired by EGNN from drug discovery (Satorras et al., 2021) and extended to incorporate semantic edge features.
The I-Design scene generation pipeline (Celen et al., 2024) is integrated for downstream evaluation.
Insight: In 3D scene understanding, joint modeling of spatial and semantic relationships is key to improving retrieval quality, and equivariance ensures robust generalization of representations.

Rating¶

Novelty: ⭐⭐⭐⭐ (ESSGNN and the scene-aware retrieval framework are genuinely novel)
Experimental Thoroughness: ⭐⭐⭐⭐ (Comprehensive multi-dimensional evaluation with thorough ablation studies)
Writing Quality: ⭐⭐⭐⭐ (Clear structure, though some details are overly verbose)
Value: ⭐⭐⭐⭐ (Advances 3D retrieval from object-level to scene-level)