Skip to content

MetaFind: Scene-Aware 3D Asset Retrieval for Coherent Metaverse Scene Generation

Conference: NeurIPS 2025 arXiv: 2510.04057 Code: None Area: Others Keywords: 3D asset retrieval, scene awareness, graph neural networks, SE(3) equivariance, multimodal fusion

TL;DR

MetaFind is a scene-aware tri-modal (text + image + point cloud) 3D asset retrieval framework that encodes scene layout information via an SE(3)-equivariant spatial-semantic graph neural network (ESSGNN), enabling iterative asset retrieval with style consistency and spatial coherence for metaverse scene generation.

Background & Motivation

Metaverse and virtual scene generation require retrieving appropriate objects from large-scale 3D asset libraries to assemble scenes. Existing methods suffer from three core issues:

Ignoring scene context: Existing 3D retrieval methods (e.g., ULIP, OpenShape) focus primarily on object-level geometric features without considering spatial relationships, semantic consistency, or stylistic coherence.

Lack of a standardized retrieval paradigm: Unlike the well-established dual-encoder (DPR) architecture in NLP, 3D asset retrieval lacks a dedicated, standardized framework.

Single-modality query limitation: Most methods support only a single modality (3D→3D or text→3D) and cannot handle multimodal combination queries.

Furthermore, when an object is placed into a scene, its positional relationships with existing objects, functional compatibility, and overall aesthetic harmony must be considered — aspects that purely object-level retrieval cannot address.

Method

Overall Architecture

MetaFind adopts a dual-tower retrieval architecture: a query encoder (supporting arbitrary modality combinations with optional layout information) and a gallery encoder (precomputing embeddings for all 3D assets). The framework builds on the ULIP-2 backbone for tri-modal alignment and injects scene spatial context via the ESSGNN module.

Key Designs

  1. ESSGNN Equivariant Spatial-Semantic Graph Encoder: The core contribution. The scene is modeled as a graph \(G = (\mathcal{V}, \mathcal{E})\), where nodes represent existing objects in the scene (with 3D coordinates \(x_i\) and text features \(t_i\)), and edges include physical relation edges (e.g., "cup on table") and semantic relation edges (functional association descriptions generated by an LLM and encoded via CLIP). Message passing follows a modified EGCL structure:

    • Feature update: \(h_i^{l+1} = h_i^l + \sum_{j \in \mathcal{N}(i)} f_h(d_{ij}^l, h_i^l, h_j^l, e_{ij})\)
    • Coordinate update: \(x_i^{l+1} = x_i^l + \sum_{j \in \mathcal{N}(i)} (x_i^l - x_j^l) \cdot f_x(d_{ij}^l, h_i^{l+1}, h_j^{l+1}, e_{ij})\)

This guarantees full SE(3) equivariance — scene rotations and translations do not affect the embeddings.

  1. Modality-Aware Fusion Strategy: Supports any subset of text, image, and point cloud as query input. During training, 30% random modality dropout (masking) is applied to simulate missing modality conditions, using mask embeddings rather than zero-padding to prevent model degradation. Fusion options include mean pooling, MLP, gated fusion, and Transformer.

  2. Iterative Scene Composition: At inference time, objects are retrieved and placed one at a time (Algorithm 1); after each placement, the layout embedding is recomputed so that subsequent retrievals are aware of the updated spatial context. Region-decomposition-based parallel retrieval is also supported for improved efficiency.

Loss & Training

A two-stage training strategy is employed:

Stage 1 (Cross-modal Alignment Pretraining): Trained on the Objaverse-LVIS dataset (48K 3D assets) using contrastive learning to align the multimodal embedding space: $\(\mathcal{L}_{pre} = -\log \frac{\exp(\text{sim}(f_{query}(Q), f_{gallery}(A))/\tau)}{\sum_{A'} \exp(\text{sim}(f_{query}(Q), f_{gallery}(A'))/\tau)}\)$

Stage 2 (Layout-Aware Fine-tuning): Fine-tuned on the ProcTHOR-10K dataset, incorporating the ESSGNN encoder and a bidirectional contrastive loss: $\(\mathcal{L}_{layout} = \frac{1}{2}(\mathcal{L}_{layout}^{q2g} + \mathcal{L}_{layout}^{g2q})\)$ The gallery encoder is frozen; only the query-side fusion layers and the ESSGNN module are updated. A 30% scene dropout is applied to ensure generalization to layout-free inputs.

Key Experimental Results

Object-Level Retrieval Performance (Objaverse-LVIS, R@1/R@5 %)

Method Text Only Image Only PC Only T+I T+PC I+PC T+I+PC
ULIP 0.1/0.9 0.1/1.3 97.9/99.4 0/0.3 33.9/58 22.6/41.6 6.4/15.9
OpenShape 0.6/1.7 0.3/1.1 98.4/99.7 0/0.5 35.1/61.4 25.0/44.3 7.0/17.2
OmniBind (Full) 5.3/11.7 2.3/3.5 99.0/99.7 0.5/1.2 37.5/60.8 27.5/46.4 11.9/23.4
MetaFind w/o ESSGNN 13.8/23.1 11.7/19.2 75.1/78.0 17.2/21.8 44.5/71.3 45.8/73.1 51.7/76.5
MetaFind w/ ESSGNN 11.3/21.5 10.5/15.9 63.2/66.5 15.9/20.3 41.2/68.8 42.0/70.4 48.2/74.9

Scene-Level Quality Evaluation (1–5 scale)

Method Aesthetics (GPT/Human) Color & Material (GPT/Human) Scene Consistency (GPT/Human) Geometric Realism (GPT/Human)
ULIP 2.91/3.02 2.84/2.97 2.76/2.89 2.70/2.81
OpenShape 3.14/3.28 3.08/3.19 3.01/3.11 2.95/3.06
MetaFind w/o ESSGNN 3.42/3.55 3.31/3.41 3.26/3.33 3.22/3.30
MetaFind w/ ESSGNN 4.13/4.25 4.04/4.17 4.10/4.21 4.06/4.18

Ablation Study (Text Only)

Variant R@1 (%) Aesthetics (GPT) Scene Consistency (GPT)
Full (bidirectional + iterative + ESSGNN) 11.4 4.1 4.2
w/o layout context 13.5 3.4 3.3
w/ GAT replacing ESSGNN 11.0 3.4 3.7
Fusion = Mean 9.4 3.2 3.5
Modality Dropout = 50% 13.2 3.1 3.2
Zero-padding for missing modalities 10.5 3.1 3.1

Key Findings

  1. ESSGNN substantially improves scene quality: Although incorporating ESSGNN causes a slight decrease in object-level R@1 (due to feature attribution shift), scene-level scores improve by approximately 0.7–0.9 points (out of 5), demonstrating the value of spatial context encoding.
  2. GAT is sensitive to coordinate normalization: Standard GAT lacks translation invariance and produces unstable embeddings under large-scale or unnormalized coordinate systems; the SE(3) equivariance of ESSGNN effectively addresses this issue.
  3. 30% modality dropout is optimal: Values below this threshold lead to overfitting on full-modality inputs, while higher values introduce training instability.
  4. Iterative retrieval outperforms one-shot retrieval: Placing objects one at a time and updating the scene graph significantly improves spatial coherence.

Highlights & Insights

  • This work is the first to introduce scene-aware layout encoding into 3D asset retrieval, advancing the paradigm from object-level to scene-level retrieval.
  • The SE(3)-equivariant design of ESSGNN elegantly resolves coordinate system inconsistencies encountered in open-world environments.
  • Bilateral semantic relations (physical relations + LLM-generated functional relations) enrich the expressive power of the graph.
  • The iterative scene composition strategy allows retrieval results to dynamically adapt as the scene evolves, analogous to how humans furnish a room incrementally.

Limitations & Future Work

  • Asset descriptions rely on GPT-4o generation, which may introduce linguistic bias and hallucinations.
  • Evaluation is currently limited to indoor single-room scenes; generalization to open-world settings remains unverified.
  • The object-level accuracy degradation introduced by ESSGNN is not fully resolved (the authors suggest maintaining dual fusion heads).
  • The computational overhead of iterative retrieval grows linearly with the number of objects in the scene.
  • End-to-end comparisons with dedicated scene generation methods (e.g., LayoutGPT, CTRL-Room) are absent.
  • The equivariant graph neural network design is inspired by EGNN from drug discovery (Satorras et al., 2021) and extended to incorporate semantic edge features.
  • The I-Design scene generation pipeline (Celen et al., 2024) is integrated for downstream evaluation.
  • Insight: In 3D scene understanding, joint modeling of spatial and semantic relationships is key to improving retrieval quality, and equivariance ensures robust generalization of representations.

Rating

  • Novelty: ⭐⭐⭐⭐ (ESSGNN and the scene-aware retrieval framework are genuinely novel)
  • Experimental Thoroughness: ⭐⭐⭐⭐ (Comprehensive multi-dimensional evaluation with thorough ablation studies)
  • Writing Quality: ⭐⭐⭐⭐ (Clear structure, though some details are overly verbose)
  • Value: ⭐⭐⭐⭐ (Advances 3D retrieval from object-level to scene-level)