Learning 3D Scene Analogies with Neural Contextual Scene Maps¶

Conference: ICCV 2025 arXiv: 2503.15897 Code: https://82magnolia.github.io/3d_scene_analogies/ Area: 3D Vision Keywords: 3D scene analogy, scene mapping, descriptor field, contrastive learning, coarse-to-fine

TL;DR¶

This paper introduces the 3D scene analogy task and proposes neural contextual scene maps to establish dense 3D mappings between scene regions sharing similar semantic context, enabling downstream applications such as trajectory transfer and object placement transfer.

Background & Motivation¶

Understanding 3D scene context is critical for robots to execute tasks and transfer knowledge across environments. Existing methods focus primarily on point-level or object-level representations, making it difficult to capture holistic relationships between scene regions. Humans rely on analogical reasoning to transfer experience from known scenes to new environments, yet enabling machines to perform such mappings remains highly non-trivial.

The core challenges are: (1) a smooth mapping that preserves spatial consistency is required, attending not only to object positions but also to their surrounding context; (2) dense scene-level ground-truth annotations are unavailable; (3) robustness to appearance variation is necessary. Conventional feature matching approaches (e.g., DINOv2 keypoints) are computationally expensive and fail to capture fine-grained semantic relationships, while scene graph matching methods reduce objects to sparse nodes, losing geometric granularity.

The 3D scene analogy task proposed in this paper requires finding dense mappings between contextually corresponding regions across scenes—covering not only points near object surfaces but also open-space regions—a regime that existing methods cannot address.

Method¶

Overall Architecture¶

Given a pair of 3D scenes, the method selects a region of interest (RoI) from the target scene, identifies a corresponding region with similar context in the reference scene, and establishes a dense mapping \(F(\cdot): \text{conv}(S_{\text{tgt}}) \rightarrow \text{conv}(S_{\text{ref}})\). The pipeline consists of three stages: (1) constructing scene representations from sparse keypoints; (2) building contextual descriptor fields; (3) coarse-to-fine mapping estimation.

Key Designs¶

Context Descriptor Fields: For an arbitrary query point \(\mathbf{q}\) in the scene, keypoints within radius \(r\) are aggregated and encoded via a Transformer encoder. Each point's token is formed by concatenating a distance embedding \(d_\theta(\|\mathbf{q}-\mathbf{p}\|_2)\) and a semantic embedding \(s_\phi(\texttt{label}(\mathbf{p}))\). The output of a learnable [CLS] token serves as the final feature vector (dimension \(d=256\)). The descriptor field can distinguish fine-grained contextual differences—for instance, producing high similarity peaks only at chair armrests adjacent to table corners.
Contrastive Learning Training: Descriptor fields are trained on procedurally generated scene triplets. Positive pairs are created by replacing objects with the same semantic label (while preserving pose), and negatives are generated by adding pose noise. Training uses the InfoNCE loss:

\[\mathcal{L} = \sum_{\mathbf{q},\mathbf{q^+}} -\log \frac{\exp(D_\Phi(\mathbf{q};S,r)^T D_\Phi(\mathbf{q^+};S^+,r)/\tau)}{\sum_{\tilde{S}\in\mathcal{S}} \exp(D_\Phi(\mathbf{q};S,r)^T D_\Phi(\mathbf{q^+};\tilde{S},r)/\tau)}\]

where the temperature parameter \(\tau=0.2\), requiring no dense annotation data.

Coarse-to-Fine Mapping Estimation: Scene mapping is decomposed into an affine transformation plus a local displacement: \(F(\mathbf{x}) := \mathbf{A}\mathbf{x} + \mathbf{b} + d_w(\mathbf{x};P_{\text{RoI}})\).
- Coarse Estimation: A pool of candidate affine mappings is generated from combinations of object pairs (\(N_{\text{ortho}}=16\) rotations/reflections); the \(K_{\text{coarse}}=5\) best mappings are selected by descriptor field alignment cost and refined via gradient optimization.
- Fine Estimation: Local displacement weights \(w_k\) are learned via Thin Plate Spline radial basis functions with a regularization term \(\lambda=0.5\) to ensure smoothness.

Loss & Training¶

Training uses only the InfoNCE contrastive objective (Equation 2).
The mapping estimation stage employs gradient-based optimization on descriptor field alignment cost (Adam, lr=1e-3).
10,000 training triplets are generated from 3D-FRONT; 4,498 triplets from ARKitScenes.
If the minimum alignment cost exceeds \(\rho_{\text{valid}}=1.5\), the pair is flagged as unmappable.

Key Experimental Results¶

Main Results¶

Method	PCP@0.25	PCP@0.50	Bi-PCP@0.25	Bi-PCP@0.50	Chamfer@0.15	Chamfer@0.20
Scene Graph Matching	0.26	0.42	0.29	0.47	0.32	0.48
Multi-view Semantic Corresp.	0.10	0.20	0.14	0.21	0.62	0.86
Visual Feature Field	0.50	0.66	0.52	0.61	0.81	0.86
3D Point Feature Field	0.56	0.71	0.60	0.68	0.86	0.89
Ours	0.76	0.90	0.92	0.94	0.97	0.99

Comparison on procedurally generated 3D-FRONT scene pairs; the proposed method comprehensively outperforms all baselines.

Ablation Study¶

Method	Bi-PCP@0.25	Bi-PCP@0.50	Chamfer@0.15	Chamfer@0.20
Ours w/ CLIP Emb.	0.77	0.81	0.91	0.97
Ours w/ Sentence Emb.	0.78	0.82	0.92	0.97
Ours w/o Local Displacement	0.83	0.89	0.77	0.85
Ours (full)	0.90	0.92	0.94	0.96

Averaged results on 3D-FRONT manual and procedural scene pairs. Removing local displacement leads to a notable performance drop, validating the necessity of the coarse-to-fine strategy.

Key Findings¶

Descriptor fields can effectively incorporate foundation model features (CLIP/sentence embeddings) and operate without explicit semantic labels.
The method is robust across Sim2Real and Real2Sim transfers; descriptor fields trained solely on 3D-FRONT generalize to cross-domain mappings.
Runtime for affine mapping and local displacement estimation is only 0.67s and 0.57s, respectively.

Highlights & Insights¶

Novel Task Definition: Contextual understanding between scenes is formalized as a dense mapping problem, representing a natural extension of instance-level matching.
Lightweight Input: Only 50 sparse keypoints per object plus semantic labels are required, conferring robustness to noisy inputs.
Open-Space Reasoning: Correspondences are established not only on object surfaces but also in empty spatial regions—critical for applications such as trajectory transfer.
Strong Practicality: Long-trajectory transfer is combined with A* path planning for collision avoidance; object placement transfer supports AR/VR collaboration scenarios.

Limitations & Future Work¶

The current method outputs a single mapping and cannot handle multimodal or symmetric cases (e.g., multiple valid mappings for four chairs around a table).
Affine initialization may fail when object positions are swapped (e.g., toilet and bathtub exchanging locations).
Training still requires semantic/instance labels and object poses for generating contrastive learning samples.
The correctness criterion is defined based on semantic and local geometric similarity; different downstream tasks may require alternative evaluation standards.

Extending from semantic correspondence (2D/3D instance-level matching) to scene-level dense mapping represents a significant advancement.
The contrastive learning strategy eliminates the need for dense annotations and is worth adopting in other scene understanding tasks.
The negative sample generation strategy is complementary to scene synthesis methods such as LEGO-Net.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (novel task definition with an effective solution)
Experimental Thoroughness: ⭐⭐⭐⭐ (synthetic and real scenes, complete ablations, but lacking user studies)
Writing Quality: ⭐⭐⭐⭐⭐ (clear structure, high-quality figures)
Value: ⭐⭐⭐⭐ (clear application scenarios in robotics, AR/VR)