Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models¶

Conference: ICCV 2025 arXiv: 2503.19914 Area: 3D Vision Keywords: Object spatial relationships, diffusion models, 3D scene layout, score-based models, multi-object scene generation Authors: Sangwon Baik, Hyeonwoo Kim, Hanbyul Joo (Seoul National University & RLWRLD)

TL;DR¶

This paper proposes learning 3D object-object spatial relationships (OOR) from synthetic images generated by pre-trained 2D diffusion models. A 3D lifting pipeline is introduced to construct a paired dataset, upon which a text-conditioned score-based diffusion model is trained to model the distribution of relative poses and scales between object pairs. The framework is further extended to multi-object scene layout generation and scene editing.

Background & Motivation¶

Objects in real-world scenes exhibit specific spatial and functional placement patterns. Chairs are arranged around tables, cups rest on tables rather than chairs, and pizza cutters slice at particular angles — these intuitive yet diverse relationships are defined as Object-Object Relationships (OOR), describing the relative pose and scale between pairs of objects. Understanding and generating such natural layouts is critical for applications in content creation, VR/AR, and robotic manipulation.

Limitations of Prior Work:

Manual annotation / controlled capture: OOR diversity is extremely high, with combinatorially explosive category pairs, making human annotation prohibitively costly.
Indoor 3D datasets (ScanNet, 3D-FRONT, HyperSim, etc.): Cover only a limited set of predefined categories and cannot generalize to open-vocabulary settings.
Real Internet images: Cluttered scenes make it difficult to extract precise 3D spatial relationships from 2D images.
LLM-based methods (SceneTeller, SMC): Lack direct access to real 3D data and cannot provide fine-grained spatial control.

Core Insight: Images generated by 2D diffusion models naturally embed plausible object spatial relationship cues — the tilt angle of a teapot pouring tea, the pose of a knife cutting an apple — all of which encode rich 3D priors. This property can be leveraged to efficiently construct a diverse 3D OOR dataset.

Method¶

Overall Architecture (Three Stages)¶

OOR Formalization: Define the representation space of relative pose and scale for object pairs.
3D OOR Dataset Generation: Construct data from synthetic 2D images via a 3D lifting pipeline.
OOR Diffusion Model: Train a score-based diffusion model to learn the OOR distribution.

3.1 OOR Formal Representation¶

One object in each pair is designated as the base object and the other as the target. An OOR sample is defined as:

\[\phi = (\mathbf{R}^{\mathcal{T}\to\mathcal{B}},\; \mathbf{t}^{\mathcal{T}\to\mathcal{B}},\; \mathbf{s}^{\mathcal{T}\to\mathcal{B}},\; \mathbf{s}^{\mathcal{B}})\]

R ∈ SO(3): Rotation of the target relative to the base.
t ∈ ℝ³: Relative translation.
s_target ∈ ℝ³₊: Anisotropic scaling of the target object (preserving aspect ratio).
s_base ∈ ℝ³₊: Scaling factor of the base object.

Each object instance is defined in its own canonical space (bounding box center at origin, y-axis up, z-axis forward). A scale-normalized canonical space (3D bounding box normalized to a unit cube) is introduced to handle intra-class variation in aspect ratios.

3.2 3D OOR Data Generation Pipeline¶

This is the most engineering-intensive and critical component of the method, addressing the scarcity of 3D OOR data.

Step 1: High-Quality 2D OOR Image Synthesis

FLUX.1-dev text-to-image model is used to generate images containing OOR cues.
Prompting strategy: appending "white background" to ensure complete object visibility; incorporating shape and texture descriptions to align with template meshes; adjusting viewpoints to handle category pairs with large scale differences (e.g., table–teacup).
An image-to-video model (SV3D) is additionally used for diversity augmentation, treating each frame as an independent 2D sample.

Step 2: Pseudo Multi-View Generation and SfM

SV3D generates orbital multi-view images.
VGGSfM reconstructs 3D point clouds; samples with failed reconstruction are discarded.
Output: 3D point clouds with 2D keypoint correspondences.

Step 3: Mesh Registration for Pose and Scale Extraction

Video segmentation models (SAM2 / Grounding DINO) separate base and target object point clouds.
Semantic feature extraction: 768-dimensional semantic features are extracted from 2D views, reduced to 15 dimensions via PCA, and aggregated per 3D point by averaging.
Cosine similarity establishes correspondences between template meshes and point clouds.
Procrustes analysis + RANSAC estimates rigid-body transformations; ICP refines the result.
Multiple candidate template meshes are evaluated, with the best match selected via DINO features.
Unreliable samples are automatically filtered out.

3.3 OOR Diffusion Model¶

The framework is built upon the score-based model of GenPose.

Training: Model \(\Psi_\theta\) learns the noise score function of the OOR distribution. Conditioning inputs include text context \(c\), base category \(B\), and target category \(T\), all encoded by a pre-trained T5 encoder. The model is trained using the Denoising Score Matching (DSM) objective.

Inference: Starting from pure Gaussian noise, OOR samples are generated via the reverse process of the Probability Flow ODE.

Text Context Augmentation (LLM-driven):

Paraphrase diversification: Varying verbs and sentence structures while preserving semantics (e.g., "pouring tea" → "filling the cup with tea").
Category substitution: Objects with similar shapes share OOR distributions (e.g., "teapot → kettle," "teacup → coffee cup").
Final coverage: 475 contexts, 188 object categories, 23,750 text prompts.

3.4 Extension to Multi-Object OOR¶

The scene is represented as a connected DAG (directed acyclic graph) with a single root node. Each node corresponds to an object, and each edge corresponds to a pairwise OOR.

Two Key Challenges and Solutions:

Collision: Non-adjacent objects may overlap → a collision loss \(C(\Phi)\) penalizes AABB overlaps.
Inconsistency: The same object may have its pose determined via multiple graph paths (e.g., a keyboard inferred from both the monitor and the mouse) → an inconsistency loss \(I(\Phi)\) minimizes the variance of OOR estimates across paths.

The modified reverse ODE is:

\[\frac{d\phi_t^{p_i}}{dt} = -\sigma(t)\dot{\sigma}(t)\nabla_{\phi_t^{p_i}}\log p_i(\phi_t^{p_i}) + \lambda_1 \nabla C(\Phi) + \lambda_2 \nabla I(\Phi)\]

with weights \(\lambda_1 = \min(100/t,\, 10^4)\) and \(\lambda_2 = \min(100/t^2,\, 10^5)\), applied from \(t = 0.5\) onward.

Key Experimental Results¶

Pairwise OOR Generation (150 scenes, 30 category pairs, 92 user study participants)¶

Metric	SMC	SceneTeller	Ours
CLIP Score ↑	28.54	29.06	29.11
VQA Score ↑	0.61	0.68	0.69
VLM Score ↑	49.83	64.67	75.67
User Study (%) ↑	22.21	23.77	54.02

SMC produces reasonable translations but frequently misestimates rotation entirely.
SceneTeller benefits from LLM in-context learning for approximate positional reasoning, but lacks grounding in fine-grained 3D data.
The proposed method is particularly strong on functional relationships (e.g., "pouring tea," "cutting").

Multi-Object OOR Generation (20 scenes, 3–5 objects, 81 user study participants)¶

Metric	GraphDreamer	Ours
VLM Score ↑	2.50	97.50
User Study (%) ↑	11.88	88.12

GraphDreamer frequently fails to capture OOR (e.g., "knife cutting apple") and even loses objects (e.g., mouse, salt shaker). The proposed method stably generates multi-object scenes by composing pairwise OOR knowledge.

Application Validation¶

3D Scene Editing: Score function gradients drive optimization (50 steps, \(\eta = 0.01\), \(\lambda_1 = 0.01\)):

Denoising noisy scenes into plausible layouts.
Switching scene semantics (e.g., teapot "placed beside cup" → "pouring into cup").
Adding new objects to existing scenes with newly specified relationships.

Human Motion Synthesis: Combining the VPoser body pose prior with contact constraints, the method generates coherent motion sequences from an initial human–object interaction state (e.g., a person grasping a teapot to pour tea into a cup). OOR sequences are produced through optimization; contact constraints ensure that the distances of initial human–object contact pairs remain fixed throughout the sequence.

Highlights & Insights¶

Novel Task Definition: The first formal definition of OOR as a concept and parameter space, filling a gap in the formalization of 3D relational modeling.
Leveraging Implicit 3D Knowledge in 2D Diffusion Models: Object placement priors are embedded in generated images, requiring no real 3D annotations.
Score-Based Models Capture Multimodal Distributions: A given OOR context may admit multiple valid configurations (e.g., a teapot pouring from different directions); diffusion models naturally represent this multimodality.
DAG + Inference-Time Losses for Multi-Object Extension: Pairwise OORs are composed into multi-object scenes purely through inference-time constraints, without retraining.
Flexible Application of Score Functions: Score gradients directly drive scene editing optimization, demonstrating the natural advantage of score-based models for downstream tasks.
LLM-Driven Data Augmentation: Simultaneous expansion along semantic and category dimensions, with 475 contexts providing broad coverage.

Limitations & Future Work¶

Lifting quality depends on the geometric consistency of SV3D pseudo-multi-view images; inconsistencies lead to elevated registration failure rates.
Only static spatial relationships are modeled; dynamic processes (e.g., rising liquid level during pouring) are not considered.
Collision detection relies on AABB approximations, which may misclassify non-convex objects.
The pipeline is lengthy (text-to-image → video → SfM → segmentation → features → registration → filtering), incurring significant computational overhead.
The diversity of functional relationships is bounded by the generative capacity of the underlying 2D diffusion model.
Evaluation metrics (e.g., VLM Score) are subjective; standardized quantitative benchmarks are lacking.

Object Spatial Relationship Learning: Robot placement tasks and language-conditioned spatial reasoning are largely restricted to predefined categories.
2D Diffusion Prior Extraction: CHORUS (human–object interaction) and ComA (human–object relationships) are related, though object pose estimation is substantially more challenging than human body estimation.
Score-Based Diffusion Models: GenPose (6D pose) and DAViD (dynamic HOI) serve as the direct technical foundations.
3D Scene Generation: GraphDreamer (text-to-3D) and SceneTeller (LLM-based layout) both fall short of fine-grained 3D spatial control.

Rating¶

Novelty: ★★★★★ — Novel task definition and a creative "2D diffusion → 3D relationships" paradigm.
Technical Depth: ★★★★☆ — Pipeline design is sound and complete, though it relies heavily on modular composition.
Experimental Thoroughness: ★★★★☆ — Multi-metric evaluation and user studies are convincing, but ablation analysis of individual module contributions is lacking.
Practicality: ★★★★☆ — Scene editing and motion synthesis demonstrate strong application potential.
Overall: 8.5/10