Object-Centric Representation Learning for Enhanced 3D Semantic Scene Graph Prediction¶

Conference: NeurIPS 2025 arXiv: 2510.04714 Code: https://github.com/VisualScienceLab-KHU/OCRL-3DSSG-Codes Area: 3D Vision / Scene Understanding Keywords: 3D semantic scene graph, object-centric representation, contrastive pre-training, GNN, relation prediction

TL;DR¶

Through empirical analysis, this paper identifies object feature discriminability as the critical bottleneck in 3D scene graph predicate prediction (object misclassification accounts for 92%+ of predicate errors). It proposes an independently contrastively pre-trained object encoder (3D-2D-Text tri-modal alignment), a geometry-regularized relation encoder, and a bidirectional edge-gated GNN, achieving new SOTA on 3DSSG with Object R@1 59.53% and Predicate R@50 91.40%.

Background & Motivation¶

Background: 3D semantic scene graphs (3D-SSG) represent 3D scenes as directed graphs with nodes (objects) and edges (relations), serving as a key representation for robot navigation and AR/VR interaction. Methods such as SGPN, SGFN, and VL-SAT have progressively advanced performance.

Limitations of Prior Work: (a) Existing methods over-rely on GNNs for relational reasoning while neglecting insufficient discriminability of object representations — VL-SAT's object embeddings are non-discriminative, leading to low-confidence predictions and frequent misclassifications; (b) Relation feature encoding relies solely on geometric descriptors (centroid differences, bounding box differences, etc.), ignoring the integration of object semantic features; (c) GNN edge processing is symmetric, whereas real-world relations (e.g., "A standing on B") exhibit directional asymmetry.

Key Challenge: Object misclassification → predicate prediction errors. Analysis of VL-SAT reveals that only 8% of predicate errors occur when both subject and object are correctly classified, while error rates surge under object misclassification. Using ground-truth object labels brings predicate R@50 close to 94%+, confirming that the bottleneck lies in object encoding rather than relational reasoning.

Goal: (a) Improve the discriminability of object features to indirectly enhance all downstream metrics; (b) Integrate semantic and geometric information to improve relation encoding; (c) Introduce directionality modeling to capture asymmetric relations.

Key Insight: A probabilistic formalization — \(P(e_{ij}|z_i, z_j) = \sum P(e_{ij}|o'_i, o'_j) P(o'_i|z_i) P(o'_j|z_j)\) — shows that the sharper the object posterior (i.e., higher discriminability), the more accurate the predicate prediction.

Core Idea: Independent contrastive pre-training enables the object encoder to produce highly discriminative embeddings → reduces object classification entropy → automatically improves predicate and triplet prediction through probabilistic propagation.

Method¶

Overall Architecture¶

The framework proceeds in two stages: (1) Pre-training stage: contrastive learning trains the object feature encoder via 3D point cloud ↔ multi-view 2D images ↔ CLIP text description alignment, independently of the scene graph task; (2) Scene graph prediction stage: the object encoder is frozen, and the relation feature encoder (object-pair features + geometric descriptors + LSE auxiliary task) and a GNN with GSE and BEG are trained jointly.

Key Designs¶

Discriminative Object Feature Encoder (Contrastive Pre-training):
- Function: Pre-trains an encoder that produces highly discriminative object embeddings, independent of the downstream task.
- Mechanism: The input is a 3D point cloud of an object instance; after T-Net affine transformation for invariance, features \(z^t\) are extracted. The contrastive objectives are: (a) visual contrast \(\mathcal{L}^{visual}\) — pulling \(z^t\) closer to multi-view 2D CLIP features of same-class objects and pushing apart different-class ones; (b) text contrast \(\mathcal{L}^{text}\) — aligning \(z^t\) with CLIP text features of "A point cloud of {object}". Supervised contrastive learning is adopted, with same-class objects sharing positive samples.
- Design Motivation: Object encoders in methods such as VL-SAT are jointly trained with scene graph objectives, resulting in insufficiently discriminative object features. Independent pre-training decouples the two objectives and lets the object encoder focus on classification accuracy. Experiments confirm that plugging this pre-trained encoder into existing frameworks (SGFN/VL-SAT) consistently improves all metrics.
Relation Feature Encoder + LSE:
- Function: Fuses semantic features of object pairs with geometric descriptors to construct relation edge features.
- Mechanism: \(z^e_{ij} = f_{\theta_r}(\text{CAT}(g_{obj}(z^t_i), g_{obj}(z^t_j), g_{geo}(g_{ij})))\), where \(g_{ij} \in \mathbb{R}^{11}\) includes centroid differences, standard deviation differences, bounding box differences, volume ratio, and longest-edge ratio.
- LSE (Local Spatial Enhancement): An auxiliary task that reconstructs original geometric descriptors from relation features (L1 loss), enforcing the relation representation to retain geometric information and mitigating the information imbalance between high-dimensional object features and low-dimensional geometric descriptors.
- Design Motivation: Prior methods such as SGFN and VL-SAT use only geometric descriptors as edge features, ignoring object semantics; SGPN uses the entire scene point cloud, introducing excessive background noise.
GNN: GSE + BEG:
- GSE (Global Spatial Enhancement): Uses the Euclidean distance matrix \(D\) between objects as an attention bias — \(\alpha_{ij} = \text{softmax}(\frac{q_i^T k_j}{\sqrt{d_k}} + w_{ij}^{(h)})\), where \(w_{ij}^{(h)} = W^{(h)}D\) — so that spatially proximate objects receive stronger attention.
- BEG (Bidirectional Edge Gating): Separates each node's edges into outgoing edges (as subject) and incoming edges (as object), aggregates them separately, then concatenates and gates the result. When updating edges, the reverse edge \(z^e_{ji}\) is modulated by a gate scalar \(\beta_{ij} = \text{gate}(z^e_{ij})\), reflecting that information flow for "A standing on B" and "B supporting A" should differ.
- Design Motivation: Standard GNNs treat edges symmetrically, whereas relations in 3D scenes are inherently directional.

Loss & Training¶

Pre-training: \(\mathcal{L}_{pretrain} = 0.001 \mathcal{L}_{reg} + \mathcal{L}_{cross}\) (affine regularization + cross-modal contrastive)
Scene graph: \(\mathcal{L}_{sg} = \lambda_{obj} \mathcal{L}_{obj} + \lambda_{rel} \mathcal{L}_{rel} + \lambda_{lse} \mathcal{L}_{lse}\)

Key Experimental Results¶

Main Results (3DSSG, 1553 scenes, 160 object classes, 26 predicate classes)¶

Method	Object R@1	Object R@5	Pred R@1	Pred R@50	Triplet R@100
SGPN	49.46	73.99	86.92	85.38	88.59
SGFN	53.36	76.88	89.00	88.59	91.14
VL-SAT	55.93	78.06	89.81	89.35	92.20
Ours	59.53	81.20	91.27	91.40	93.80

Ablation Study¶

Configuration	Obj R@1	Pred R@50	Triplet R@100
Baseline (SGFN-style)	53.36	88.59	91.14
+ Contrastive pre-trained encoder	59.53 (+6.17)	—	—
+ LSE	—	+1–2%	—
+ GSE + BEG	—	+1–2%	—
Full model	59.53	91.40	93.80

Plug-in validation: Inserting the pre-trained encoder into SGFN raises Obj R@1 from 53.36 to ~57%; inserting it into VL-SAT yields a similar gain of 2–3%.

Key Findings¶

Empirical evidence that object discriminability is the core bottleneck: Object classification entropy \(H(o|z)\) is nearly monotonically correlated with predicate error rate — even when the Top-1 prediction is correct, high-entropy objects still cause more predicate errors.
92% of predicate errors are associated with object misclassification: Only 8% of predicate errors occur when both subject and object are correctly identified.
The independently pre-trained object encoder is plug-and-play: Replacing only the object encoder in existing frameworks consistently improves all metrics without modifying other components.
LSE auxiliary task is effective: Forcing the relation representation to retain geometric information improves predicate R@50 by ~1%.
BEG captures directional asymmetry: Gains are especially pronounced for directional predicates such as "standing on" and "hanging from."

Highlights & Insights¶

The probabilistic argument linking object discriminability to relation prediction is elegant: The formulation \(P(e_{ij}|z_i,z_j) = \sum P(e_{ij}|o'_i,o'_j) P(o'_i|z_i) P(o'_j|z_j)\) formally characterizes the mechanism by which better object embeddings → sharper posteriors → reduced predicate confusion.
Decoupled pre-training design: Pre-training the object encoder independently of the scene graph objective prevents the two objectives from competing. This principle generalizes — upstream encoders in any pipeline may benefit from independent pre-training followed by freezing.
Design philosophy of LSE: Rather than naively concatenating geometric descriptors (which would be overwhelmed by high-dimensional object features), an auxiliary reconstruction task compels the relation encoder to internalize geometric information.

Limitations & Future Work¶

Requires multi-view RGB data from 3RScan for 2D–3D alignment, imposing a strong data dependency.
Restricted to a closed vocabulary of 160 object classes — generalization to open-vocabulary 3D scene graphs remains unexplored.
Contrastive pre-training requires CLIP feature extraction and multi-view image processing, increasing computational overhead.
The 3DSSG dataset is relatively small (1553 scenes); generalization to larger-scale scenes is unknown.

vs. VL-SAT: VL-SAT employs vision-language pre-training but jointly optimizes within the scene graph task, yielding less discriminative object features than independent pre-training. The proposed method outperforms VL-SAT by +3.6% on Object R@1.
vs. SGFN: A basic GNN baseline with no contrastive pre-training and no directionality modeling. The proposed method improves Object R@1 by +6.17%.
The insight that object encoder discriminability may be a latent bottleneck in predicate prediction applies equally to 2D scene graph generation.

Rating¶

Novelty: ⭐⭐⭐⭐ Bottleneck diagnosis + probabilistic formalization + multimodal contrastive pre-training + bidirectional gated GNN; component designs are well-motivated.
Experimental Thoroughness: ⭐⭐⭐⭐ Standard 3DSSG benchmark + multi-component ablation + plug-in validation.
Writing Quality: ⭐⭐⭐⭐ In-depth problem analysis, clear probabilistic formalization, and intuitive comparisons in Figure 1.
Value: ⭐⭐⭐⭐ Effective improvement for 3D scene understanding; the insight that object discriminability drives relation prediction is broadly applicable.