Open-Vocabulary Octree-Graph for 3D Scene Understanding¶

Conference: ICCV 2025 arXiv: 2411.16253 Code: GitHub Area: 3D Vision Keywords: Open-vocabulary, 3D scene understanding, octree, scene graph, semantic segmentation

TL;DR¶

This paper proposes Octree-Graph, a novel scene representation combining adaptive octrees with a graph structure. Through Chronological Group-based Segment Merging (CGSM) and Instance Feature Aggregation (IFA), it obtains accurate semantic objects and enables efficient open-vocabulary 3D scene understanding.

Background & Motivation¶

Open-vocabulary 3D scene understanding is critical for embodied agents. Existing methods leverage pretrained VLMs for object segmentation and project results onto point clouds to construct 3D maps, but suffer from two core issues:

Inefficient spatial representation: Mainstream methods build 3D maps on point clouds, which are unordered discrete coordinates requiring large storage, and lack explicit occupancy information and spatial connectivity, hindering downstream tasks such as path planning and text-based retrieval.

Inaccurate semantic segmentation: Existing methods overlook the inherent imprecision of foundation models in segmentation and feature extraction, leading to degraded 3D object segmentation and semantic quality.

Method¶

Overall Architecture¶

Given an RGB-D sequence and camera poses, the pipeline consists of four steps: 1. Extract 2D segmentation proposals and semantic features using VLMs 2. Merge segments into instances via CGSM 3. Aggregate features for each instance via IFA 4. Construct the Octree-Graph for downstream applications

Chronological Group-based Segment Merging (CGSM)¶

Existing segment merging strategies are either frame-level, which is susceptible to noise, or global-level, which introduces redundant computation. The key designs of CGSM are:

Chronological grouping: Frames are divided into groups in temporal order with interval \(I\), preserving spatiotemporal details of adjacent frames while avoiding global interference.
Semantics-guided under-segmentation filtering: For under-segmented segments \(\mathcal{S}_m\) that may contain distinct objects, the variance of semantic features among internal segments is computed; those exceeding threshold \(\tau_u\) are filtered out.
Dynamic threshold decay: The overall similarity is \(\phi = \phi_{\text{geo}}^{\text{iou}} + \phi_{\text{geo}}^{\text{ior}} + \phi_{\text{sem}}^{v} + \phi_{\text{sem}}^{c}\), and threshold \(\theta_i\) decays linearly to merge partially observed segments with low spatial overlap.

Instance Feature Aggregation (IFA)¶

Semantic features are fused for each instance, jointly considering representativeness and discriminability:

\[a_{i,j}^{v} = \cos(\mathbf{f}_{i,j}^{v}, \bar{\mathbf{f}}_{i}^{v}) - \sum_{\mathcal{O}_k \in \mathcal{N}_i} \cos(\mathbf{f}_{i,j}^{v}, \bar{\mathbf{f}}_{k}^{v})\]

where weights are normalized via softmax. The intuition is that features closer to their own cluster center and farther from neighboring instances receive higher aggregation weights.

Adaptive Octree¶

Conventional octrees use cubic voxels, requiring very deep trees to approximate objects with large aspect ratios (e.g., walls). The adaptive octree adjusts the size of each node according to object shape:

\[\mathbf{d}_l = (\mathbf{b}_{\max} - \mathbf{b}_{\min}) / 2^l\]

where \(\mathbf{b}_{\max}\) and \(\mathbf{b}_{\min}\) are the corner coordinates of the object bounding box, and \(l\) is the octree depth.

Octree-Graph Construction¶

Node \(\mathbf{N}_i\): contains semantics \(n_i^s\), center \(n_i^c\), and adaptive octree \(n_i^o\)
Edge \(\mathbf{E}_{i,j}\): contains semantic relation \(e_{i,j}^s\), spatial distance \(e_{i,j}^d\), and 3D direction vector \(\mathbf{e}_{i,j}^v\)

Key Experimental Results¶

Main Results — 3D Semantic Segmentation¶

Method	Replica mIoU↑	ScanNet mIoU↑	ScanNet mAcc↑
ConceptFusion	0.10	0.08	0.15
ConceptGraph	0.18	0.16	0.28
HOV-SG	0.231	0.222	0.431
Ours	0.320	0.393	0.601

The proposed method significantly outperforms all baselines on both Replica and ScanNet, with gains of +17.1% mIoU and +17.0% mAcc over HOV-SG on ScanNet.

3D Instance Segmentation (ScanNet200)¶

Method	AP↑	AP50↑	AP25↑
Mask-Clustering	12.0	23.3	30.1
Ours (z.s.)	14.3	25.8	33.6

Ablation Study — Merging Strategy¶

Merging Strategy	mIoU↑	mAcc↑
Frame-level	0.323	0.519
Global-level	0.286	0.476
CGSM (I=200)	0.356	0.574

Path Planning¶

Method	SR(1.0m)	SR(0.5m)	SR(0.25m)
HOV-SG	55.25	46.75	32.16
Ours	97.88	96.88	96.38

The proposed method substantially outperforms HOV-SG in path planning success rate, as Octree-Graph supports navigation to arbitrary free-space regions.

Highlights & Insights¶

The adaptive octree elegantly addresses the inefficiency of conventional octrees in representing objects with large aspect ratios.
CGSM achieves a balance between exploiting local temporal details and avoiding global redundancy through its grouping strategy.
IFA performs weighted aggregation by accounting for inter-instance discriminability, yielding greater robustness than simple averaging.
Storage is drastically reduced: all adaptive octrees together occupy only 42 KB, compared to 6.8 MB for point clouds.

Limitations & Future Work¶

The method depends on the segmentation quality of 2D foundation models, and challenges remain for small objects and occluded scenes.
The octree depth \(L_{\max}=4\) may be insufficient to represent highly fine-grained geometric details.
Path planning may still incorrectly mark certain free regions as occupied due to octree discretization.

ConceptGraph, HOV-SG: 3D scene graph methods
OpenScene, ConceptFusion: point/mesh-level 3D maps
OVIR-3D, MaskClustering: instance-level 3D maps
PlenOctrees, OctreeOcc: octree structures for rendering/semantics

Rating¶

Novelty: ⭐⭐⭐⭐ (the hybrid representation of adaptive octrees and graph structure is highly novel)
Technical Depth: ⭐⭐⭐⭐ (CGSM and IFA are elegantly designed)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (4 tasks across 4 datasets)
Practical Value: ⭐⭐⭐⭐⭐ (highly valuable for embodied agents)