OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding¶

Conference: AAAI 2026
arXiv: 2408.11030
Code: https://youjunzhao.github.io/OpenScan/
Area: 3D Scene Understanding / Open-Vocabulary
Keywords: Open-Vocabulary 3D, Attribute Understanding, 3D Scene Segmentation, Benchmark, Knowledge Graph

TL;DR¶

This paper proposes the Generalized Open-Vocabulary 3D Scene Understanding (GOV-3D) task and the corresponding OpenScan benchmark, extending 3D scene understanding beyond object categories to eight linguistic attribute dimensions, revealing critical deficiencies of existing OV-3D methods in understanding abstract object attributes.

Background & Motivation¶

State of the Field¶

Open-vocabulary 3D scene understanding (OV-3D) aims to localize and classify novel objects beyond training categories. Leveraging vision-language models (VLMs) such as CLIP, OV-3D has achieved remarkable progress in object-category-level recognition. Representative methods including OpenMask3D, SAI3D, MaskClustering, and Open3DIS demonstrate strong performance on ScanNet200.

Limitations of Prior Work¶

Existing methods and benchmarks (ScanNet, ScanNet200) focus exclusively on object-category-level open-vocabulary problems. However, understanding object-related attributes (e.g., affordance, material, properties) is equally critical for AI systems. For instance, a robot needs to understand "something you can sit on" (functional attribute) rather than merely "chair" (category label).

Root Cause¶

The absence of large-scale 3D scene attribute annotation benchmarks prevents systematic evaluation of OV-3D models' generalization ability in object attribute understanding. Existing benchmarks contain only object category annotations, with no attribute annotations.

Paper Goals¶

To construct a comprehensive evaluation benchmark beyond object categories, assessing OV-3D models' ability to understand abstract object attributes across multiple linguistic dimensions.

Starting Point¶

The GOV-3D (Generalized Open-Vocabulary 3D Scene Understanding) task is introduced, extending queries from object categories to abstract object-related attributes. The OpenScan benchmark is built upon ScanNet200, acquiring attribute annotations through a combination of knowledge graphs and human annotation.

Core Idea¶

Object category recognition is merely the tip of the iceberg in 3D scene understanding. Truly open-vocabulary understanding should encompass abstract concepts spanning multiple linguistic dimensions, including affordance, material, and properties.

Method¶

Overall Architecture¶

The OpenScan benchmark construction pipeline: 1. Knowledge Graph Association: ConceptNet is used to establish associations between the 200 object categories in ScanNet200 and various attributes. 2. Human Annotation: Visual attributes (e.g., material) are annotated manually. 3. Attribute Categorization: Attributes are organized into eight linguistic dimensions. 4. Attribute Verification: Human verification ensures semantic consistency. 5. Query Generation: Text queries are generated with object names concealed.

Key Design 1: Eight-Dimensional Linguistic Attribute System¶

Function: Object attributes are organized into eight representative linguistic dimensions.

Specific Dimensions: - Affordance: Functional use of an object, e.g., "sit" for a chair. - Property: Object characteristics, e.g., "soft" for a pillow. - Type: Category membership, e.g., a phone is a "communication device." - Manner: Mode of use, e.g., a hat is "worn on a head." - Synonym: Near-synonymous substitution, e.g., "image" for a picture. - Requirement: Necessary conditions, e.g., a bicycle requires "balance to ride." - Element: Constituent parts, e.g., a bicycle has "two wheels." - Material: Material type, e.g., "plastic" for a bottle.

Design Motivation: These eight dimensions span from commonsense knowledge (affordance, requirement) to visual knowledge (material), enabling comprehensive evaluation of a model's deep understanding of objects.

Key Design 2: Knowledge-Graph-Driven Annotation Generation¶

Function: ConceptNet is leveraged to automatically generate object–attribute association annotations.

Mechanism: For each object category $c_i$ in ScanNet200, relevant edges are queried from the knowledge graph $\mathcal{G} = (\mathcal{V}, \mathcal{E})$: $$\{e\}_i = \{(v_m, r, w, v_n) \in \mathcal{E} | v_m = c_i\}$$

For the same relation $r$, the attribute with the highest weight $w$ is retained, ensuring only the most representative attribute is preserved for each object per dimension.

Design Motivation: Knowledge graphs provide a structured and scalable source of commonsense knowledge, enabling attribute annotation generation for a large number of objects at low cost.

Key Design 3: Query Template Design¶

Function: Text queries with concealed object names are generated to evaluate the GOV-3D task.

Mechanism: The object category $v_m$ in the query is replaced with "this term," followed by concatenation of the relation and attribute: $$q = \text{Concatenate}(t, r, v_n)$$ For example: "this term is made of wood."

Design Motivation: By excluding object names from queries, the model is forced to reason through attributes to localize objects, rather than relying on simple name matching.

Loss & Training¶

The benchmark itself does not involve training losses. Evaluation metrics follow standard OV-3D protocols: AP/AP50/AP25 for instance segmentation, and mIoU/mAcc for semantic segmentation.

Key Experimental Results¶

Main Results: 3D Instance Segmentation¶

Method	Affordance	Property	Synonym	Material	Mean	ScanNet200
OpenMask3D	7.2	7.5	16.9	18.8	9.9	15.4
SAI3D	5.3	5.8	10.0	11.3	7.7	12.7
MaskClustering	6.2	7.0	16.2	12.1	8.1	12.0
Open3DIS	11.9	12.8	26.7	28.3	15.8	23.7

(AP metric; all methods perform significantly worse on OpenScan than on ScanNet200.)

3D Semantic Segmentation¶

Method	OpenScan mIoU	OpenScan mAcc	ScanNet mIoU
OpenScene	0.45	1.87	47.5
PLA	0.01	2.37	66.6
RegionPLC	0.07	2.36	68.7

Semantic segmentation methods nearly completely fail on OpenScan (mIoU < 1%), indicating severe lack of generalization from categories to attributes.

Ablation Study: Effect of Pre-training Vocabulary Size¶

Increasing the training vocabulary size ($S = 10 \to 170$) yields no significant improvement across most attribute dimensions, with only a marginal gain on the material dimension. This demonstrates that simply expanding the number of training categories cannot resolve the attribute understanding problem.

Key Findings¶

All OV-3D models perform substantially worse on OpenScan than on ScanNet200, confirming that GOV-3D is a more challenging task.
Synonym and Material achieve relatively higher performance: the former due to semantic proximity to object categories, the latter due to CLIP's visual pattern recognition capability.
Affordance and Property are the most challenging: they require commonsense reasoning, which is not covered by CLIP's pre-training objective.
Using query templates (with relational descriptions) improves AP by approximately 0.2–6 points compared to using bare attribute words.

Highlights & Insights¶

Visionary Problem Formulation: Extending OV-3D from categories to attributes is a natural yet previously overlooked research direction.
Systematic Benchmark Design: The eight-dimensional attribute system provides comprehensive coverage; the hybrid strategy of knowledge graphs combined with human annotation balances efficiency and quality.
Experimentally Revealed Fundamental Limitations: The results demonstrate that simply expanding training vocabulary cannot address attribute understanding, pointing toward the need for deeper methodological transformation.
Substantial Scale: 153,644 attribute annotations, 341 attributes, with an average of 3.15 attribute annotations per object.

Limitations & Future Work¶

Commonsense attribute annotations rely on ConceptNet, whose coverage and quality may be limited.
Only material is annotated among visual attributes; other visual dimensions such as color and shape are not covered.
The benchmark is constructed solely on ScanNet200, restricting scene types to indoor environments.
No solution is proposed for the GOV-3D task; the paper only exposes the problem.
The selection criteria for the eight attribute dimensions are insufficiently justified, with potential omissions or overlaps.
Evaluation assumes the target object referenced by the query exists in the scene, which may be inconsistent with the GOV-3D task's claimed requirement to determine object presence.

OpenScene (Peng et al. 2023): Supports arbitrary text queries for zero-shot 3D semantic segmentation, but lacks quantitative evaluation on attribute dimensions.
SceneFun3D (Delitzas et al. 2024): Provides functional annotations for robot interaction scenes, but focuses solely on the affordance dimension.
MMScan (Lyu et al. 2024): A visual attribute understanding benchmark that lacks commonsense attributes.
Insights: The deficiencies of vision-language models in commonsense reasoning suggest that image-text alignment alone is insufficient; incorporating structured knowledge or multi-step reasoning may be necessary.

Rating¶

⭐⭐⭐⭐ (4/5)

Strengths: The problem formulation is valuable, the benchmark construction is rigorous and comprehensive, and the experimental findings offer important guidance for future research.

Weaknesses: No solution is proposed, and certain annotation design choices lack sufficient justification.