ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail¶
Conference: ICCV 2025 arXiv: 2503.17044 Code: Coming soon Area: 3D Vision Keywords: 3D scene understanding, dense annotation, multi-granularity captioning, joint object-part generation, 3D Gaussian
TL;DR¶
This paper proposes ExCap3D, a method for generating multi-granularity captions for objects in 3D indoor scenes at both object-level and part-level description layers. Through part-to-object information sharing and semantic/textual consistency losses, the approach ensures caption accuracy and coherence. On a newly constructed dataset of 190K captions, CIDEr scores improve by 17% and 124% over the prior SOTA at the object and part levels, respectively.
Background & Motivation¶
3D indoor scene understanding is fundamental to AR/VR and robotics applications. Natural language descriptions can encode complex information and support more natural human–scene interaction. However, existing 3D annotation methods suffer from critical limitations:
Single-granularity captioning: Methods such as Scan2Cap and ScanQA describe objects at only one level of detail, focusing primarily on spatial relationships between objects.
Lack of part-level detail: These methods cannot describe the appearance, material, or functional properties of individual object parts.
Different applications require different granularities: Robot navigation may only require "recliner," whereas an assistive AI needs "a recliner with a high backrest, wooden armrests, a soft seat cushion, and an adjustable footrest."
Core contribution: The paper introduces the task of Expressive 3D Captioning—generating, for each detected object, an object-level caption (semantic category and appearance) and a part-level caption (material, color, and function of each part).
Method¶
Overall Architecture (Fig. 3)¶
- 3D instance segmentation: Mask3D is used to detect objects in the scene.
- Joint caption generation: Two independent captioning heads generate object-level and part-level descriptions.
- Consistency constraints: Semantic and textual consistency losses enforce coherence between the two captioning levels.
3D Instance Segmentation¶
Given a 3D scene mesh \(M=(\mathcal{V}, \mathcal{F})\), the input is voxelized and fed into Mask3D: - A 3D sparse convolutional UNet encoder extracts dense features \(F \in \mathbb{R}^{N_{vox} \times D}\). - A parallel mask module iteratively refines query vectors \(Q \in \mathbb{R}^{N_q \times D_q}\) via transformer encoder layers. - Output: refined instance-aware queries \(Q_r\) and instance masks \(I \in \{0,1\}^{N_q \times N_{vox}}\).
Joint Captioning and Information Sharing¶
Two transformer language models \(\Psi_{obj}\) and \(\Psi_{part}\) perform autoregressive token prediction, conditioned on two information sources:
1. Caption-aware queries: Refined queries are linearly projected into captioning initialization tokens:
2. Segment-level context features: Cross-attention layers attend to 3D features corresponding to each object. To reduce computational complexity, features are aggregated within pre-computed segments on mesh faces, yielding \(S_o \in \mathbb{R}^{n_{s,o} \times D_{caption}}\).
Part-to-object information sharing (key design): Part-level captions are generated first; the final-layer hidden states of \(\Psi_{part}\) are linearly projected and concatenated with the object captioner's context features:
Object-level captioning is then conditioned on \(Q_{c,o}\) and \([H_{part,o}; S_{obj,o}]\).
Semantic and Textual Consistency Losses¶
Semantic consistency: Hidden states from both levels are projected to a low-dimensional space and classified into \(N_{sem}\) fine-grained categories, constrained by a symmetric cross-entropy loss:
where \(SG\) denotes the stop-gradient operator.
Textual consistency: Hidden state sequences are aggregated into single vectors, and the distance between the two captioning levels is minimized:
Total Loss¶
where \(w_1=1,\ w_2=w_3=0.1\).
ExCap3D Dataset¶
Construction Pipeline¶
Built upon 947 scenes from ScanNet++, covering 34K objects, using an automated pipeline:
- Object-level: Project 3D GT instance annotations onto DSLR images → crop object regions → generate multi-view descriptions with a VLM (LLaVA 1.6-7B) → aggregate with an LLM (Llama 3.1-8B).
- Part-level: Generate part pseudo-masks with MaskClustering + SAM → describe each part with a VLM → aggregate into a unified part-level caption with an LLM.
Dataset Statistics (Table 1)¶
| Dataset | # Captions | # Categories | # Objects | Granularity |
|---|---|---|---|---|
| Scan2Cap | 46k | 265 | 9.9k | Scene + Object |
| ScanQA | 35k | 370 | 9.5k | Scene + Object |
| ExCap3D | 190k | 2k | 34.7k | Object + Part |
The dataset contains more than four times as many captions as the largest existing dataset, with category coverage reaching 2,000 classes.
Key Experimental Results¶
Main Results: Comparison with SOTA (Table 2)¶
Object-level captioning:
| Method | CIDEr↑ | ROUGE↑ | METEOR↑ |
|---|---|---|---|
| D3Net | 6.7 | 5.4 | 6.7 |
| Vote2Cap-DETR | 13.3 | 12.9 | 17.2 |
| PQ3D | 27.9 | 11.6 | 12.5 |
| ExCap3D | 32.7 | 16.6 | 17.9 |
Part-level captioning:
| Method | CIDEr↑ | ROUGE↑ | METEOR↑ |
|---|---|---|---|
| D3Net | 10.5 | 7.9 | 7.9 |
| Vote2Cap-DETR | 13.3 | 20.7 | 22.7 |
| PQ3D | 14.4 | 16.3 | 15.6 |
| ExCap3D | 32.3 | 21.7 | 20.8 |
CIDEr improvements: +17% at the object level and +124% at the part level (both vs. PQ3D). The substantial gain at the part level demonstrates that existing methods are fundamentally ill-equipped to handle fine-grained captioning.
Ablation Study (Table 3)¶
| Method | Object CIDEr | Part CIDEr |
|---|---|---|
| Independent models (baseline) | 29.8 | 18.7 |
| + Semantic consistency | 30.2 | 24.8 |
| + Textual consistency | 32.2 | 19.6 |
| + Part→Object information sharing | 34.8 | 25.4 |
| Full model | 32.7 | 32.3 |
Key findings: - Semantic consistency primarily benefits part-level captioning (18.7→24.8), while textual consistency primarily benefits object-level captioning (29.8→32.2)—demonstrating complementary effects. - Part-to-object information sharing improves both levels, with a particularly large gain at the object level (29.8→34.8). - When all components are combined, the part-level improvement is most pronounced (18.7→32.3, +73%).
Comparison of Information Sharing Directions (Table 4)¶
| Direction | Object CIDEr | Part CIDEr |
|---|---|---|
| Object→Part | 32.8 | 15.6 |
| Part→Object | 32.7 | 32.3 |
Modeling objects as the sum of their parts is substantially superior to treating parts as derived from the object—object-level captions do not contain fine-grained part information and thus cannot effectively guide part caption generation.
Context Feature Ablation (Table 5)¶
| Method | Object CIDEr | Part CIDEr |
|---|---|---|
| Without context features | 33.7 | 27.0 |
| With context features | 32.7 | 32.3 |
Segment-level context features are critical for part-level captioning (27.0→32.3), as describing low-level part details requires finer-grained features.
Highlights & Insights¶
- "Object as the sum of its parts" modeling philosophy: The information flow of first describing parts and then synthesizing object-level descriptions yields a clear advantage, consistent with the "bottom-up" object perception pattern in human cognition.
- Complementarity of consistency losses: Semantic consistency ensures that both levels refer to the same semantic entity, while textual consistency ensures overlap in descriptive content—each constraining consistency from a distinct dimension.
- Scalability of the VLM pipeline: Leveraging VLM + LLM to automatically generate 190K high-quality annotations avoids the bottleneck of manual annotation, and the approach can be extended to other 3D datasets.
- End-to-end learning outperforms a disjoint pipeline (Table 6): Even when using the same VLM, end-to-end trained captioning quality substantially surpasses a two-stage detect-then-describe approach.
Limitations & Future Work¶
- Two independent captioning heads share information via cross-attention, which may limit the thoroughness of information transfer.
- The sparse convolutional backbone operates at a voxel resolution of approximately 2 cm, constraining captioning ability for small or thin objects.
- Part pseudo-masks derived from MaskClustering + SAM are lower quality than manual annotations and may introduce noise.
- The dataset is built on ScanNet++; generalization to other 3D scanning formats remains to be validated.
Related Work & Insights¶
- Unlike Cap3D (Luo et al., 2023), which captions isolated 3D objects, ExCap3D jointly detects and generates multi-granularity captions within complete 3D scenes.
- The multi-granularity captioning paradigm can be combined with 3D vision–language alignment methods (e.g., 3D-VISTA) to enable finer-grained embodied understanding.
- The part-to-object information sharing paradigm is generalizable to other hierarchical generation tasks, such as scene graph generation.
- The ExCap3D dataset can serve as foundational training data supporting fine-grained instruction following in 3D embodied AI.
Rating ⭐⭐⭐⭐¶
Novelty ★★★★☆: The definition of the multi-granularity captioning task and the information sharing mechanism are original. Experimental Thoroughness ★★★★☆: Ablation studies are comprehensive, with clear attribution of each component's contribution. Writing Quality ★★★★☆: Figures and tables are clear; the method is described systematically. Value ★★★★☆: The 190K dataset is highly valuable, though it relies on the high-resolution DSLR imagery of ScanNet++.