Lifting Unlabeled Internet-level Data for 3D Scene Understanding¶
Conference: CVPR 2026 arXiv: 2604.01907 Code: Project Page Area: 3D Vision Keywords: 3D scene understanding, internet videos, automated data engine, vision-language navigation, spatial reasoning
TL;DR¶
This paper presents SceneVerse++, an automated data engine that generates 3D scene understanding training data from 6,687 unlabeled internet videos. It demonstrates the feasibility of leveraging internet-scale data to advance 3D scene understanding across three tasks: 3D object detection (F1@.25 +20.6), spatial VQA (+14.9%), and vision-language navigation (+14% SR).
Background & Motivation¶
3D scene understanding is a fundamental capability for both human cognition and embodied intelligence, spanning geometric perception (depth estimation, object detection), semantic understanding (segmentation, visual grounding), and high-level reasoning (spatial QA, navigation). Progress in this area via deep learning relies heavily on large-scale annotated real-world 3D datasets.
Key Challenge: Unlike 2D images that can be easily collected and annotated from the web, 3D scene data acquisition and annotation is prohibitively expensive — requiring specialized hardware (RGB-D/LiDAR), 3D mesh reconstruction, and dense manual semantic annotation. Since ScanNet, the field has seen virtually no order-of-magnitude growth in 3D data scale, while the internet hosts an abundance of unlabeled video data that naturally captures the 3D world.
Key Insight: The paper proposes an automated data engine that converts unlabeled internet videos into training data for 3D scene understanding. Rather than naively chaining existing sub-modules (reconstruction, segmentation, semantic annotation), the paper systematically analyzes the bottlenecks of automated data generation and provides guidelines for scaling end-to-end models across tasks of different perceptual granularity. Core Idea: Through a carefully designed data engine, internet videos can serve as a viable path to bridging the scarcity of annotated 3D data and improving end-to-end model capabilities.
Method¶
Overall Architecture¶
Starting from internet videos, the pipeline consists of three stages: (1) video filtering and structure-from-motion (SfM) to obtain camera poses and sparse 3D geometry; (2) a modular reconstruction and segmentation pipeline to produce dense 3D reconstructions and instance annotations; (3) task-specific data generation for downstream tasks (detection/segmentation, spatial VQA, VLN). The final dataset comprises 6,687 scenes sourced from 8,217 videos, including images, camera poses, dense reconstructions, instance segmentation, and high-level reasoning annotations.
Key Designs¶
- Video Filtering and SfM Reconstruction Pipeline:
- Function: Extract high-quality camera poses and sparse 3D point clouds from raw internet videos.
- Mechanism: TransNetV2 shot detection → filtering of low-quality/outdoor/portrait content → disparity-based keyframe selection (rather than uniform sampling) → dense pixel matching + global bundle adjustment → spatial coverage and SfM quality checks.
-
Design Motivation: Internet videos contain abundant irrelevant content; disparity-based frame selection ensures triangulation quality. Optimized pseudo-trajectory pixels are introduced to improve memory efficiency for long videos.
-
Dense Reconstruction and Instance Segmentation Pipeline:
- Function: Produce complete 3D meshes and instance-level annotations from sparse SfM outputs.
- Mechanism: For reconstruction, sparse SfM points are projected onto image planes to obtain sparse depth priors; PriorDA then predicts dense metric depth, and TSDF fusion generates watertight meshes. For segmentation, CropFormer produces per-frame segmentation masks, which are aggregated into 3D space via inter-frame view consensus and spatial consistency; VLMs then generate text descriptions and semantic labels.
-
Design Motivation: Neural rendering methods produce high-quality results but are too slow for per-scene optimization; end-to-end reconstruction methods are fast but suffer from memory constraints and geometric distortions on long videos. The metric depth + SfM approach strikes a balance between quality and efficiency (averaging 71 seconds for reconstruction and 96 seconds for segmentation per scene).
-
Task-Specific Data Generation:
- Function: Transform 3D scenes into task-specific training data.
- Mechanism: 3D detection/segmentation directly uses the reconstruction and instance annotations. Spatial VQA generates templated QA pairs (632K) via 3D scene graphs. VLN converts free-exploration room-tour trajectories into R2R-style navigation data through a three-stage pipeline (trajectory preprocessing → action encoding → instruction generation).
- Design Motivation: The core challenge for VLN is bridging the gap between the irregular motion in room-tour videos and the goal-directed shortest paths in the R2R benchmark, necessitating dedicated trajectory refinement and action encoding.
Loss & Training¶
- 3D Detection: SpatialLM (MLLM-based) is pretrained on SceneVerse++ and fine-tuned on ScanNet.
- 3D Segmentation: Mask3D is pretrained on SceneVerse++ and fine-tuned on ScanNet.
- Spatial VQA: Qwen2.5-VL-3B/7B is fine-tuned using LoRA on 202K training samples.
- VLN: LLaVA-Video serves as the base model, pretrained on SceneVerse++ and then fine-tuned on R2R.
Key Experimental Results¶
Main Results¶
| Dataset/Task | Metric | Ours (SceneVerse++) | Prev. SOTA | Gain |
|---|---|---|---|---|
| ScanNet 3D Detection | F1@.25 (pretrain+finetune) | 58.6 | 38.0 (SpatialLM orig.) | +20.6 |
| ScanNet 3D Detection | F1@.25 (zero-shot) | 30.9 | 29.0 (SpatialLM) | +1.9 |
| ARKitScenes 3D Detection | F1@.25 (zero-shot) | 35.8 | 35.1 (SpatialLM) | +0.7 |
| ScanNet 3D Segmentation | AP25 (pretrain+finetune) | 38.5 | 36.1 (from scratch) | +2.4 |
| VSI-Bench VQA (3B) | Avg Accuracy | 42.8 (SV++ zero-shot) | 27.9 (baseline) | +14.9 |
| VSI-Bench VQA (7B) | Avg Accuracy | 46.4 (SV++ zero-shot) | 36.6 (baseline) | +9.8 |
| R2R VLN | SR (pretrain+finetune) | 0.228 | 0.088 (R2R only) | +0.14 |
Ablation Study¶
| Configuration | Key Metric | Notes |
|---|---|---|
| Full SceneVerse++ pretrain + R2R finetune | SR 0.228 | Optimal strategy |
| Joint training (R2R + SV++) | SR 0.188 | Direct mixing underperforms pretrain-then-finetune |
| w/o Trajectory Refinement (w/o TR) | SR 0.036 → 0.177 (ft) | Raw trajectories are low-quality; refinement is critical |
| w/o Instruction Enhancement (w/o IE) | SR 0.022 → 0.074 (ft) | Language diversity has a large impact on performance |
| SV++ zero-shot VQA (ARKit subset) | 48.0 (3B) | Approaches annotated SN/SN++ training (49.0) |
Key Findings¶
- On 3D detection, the real-world distribution prior from SceneVerse++ pretraining leads to large fine-tuning gains (F1@.25 from 38.0 to 58.6).
- On 3D segmentation, Mask3D's reliance on pipeline-specific graph cut results makes it sensitive to domain transfer; SceneVerse++ zero-shot performance is limited, though fine-tuning still yields improvements.
- On spatial VQA, SceneVerse++ yields the largest improvements on general spatial knowledge (relative distance, relative direction) but performs weaker on domain-specific knowledge (object count, room size), reflecting the domain gap.
- For VLN, trajectory refinement and instruction enhancement are both critical data quality factors; raw internet videos cannot be used directly.
- A clear overfitting inflection point exists: all evaluation metrics improve in early training, after which in-domain metrics continue to rise while out-of-domain metrics plateau or decline.
Highlights & Insights¶
- The paper systematically analyzes the full pipeline from internet videos to 3D scene understanding, rather than naively assembling sub-modules.
- Three representative tasks spanning low-level perception (detection/segmentation) to high-level reasoning (VQA/VLN) provide comprehensive validation.
- The dataset scale is substantial: 6,687 scenes surpassing ARKitScenes, with an average of 49 objects and 21 categories per scene.
- The in-depth discussion of model scalability is valuable: models that depend on pre-computed segmentation (Mask3D) are harder to scale than those operating directly on raw modalities (SpatialLM).
Limitations & Future Work¶
- The pipeline depends on multiple sub-modules (SfM, depth estimation, segmentation, VLM annotation), and errors from each module cascade through the pipeline.
- Video filtering still requires a small amount of human annotation (<10 seconds/scene) to ensure data quality.
- The 3D segmentation task demonstrates how domain-specific biases can limit model scalability, motivating the need for more robust model architectures.
- Internet videos are predominantly indoor room-tour style, resulting in limited coverage of outdoor or dynamic scenes.
- Sub-modules in the automated data generation pipeline are mostly trained on small-scale, task-specific benchmarks, limiting their generalization capability.
Related Work & Insights¶
- vs. ScanNet/ScanNet++: High-quality manually collected 3D datasets, but scale is limited (~1.5k scenes for ScanNet); SceneVerse++ acquires 6.7k scenes from the internet via automation at a larger scale, with a trade-off in quality.
- vs. RoomTour3D/NaVILA: Also leverage internet videos, but are restricted to the single task of navigation; SceneVerse++ covers detection, segmentation, VQA, and VLN comprehensively.
- vs. Miao et al.: Uses 2D single-view datasets with estimated depth to generate 3D annotations, but is constrained to existing 2D datasets and only supports single-frame-level processing.
- Insight: Sub-module development should target "supporting robust in-the-wild 3D understanding," evaluating not only task-specific performance but also the contribution to automated data generation pipelines.
Rating¶
- Novelty: ⭐⭐⭐⭐ The systematic use of internet videos for comprehensive 3D scene understanding is innovative, and the bottleneck analysis is insightful.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three tasks, multiple training strategy comparisons, detailed ablations, and training dynamics analysis.
- Writing Quality: ⭐⭐⭐⭐ Clear structure, in-depth discussion, and honest analysis of the data engine's limitations.
- Value: ⭐⭐⭐⭐⭐ Provides a systematic roadmap and practical guidelines for scaling data in 3D scene understanding.