Skip to content

3EED: Ground Everything Everywhere in 3D

Conference: NeurIPS 2025 arXiv: 2511.01755 Code: https://github.com/worldbench/3EED Area: Autonomous Driving Keywords: 3D visual grounding, multi-platform, multimodal, outdoor scenes, cross-platform transfer

TL;DR

This paper introduces 3EED — the first large-scale multi-platform (vehicle, drone, quadruped robot), multimodal (LiDAR + RGB) outdoor 3D visual grounding benchmark, containing over 128K objects and 22K language descriptions, making it 10× larger than existing outdoor datasets. A baseline method incorporating cross-platform alignment, multi-scale sampling, and scale-adaptive fusion is also proposed, revealing substantial performance gaps in cross-platform 3D grounding.

Background & Motivation

3D visual grounding requires models to localize target objects in 3D scenes based on natural language, a core capability for embodied intelligence (navigation, interaction, situational awareness). Existing benchmarks are almost exclusively focused on indoor RGB-D small-scale scenes (ScanRefer, Nr3D, etc.) with objects limited to furniture categories, which fails to meet real-world demands. The few outdoor datasets available (Talk2Car, KITTI360Pose, etc.) are restricted to single platforms (vehicle-mounted LiDAR), are small in scale, and lack cross-platform diversity. In practice, different embodied agents (autonomous vehicles, drones, quadrupeds) differ substantially in sensor configuration, viewpoint geometry, and point cloud density, making a unified outdoor multi-platform grounding benchmark urgently needed.

Core Problem

  1. Lack of multi-platform outdoor 3D grounding datasets: Existing datasets are either limited to indoor environments or to a single vehicle-mounted platform, precluding evaluation of cross-platform generalization.
  2. Cross-platform domain gap: Different platforms exhibit large variations in viewpoint (top-down/horizontal/upward), LiDAR density, and object scale distribution; indoor methods fail catastrophically when directly transferred to outdoor multi-platform settings.
  3. Annotation efficiency: Large-scale 3D object annotation and language description generation are costly, necessitating efficient semi-automatic pipelines.

Method

Overall Architecture

The 3EED work comprises two main components: dataset construction and baseline method design.

For the dataset, synchronized LiDAR and RGB data are collected from Waymo (vehicle) and M3ED (drone + quadruped). High-quality 3D bounding box annotations are obtained via a three-stage pipeline of multi-detector fusion + tracking + human verification, followed by structured language description generation using Qwen2-VL-72B and human-verified filtering.

For the baseline, the method builds on BUTD-DETR: PointNet++ encodes LiDAR point clouds, a frozen RoBERTa encodes language, and a Transformer decoder predicts 3D boxes. Three additional modules are introduced: Cross-Platform Alignment (CPA), Multi-Scale Sampling (MSS), and Scale-Adaptive Fusion (SAF).

Key Designs

  1. Data Annotation Pipeline:

    • 3D Bounding Box Annotation: For vehicles, official Waymo annotations are used directly. For drones/quadrupeds, pseudo-labels are generated by multiple detectors (PV-RCNN, CenterPoint, etc.) → KDE fusion + 3D multi-object tracking (CTRL) for completion → Tokenize-Anything projection onto RGB for category verification → human refinement. The entire pipeline limits manual effort to approximately 100 seconds per frame.
    • Language Descriptions: 3D boxes are projected onto RGB images; five-slot structured prompts (category/state/location/orientation/spatial relation) are fed into Qwen2-VL-72B to generate descriptions. Platform-agnostic paraphrase rules are applied to unify terminology, followed by human verification and revision by five annotators. All descriptions are written from the observer's perspective (camera viewpoint) to ensure cross-platform consistency.
  2. Cross-Platform Alignment (CPA): Prior to feature extraction, each scene is rotated to align with the gravity direction (eliminating roll/pitch), with additional height-offset normalization applied to drone data. All platforms are placed into a unified gravity-aligned coordinate system, ensuring that spatial relations such as "above/below/behind" are encoded consistently across platforms. This is a one-time geometric normalization step that requires no modification to the network architecture, yet allows the backbone to focus its capacity on object/content feature learning rather than pose correction.

  3. Multi-Scale Sampling (MSS): Each PointNet++ layer queries neighborhoods using multiple radii (from 0.6 m to 4.8 m), simultaneously preserving fine local details for nearby objects and broad context for distant sparse objects. This avoids the failure modes of single-radius schemes: small radii leave distant objects with no neighboring points, while large radii over-smooth nearby objects. This directly addresses the LiDAR sparsification problem with increasing distance.

  4. Scale-Adaptive Fusion (SAF): Features computed at all radii are fed into a lightweight MLP to generate per-point dynamic weights, fusing multi-scale features into a single embedding that adaptively emphasizes the radius scale most informative for local geometry. This prevents "wrong-scale" decisions and stabilizes predictions under large density variations across platforms. The parameter and latency overhead is minimal.

Loss & Training

  • Hungarian matching assigns predicted boxes to ground-truth boxes (similar to DETR)
  • Loss combination: box regression L1 + 3D GIoU + token-level classification loss + symmetric contrastive alignment loss (bidirectional query-to-token and token-to-query)
  • Objectness supervision: focal loss, with the 4 nearest points to each ground-truth center designated as positive samples
  • Point clouds are uniformly downsampled to 16,384 points; PointNet++ is trained from scratch; RoBERTa is frozen
  • Learning rate: \(1 \times 10^{-3}\) for the visual encoder, \(1 \times 10^{-4}\) for others; 100 epochs training on 2× RTX 4090
  • Multi-target grounding setting: each target is associated with an independent positive map; Hungarian matching performs one-to-one assignment; training for 200 epochs

Key Experimental Results

Cross-Platform Grounding (Multi-Platform Joint Training, Acc@25%)

Platform Metric Ours BUTD-DETR WildRefer Gain
Vehicle Acc@25 63.84
Drone Acc@25 53.45
Quadruped Acc@25 53.31
Average Acc@25 +12.29

Cross-Platform Zero-Shot Transfer (Trained on Vehicle Only)

Test Platform Metric BUTD-DETR Ours
Vehicle (in-domain) Acc@25 52.38 High
Drone (zero-shot) Acc@25 1.54 Substantial gain
Quadruped (zero-shot) Acc@25 10.18 Substantial gain

Multi-Target Grounding (Vehicle Platform)

Method Acc@25 mIoU
BUTD-DETR 25.40 47.88
Ours Substantial gain Substantial gain

Ablation Study

  • CPA contributes most: Removing CPA drops Vehicle Acc@25 from 80.86 to 71.76 (−9.10), confirming its role as the key component for cross-platform alignment
  • MSS ranks second: Removing MSS drops Vehicle Acc@25 to 75.65 (−5.21), addressing the sparse point cloud problem at long range
  • SAF provides complementary gains: Removing SAF drops Quadruped Acc@25 from 53.31 to 51.98 (−1.33), stabilizing predictions under density variations
  • Scene complexity effect: On Quadruped, Acc@25 drops sharply from 71.23 (1–3 targets) to 30.75 (7–9 targets)
  • Platform characteristics: Drone is the most challenging (only 102 points/object vs. 462 for Vehicle) and has the highest scene density (8.05 objects/scene)

Highlights & Insights

  • First unified multi-platform outdoor 3D grounding benchmark: Covering three fundamentally different embodied viewpoints (vehicle/drone/quadruped), this represents an important infrastructure contribution to the field
  • Elegant annotation pipeline design: The cascaded scheme of multi-detector fusion + tracking + category verification + human refinement balances efficiency and quality; the VLM generation + human filtering approach for language annotation is reusable
  • CPA is simple yet highly effective: A single coordinate-system rotation alignment yields a +9.10 Acc@25 improvement, demonstrating that geometric normalization is a critically underappreciated preprocessing step in outdoor 3D tasks
  • Reveals the substantial challenge of cross-platform grounding: BUTD-DETR's Acc@25 plummets from 52.38 to 1.54 when transferred from Vehicle to Drone, representing near-complete failure, which clearly motivates future research directions

Limitations & Future Work

  • Only two object categories (Vehicle and Pedestrian) are covered, omitting many outdoor object types (traffic signs, cones, etc.)
  • Static scenes: Temporal dynamic modeling and conversational interaction are not addressed
  • LiDAR-only baseline: The proposed baseline uses only LiDAR point clouds without exploiting RGB image information for multimodal fusion
  • Assumption of accurate descriptions: Ambiguous, contradictory, or noisy text inputs are not considered
  • Limited sensor coverage: Only LiDAR + RGB are used; modalities such as thermal imaging and event cameras are not explored
  • Potential extensions: Integrating RGB visual features (CLIP/DINOv2) for multimodal grounding; introducing temporal reasoning; expanding to open-vocabulary categories
  • vs ScanRefer/Nr3D: Indoor small-scale scenes, dense RGB-D, furniture categories → 3EED provides outdoor large-scale, sparse LiDAR, multi-platform settings, fundamentally broadening the scope of 3D grounding
  • vs Talk2Car/Talk2LiDAR: Single vehicle platform, small scale (a few thousand expressions) → 3EED covers multiple platforms with 22K expressions and 128K objects, an order of magnitude larger
  • vs WildRefer: The most closely related work, also addressing outdoor multi-platform grounding, but 3EED is more comprehensive in data scale, annotation quality, and cross-platform evaluation protocol, and proposes a platform-aware baseline

Takeaways

  • The CPA cross-platform alignment idea can be generalized to other cross-domain 3D perception tasks (e.g., domain-adaptive 3D detection); simple geometric normalization is often more effective than complex network designs
  • The VLM + human verification annotation pipeline is a general-purpose paradigm applicable to the efficient construction of other 3D language datasets
  • Multi-platform grounding is fundamentally a domain generalization problem and can benefit from domain generalization techniques (e.g., style transfer, meta-learning)

Rating

  • Novelty: ⭐⭐⭐⭐ The first multi-platform outdoor 3D grounding benchmark is an important contribution, though the technical novelty of the baseline method is relatively limited
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ The evaluation protocol is comprehensive (in-domain/cross-domain/multi-target/joint training), ablations are thorough, and dataset statistics are detailed
  • Writing Quality: ⭐⭐⭐⭐ The paper is clearly structured with rich figures and tables, though the overall content is lengthy (with a substantial appendix)
  • Value: ⭐⭐⭐⭐ The dataset makes an important contribution to the field and reveals key challenges in cross-platform grounding, though methodological inspiration is limited