3EED: Ground Everything Everywhere in 3D¶
Conference: NeurIPS 2025 arXiv: 2511.01755 Code: https://github.com/worldbench/3EED Area: Autonomous Driving Keywords: 3D visual grounding, multi-platform, multimodal, outdoor scenes, cross-platform transfer
TL;DR¶
This paper introduces 3EED — the first large-scale multi-platform (vehicle, drone, quadruped robot), multimodal (LiDAR + RGB) outdoor 3D visual grounding benchmark, containing over 128K objects and 22K language descriptions, making it 10× larger than existing outdoor datasets. A baseline method incorporating cross-platform alignment, multi-scale sampling, and scale-adaptive fusion is also proposed, revealing substantial performance gaps in cross-platform 3D grounding.
Background & Motivation¶
3D visual grounding requires models to localize target objects in 3D scenes based on natural language, a core capability for embodied intelligence (navigation, interaction, situational awareness). Existing benchmarks are almost exclusively focused on indoor RGB-D small-scale scenes (ScanRefer, Nr3D, etc.) with objects limited to furniture categories, which fails to meet real-world demands. The few outdoor datasets available (Talk2Car, KITTI360Pose, etc.) are restricted to single platforms (vehicle-mounted LiDAR), are small in scale, and lack cross-platform diversity. In practice, different embodied agents (autonomous vehicles, drones, quadrupeds) differ substantially in sensor configuration, viewpoint geometry, and point cloud density, making a unified outdoor multi-platform grounding benchmark urgently needed.
Core Problem¶
- Lack of multi-platform outdoor 3D grounding datasets: Existing datasets are either limited to indoor environments or to a single vehicle-mounted platform, precluding evaluation of cross-platform generalization.
- Cross-platform domain gap: Different platforms exhibit large variations in viewpoint (top-down/horizontal/upward), LiDAR density, and object scale distribution; indoor methods fail catastrophically when directly transferred to outdoor multi-platform settings.
- Annotation efficiency: Large-scale 3D object annotation and language description generation are costly, necessitating efficient semi-automatic pipelines.
Method¶
Overall Architecture¶
The 3EED work comprises two main components: dataset construction and baseline method design.
For the dataset, synchronized LiDAR and RGB data are collected from Waymo (vehicle) and M3ED (drone + quadruped). High-quality 3D bounding box annotations are obtained via a three-stage pipeline of multi-detector fusion + tracking + human verification, followed by structured language description generation using Qwen2-VL-72B and human-verified filtering.
For the baseline, the method builds on BUTD-DETR: PointNet++ encodes LiDAR point clouds, a frozen RoBERTa encodes language, and a Transformer decoder predicts 3D boxes. Three additional modules are introduced: Cross-Platform Alignment (CPA), Multi-Scale Sampling (MSS), and Scale-Adaptive Fusion (SAF).
Key Designs¶
-
Data Annotation Pipeline:
- 3D Bounding Box Annotation: For vehicles, official Waymo annotations are used directly. For drones/quadrupeds, pseudo-labels are generated by multiple detectors (PV-RCNN, CenterPoint, etc.) → KDE fusion + 3D multi-object tracking (CTRL) for completion → Tokenize-Anything projection onto RGB for category verification → human refinement. The entire pipeline limits manual effort to approximately 100 seconds per frame.
- Language Descriptions: 3D boxes are projected onto RGB images; five-slot structured prompts (category/state/location/orientation/spatial relation) are fed into Qwen2-VL-72B to generate descriptions. Platform-agnostic paraphrase rules are applied to unify terminology, followed by human verification and revision by five annotators. All descriptions are written from the observer's perspective (camera viewpoint) to ensure cross-platform consistency.
-
Cross-Platform Alignment (CPA): Prior to feature extraction, each scene is rotated to align with the gravity direction (eliminating roll/pitch), with additional height-offset normalization applied to drone data. All platforms are placed into a unified gravity-aligned coordinate system, ensuring that spatial relations such as "above/below/behind" are encoded consistently across platforms. This is a one-time geometric normalization step that requires no modification to the network architecture, yet allows the backbone to focus its capacity on object/content feature learning rather than pose correction.
-
Multi-Scale Sampling (MSS): Each PointNet++ layer queries neighborhoods using multiple radii (from 0.6 m to 4.8 m), simultaneously preserving fine local details for nearby objects and broad context for distant sparse objects. This avoids the failure modes of single-radius schemes: small radii leave distant objects with no neighboring points, while large radii over-smooth nearby objects. This directly addresses the LiDAR sparsification problem with increasing distance.
-
Scale-Adaptive Fusion (SAF): Features computed at all radii are fed into a lightweight MLP to generate per-point dynamic weights, fusing multi-scale features into a single embedding that adaptively emphasizes the radius scale most informative for local geometry. This prevents "wrong-scale" decisions and stabilizes predictions under large density variations across platforms. The parameter and latency overhead is minimal.
Loss & Training¶
- Hungarian matching assigns predicted boxes to ground-truth boxes (similar to DETR)
- Loss combination: box regression L1 + 3D GIoU + token-level classification loss + symmetric contrastive alignment loss (bidirectional query-to-token and token-to-query)
- Objectness supervision: focal loss, with the 4 nearest points to each ground-truth center designated as positive samples
- Point clouds are uniformly downsampled to 16,384 points; PointNet++ is trained from scratch; RoBERTa is frozen
- Learning rate: \(1 \times 10^{-3}\) for the visual encoder, \(1 \times 10^{-4}\) for others; 100 epochs training on 2× RTX 4090
- Multi-target grounding setting: each target is associated with an independent positive map; Hungarian matching performs one-to-one assignment; training for 200 epochs
Key Experimental Results¶
Cross-Platform Grounding (Multi-Platform Joint Training, Acc@25%)¶
| Platform | Metric | Ours | BUTD-DETR | WildRefer | Gain |
|---|---|---|---|---|---|
| Vehicle | Acc@25 | 63.84 | — | — | — |
| Drone | Acc@25 | 53.45 | — | — | — |
| Quadruped | Acc@25 | 53.31 | — | — | — |
| Average | Acc@25 | — | — | — | +12.29 |
Cross-Platform Zero-Shot Transfer (Trained on Vehicle Only)¶
| Test Platform | Metric | BUTD-DETR | Ours |
|---|---|---|---|
| Vehicle (in-domain) | Acc@25 | 52.38 | High |
| Drone (zero-shot) | Acc@25 | 1.54 | Substantial gain |
| Quadruped (zero-shot) | Acc@25 | 10.18 | Substantial gain |
Multi-Target Grounding (Vehicle Platform)¶
| Method | Acc@25 | mIoU |
|---|---|---|
| BUTD-DETR | 25.40 | 47.88 |
| Ours | Substantial gain | Substantial gain |
Ablation Study¶
- CPA contributes most: Removing CPA drops Vehicle Acc@25 from 80.86 to 71.76 (−9.10), confirming its role as the key component for cross-platform alignment
- MSS ranks second: Removing MSS drops Vehicle Acc@25 to 75.65 (−5.21), addressing the sparse point cloud problem at long range
- SAF provides complementary gains: Removing SAF drops Quadruped Acc@25 from 53.31 to 51.98 (−1.33), stabilizing predictions under density variations
- Scene complexity effect: On Quadruped, Acc@25 drops sharply from 71.23 (1–3 targets) to 30.75 (7–9 targets)
- Platform characteristics: Drone is the most challenging (only 102 points/object vs. 462 for Vehicle) and has the highest scene density (8.05 objects/scene)
Highlights & Insights¶
- First unified multi-platform outdoor 3D grounding benchmark: Covering three fundamentally different embodied viewpoints (vehicle/drone/quadruped), this represents an important infrastructure contribution to the field
- Elegant annotation pipeline design: The cascaded scheme of multi-detector fusion + tracking + category verification + human refinement balances efficiency and quality; the VLM generation + human filtering approach for language annotation is reusable
- CPA is simple yet highly effective: A single coordinate-system rotation alignment yields a +9.10 Acc@25 improvement, demonstrating that geometric normalization is a critically underappreciated preprocessing step in outdoor 3D tasks
- Reveals the substantial challenge of cross-platform grounding: BUTD-DETR's Acc@25 plummets from 52.38 to 1.54 when transferred from Vehicle to Drone, representing near-complete failure, which clearly motivates future research directions
Limitations & Future Work¶
- Only two object categories (Vehicle and Pedestrian) are covered, omitting many outdoor object types (traffic signs, cones, etc.)
- Static scenes: Temporal dynamic modeling and conversational interaction are not addressed
- LiDAR-only baseline: The proposed baseline uses only LiDAR point clouds without exploiting RGB image information for multimodal fusion
- Assumption of accurate descriptions: Ambiguous, contradictory, or noisy text inputs are not considered
- Limited sensor coverage: Only LiDAR + RGB are used; modalities such as thermal imaging and event cameras are not explored
- Potential extensions: Integrating RGB visual features (CLIP/DINOv2) for multimodal grounding; introducing temporal reasoning; expanding to open-vocabulary categories
Related Work & Insights¶
- vs ScanRefer/Nr3D: Indoor small-scale scenes, dense RGB-D, furniture categories → 3EED provides outdoor large-scale, sparse LiDAR, multi-platform settings, fundamentally broadening the scope of 3D grounding
- vs Talk2Car/Talk2LiDAR: Single vehicle platform, small scale (a few thousand expressions) → 3EED covers multiple platforms with 22K expressions and 128K objects, an order of magnitude larger
- vs WildRefer: The most closely related work, also addressing outdoor multi-platform grounding, but 3EED is more comprehensive in data scale, annotation quality, and cross-platform evaluation protocol, and proposes a platform-aware baseline
Takeaways¶
- The CPA cross-platform alignment idea can be generalized to other cross-domain 3D perception tasks (e.g., domain-adaptive 3D detection); simple geometric normalization is often more effective than complex network designs
- The VLM + human verification annotation pipeline is a general-purpose paradigm applicable to the efficient construction of other 3D language datasets
- Multi-platform grounding is fundamentally a domain generalization problem and can benefit from domain generalization techniques (e.g., style transfer, meta-learning)
Rating¶
- Novelty: ⭐⭐⭐⭐ The first multi-platform outdoor 3D grounding benchmark is an important contribution, though the technical novelty of the baseline method is relatively limited
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ The evaluation protocol is comprehensive (in-domain/cross-domain/multi-target/joint training), ablations are thorough, and dataset statistics are detailed
- Writing Quality: ⭐⭐⭐⭐ The paper is clearly structured with rich figures and tables, though the overall content is lengthy (with a substantial appendix)
- Value: ⭐⭐⭐⭐ The dataset makes an important contribution to the field and reveals key challenges in cross-platform grounding, though methodological inspiration is limited