WildSeg3D: Segment Any 3D Objects in the Wild from 2D Images¶
Conference: ICCV 2025 arXiv: 2503.08407 Code: Coming soon Area: 3D Vision Keywords: 3D segmentation, feed-forward, SAM2, global alignment, real-time interaction
TL;DR¶
This paper proposes WildSeg3D, the first feed-forward 3D segmentation model that requires no scene-specific training. It addresses multi-view pointmap alignment errors via Dynamic Global Alignment (DGA) and achieves real-time interactive 3D segmentation through Multi-view Group Mapping (MGM), outperforming the current state of the art in accuracy while being 40× faster.
Background & Motivation¶
Interactive 3D segmentation (segmenting 3D objects from 2D images) has broad applications in VR/AR, real-time interactive systems, and automatic annotation.
Common bottlenecks of existing methods:
NeRF-based methods (SA3D, SANeRF-HQ): integrate SAM for 3D segmentation, but NeRF requires extensive scene-specific training time.
3DGS-based methods (SAGA, Gaussian Grouping, Feature3DGS): faster than NeRF, yet still require training to construct Gaussian feature fields.
Shared limitation: all methods rely on scene-specific training to acquire accurate 3D priors, severely hindering real-time applicability.
Key challenge: Feed-forward approaches (e.g., DUSt3R/MASt3R) can bypass scene-specific training, but 3D alignment errors across multi-view pointmaps accumulate, causing target objects to be confused with the background and degrading segmentation accuracy.
Method¶
Overall Architecture¶
A three-stage pipeline: 1. 2D mask preprocessing (offline): SAM2 panoptic segmentation → multi-view tracking → mask caching 2. Dynamic Global Alignment (DGA): dynamic weight adjustment → optimized multi-view pointmap alignment 3. Multi-view Group Mapping (MGM) (real-time): user prompt → retrieve mask cache → map to 3D space
Dynamic Global Alignment (DGA)¶
Standard global alignment (MASt3R) treats all pixels equally, but large background variation and target occlusion lead to misalignment. DGA introduces three innovations:
1. Soft mask + confidence aggregation: fuses SAM2 masks with pointmap confidence scores
2. Dynamic adjustment function: assigns greater attention to hard-to-match points (confidence ≈ 0.5)
where \(\alpha\) takes a positive value \(\alpha_p\) for matched points and a negative value \(-\alpha_n\) for unmatched points.
3. Optimized alignment loss:
Multi-view Group Mapping (MGM)¶
At the real-time stage: the user provides a prompt in one view → the system retrieves all-view masks for the corresponding object from the mask cache → applies the transformation matrices learned by DGA to map masks into the aligned 3D space → returns results within 5–20 ms.
Mask Caching Mechanism¶
Offline processing leverages SAM2's video tracking capability: 1. Perform panoptic segmentation on a single view. 2. SAM2 tracking propagates object masks to all views. 3. Results are stored as an offline cache and queried directly at runtime.
Experiments¶
Main Results on NVOS Dataset¶
| Method | Scene Training | mIoU (%) | mAcc (%) | Total Time |
|---|---|---|---|---|
| NVOS | Yes | 70.1 | 92.0 | - |
| SA3D | Yes | 90.3 | 98.2 | 780s |
| SAGA | Yes | 90.9 | 98.3 | 2280s |
| OmniSeg3D | Yes | 91.7 | 98.4 | 8220s |
| FlashSplat | Yes | 91.8 | 98.6 | 1500s |
| WildSeg3D | No | 94.1 | 99.0 | 30s |
WildSeg3D requires no scene-specific training yet surpasses all training-based methods in accuracy while achieving more than 40× speedup.
Efficiency Comparison¶
| Metric | SA3D | SAGA | OmniSeg3D | WildSeg3D |
|---|---|---|---|---|
| Scene reconstruction time | 780s | 2280s | 8220s | <30s |
| Interaction response time | Seconds | Seconds | Seconds | 5–20ms |
Ablation Study¶
| Ablation | Key Findings |
|---|---|
| Standard alignment vs. DGA | DGA significantly improves target object alignment and reduces background interference |
| Effect of soft mask | Soft masks focus attention on target regions, suppressing negative effects of background features on alignment |
| Dynamic adjustment function | Assigns higher weights to ambiguous points near the decision boundary (confidence ≈ 0.5) |
| Mask caching | Offline preprocessing eliminates online segmentation during interaction, reducing response time to milliseconds |
Key Findings¶
- Surpassing training-based methods without training: the feed-forward mechanism with accurate alignment matches or exceeds scene-training-dependent methods.
- 40× acceleration: total time reduced from 780s (SA3D) to 30s, with 5–20ms interactive response.
- Robust generalization: operates across diverse scenes without adaptation.
Highlights & Insights¶
- Feed-forward paradigm shift: completely eliminates scene-specific training, making 3D segmentation practically deployable.
- Core idea of DGA: improving alignment quality by dynamically attending to "hard" points — low-confidence points typically correspond to object boundaries or occluded regions.
- Offline–online decoupling: SAM2 segmentation is performed and cached offline; at runtime, only 3D mapping lookup is required, enabling millisecond-level response.
Limitations & Future Work¶
- Relies on the quality of pointmap prediction from MASt3R; alignment may fail in texture-less regions.
- The quality of SAM2 panoptic segmentation directly affects final results.
- Alignment accuracy may degrade under sparse viewpoint settings.
Related Work & Insights¶
- 3D reconstruction: NeRF, 3DGS, DUSt3R, MASt3R
- Interactive 3D segmentation: SA3D, SAGA, Feature3DGS, FlashSplat
- Foundation models: SAM, SAM2
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First feed-forward 3D segmentation model; a paradigm-level innovation
- Technical depth: ⭐⭐⭐⭐ — DGA dynamic adjustment function is theoretically grounded
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive speed/accuracy comparisons with clear ablations
- Value: ⭐⭐⭐⭐⭐ — 30s reconstruction + millisecond-level interaction; genuinely deployable