SAM4D: Segment Anything in Camera and LiDAR Streams¶
Conference: ICCV 2025 arXiv: 2506.21547 Code: SAM4D-Project.github.io Area: Autonomous Driving Keywords: Multimodal Segmentation, Foundation Model, Camera-LiDAR Fusion, Temporal Segmentation, SAM
TL;DR¶
This paper presents SAM4D, the first promptable multimodal segmentation foundation model for camera and LiDAR streams. It introduces Unified Multimodal Positional Encoding (UMPE) to enable cross-modal prompting and interaction, Motion-aware Cross-Modal Attention (MCMA) for temporal consistency, and constructs the Waymo-4DSeg dataset containing 300K+ masklets, demonstrating strong capabilities in cross-modal segmentation and data annotation.
Background & Motivation¶
Problem Definition¶
In autonomous driving, cameras and LiDAR complement each other's limitations (e.g., low-light conditions, depth accuracy), making robust multimodal perception essential. Existing segmentation models are restricted to a single modality (image or point cloud) and typically operate on individual frames, failing to exploit cross-modal spatial consistency and temporal continuity.
Limitations of Prior Work¶
SAM/SAM2: Designed solely for image/video segmentation; no support for LiDAR or other sensor modalities.
LiDAR segmentation methods (SAL, PointSAM): Build SAM-like models directly on point clouds but are limited to a single modality.
Projection-based methods (CLIP2Scene, etc.): Project 2D segmentation results into 3D, but are constrained by sensor viewpoint discrepancies and synchronization issues.
Multimodal perception methods (BEVFusion, etc.): Produce 3D predictions only, lacking cross-modal interaction and unified 2D-3D segmentation.
Frame-level LiDAR segmentation: Does not leverage LiDAR's precise depth for temporal feature association.
Core Problem¶
A unified multimodal temporal segmentation framework is needed that can: - Simultaneously generate segmentation masks in both camera and LiDAR modalities - Support cross-modal prompting (e.g., using image clicks to guide LiDAR segmentation) - Maintain temporal consistency over long sequences - Substantially reduce annotation costs for multimodal data
Method¶
Overall Architecture¶
SAM4D extends SAM2 to the multimodal domain, consisting of four main components: 1. Multimodal Feature Extraction: Image encoder (Hiera) + LiDAR encoder (MinkUNet) 2. Unified Multimodal Positional Encoding (UMPE): Aligns image and LiDAR features in a shared 3D space 3. Motion-aware Cross-Modal Attention (MCMA): Cross-modal fusion with ego-motion compensated temporal attention 4. Mask Decoder: Simultaneously outputs 2D and 3D segmentation masks
Key Designs¶
1. Unified Multimodal Positional Encoding (UMPE)¶
- Function: Aligns the positional representations of image patch tokens and LiDAR voxel tokens in a shared 3D space.
- Mechanism:
UMPE consists of two complementary components: (i) modality-specific positional priors; (ii) shared 3D spatial representations.
Image Positional Encoding: - 2D sinusoidal positional encoding preserves image plane structure: \(\mathcal{P}_{\text{img\_sin}} = \text{SinPE2D}(u, v)\) - Pixels are lifted to 3D space via depth estimation (analogous to Lift-Splat-Shoot): $\(\mathbf{x}_{\text{img}} = T_c^l K^{-1} [u \cdot D(u,v), v \cdot D(u,v), D(u,v), 1]^T\)$ - An MLP encodes the 3D position: \(\mathcal{P}_{\text{img\_mlp}} = \text{MLP}(\mathbf{x}_{\text{img}})\)
LiDAR Positional Encoding: - 3D sinusoidal positional encoding: \(\mathcal{P}_{\text{LiDAR\_sin}} = \text{SinPE3D}(x, y, z)\) - Shared MLP encodes the 3D position: \(\mathcal{P}_{\text{LiDAR\_mlp}} = \text{MLP}(\mathbf{x}_{\text{LiDAR}})\)
The final positional encodings \(\mathcal{P}_{\text{img}}\) and \(\mathcal{P}_{\text{LiDAR}}\) each consist of both components, ensuring cross-modal positional alignment.
- Design Motivation:
- The two-stage encoding captures both modality-specific characteristics (2D structure for images, 3D structure for LiDAR) and cross-modal alignment (shared 3D MLP).
- The shared 3D MLP renders image and LiDAR features comparable in the same space, enabling cross-modal prompting.
- Sparse prompts (points, bounding boxes) use the same two-stage encoding, ensuring prompt-feature spatial consistency.
2. Motion-aware Cross-Modal Attention (MCMA)¶
- Function: Integrates cross-modal feature fusion with ego-motion-compensated temporal memory attention.
- Mechanism:
Three-stage attention pipeline:
Step 1 — Intra-modal Self-Attention: $\(\mathcal{F}'_{\text{img}} = \text{SelfAttn}(\mathcal{F}_{\text{img}} + \mathcal{P}_{\text{img}})\)$ $\(\mathcal{F}'_{\text{LiDAR}} = \text{SelfAttn}(\mathcal{F}_{\text{LiDAR}} + \mathcal{P}_{\text{LiDAR}})\)$
Step 2 — Cross-modal Cross-Attention: $\(\mathcal{F}''_{\text{img}} = \text{CrossAttn}(\mathcal{F}'_{\text{img}}, \mathcal{F}'_{\text{LiDAR}} + \mathcal{P}_{\text{LiDAR}})\)$ $\(\mathcal{F}''_{\text{LiDAR}} = \text{CrossAttn}(\mathcal{F}'_{\text{LiDAR}}, \mathcal{F}'_{\text{img}} + \mathcal{P}_{\text{img}})\)$
Step 3 — Ego-motion-compensated Temporal Memory Attention: Historical frame features and positions are transformed via ego-motion to align with the current frame coordinate system: $\(\mathcal{M}_{\text{img}}^{t \leftarrow t'} = \mathcal{M}_{\text{img}}^{t'} + \Phi_{\text{img}}(T_{t \leftarrow t'}(\mathbf{x}_{\text{img}}))\)$ $\(\mathcal{M}_{\text{LiDAR}}^{t \leftarrow t'} = \mathcal{M}_{\text{LiDAR}}^{t'} + \Phi_{\text{LiDAR}}(T_{t \leftarrow t'}(\mathbf{x}_{\text{LiDAR}}))\)$
Current frame features are then fused with the aligned memory features via cross-attention: $\(\mathcal{F}_{\text{img}}^{\text{final}} = \text{CrossAttn}(\mathcal{F}''_{\text{img}}, (\mathcal{M}_{\text{img}}^{t \leftarrow t'}, \mathcal{O}_{\text{img}}^{t'}))\)$
where \(T_{t \leftarrow t'} \in SE(3)\) is derived from vehicle odometry.
- Design Motivation:
- Unlike SAM2, which only accounts for short-range motion, MCMA explicitly incorporates ego-motion compensation to handle large-scale scene changes in autonomous driving.
- The ego-motion transformation spatially aligns historical frame features with the current frame, preventing feature misalignment caused by vehicle movement.
- The memory bank uses a FIFO queue storing \(N\) non-prompt frames and \(M\) prompt frames separately, ensuring critical frames are retained.
3. Multimodal Automatic Data Engine¶
- Function: Automatically generates high-quality camera-LiDAR aligned pseudo-labels to construct the Waymo-4DSeg dataset.
- Mechanism:
Three-step pipeline: 1. VFM-driven video masklet generation: Grounding-DINO detects objects in keyframes + SAM performs segmentation → SAM2 propagates to intermediate frames. 2. 4D voxel reconstruction: LiDAR frames and 3D bounding boxes are used to build a 4D voxel representation, establishing a pixel-to-voxel mapping table. 3. Cross-modal label fusion: Video masklets are projected onto voxels via the mapping table; DBSCAN filters noise; overlapping masklets across views are merged; labels are finally transferred to LiDAR frames.
Result: cross-modal IoU of 0.56.
- Design Motivation:
- No existing dataset simultaneously supports 2D and 3D segmentation while guaranteeing temporal instance consistency.
- The strong zero-shot capabilities of VFMs (SAM, Grounding-DINO) enable automatic generation of high-quality labels.
- 4D reconstruction serves as an intermediate bridge connecting 2D image labels and 3D point cloud labels.
Loss & Training¶
- Identical loss functions are applied to image and LiDAR predictions to enforce cross-modal consistency.
- Training simulates an interactive prompting process (analogous to SAM2's strategy).
- The model is trained for 36 epochs on 16 A100 GPUs, processing at most 6 objects per iteration.
- The image encoder uses Hiera-S with SA-V pretraining; the LiDAR encoder uses Mink-34.
- Image resolution is 768×768; LiDAR voxel size is 0.15m.
Key Experimental Results¶
Main Results¶
Cross-modal frame-level segmentation (Image-Prioritized Prompting):
| Prompt Type | Image mIoU↑ | LiDAR mIoU↑ |
|---|---|---|
| 1-click | 68.0% | 42.3% |
| 3-click | 73.6% | 53.1% |
| Bounding Box | 74.7% | 47.0% |
Semi-supervised streaming segmentation (first-frame prompt → sequence propagation):
| Prompt Type | Image mIoU↑ | J&F↑ | NMP↓ | LiDAR mIoU↑ | NMP↓ |
|---|---|---|---|---|---|
| 1-click | 61.4% | 72.2 | 398 | 50.1% | 784 |
| 3-click | 65.6% | 76.3 | 327 | 52.8% | 711 |
| 5-click | 67.1% | 77.7 | 315 | 52.6% | 702 |
| GT mask | 69.8% | 80.1 | 280 | 55.7% | 582 |
Cross-dataset generalization (nuScenes, semi-supervised streaming segmentation):
| Setting | Image mIoU↑ | J&F↑ | LiDAR mIoU↑ |
|---|---|---|---|
| Zero-shot | 58.4% | 65.8 | 25.9% |
| Fine-tuned | 67.5% | 75.4 | 44.8% |
Ablation Study¶
Input modality ablation:
| Configuration | Image mIoU↑ | J&F↑ | NMP↓ | LiDAR mIoU↑ | NMP↓ |
|---|---|---|---|---|---|
| SAM2 + Projection | 68.2% | 79.7 | 383 | 32.0% | - |
| SAM4D-Camera Only | 68.6% | 80.4 | 301 | - | - |
| SAM4D-LiDAR Only | - | - | - | 47.0% | 799 |
| SAM4D (Full) | 69.8% | 80.1 | 280 | 55.7% | 582 |
Ego-motion compensation ablation:
| Configuration | Image mIoU↑ | LiDAR mIoU↑ | LiDAR NMP↓ |
|---|---|---|---|
| Without Ego-motion | 69.7% | 52.2% | 746 |
| With Ego-motion | 69.8% | 55.7% | 582 |
Key Findings¶
- Cross-modal prompting is effective: Prompting on images yields 53.1% LiDAR mIoU, demonstrating that UMPE successfully achieves cross-modal alignment.
- Multimodal fusion substantially improves LiDAR: LiDAR mIoU improves from 47.0% (single-modal) to 55.7% (multimodal) (+8.7%), indicating that image semantic information provides significant benefit to point cloud segmentation.
- SAM2 + Projection achieves only 32.0% LiDAR mIoU: Confirming that simple projection cannot address cross-modal segmentation; deep fusion is required.
- Ego-motion compensation primarily benefits LiDAR: LiDAR NMP decreases from 746 to 582 (−22%), suggesting that LiDAR's sparsity makes it more sensitive to spatial alignment.
- Reasonable zero-shot performance on nuScenes: Image mIoU of 58.4% demonstrates a degree of generalization capability.
- Data engine efficiency: An average of 300 masklets are generated per clip with a cross-modal IoU of 0.56, far exceeding the throughput of manual annotation.
Highlights & Insights¶
- First unified promptable segmentation model for camera + LiDAR: Fills the gap in multimodal segmentation foundation models.
- Innovation in cross-modal prompting: Prompts from one modality guide segmentation in the other, substantially improving annotation efficiency.
- Elegant design of UMPE: Lifts image features into 3D space via depth estimation, enabling both modalities to interact in a shared space.
- Engineering value of the data engine: Combining VFMs, 4D reconstruction, and cross-modal fusion, it constructs large-scale high-quality pseudo-labels.
- Waymo-4DSeg dataset: 300K+ masklets covering vehicles, pedestrians, buildings, and other categories, with object sizes ranging from 10 voxels to 200K voxels.
Limitations & Future Work¶
- Dependence on depth estimation quality: The image-to-3D lifting in UMPE relies on depth estimation; depth errors degrade cross-modal alignment accuracy.
- Remaining gap in LiDAR performance: LiDAR mIoU of 55.7% is substantially lower than image mIoU of 69.8%, indicating that stronger feature extraction for point clouds is still needed.
- Cross-modal IoU of only 0.56: Label quality from the data engine still has room for improvement.
- Training limited to Waymo: Although nuScenes generalization is evaluated, training data diversity is limited.
- Downstream task impact not evaluated: The transferability of SAM4D annotations to tasks such as 3D detection or trajectory prediction has not been verified.
Related Work & Insights¶
- vs. SAM2: SAM2 supports video segmentation only; SAM4D extends to camera + LiDAR multimodal streams.
- vs. PointSAM/SAL: These methods build promptable segmentation solely on point clouds; SAM4D unifies 2D and 3D.
- vs. BEVFusion et al.: Multimodal perception methods produce 3D predictions; SAM4D simultaneously outputs 2D and 3D segmentation masks.
- The Lift-Splat-Shoot paradigm is adopted within UMPE to lift 2D features into 3D space.
- The data engine design (VFM → 4D reconstruction → cross-modal fusion) offers a new paradigm for autonomous driving data bootstrapping.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — The first promptable segmentation foundation model unifying camera and LiDAR streams; both the task formulation and model design represent significant advances.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple evaluation settings with thorough ablations, though LiDAR segmentation baseline comparisons are limited.
- Writing Quality: ⭐⭐⭐⭐ — The three contributions (task, model, data) are presented with a clear and well-organized structure.
- Value: ⭐⭐⭐⭐⭐ — Substantial potential impact on multimodal data annotation efficiency; represents an important extension of the SAM family to autonomous driving.