Skip to content

4D Contrastive Superflows are Dense 3D Representation Learners

Conference: ECCV 2024
arXiv: 2407.06190
Code: https://github.com/Xiangxu-0103/SuperFlow
Area: 3D Vision / Self-Supervised Pre-training / Autonomous Driving
Keywords: LiDAR semantic segmentation, 3D representation learning, contrastive learning, spatiotemporal consistency, cross-sensor distillation

TL;DR

The SuperFlow framework is proposed, which establishes 4D pre-training objectives using continuous LiDAR-camera pairs through three modules: view consistency alignment, dense-sparse consistency regularization, and flow-based spatiotemporal contrastive learning. It comprehensively outperforms prior image-to-LiDAR pre-training methods across 11 heterogeneous LiDAR datasets.

Background & Motivation

LiDAR 3D perception model training in autonomous driving heavily relies on large-scale manual annotations, which are significantly more expensive than 2D labels. Representation learning (pre-training) is a crucial direction to alleviate this issue. Image-to-LiDAR distillation methods, represented by SLidR, demonstrate promising results by transferring knowledge from pre-trained 2D backbones to 3D backbones. However, existing methods suffer from two key blind spots: (1) neglecting temporal characteristics of LiDAR data—each frame is treated as an independent snapshot, discarding rich motion and semantic consistency information between consecutive scans; (2) vulnerability to changes in point cloud density—the density of points in LiDAR scans varies significantly between near and far regions, affecting the consistency of feature learning. In addition, existing superpixel generation approaches suffer from a "self-conflict" issue, where objects of the same category are incorrectly treated as negative samples under different viewpoints or even the same viewpoint.

Core Problem

How can spatiotemporal information in LiDAR sequences be fully utilized to enhance 3D pre-training performance? Specifically, three sub-problems need to be resolved simultaneously: (1) How to eliminate semantic conflicts of superpixels across different views? (2) How to maintain feature consistency under varying point cloud densities? (3) How to extract meaningful temporal cues from consecutive frames to enhance representation learning?

Method

Overall Architecture

The inputs consist of consecutive LiDAR-camera pairs \(\mathcal{\{}(P_t, I_t), (P_{t+\Delta t}, I_{t+\Delta t}), (P_{t-\Delta t}, I_{t-\Delta t})\mathcal{\}}\). The 2D branch utilizes a frozen DINOv2 to extract image features, while the 3D branch uses MinkUNet to extract point cloud features. Point-to-pixel correspondences are established via LiDAR-camera calibration matrices, followed by contrastive learning based on superpixel/superpoint grouping. The overall goal is to distill semantic knowledge from the 2D network into the 3D network, while utilizing temporal consistency to enhance representation quality.

Key Designs

  1. View Consistency Alignment (VC): Existing methods (superpixels generated by SLIC or VFMs) suffer from three types of "self-conflict": the same object is treated as different instances across views, objects of the same class in the same view are treated as negative samples, and objects of the same class across views are treated as negative samples. SuperFlow addresses this by fine-tuning the classification head of VFMs using CLIP's text encoder to generate semantic-level (rather than instance-level) superpixels, thereby unifying the superpixel labels of the same category across all camera views. This serves as a simple yet effective plug-and-play module.

  2. Dense-Sparse Consistency Regularization (D2S): Multiple LiDAR sweeps within a temporal window are concatenated onto the coordinate system of the current keyframe through coordinate transformations, forming a dense point cloud \(P_d\). The dense and sparse point clouds are respectively fed into a weight-shared 3D network to extract features, which are then average-pooled based on superpoint grouping to obtain two sets of superpoint features \(Q_d\) and \(Q_t\). The D2S loss constrains the consistency (cosine similarity) between the two, forcing the model to learn features that are invariant to density variations.

  3. Flow-based Contrastive Learning (FCL): This includes two sub-objectives—Spatial Contrastive Learning (\(\mathcal{L}_{sc}\)): standard image-to-LiDAR superpixel-superpoint contrastive learning (InfoNCE) within each timestamp; Temporal Contrastive Learning (\(\mathcal{L}_{tc}\)): contrastive learning between superpoint features at different timestamps to maintain semantic consistency of the same category across frames. This extends the distillation, which originally focused only on single frames, to a joint spatiotemporal distillation.

Loss & Training

  • Total loss = \(\mathcal{L}_{sc}\) (spatial contrast) + \(\mathcal{L}_{tc}\) (temporal contrast) + \(\mathcal{L}_{d2s}\) (dense-sparse consistency)
  • Both spatial and temporal contrastive learning utilize the InfoNCE loss, where the temperature parameter \(\tau\) controls the distillation smoothness.
  • FCL takes 3 consecutive frames (current frame \(\pm \Delta t\)), and D2S concatenates 2-3 sweeps.
  • Pre-training: nuScenes 600 scenes, 8 GPUs, 50 epochs, AdamW + OneCycle, lr = 0.01.
  • Downstream fine-tuning: 4 GPUs, 100 epochs, lr = 0.001.
  • 2D backbone: DINOv2 (ViT-S/B/L, frozen); 3D backbone: MinkUNet (trainable).
  • Superpixels are generated by OpenSeeD, with the CLIP text encoder fine-tuning the last layer.

Key Experimental Results

Dataset Setup SuperFlow (ViT-B) Seal (ViT-B) Gain
nuScenes Linear Probing SOTA - Comprehensive Outperformance
nuScenes 1% Fine-tune SOTA - Significant Improvement
SemanticKITTI 1% Fine-tune SOTA - Strong Cross-domain Generalization
Waymo Open 1% Fine-tune SOTA - Strong Cross-domain Generalization
  • Comprehensively outperforms prior arts including PPKT, SLidR, Seal, etc., across 11 heterogeneous LiDAR datasets.
  • Cross-domain generalization experiments (Table 2): Achieves SOTA on all 14 tasks across 7 different LiDAR datasets.
  • OOD robustness (Table 3, Robo3D benchmark): The pre-trained SuperFlow model demonstrates stronger robustness under 8 corruption scenarios.
  • Scaling up the 2D backbone (ViT-S \(\rightarrow\) ViT-L) yields continuous performance gains, indicating the presence of a scaling law.

Ablation Study

  • FCL contributes the most (Table 6): FCL brings approximately a 2% mIoU improvement, D2S brings about 1% mIoU, and VC shows a slight improvement. The combination of all three achieves the best performance.
  • Number of sweeps (Table 4): 2-3 sweeps is optimal; too many sweeps introduce projection misalignment due to the motion of dynamic objects.
  • Number of temporal frames (Table 7): 3 frames perform better than 2 frames, which perform better than a single frame; a shorter temporal span ensures better consistency, whereas an excessively long timespan introduces uncertainties.
  • 3D network capacity (Table 5): MinkUNet-34/50 yields better results, whereas MinkUNet-101 experiences a drop (potentially due to optimization difficulty with large parameter sizes).

Highlights & Insights

  • Unified spatiotemporal pre-training: This work is the first to systematically introduce 4D temporal information into the image-to-LiDAR distillation framework, serving as a natural and effective extension of the SLidR \(\rightarrow\) ST-SLidR \(\rightarrow\) Seal trajectory.
  • Elegant design of view consistency alignment: Fine-tuning the last layer of the VFM segmentation head with a CLIP text encoder resolves cross-view self-conflict at an extremely low cost but with high effectiveness. This is a classic example of utilizing language priors to unify visual semantics.
  • Intuitive D2S regularization: Concatenating multi-frame LiDAR sweeps into a dense point cloud and enforcing consistency constraints with sparse point clouds is straightforward but highly effective, leveraging the natural temporal scanning characteristics of LiDAR.
  • Comprehensive evaluation on 11 datasets: The experiments cover a variety of scenarios (real/synthetic, clear/adverse weather, different sensors), validating the generalization capability of the method.
  • Preliminary findings on scaling behavior: Scaling up either the 2D or 3D backbones delivers consistent improvements, providing empirical evidence for the development of 3D foundation models.

Limitations & Future Work

  • Temporal conflict of dynamic objects: Outward changes in appearance and scale of dynamic items across frames can lead to inconsistent cross-frame superpixels, which are mistakenly treated as negative samples (acknowledged in Fig. 12).
  • Asynchronous LiDAR-Camera frequencies: Mismatches in operational frequencies lead to systematic projection errors, which are more pronounced when concatenating multiple sweeps for dense point clouds, limiting further extension of the D2S module.
  • Limited to LiDAR semantic segmentation: Validation has not been conducted on other downstream tasks such as 3D detection or occupancy network prediction, leaving the universality of the pre-trained representations to be explored.
  • Dependence on VFM and CLIP: The quality of superpixels depends on the segmentation quality of OpenSeeD, and CLIP fine-tuning introduces extra dependencies.
  • Computational overhead: Multi-frame inputs combined with multi-way contrastive learning increase pre-training overhead, and the paper does not provide an efficiency comparison with the baselines.
  • vs SLidR (CVPR'22): SuperFlow introduces temporal dimensions (FCL) and density robustness (D2S) on top of SLidR, upgrading from 3D distillation to 4D distillation. While SLidR relies on single-frame single-modality contrast, SuperFlow comprehensively outperforms it.
  • vs Seal (NeurIPS'23): Seal introduces VFMs to generate semantic superpixels but remains restricted to single frames. SuperFlow resolves the remaining self-conflict in Seal using CLIP for view consistency alignment, and introduces two new spatiotemporal modules: D2S and FCL.
  • vs BEVContrast (3DV'24) / TARL (CVPR'23): These methods leverage temporal information but operate only in a single modality (LiDAR) and lack cross-sensor distillation. SuperFlow combines multi-modal distillation with temporal consistency, yielding superior performance.
  • Potential new ideas: Can scene flow be used to explicitly model temporal correspondences of dynamic objects to solve the "temporal conflict" issue? Or can view-consistent superpixels be generated in a self-supervised manner without relying on VFMs?

Rating

  • Novelty: ⭐⭐⭐⭐ Systematically introducing 4D temporal information into image-to-LiDAR distillation is an effective innovation, though the three sub-modules individually are not entirely brand new.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive experiments spanning 11 datasets, linear probing, fine-tuning, cross-domain, robustness, ablations, and visualizations.
  • Writing Quality: ⭐⭐⭐⭐ Well-structured with reasonable motivations and intuitive illustrations, though certain technical details require consulting the appendix.
  • Value: ⭐⭐⭐⭐ Establishes a new SOTA baseline for LiDAR pre-training with inspiring scaling findings, albeit limited to segmentation tasks.