ECCV 2024 Autonomous Driving Semi-supervised learning LiDAR semantic segmentation peer-representation consistency contrastive learning Gaussian Mixture Model

ItTakesTwo: Leveraging Peer Representations for Semi-supervised LiDAR Semantic Segmentation¶

Conference: ECCV 2024
arXiv: 2407.07171
Code: https://github.com/yyliu01/IT2
Area: Autonomous Driving / 3D Semantic Segmentation
Keywords: Semi-supervised learning, LiDAR semantic segmentation, peer-representation consistency, contrastive learning, Gaussian Mixture Model

TL;DR¶

The proposed IT2 framework significantly improves semi-supervised LiDAR semantic segmentation by leveraging consistency learning between peer representations (range image + voxel grid) of LiDAR data as a novel form of perturbation, and introducing cross-distribution contrastive learning based on Gaussian Mixture Models (GMMs).

Background & Motivation¶

Background: Outdoor LiDAR semantic segmentation is a foundational task in autonomous driving. Current methods extract features by converting point clouds into different representations (range image, voxel grid, BEV, etc.), but heavily rely on large-scale annotated data. Labeling LiDAR point clouds is extremely time-consuming and expensive, driving the development of semi-supervised learning (SSL) methods.

Limitations of Prior Work:

Limitations of Consistency Learning: Existing SSL methods (such as LaserMix, Mean Teacher) only perform consistency learning on a single LiDAR representation, with limited forms of perturbation (such as data augmentation, network perturbation), which constrains their generalization capability. Different representations have their own distinct weaknesses—the range view loses spatial detail in projection (e.g., Pole category mIoU is only 52.02), whereas the voxel grid suffers from poor accuracy in distant sparse regions (e.g., Sidewalk category mIoU is only 69.50).

Bottlenecks of Contrastive Learning: Existing contrastive learning approaches randomly sample positive and negative embedding pairs from limited mini-batches, failing to comprehensively capture the full distribution of the embedding space. Under multi-representation scenarios, noisy predictions from one representation can also propagate to another through contrastive learning, leading to confirmation bias.

Key Challenge: The same 3D point cloud should share the same semantics across different representations—the same point in the range image and voxel grid must have the identical "terrain" label—yet their prediction confidence and error patterns are drastically different. This complementarity is entirely ignored by existing single-representation SSL methods.

Key Insight: The differences between different LiDAR representations can be treated as a natural form of "perturbation." Unlike traditional data augmentation perturbations, peer-representation perturbations intrinsically represent different ways of observing the same semantic information, providing much more effective consistency learning signals.

Core Idea: Achieve consistency learning through cross-peer pseudo-label supervision (range predictions supervise voxels, and vice versa), while modeling the cross-representation embedding distribution using a GMM to sample highly informative samples for contrastive learning.

Method¶

Overall Architecture¶

IT2 simultaneously trains two networks: a range network (FIDNet/ResNet34) processing range images, and a voxel network (Cylinder3D) processing voxel grids. The same point cloud is converted into both representations and fed into their respective networks. On unlabeled data, predictions from one representation are mapped into the other's representation space via representation conversion, serving as pseudo-labels for cross-supervision. Meanwhile, the embedding spaces of both networks are modeled using GMMs for cross-distribution contrastive learning.

Key Designs¶

Peer-Representation Consistency: The core mechanism is to project predictions from the voxel network onto the range image space (or vice versa), which then act as pseudo-labels to supervise the peer network. Locally, this is implemented using bidirectional projection functions between representations:

\(\tilde{\mathbf{y}}^r(\omega^r) = \text{argmax}\ \Psi_{v \to r}^{(\omega^r)}(\hat{\mathbf{y}}^v), \quad \tilde{\mathbf{y}}^v(\omega^v) = \text{argmax}\ \Psi_{r \to v}^{(\omega^v)}(\hat{\mathbf{y}}^r)\)

where \(\Psi_{v \to r} = \psi_{p \to r} \circ \psi_{v \to p}\) completes the projection between representations via 3D point space intermediate mapping. The total loss is the sum of labeled and unlabeled losses across both representations:

$\ell_{\text{IT2}} = \ell_{\text{range}} + \ell_{\text{voxel}}$

Design Motivation: Different representations have complementary error patterns for the same scene—the range view is weak on thin, tall objects (Pole), whereas voxels are weak on distant dense surfaces (Sidewalk). Cross-supervision leverages this complementarity to mitigate the confirmation bias inherent in single-representation pseudo-labeling.

Cross-Distribution Contrastive Learning: Instead of the conventional approach of sampling embeddings from a mini-batch, IT2 utilizes a class-level Gaussian Mixture Model (GMM) to model the joint embedding space of both representations. The parameters of the GMM are optimized via the EM algorithm, with the pseudo-label confidence integrated as a weight factor to suppress the influence of lower-quality labels:

\(\mathbf{P}_{\Gamma^y}(\mathbf{z}|y) = \sum_{m=1}^{M} \boldsymbol{\pi}_m^y \mathcal{N}(\mathbf{z}; \boldsymbol{\mu}_m^y, \boldsymbol{\Sigma}_m^y)\)

Each category uses \(M=5\) Gaussian components. During training, virtual positive and negative prototype samples are sampled directly from the GMM distribution rather than being randomly selected from actual training samples:

$\ell_{\text{cross}} = \sum_{(\mathbf{z},y) \in \mathcal{A}} \sum_{\mathbf{s} \in \mathcal{S}^y} -\log \frac{\exp(\mathbf{z} \cdot \mathbf{s} / \tau)}{\exp(\mathbf{z} \cdot \mathbf{s} / \tau) + \sum_{\mathbf{s}^- \in \bar{\mathcal{S}}^y} \exp(\mathbf{z} \cdot \mathbf{s}^- / \tau)}$

Design Motivation: GMMs can capture the complete distribution characteristics (multi-modality, covariance structure) of the embedding space. Virtual prototypes sampled this way are more representative and informative than random samples, and they do not participate in backpropagation, thereby avoiding the propagation of cross-representation noise.

Representation-Specific Augmentation: Different representations are suited to different augmentation strategies—range images use multi-boxes CutMix, while voxels employ single-inclination LaserMix. This differs from past approaches which applied identical augmentations across all representations.

Design Motivation: The range image is a 2D projection, suited for 2D spatial cutting augmentations; the voxel representation is a 3D structure, suited for 3D augmentations based on laser beam inclinations. Tailored augmentation strategies yield better generalization.

Loss & Training¶

Total training loss: \(\mathcal{L} = \ell_{\text{IT2}} + \ell_{\text{cross}}\), optimized end-to-end.
Segmentation loss \(\ell\) consists of Cross-Entropy + Lovász-Softmax.
Contrastive learning temperature coefficient \(\tau = 0.1\).
The embedding projector is a 3-layer MLP yielding a 64-dimensional output.
The number of GMM components is \(M=5\), with uniform weights \(\pi_m^y = 1/M\).

Key Experimental Results¶

Main Results (Uniform Sampling)¶

nuScenes Dataset (mIoU%):

Method	Representation	1%	10%	20%	50%
LaserMix	Range	49.5	68.2	70.6	73.0
IT2 (Ours)	Range	56.5	71.3	73.4	74.0
LaserMix	Voxel	55.3	69.9	71.8	73.2
IT2 (Ours)	Voxel	57.5	72.1	73.6	74.1

SemanticKITTI + ScribbleKITTI (mIoU%):

Method	Representation	KITTI 1%	KITTI 10%	Scribble 1%	Scribble 10%
LaserMix	Range	43.4	58.8	38.3	54.4
IT2	Range	51.9	60.3	46.6	57.1
LaserMix	Voxel	50.6	60.0	44.2	53.7
IT2	Voxel	52.0	61.4	47.9	56.7

Under the 1% label setting on nuScenes, Range improves by +7.0%, and Voxel improves by +2.2%. Under 1% labeling on ScribbleKITTI, Range improves by +8.3%.

Ablation Study¶

Contribution of Each Component (nuScenes + ScribbleKITTI, Range Representation):

IT2 Architecture	Contrastive Learning	Augmentation	nuScenes 10%	ScribbleKITTI 10%	Notes
✗	✗	✗	60.8	50.0	CPS Baseline
✓	✗	✗	67.3 (+6.5)	53.2 (+3.2)	Peer-representation consistency
✓	✓	✗	70.3 (+9.5)	54.9 (+4.9)	+ Cross-distribution contrastive learning
✓	✓	✓	71.3 (+10.5)	57.1 (+7.1)	+ Representation-specific augmentation

All three components make obvious contributions, with the IT2 architecture itself contributing the most (+6.5%), contrastive learning further lifting performance by roughly 3%, and the augmentation strategy contributing another 1-2%.

Key Findings¶

Comparison of Contrastive Learning: Compared to ContrasSeg (state-of-the-art pixel-level contrastive learning), the GMM-based method improves mIoU by about 1% on nuScenes 10% (range: 71.3 vs 70.3, voxel: 72.1 vs 71.2), demonstrating that distribution sampling outperforms random sampling.
Temperature Coefficient: \(\tau=0.10\) is the optimal value (on nuScenes 10%); values too large or too small degrade performance.
Scalability to Multiple Representations: Integrating BEV (PolarNet) as a third representation yields further improvements (+6.3% when paired with range, +7.8% when paired with voxel), proving that the method is extensible to arbitrary combinations of representations.
Partial/Significant Sampling: On SemanticKITTI partial 5%, it outperforms GPC by +4.5% (43.8 vs 40.2), and on significant 10% it beats lim3D by +2.2%.
Augmentation Tactics: The combination of multi CutMix for range and 1-inc LaserMix for voxel is optimal, showing a +1.3% boost for the range representation compared to using LaserMix uniformly.

Highlights & Insights¶

Excellent core insight: Different LiDAR representations are distinct "observations" of the same 3D scene, which naturally satisfies the clustering hypothesis—this presents a more fundamental form of perturbation than manual data augmentation.
Modeling the embedding distribution with GMM before sampling is an elegant design: it not only covers the full distribution but also mitigates the negative impact of noisy pseudo-labels via confidence weighting.
Cross-representation pseudo-label supervision cleverly capitalizes on the complementary strengths of different representations, diminishing the confirmation bias inherent to single-representation SSL.
High degree of framework versatility: It extends seamlessly to any combination of LiDAR representations (range+voxel+BEV) and consistently yields substantial improvements.

Limitations & Future Work¶

Employing two representations requires training two independent networks, which doubles the training computational overhead.
Projection mappings between representations heavily rely on precise point cloud coordinates; sensor noise might degrade the quality of cross-representation labels.
GMM assumes a Gaussian mixture distribution for the embedding space, which might not be flexible enough for highly complex categories.
Worth exploring: Incorporating the camera modality as a third representation to achieve unified LiDAR-Camera SSL fusion.
Worth exploring: Extending the concept of peer-representation consistency to other SSL fields (e.g., multimodal SSL in medical imaging).

LaserMix introduced laser beam-mixing augmentations for single representations. IT2 advances beyond this by treating the "asymmetry between representations" itself as a higher-level perturbation.
Contrastive learning in GPC and lim3D is restricted by actual physical samples; IT2 overcomes this bottleneck using virtual GMM prototypes.
CPS (Cross Pseudo Supervision) in 2D image SSL uses two identical networks for cross-supervision. IT2 extends this to cross-supervising between two different representations, naturally introducing superior perturbation diversity.
Inspiration: The intrinsic complementarity of multi-view/multimodal data is a valuable asset for semi-supervised learning, proving more effective than synthetic augmentations.

Rating¶

Novelty: ⭐⭐⭐⭐ Peer-representation consistency learning and GMM-based contrastive learning are both significant and meaningful innovations.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 3 datasets × 3 sampling strategies × extensive ablation experiments + representation extension + comparison of contrastive learning. Extremely thorough.
Writing Quality: ⭐⭐⭐⭐ Thorough motivation analysis (illustrated well by the complementarity diagram in Fig.1) and complete, clear mathematical formalization.
Value: ⭐⭐⭐⭐ Significantly drives the LiDAR SSL field forward, with a +7% mIoU boost under the nuScenes 1% setting carrying strong practical significance.