LASS3D: Language-Assisted Semi-Supervised 3D Semantic Segmentation with Progressive Unreliable Data Exploitation¶
Conference: ECCV 2024
Code: None
Area: 3D Vision / Semantic Segmentation
Keywords: Semi-Supervised Learning, 3D Point Cloud Segmentation, Large Vision-Language Models, Pseudo-labeling, Negative Learning
TL;DR¶
This paper proposes LASS3D, which introduces Large Vision-Language Models (LVMs) to generate multi-level text descriptions for enhancing 3D features within a MeanTeacher semi-supervised 3D semantic segmentation framework, and effectively exploits low-confidence pseudo-labeled points through a progressive negative learning strategy, achieving significant improvements on both indoor and outdoor datasets.
Background & Motivation¶
Background: 3D point cloud semantic segmentation requires large amounts of precise point-wise annotations, which is extremely time-consuming on large-scale datasets. Semi-supervised methods (like MeanTeacher) have become a dominant paradigm by leveraging pseudo-labels from unlabeled data through a teacher-student framework to alleviate annotation pressure.
Limitations of Prior Work: Current semi-supervised 3D segmentation methods face two core challenges. First, while Large Vision-Language Models (LVMs) have shown strong performance in 2D semi-supervised learning, their application in 3D semi-supervised semantic segmentations remains largely unexplored, leaving 3D features lacking guidance from high-level semantic information. Second, existing methods typically adopt a hard-thresholding strategy for pseudo-labels generated by the teacher model—retaining high-confidence points as positive supervision and simply discarding low-confidence ones, resulting in a waste of information embedded in a large number of "unreliable" points.
Key Challenge: Although low-confidence pseudo-labels are not reliable enough for positive supervision, they still contain exclusionary information about "which classes the point should not belong to." Simply discarding them ignores this useful negative knowledge. Furthermore, there is a modality gap between 3D point clouds and 2D images, making the effective transfer of semantic knowledge from vision-language models to the 3D space another key challenge.
Goal: (1) How to inject semantic information from LVMs into 3D semi-supervised segmentation? (2) How to effectively utilize the low-confidence unreliable points discarded by existing methods?
Key Insight: The authors propose using 2D images as a bridge to connect text and point clouds—leveraging off-the-shelf LVMs to generate multi-level text descriptions for 2D images, and then embedding text semantics into the 3D feature space through adaptive fusion. For unreliable points, the idea of negative learning is adopted: although the exact class of a point is unknown, the model can learn that the point "does not belong to" the incorrect class with the highest prediction confidence.
Core Idea: Enhance 3D features with LVM text descriptions for semantic guidance, and utilize a progressive negative learning strategy to mine exclusionary information from unreliable points to boost semi-supervised segmentation performance.
Method¶
Overall Architecture¶
LASS3D is built upon the MeanTeacher framework. The input consists of point clouds and their corresponding 2D image perspectives. In the student branch, two off-the-shelf LVMs are utilized to generate scene-level and object-level text descriptions, obtaining multi-level semantic embeddings through a text encoder. After the student network extracts 3D point features, a Semantic-Aware Adaptive Fusion (SAAF) module injects text semantics into the 3D features. The teacher branch, updated via EMA, generates pseudo-labels for unlabeled data and splits them into reliable and unreliable points. Reliable points train the student network using standard cross-entropy loss, while unreliable points undergo complementary training through a progressive negative learning strategy. Meanwhile, the text-enhanced semantics in the student branch are transferred to the teacher branch via knowledge distillation.
Key Designs¶
-
Multi-level Text Description and Semantic-Aware Adaptive Fusion (SAAF) Module:
- Function: Generate multi-level text descriptions using LVMs and adaptively fuse them into 3D features.
- Mechanism: Use two LVMs to generate scene-level descriptions (e.g., "an indoor office scene with desks and chairs") and object-level descriptions (e.g., "a wooden chair"), then extract text embeddings \(e_{scene}\) and \(e_{obj}\) using a pre-trained text encoder. During fusion, a class-aware attention mechanism is applied: for each 3D point feature \(f_i\), its similarity to each class text embedding is calculated as attention weights, which are then used to weighted-fuse the text features into the 3D features. The fusion ratio is adaptively controlled by a learnable gating parameter to prevent text information from overriding the original geometric details.
- Design Motivation: Multi-level descriptions supplement 3D features from both macro-scene semantics and micro-object attributes, compensating for the lack of high-level semantic understanding in purely geometric features. The adaptive gating ensures adjustable integration of text information across different scenes.
-
Progressive Unreliable Data Exploitation:
- Function: Progressively exploit low-confidence pseudo-labeled points, which are typically discarded by traditional methods, via negative learning.
- Mechanism: First, the teacher model's predictions are divided into reliable and unreliable points based on a confidence threshold. For unreliable points, Negative Learning is applied—if the highest confidence class predicted by the teacher is likely incorrect, the student model can at least be told that "this point does not belong to this class." Concretely, this is implemented via complementary labels: the predicted class with the highest probability is treated as a negative label, and the student model is trained to suppress the prediction probability for that class. As training progresses, the threshold is gradually relaxed to allow more initially unreliable points into training, establishing a progressive learning process.
- Design Motivation: Low confidence does not imply zero utility—a point might belong to any of the 10 classes, but the most likely incorrect prediction at least provides negative information for exclusion. The progressive strategy allows the model to learn simple exclusions from the most unreliable negative signals first, before dealing with difficult boundary cases containing blurred decision boundaries.
-
Knowledge Distillation for Text Semantics Transfer:
- Function: Transfer text-enhanced features from the student branch to the teacher branch.
- Mechanism: The teacher branch does not directly connect to the LVM modules (to preserve computational efficiency and stable EMA updates). Instead, it obtains text-enhanced semantic information indirectly through feature-level distillation loss. Specifically, the KL divergence or MSE loss between the intermediate features of the student and teacher branches is calculated to align the teacher's feature space with the text-enhanced semantic space.
- Design Motivation: Enhancing the semantic features of the teacher branch improves the quality of its generated pseudo-labels, creating a virtuous cycle.
Loss & Training¶
The total loss consists of four parts: (1) cross-entropy loss \(\mathcal{L}_{ce}\) on labeled data; (2) cross-entropy loss \(\mathcal{L}_{pseudo}\) on reliable pseudo-labels; (3) negative learning loss \(\mathcal{L}_{neg}\) on unreliable points, which decreases the prediction probability of incorrect classes using complementary labels; and (4) knowledge distillation loss \(\mathcal{L}_{kd}\) to align student and teacher intermediate features. The progressive threshold \(\tau\) decays linearly over training epochs to gradually include more unreliable points.
Key Experimental Results¶
Main Results¶
| Dataset | Label Ratio | Metric (mIoU) | Ours | Prev. SOTA | Gain |
|---|---|---|---|---|---|
| ScanNet v2 | 1% | mIoU | 55.8 | 51.2 | +4.6 |
| ScanNet v2 | 5% | mIoU | 65.3 | 62.1 | +3.2 |
| SemanticKITTI | 1% | mIoU | 48.7 | 44.9 | +3.8 |
| SemanticKITTI | 10% | mIoU | 58.2 | 55.6 | +2.6 |
| S3DIS | 5% | mIoU | 59.1 | 56.3 | +2.8 |
Ablation Study¶
| Configuration | mIoU | Description |
|---|---|---|
| Baseline (MeanTeacher) | 51.2 | Without text enhancement and negative learning |
| + Multi-level Text Fusion | 53.9 | SAAF added, improves by 2.7% |
| + Negative Learning (Fixed Threshold) | 54.6 | Exploits unreliable points, improves by 0.7% |
| + Progressive Negative Learning | 55.3 | Progressive strategy outperforms fixed threshold |
| + Knowledge Distillation (Full) | 55.8 | Full model |
Key Findings¶
- Multi-level text fusion contributes the most (+2.7 mIoU), proving the significant value of LVM semantic information for semi-supervised 3D segmentation.
- Progressive negative learning outperforms fixed-threshold negative learning by 0.7 mIoU, validating the effectiveness of gradually expanding the scope of unreliable points.
- Improvements are more pronounced at lower annotation ratios (1%), demonstrating that semantic priors are more critical under severe data scarcity.
- Both scene-level and object-level text descriptions are essential; using only scene-level descriptions drops performance by 1.2 mIoU.
Highlights & Insights¶
- Using 2D images as a bridge connecting text and 3D point clouds is a highly practical paradigm, avoiding the massive cost of training vision-language models directly on 3D data. This bridge concept can be extended to tasks like 3D object detection or point cloud generation.
- Leveraging negative learning to exploit unreliable data is a clever intuition—when we cannot confirm what it is, we can learn what it is not. This complementary label concept is broadly applicable to semi-supervised or noisy-label scenarios.
- The progressive threshold design is simple and effective, sharing a similar philosophy with curriculum learning—learning simple exclusion signals before handling difficult boundary cases.
Limitations & Future Work¶
- Relies on 2D image views to generate text descriptions, making it inapplicable to scenarios with only raw point clouds (no corresponding images).
- The quality of text descriptions generated by LVMs directly impacts performance; descriptions for long-tail classes might be inaccurate.
- Negative learning requires a reasonable threshold scheduling strategy, and different datasets might require different scheduling parameters.
- Advanced 3D backbones (such as Point Transformer v3) have not been explored; the effectiveness when paired with stronger backbones remains unchecked.
Related Work & Insights¶
- vs LaserMix: LaserMix achieves data augmentation by mixing different laser-scanned regions, which is a purely data-driven strategy. LASS3D tackles the problem from the perspective of semantic priors. The two are complementary, and combining them could yield further improvements.
- vs ST3D: ST3D also uses a self-training framework for semi-supervised 3D tasks but completely discards low-confidence predictions. LASS3D's negative learning strategy is a highly effective supplement to the self-training paradigm.
- vs OpenScene: OpenScene also matches 3D features with language using CLIP, but targets open-vocabulary segmentation instead of semi-supervised learning. LASS3D could benefit from OpenScene's multi-scale feature alignment strategy.
Rating¶
- Novelty: ⭐⭐⭐⭐ Introducing LVMs to semi-supervised 3D segmentation is novel, and the implementation of negative learning is applied cleverly.
- Experimental Thoroughness: ⭐⭐⭐⭐ Cooped with multiple indoor/outdoor datasets and various annotation ratios, presenting comprehensive ablations.
- Writing Quality: ⭐⭐⭐⭐ Clear motivation with a highly systematized methodology description.
- Value: ⭐⭐⭐⭐ Provides two effective new tools (semantic enhancement and data utilization) for 3D semi-supervised learning.