Generalized Few-Shot 3D Point Cloud Segmentation with Vision-Language Model¶

Conference: CVPR 2025
arXiv: 2503.16282
Code: https://github.com/ZhaochongAn/GFS-VL
Area: Multimodal VLM
Keywords: Point Cloud Segmentation, Few-Shot Learning, Vision-Language Models, Pseudo-Labeling, 3D Semantic Understanding

TL;DR¶

GFS-VL proposes a generalized few-shot 3D point cloud segmentation framework that synergistically fuses dense but noisy pseudo-labels generated by a 3D Vision-Language Model (3D VLM) with precise but sparse few-shot annotations. Through prototype-guided pseudo-label selection, adaptive infilling, and novel-base mix augmentation, it achieves SOTA performance on both existing and newly established challenging benchmarks.

Background & Motivation¶

Background: Generalized Few-Shot 3D Point Cloud Segmentation (GFS-PCS) requires models to segment both base and novel classes given only a few novel class samples (e.g., 1-shot/5-shot). Existing methods primarily adopt the prototype learning paradigm, where each class is represented by a prototype, and segmentation is performed based on the relationship between prototypes and query points. CAPL enhances prototypes using co-occurrence priors, while GW encodes geometric structures as auxiliary information.

Limitations of Prior Work: The core bottleneck of all these methods is that few-shot data is too sparse to provide sufficient novel class knowledge. Relying solely on 1-5 support samples to learn the prototypes and decision boundaries of novel classes results in severely insufficient information, leading to novel class segmentation accuracy that is far below the fully supervised upper bound.

Key Challenge: On one hand, 3D VLMs (e.g., OpenScene, RegionPLC) possess open-world novel class recognition capabilities by aligning 3D and language features, enabling them to generate dense pseudo-labels for novel classes—but these pseudo-labels are highly noisy, and using them directly leads to error accumulation. On the other hand, few-shot annotations are precise but suffer from extremely narrow coverage. The core challenge is how to combine the strengths of both—using dense pseudo-labels to compensate for sparse annotations, while using precise annotations to calibrate noisy pseudo-labels.

Goal: (1) How to effectively utilize pseudo-labels from 3D VLMs while controlling noise? (2) How to re-label the filtered noisy regions? (3) How to fully exploit the valuable few-shot support samples?

Key Insight: Instead of using the 3D VLM as an independent classifier, it is treated as a "noisy annotator." The precise prototypes from few-shot samples are leveraged to "calibrate" the pseudo-labels—filtering out regions inconsistent with the prototypes, infilling missing regions, and augmenting training scenes.

Core Idea: Few-shot support prototypes are used to guide the selection and refinement of 3D VLM pseudo-labels, and support samples are embedded into training scenes via novel-base mix augmentation while preserving context.

Method¶

Overall Architecture¶

The framework consists of three steps: ① generating raw predictions (pseudo-labels) containing all categories for the training scenes using a 3D VLM; ② cleaning the pseudo-labels through a pipeline of "prototype-guided pseudo-label selection \(\rightarrow\) adaptive infilling"; ③ embedding the support samples into the training scenes via novel-base mix augmentation. The cleaned pseudo-labels are then merged with the original base class annotations to train a simple segmenter consisting of a backbone and a linear classification head.

Key Designs¶

Prototype-guided Pseudo-label Selection:
- Function: Filters low-quality novel class pseudo-labels in 3D VLM predictions.
- Mechanism: First, the 3D VLM's visual encoder is used to extract the support prototype \(\mathbf{p}^c\) for each novel class from the few-shot support samples (via masked average pooling). Then, for the current training scene, the 3D VLM generates raw predictions \(\hat{\mathbf{Y}}\). For each region predicted as novel class \(c\), a predicted prototype \(\mathbf{u}^c\) is computed in the same manner. If the cosine similarity between the predicted prototype and the support prototype is below a threshold \(\tau\), the pseudo-label for that class is filtered out (labeled as -1); base class predictions are filtered out directly (as ground-truth annotations are available).
- Design Motivation: 3D VLMs might misidentify the background as a novel class or confuse one novel class with another—but accurately predicted regions should possess features highly consistent with the support samples. This prototype consistency-based filtering is simple yet effective.
Adaptive Infilling:
- Function: Re-allocates pseudo-labels to the filtered, unannotated regions.
- Mechanism: The filtered regions may contain "completely incorrect predictions" (requiring the discovery of correct novel classes) and "partially correct predictions" (requiring completion of incomplete masks). The approach is to construct an adaptive prototype set \(\{\mathbf{m}^c\}\): if a novel class exists in the filtered labels of the current scene, the in-scene prototype \(\mathbf{v}^c\) is used; otherwise, the support prototype \(\mathbf{p}^c\) is used. Then, for each unannotated point, the cosine similarity to all novel class prototypes is calculated, and the corresponding class label is assigned if the maximum similarity exceeds a threshold \(\delta\).
- Design Motivation: Completing existing classes using in-scene prototypes (rather than external support prototypes) is more accurate (since features from the same scene are more similar), while retaining the capability to discover completely missing novel classes using support prototypes.
Novel-Base Mix:
- Function: Directly embeds few-shot support samples into training scenes to increase training signals for novel classes.
- Mechanism: A support sample is randomly selected, and a local region containing the novel class object is cropped (preserving the surrounding context). The cropped region is then "attached" to the boundary of the training scene via corner alignment. Specifically, the four corners of the XY plane are extracted from both the training scene and the cropped region, and a pair of diagonal corners is randomly selected for alignment translation.
- Design Motivation: Unlike traditional 3D data augmentations (such as Mix3D), this method emphasizes context preservation. Directly cropping an object out of its original scene and placing it at an arbitrary position loses spatial context—yet in GFS-PCS, novel classes are often hard-to-detect objects for which contextual clues (e.g., "toilet next to the sink," "books on the desk") are critical for recognition.

Loss & Training¶

A two-stage training scheme is adopted: first, pre-train the backbone and base class classification head on base class data (800 epochs), and then add the novel class classification head to fine-tune for 20 epochs using the cleaned pseudo-labels and embedded support samples. Point Transformer V3 (PTv3) is used as the backbone.

Key Experimental Results¶

Main Results (ScanNet200, 5-shot/1-shot)¶

Method	5-shot mIoU-N	5-shot HM	1-shot mIoU-N	1-shot HM
attMPTI	4.99	8.79	3.28	6.17
COSeg	5.21	9.54	4.03	7.42
GW	8.30	14.55	6.47	11.56
GFS-VL	17.21	25.38	13.60	20.49
Fully Supervised Upper Bound	39.32	50.02	39.32	50.02

Ablation Study (ScanNet200, 1-shot)¶

Configuration	mIoU-N	HM	Description
No Pseudo-labels (Baseline)	8.29	14.19	Fine-tuning only with support samples
+ Raw Pseudo-labels	11.28	17.14	Unselected, contains noise
+ Pseudo-label Selection	12.04	18.61	Removes low-quality regions
+ Adaptive Infilling	12.79	19.37	Completes missing regions
+ Novel-Base Mix	13.60	20.49	Full Model

Key Findings¶

On ScanNet200 (40 novel classes), GFS-VL achieves a novel class mIoU of 17.21 (5-shot), which is more than twice that of the previous SOTA (GW 8.30).
The newly introduced ScanNet200 and ScanNet++ benchmarks are significantly more challenging—having 40/18 novel classes is far more than the 6 novel classes in existing benchmarks, with greater category diversity.
The improvement from pseudo-label selection (\(11.28 \rightarrow 12.04\)) is relatively moderate, but combined with adaptive infilling and Mix augmentation, the stacked performance gain is significant (\(12.04 \rightarrow 13.60\)), showing that infilling missing regions and increasing training signals are equally crucial.
Raw pseudo-labels already provide a substantial gain (\(8.29 \rightarrow 11.28\)), demonstrating that the knowledge from 3D VLMs is highly valuable despite being noisy.

Highlights & Insights¶

The collaborative paradigm of "dense-noisy + sparse-precise" is the core contribution of this work. Instead of choosing between a 3D VLM or few-shot samples, the two are made complementary—using a small amount of precise annotations to calibrate a large amount of coarse knowledge. This concept can be transferred to any "weak annotation + strong prior" scenario.
Context-preserving data augmentation is a heavily overlooked design detail: while most 3D data augmentations randomly mix objects, context is more critical than the objects themselves in few-shot scenarios.
The newly introduced benchmarks represent a substantial contribution: existing evaluation frameworks (with 6 novel classes) are overly simplistic, whereas the 40/18 novel class benchmarks closer reflect real-world complexity.

Limitations & Future Work¶

Relies on the quality of the specific 3D VLM (RegionPLC)—if the 3D VLM has zero prior knowledge of certain classes, the pseudo-labels cannot provide any signals.
The thresholds \(\tau\) and \(\delta\) need to be manually adjusted for different datasets.
The method increases training complexity: it requires running the 3D VLM to generate pseudo-labels followed by multiple cleaning steps, which is engineering-wise not concise enough.
The 1-shot novel class mIoU is only 13.60, leaving a huge gap behind the fully supervised milestone of 39.32, which indicates substantial room for improvement.
One could consider dynamically updating pseudo-labels during fine-tuning (in a self-training iterative manner) instead of doing it as a one-time preprocessing step.

vs. GW: GW enhances prototypes by sharing geometric structures, but the knowledge source remains limited to a few support samples. GFS-VL introduces a 3D VLM as an external knowledge source, substantially increasing the information volume.
vs. CAPL: CAPL utilizes co-occurrence priors and query context, whereas GFS-VL directly expands the training set using pseudo-labels, which is a more direct approach.
vs. OpenScene/RegionPLC: These 3D VLMs perform zero-shot segmentation with limited accuracy. GFS-VL uses them as "weak annotators" and combines them with a small amount of precise annotations, achieving better results than either alone.

Rating¶

Novelty: ⭐⭐⭐⭐ The idea of synergizing 3D VLM pseudo-labels and few-shot annotations is novel and reasonable, though each sub-module is relatively independent and lacks joint optimization.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 4 benchmarks (including 2 newly configured), multiple baseline comparisons, detailed ablation studies, and visualizations.
Writing Quality: ⭐⭐⭐⭐ Detailed framework diagrams, well-articulated motivation, though there are slightly many formula symbols.
Value: ⭐⭐⭐⭐ The new benchmarks will have a long-term impact, and the method is pioneering in the 3D few-shot field.