ProMerge: Prompt and Merge for Unsupervised Instance Segmentation¶

Conference: ECCV 2024
arXiv: 2409.18961
Code: None
Area: Segmentation / Unsupervised Learning
Keywords: Unsupervised Instance Segmentation, Self-Supervised Features, DINO, Group-and-Merge, Pseudo-Labels

TL;DR¶

ProMerge is proposed to utilize self-supervised visual features (DINO) for initial patch grouping, followed by strategic merging and background-aware mask pruning for unsupervised instance segmentation. Its inference speed significantly outperforms normalized-cut methods, and training a detector with the generated pseudo-labels outpaces existing unsupervised SOTA models.

Background & Motivation¶

Background: Unsupervised instance segmentation aims to segment independent object instances in images without relying on manually annotated data. Recent SOTA methods leverage rich visual features from self-supervised models like DINO, representing images as graphs and solving normalized-cuts to generate foreground masks.

Limitations of Prior Work: (1) Normalized-cut-based methods require solving generalized eigenvalue systems, which incurs high computational complexity and slow inference speed; (2) despite good segmentation quality, they fail to meet the demands of real-time or large-scale applications; (3) the memory consumption of graph construction and eigenvalue solving is also prohibitive.

Key Challenge: Self-supervised features provide rich local correspondence information for grouping, but mainstream methods (normalized-cut) utilize them in a heavyweight manner—requiring a much lighter grouping strategy.

Goal: Maintain or surpass the segmentation quality of normalized-cut methods while substantially reducing inference time.

Key Insight: Directly perform patch grouping and merging in the DINO feature space, replacing complex graph optimization with simple similarity thresholds and merging rules.

Core Idea: First, perform initial patch grouping (Prompt) using self-supervised features, then eliminate over-segmentation through strategic merging (Merge), followed by background-aware mask pruning to remove false positives. Finally, the predicted masks are used as pseudo-labels to train a standard detector.

Method¶

Overall Architecture¶

Input image → DINO extracts patch features → Initial grouping (patch aggregation based on feature similarity) → Strategic merging (eliminating over-segmentation) → Background-aware mask pruning → Output instance masks. Optional: Train Mask R-CNN using the generated masks as pseudo-labels.

Key Designs¶

Initial Patch Grouping (Prompt):
- Function: Performs initial semantic grouping by leveraging the local correspondence of DINO features.
- Mechanism: Computes the cosine similarity between adjacent patches, and groups highly similar patches together. Initial over-segmented results are obtained through connected component analysis.
- Design Motivation: Since DINO features naturally encode semantic similarity, a simple similarity threshold can yield meaningful initial groupings without the need for complex graph optimization.
Strategic Merging (Merge):
- Function: Merges over-segmented fragments into complete instance masks.
- Mechanism: Computes the feature similarity between adjacent groups and merges them when the similarity exceeds a threshold. The merging strategy considers spatial adjacency and semantic consistency, proceeding iteratively until no more fragments can be merged.
- Design Motivation: Since initial groupings tend to be over-segmented (e.g., a single object split into multiple parts), the merging step restores complete instance boundaries.
Background-Aware Mask Pruning:
- Function: Removes false positive masks belonging to the background.
- Mechanism: Leverages the attention map of the [CLS] token in DINO features to estimate foreground probability, discarding masks with low foreground probability.
- Design Motivation: The grouping-and-merging process may generate false foreground masks in background areas. Using DINO's global attention as a foreground prior effectively filters these out.

Loss & Training¶

ProMerge itself does not require training. When training Mask R-CNN with the generated masks, the standard instance segmentation loss is applied.

Key Experimental Results¶

Main Results¶

Method	COCO AP	Inference Speed	Type
Ours (ProMerge)	Competitive	Much Faster	Training-free
Normalized-cut based	High	Slow	Training-free
Ours → Mask R-CNN	Surpasses SOTA	Standard Detection Speed	Pseudo-label Training

Ablation Study¶

Configuration	Segmentation Quality	Description
Initial grouping only	Over-segmentation	Requires merging
+ Merging	Significant improvement	Restores complete instances
+ Background pruning	Further improvement	Removes background false positives
Training detector	Optimal	Maximizes the value of pseudo-labels

Key Findings¶

ProMerge's inference speed is significantly superior to normalized-cut methods, making unsupervised instance segmentation feasible on actual large-scale datasets.
Training a detector with pseudo-labels generated by ProMerge outperforms the direct usage of normalized-cut methods, indicating that the diverse pseudo-labels generated by simpler methods are more suitable for detector training than those from complex methods.

Highlights & Insights¶

A classic case of simplicity defeating complexity: replacing complex normalized-cut problems with simple grouping and merging achieved orders of magnitude faster inference speed without compromising quality.
The pseudo-label training pipeline (generating masks → training a standard detector) serves as a practical deployment solution.
The utilization of DINO features is highly lightweight, requiring neither graph construction nor eigenvalue decomposition.

Limitations & Future Work¶

The merging strategy relies on manual thresholds, which may require tuning across different datasets.
Adjacent objects with highly similar textures may be over-merged.
Background pruning depends on the quality of DINO attention.
Detailed quantitative results need to be supplemented from the full paper.

vs CutLER/TokenCut: These are SOTA methods based on normalized-cuts, which yield good segmentation quality but slow inference; ProMerge achieves comparable quality but is much faster.
vs FreeSOLO: Another line of unsupervised instance segmentation (based on SOLO); ProMerge achieves better performance by training a detector via pseudo-labels.

Rating¶

Novelty: ⭐⭐⭐⭐ The grouping-and-merging concept is elegant and effective, serving as an efficient alternative to normalized-cuts.
Experimental Thoroughness: ⭐⭐⭐ Validated across multiple benchmarks, though detailed data still need to be supplemented.
Writing Quality: ⭐⭐⭐ Evaluated based on abstract information.
Value: ⭐⭐⭐⭐ Significantly lowers the inference barrier for unsupervised instance segmentation.