ProMerge: Prompt and Merge for Unsupervised Instance Segmentation¶
Conference: ECCV 2024
arXiv: 2409.18961
Code: None
Area: Segmentation / Unsupervised Learning
Keywords: Unsupervised Instance Segmentation, Self-Supervised Features, DINO, Group-and-Merge, Pseudo-Labels
TL;DR¶
ProMerge is proposed to utilize self-supervised visual features (DINO) for initial patch grouping, followed by strategic merging and background-aware mask pruning for unsupervised instance segmentation. Its inference speed significantly outperforms normalized-cut methods, and training a detector with the generated pseudo-labels outpaces existing unsupervised SOTA models.
Background & Motivation¶
Background: Unsupervised instance segmentation aims to segment independent object instances in images without relying on manually annotated data. Recent SOTA methods leverage rich visual features from self-supervised models like DINO, representing images as graphs and solving normalized-cuts to generate foreground masks.
Limitations of Prior Work: (1) Normalized-cut-based methods require solving generalized eigenvalue systems, which incurs high computational complexity and slow inference speed; (2) despite good segmentation quality, they fail to meet the demands of real-time or large-scale applications; (3) the memory consumption of graph construction and eigenvalue solving is also prohibitive.
Key Challenge: Self-supervised features provide rich local correspondence information for grouping, but mainstream methods (normalized-cut) utilize them in a heavyweight manner—requiring a much lighter grouping strategy.
Goal: Maintain or surpass the segmentation quality of normalized-cut methods while substantially reducing inference time.
Key Insight: Directly perform patch grouping and merging in the DINO feature space, replacing complex graph optimization with simple similarity thresholds and merging rules.
Core Idea: First, perform initial patch grouping (Prompt) using self-supervised features, then eliminate over-segmentation through strategic merging (Merge), followed by background-aware mask pruning to remove false positives. Finally, the predicted masks are used as pseudo-labels to train a standard detector.
Method¶
Overall Architecture¶
Input image → DINO extracts patch features → Initial grouping (patch aggregation based on feature similarity) → Strategic merging (eliminating over-segmentation) → Background-aware mask pruning → Output instance masks. Optional: Train Mask R-CNN using the generated masks as pseudo-labels.
Key Designs¶
-
Initial Patch Grouping (Prompt):
- Function: Performs initial semantic grouping by leveraging the local correspondence of DINO features.
- Mechanism: Computes the cosine similarity between adjacent patches, and groups highly similar patches together. Initial over-segmented results are obtained through connected component analysis.
- Design Motivation: Since DINO features naturally encode semantic similarity, a simple similarity threshold can yield meaningful initial groupings without the need for complex graph optimization.
-
Strategic Merging (Merge):
- Function: Merges over-segmented fragments into complete instance masks.
- Mechanism: Computes the feature similarity between adjacent groups and merges them when the similarity exceeds a threshold. The merging strategy considers spatial adjacency and semantic consistency, proceeding iteratively until no more fragments can be merged.
- Design Motivation: Since initial groupings tend to be over-segmented (e.g., a single object split into multiple parts), the merging step restores complete instance boundaries.
-
Background-Aware Mask Pruning:
- Function: Removes false positive masks belonging to the background.
- Mechanism: Leverages the attention map of the [CLS] token in DINO features to estimate foreground probability, discarding masks with low foreground probability.
- Design Motivation: The grouping-and-merging process may generate false foreground masks in background areas. Using DINO's global attention as a foreground prior effectively filters these out.
Loss & Training¶
ProMerge itself does not require training. When training Mask R-CNN with the generated masks, the standard instance segmentation loss is applied.
Key Experimental Results¶
Main Results¶
| Method | COCO AP | Inference Speed | Type |
|---|---|---|---|
| Ours (ProMerge) | Competitive | Much Faster | Training-free |
| Normalized-cut based | High | Slow | Training-free |
| Ours → Mask R-CNN | Surpasses SOTA | Standard Detection Speed | Pseudo-label Training |
Ablation Study¶
| Configuration | Segmentation Quality | Description |
|---|---|---|
| Initial grouping only | Over-segmentation | Requires merging |
| + Merging | Significant improvement | Restores complete instances |
| + Background pruning | Further improvement | Removes background false positives |
| Training detector | Optimal | Maximizes the value of pseudo-labels |
Key Findings¶
- ProMerge's inference speed is significantly superior to normalized-cut methods, making unsupervised instance segmentation feasible on actual large-scale datasets.
- Training a detector with pseudo-labels generated by ProMerge outperforms the direct usage of normalized-cut methods, indicating that the diverse pseudo-labels generated by simpler methods are more suitable for detector training than those from complex methods.
Highlights & Insights¶
- A classic case of simplicity defeating complexity: replacing complex normalized-cut problems with simple grouping and merging achieved orders of magnitude faster inference speed without compromising quality.
- The pseudo-label training pipeline (generating masks → training a standard detector) serves as a practical deployment solution.
- The utilization of DINO features is highly lightweight, requiring neither graph construction nor eigenvalue decomposition.
Limitations & Future Work¶
- The merging strategy relies on manual thresholds, which may require tuning across different datasets.
- Adjacent objects with highly similar textures may be over-merged.
- Background pruning depends on the quality of DINO attention.
- Detailed quantitative results need to be supplemented from the full paper.
Related Work & Insights¶
- vs CutLER/TokenCut: These are SOTA methods based on normalized-cuts, which yield good segmentation quality but slow inference; ProMerge achieves comparable quality but is much faster.
- vs FreeSOLO: Another line of unsupervised instance segmentation (based on SOLO); ProMerge achieves better performance by training a detector via pseudo-labels.
Rating¶
- Novelty: ⭐⭐⭐⭐ The grouping-and-merging concept is elegant and effective, serving as an efficient alternative to normalized-cuts.
- Experimental Thoroughness: ⭐⭐⭐ Validated across multiple benchmarks, though detailed data still need to be supplemented.
- Writing Quality: ⭐⭐⭐ Evaluated based on abstract information.
- Value: ⭐⭐⭐⭐ Significantly lowers the inference barrier for unsupervised instance segmentation.