Skip to content

Beyond Single Images: Retrieval Self-Augmented Unsupervised Camouflaged Object Detection

Conference: ICCV 2025 arXiv: 2510.18437 Code: https://github.com/xiaohainku/RISE Area: Segmentation / Camouflaged Object Detection / Unsupervised Keywords: camouflaged object detection, unsupervised segmentation, retrieval-augmented, KNN, prototype library

TL;DR

This paper proposes RISE — a retrieval self-augmented unsupervised camouflaged object detection paradigm that constructs foreground/background prototype libraries from the training set itself and leverages KNN retrieval to generate pseudo-labels, substantially outperforming existing unsupervised and prompt-based methods without any annotations.

Background & Motivation

Background: Camouflaged Object Detection (COD) aims to segment target objects from highly similar backgrounds. Mainstream fully supervised methods rely on dense pixel-level annotations, where annotating a single image can take up to one hour. Weakly and semi-supervised approaches reduce annotation burden but still require partial labels.

Limitations of Prior Work: (a) Unsupervised methods (TokenCut, MaskCut, ProMerge, etc.) primarily exploit intra-image feature similarity to separate foreground from background, but camouflaged objects and backgrounds share highly similar features, causing single-image methods to perform poorly; (b) prompt-based methods combining SAM with task-specific prompts still require some form of supervision and offer limited context-specific understanding of COD; (c) methods that generate pseudo-labels via diffusion models or multimodal LLMs (GenSAM, ProMac) require days of computation and significant GPU resources.

Key Challenge: Within a single image, the DINOv2 features of camouflaged objects and backgrounds are nearly indistinguishable (almost overlapping in t-SNE visualizations) — intra-image similarity alone cannot effectively separate them. However, at the dataset level, foreground objects exhibit higher similarity to a foreground prototype library than to a background prototype library.

Goal: To leverage dataset-level contextual information to distinguish camouflaged foregrounds from backgrounds in a fully annotation-free setting.

Key Insight: Mining prototypes directly from the dataset through a coarse-to-fine strategy — first obtaining coarse masks via clustering, then refining prototypes through retrieval — to construct high-quality foreground/background prototype libraries, followed by KNN-based retrieval to classify each feature as foreground or background.

Core Idea: Rather than relying on intra-image similarity, RISE leverages a dataset-level prototype library combined with KNN retrieval to distinguish camouflaged objects from backgrounds, enabling unsupervised COD.

Method

Overall Architecture

RISE operates in two stages: (1) Clustering-then-Retrieval (CR) — spectral clustering is applied to each image to generate coarse masks; foreground/background global features are extracted; high-confidence prototypes are selected via cross-category retrieval and aggregated into a prototype library; (2) Multi-View KNN Retrieval (MVKR) — DINOv2 features are extracted per image; each local feature retrieves the top-K most similar prototypes from the library and votes for foreground or background; multi-view fusion eliminates artifacts; the resulting pseudo-masks are used to train SINet-V2.

Key Designs

  1. Clustering-then-Retrieval (CR) — Prototype Library Construction:

    • Function: Construct a high-quality foreground/background prototype library from unannotated COD data.
    • Mechanism:
      • Spectral clustering for coarse masks: A feature similarity graph \(\mathcal{G}\) is constructed with adjacency matrix \(\mathbf{W}_{i,j} = \max(\text{cos}(\mathbf{F}'_i, \mathbf{F}'_j), 0)\); the normalized Laplacian \(\mathbf{L} = \mathbf{D}^{-1/2}(\mathbf{D}-\mathbf{W})\mathbf{D}^{-1/2}\) is computed; eigenvectors are used for KMeans binary classification; the cluster with a lower boundary pixel ratio is assigned as foreground.
      • Cross-Category Retrieval: Rather than selecting the foreground prototype most similar to the foreground global feature, the method selects the one least similar to the background global feature: \(\mathbf{P}^f = \arg\min_{\mathbf{s} \in \mathbf{S}_f} \text{cos}(\mathbf{s}, \mathbf{F}^g_b)\). This enhances discriminability between foreground and background prototypes.
      • Histogram-Adaptive Filtering: The distribution of foreground–background global feature similarities across all images is computed; images of poor quality are filtered out using the histogram peak as a threshold.
    • Design Motivation: Cross-category retrieval is the critical component — intuitively, "least similar to the other class" provides stronger discriminative guarantees than "most similar to own class." Ablation studies confirm a 5–8% improvement attributable to this design.
  2. Multi-View KNN Retrieval (MVKR):

    • Function: Generate high-quality pseudo-masks for each image using the prototype library.
    • Mechanism: For each feature \(\mathbf{F}_{i,j}\), the top-K (\(K=512\)) most similar prototypes are retrieved from both foreground and background libraries, and a vote determines the class assignment. To eliminate artifacts in DINOv2 feature maps, multiple views (flips and rotations) of the same image are generated; each view is independently retrieved and inverse-transformed before aggregated voting.
    • Design Motivation: DINOv2 feature maps contain position-dependent artifacts whose locations shift across different views. Multi-view fusion eliminates these artifacts without requiring additional model fine-tuning.
    • Implementation Details: FAISS is used to accelerate retrieval; all images are uniformly resized to \(476 \times 476\).
  3. Pseudo-Label Training:

    • Function: Train a standard COD model using the generated pseudo-masks.
    • Mechanism: The generated pseudo-masks are directly used as ground truth to train SINet-V2 following the standard fully supervised training pipeline.
    • Design Motivation: RISE focuses on pseudo-label generation quality; the downstream training component is orthogonal and interchangeable with existing methods.

Loss & Training

RISE itself requires no training — it only performs pseudo-label generation. The downstream SINet-V2 follows the standard COD training strategy. The feature extractor is DINOv2-ViT-L14 (frozen).

Key Experimental Results

Main Results

Comparison with unsupervised methods on four COD benchmarks (DINOv2-ViT-L14 feature extractor):

Method CHAMELEON \(S_\alpha\) COD10K \(S_\alpha\) COD10K \(F^\omega_\beta\) NC4K \(S_\alpha\)
RISE 0.822 0.763 0.600 0.805
ProMerge 0.741 0.674 0.435 0.726
TokenCut 0.708 0.637 0.370 0.697
VoteCut 0.679 0.645 0.390 0.674
DiffCut 0.574 0.628 0.372 0.693

Comparison with prompt-based methods (with SAM integration):

Method CHAMELEON \(S_\alpha\) COD10K \(S_\alpha\) COD10K \(F^\omega_\beta\) NC4K \(S_\alpha\)
RISE+SAM 0.823 0.790 0.643 0.825
WS-SAM* 0.795 0.787 0.622 0.829
ProMac 0.786 0.774 0.609 0.812
GenSAM 0.659 0.641 0.390 0.702

Ablation Study

Configuration COD10K \(S_\alpha\) COD10K \(E_\phi\) COD10K \(F^\omega_\beta\) COD10K \(M\)
(e) Full RISE 0.763 0.840 0.600 0.049
(a) Image-level only (spectral clustering) 0.641 0.662 0.414 0.169
(b) Without cross-category retrieval 0.710 0.781 0.518 0.065
(c) Without histogram filtering 0.744 0.822 0.575 0.055
(d) Without multi-view retrieval 0.759 0.832 0.584 0.052

Key Findings

  • Dataset-level information is critical: Moving from image-level-only modeling to full RISE yields over 12% improvement in \(S_\alpha\), demonstrating the decisive advantage of cross-image information over single-image similarity.
  • Cross-category retrieval contributes most: Removing it causes a 5.3% drop in \(S_\alpha\) and 8.2% drop in \(F^\omega_\beta\) on COD10K.
  • RISE surpasses WS-SAM, which uses manually annotated weak supervision signals, while reducing inference time from days to hours.
  • The method is robust across different DINO variants: DINO-ViT-S16/B16 and DINOv2-S14/B14/L14 all yield effective results.

Highlights & Insights

  • Retrieval self-augmentation paradigm: Rather than relying on external data sources, RISE constructs a prototype library from the dataset itself — a "self-bootstrapping" strategy particularly valuable for COD where annotation costs are extremely high.
  • Counterintuitive cross-category retrieval: Selecting prototypes by "least similar to the opposing class" rather than "most similar to own class" significantly improves discriminability — a trick transferable to any scenario requiring contrastive prototype construction.
  • Multi-view artifact elimination: By exploiting the property that DINOv2 artifact locations shift across views, simple flip/rotation augmentation combined with voting removes artifacts far more efficiently than model fine-tuning.
  • Histogram-adaptive thresholding: Using the peak of the similarity distribution for adaptive filtering eliminates the need for manual threshold selection.

Limitations & Future Work

  • The quality of coarse masks from spectral clustering is a bottleneck — poor initial segmentation degrades prototype quality.
  • The top-K parameter (\(K=512\)) requires tuning (sensitivity analysis provided in Figure 5 of the paper).
  • The current formulation supports only binary classification (foreground/background) and does not support multi-instance detection.
  • Detection of extremely small objects remains challenging, though qualitative results show improvements over baselines.
  • vs. TokenCut/VoteCut: These methods rely on Normalized Cut within individual images; the high foreground–background similarity in camouflaged scenes causes failure. RISE overcomes this limitation through dataset-level prototypes.
  • vs. ProMac/GenSAM: These methods use diffusion models or multimodal LLMs to generate prompts, requiring days of computation. RISE achieves superior results in only a few hours.
  • vs. Retrieval-Augmented Semantic Segmentation (RASS): RASS relies on external models to generate prototype libraries, whereas RISE mines prototypes from the dataset itself, avoiding out-of-domain bias.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The retrieval self-augmentation paradigm is pioneered in COD; the cross-category retrieval strategy is elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers four datasets, eight unsupervised methods, three prompt-based baselines, comprehensive ablations, and sensitivity analysis.
  • Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated; t-SNE visualizations are highly persuasive.
  • Value: ⭐⭐⭐⭐⭐ Sets a new benchmark for unsupervised COD with ideas generalizable to other fine-grained segmentation tasks.