Beyond Single Images: Retrieval Self-Augmented Unsupervised Camouflaged Object Detection¶
Conference: ICCV 2025 arXiv: 2510.18437 Code: https://github.com/xiaohainku/RISE Area: Segmentation / Camouflaged Object Detection / Unsupervised Keywords: camouflaged object detection, unsupervised segmentation, retrieval-augmented, KNN, prototype library
TL;DR¶
This paper proposes RISE — a retrieval self-augmented unsupervised camouflaged object detection paradigm that constructs foreground/background prototype libraries from the training set itself and leverages KNN retrieval to generate pseudo-labels, substantially outperforming existing unsupervised and prompt-based methods without any annotations.
Background & Motivation¶
Background: Camouflaged Object Detection (COD) aims to segment target objects from highly similar backgrounds. Mainstream fully supervised methods rely on dense pixel-level annotations, where annotating a single image can take up to one hour. Weakly and semi-supervised approaches reduce annotation burden but still require partial labels.
Limitations of Prior Work: (a) Unsupervised methods (TokenCut, MaskCut, ProMerge, etc.) primarily exploit intra-image feature similarity to separate foreground from background, but camouflaged objects and backgrounds share highly similar features, causing single-image methods to perform poorly; (b) prompt-based methods combining SAM with task-specific prompts still require some form of supervision and offer limited context-specific understanding of COD; (c) methods that generate pseudo-labels via diffusion models or multimodal LLMs (GenSAM, ProMac) require days of computation and significant GPU resources.
Key Challenge: Within a single image, the DINOv2 features of camouflaged objects and backgrounds are nearly indistinguishable (almost overlapping in t-SNE visualizations) — intra-image similarity alone cannot effectively separate them. However, at the dataset level, foreground objects exhibit higher similarity to a foreground prototype library than to a background prototype library.
Goal: To leverage dataset-level contextual information to distinguish camouflaged foregrounds from backgrounds in a fully annotation-free setting.
Key Insight: Mining prototypes directly from the dataset through a coarse-to-fine strategy — first obtaining coarse masks via clustering, then refining prototypes through retrieval — to construct high-quality foreground/background prototype libraries, followed by KNN-based retrieval to classify each feature as foreground or background.
Core Idea: Rather than relying on intra-image similarity, RISE leverages a dataset-level prototype library combined with KNN retrieval to distinguish camouflaged objects from backgrounds, enabling unsupervised COD.
Method¶
Overall Architecture¶
RISE operates in two stages: (1) Clustering-then-Retrieval (CR) — spectral clustering is applied to each image to generate coarse masks; foreground/background global features are extracted; high-confidence prototypes are selected via cross-category retrieval and aggregated into a prototype library; (2) Multi-View KNN Retrieval (MVKR) — DINOv2 features are extracted per image; each local feature retrieves the top-K most similar prototypes from the library and votes for foreground or background; multi-view fusion eliminates artifacts; the resulting pseudo-masks are used to train SINet-V2.
Key Designs¶
-
Clustering-then-Retrieval (CR) — Prototype Library Construction:
- Function: Construct a high-quality foreground/background prototype library from unannotated COD data.
- Mechanism:
- Spectral clustering for coarse masks: A feature similarity graph \(\mathcal{G}\) is constructed with adjacency matrix \(\mathbf{W}_{i,j} = \max(\text{cos}(\mathbf{F}'_i, \mathbf{F}'_j), 0)\); the normalized Laplacian \(\mathbf{L} = \mathbf{D}^{-1/2}(\mathbf{D}-\mathbf{W})\mathbf{D}^{-1/2}\) is computed; eigenvectors are used for KMeans binary classification; the cluster with a lower boundary pixel ratio is assigned as foreground.
- Cross-Category Retrieval: Rather than selecting the foreground prototype most similar to the foreground global feature, the method selects the one least similar to the background global feature: \(\mathbf{P}^f = \arg\min_{\mathbf{s} \in \mathbf{S}_f} \text{cos}(\mathbf{s}, \mathbf{F}^g_b)\). This enhances discriminability between foreground and background prototypes.
- Histogram-Adaptive Filtering: The distribution of foreground–background global feature similarities across all images is computed; images of poor quality are filtered out using the histogram peak as a threshold.
- Design Motivation: Cross-category retrieval is the critical component — intuitively, "least similar to the other class" provides stronger discriminative guarantees than "most similar to own class." Ablation studies confirm a 5–8% improvement attributable to this design.
-
Multi-View KNN Retrieval (MVKR):
- Function: Generate high-quality pseudo-masks for each image using the prototype library.
- Mechanism: For each feature \(\mathbf{F}_{i,j}\), the top-K (\(K=512\)) most similar prototypes are retrieved from both foreground and background libraries, and a vote determines the class assignment. To eliminate artifacts in DINOv2 feature maps, multiple views (flips and rotations) of the same image are generated; each view is independently retrieved and inverse-transformed before aggregated voting.
- Design Motivation: DINOv2 feature maps contain position-dependent artifacts whose locations shift across different views. Multi-view fusion eliminates these artifacts without requiring additional model fine-tuning.
- Implementation Details: FAISS is used to accelerate retrieval; all images are uniformly resized to \(476 \times 476\).
-
Pseudo-Label Training:
- Function: Train a standard COD model using the generated pseudo-masks.
- Mechanism: The generated pseudo-masks are directly used as ground truth to train SINet-V2 following the standard fully supervised training pipeline.
- Design Motivation: RISE focuses on pseudo-label generation quality; the downstream training component is orthogonal and interchangeable with existing methods.
Loss & Training¶
RISE itself requires no training — it only performs pseudo-label generation. The downstream SINet-V2 follows the standard COD training strategy. The feature extractor is DINOv2-ViT-L14 (frozen).
Key Experimental Results¶
Main Results¶
Comparison with unsupervised methods on four COD benchmarks (DINOv2-ViT-L14 feature extractor):
| Method | CHAMELEON \(S_\alpha\)↑ | COD10K \(S_\alpha\)↑ | COD10K \(F^\omega_\beta\)↑ | NC4K \(S_\alpha\)↑ |
|---|---|---|---|---|
| RISE | 0.822 | 0.763 | 0.600 | 0.805 |
| ProMerge | 0.741 | 0.674 | 0.435 | 0.726 |
| TokenCut | 0.708 | 0.637 | 0.370 | 0.697 |
| VoteCut | 0.679 | 0.645 | 0.390 | 0.674 |
| DiffCut | 0.574 | 0.628 | 0.372 | 0.693 |
Comparison with prompt-based methods (with SAM integration):
| Method | CHAMELEON \(S_\alpha\)↑ | COD10K \(S_\alpha\)↑ | COD10K \(F^\omega_\beta\)↑ | NC4K \(S_\alpha\)↑ |
|---|---|---|---|---|
| RISE+SAM | 0.823 | 0.790 | 0.643 | 0.825 |
| WS-SAM* | 0.795 | 0.787 | 0.622 | 0.829 |
| ProMac | 0.786 | 0.774 | 0.609 | 0.812 |
| GenSAM | 0.659 | 0.641 | 0.390 | 0.702 |
Ablation Study¶
| Configuration | COD10K \(S_\alpha\) | COD10K \(E_\phi\) | COD10K \(F^\omega_\beta\) | COD10K \(M\) |
|---|---|---|---|---|
| (e) Full RISE | 0.763 | 0.840 | 0.600 | 0.049 |
| (a) Image-level only (spectral clustering) | 0.641 | 0.662 | 0.414 | 0.169 |
| (b) Without cross-category retrieval | 0.710 | 0.781 | 0.518 | 0.065 |
| (c) Without histogram filtering | 0.744 | 0.822 | 0.575 | 0.055 |
| (d) Without multi-view retrieval | 0.759 | 0.832 | 0.584 | 0.052 |
Key Findings¶
- Dataset-level information is critical: Moving from image-level-only modeling to full RISE yields over 12% improvement in \(S_\alpha\), demonstrating the decisive advantage of cross-image information over single-image similarity.
- Cross-category retrieval contributes most: Removing it causes a 5.3% drop in \(S_\alpha\) and 8.2% drop in \(F^\omega_\beta\) on COD10K.
- RISE surpasses WS-SAM, which uses manually annotated weak supervision signals, while reducing inference time from days to hours.
- The method is robust across different DINO variants: DINO-ViT-S16/B16 and DINOv2-S14/B14/L14 all yield effective results.
Highlights & Insights¶
- Retrieval self-augmentation paradigm: Rather than relying on external data sources, RISE constructs a prototype library from the dataset itself — a "self-bootstrapping" strategy particularly valuable for COD where annotation costs are extremely high.
- Counterintuitive cross-category retrieval: Selecting prototypes by "least similar to the opposing class" rather than "most similar to own class" significantly improves discriminability — a trick transferable to any scenario requiring contrastive prototype construction.
- Multi-view artifact elimination: By exploiting the property that DINOv2 artifact locations shift across views, simple flip/rotation augmentation combined with voting removes artifacts far more efficiently than model fine-tuning.
- Histogram-adaptive thresholding: Using the peak of the similarity distribution for adaptive filtering eliminates the need for manual threshold selection.
Limitations & Future Work¶
- The quality of coarse masks from spectral clustering is a bottleneck — poor initial segmentation degrades prototype quality.
- The top-K parameter (\(K=512\)) requires tuning (sensitivity analysis provided in Figure 5 of the paper).
- The current formulation supports only binary classification (foreground/background) and does not support multi-instance detection.
- Detection of extremely small objects remains challenging, though qualitative results show improvements over baselines.
Related Work & Insights¶
- vs. TokenCut/VoteCut: These methods rely on Normalized Cut within individual images; the high foreground–background similarity in camouflaged scenes causes failure. RISE overcomes this limitation through dataset-level prototypes.
- vs. ProMac/GenSAM: These methods use diffusion models or multimodal LLMs to generate prompts, requiring days of computation. RISE achieves superior results in only a few hours.
- vs. Retrieval-Augmented Semantic Segmentation (RASS): RASS relies on external models to generate prototype libraries, whereas RISE mines prototypes from the dataset itself, avoiding out-of-domain bias.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The retrieval self-augmentation paradigm is pioneered in COD; the cross-category retrieval strategy is elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers four datasets, eight unsupervised methods, three prompt-based baselines, comprehensive ablations, and sensitivity analysis.
- Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated; t-SNE visualizations are highly persuasive.
- Value: ⭐⭐⭐⭐⭐ Sets a new benchmark for unsupervised COD with ideas generalizable to other fine-grained segmentation tasks.