FCL-COD: Weakly Supervised Camouflaged Object Detection with Frequency-aware and Contrastive Learning¶

Conference: CVPR 2026 arXiv: 2603.22969 Code: None Area: Image Segmentation Keywords: Camouflaged Object Detection, Weakly Supervised, SAM, Frequency-aware LoRA, Contrastive Learning

TL;DR¶

This paper proposes FCL-COD, a framework that injects camouflaged scene knowledge into SAM via Frequency-aware Low-Rank Adaptation (FoRA), enhances foreground-background feature separation through Gradient-aware Contrastive Learning (GCL), and refines boundary-sensitive features with Multi-Scale Frequency Attention (MSFA). Under a weakly supervised setting using only bounding box annotations, FCL-COD surpasses fully supervised state-of-the-art methods.

Background & Motivation¶

Camouflaged Object Detection (COD) requires identifying objects that are highly similar to their backgrounds, posing four major challenges:

Fully supervised methods rely on pixel-level annotations, which are costly and may cause models to overlook holistic structural features of targets.

Weakly supervised methods exhibit a significant performance gap compared to fully supervised counterparts.

SAM-based methods suffer from specific failure modes in camouflaged scenes:
- (a) Non-camouflaged object response — incorrectly detecting irrelevant objects
- (b) Partial response — detecting only a portion of the target
- (c) Extreme response — detection regions that are excessively large or small
- (d) Lack of fine-grained boundary awareness

This paper systematically addresses each of these four failure modes with dedicated solutions.

Method¶

Overall Architecture¶

A two-stage framework: - Stage 1: A triadic teacher-student self-training architecture adapts SAM using FoRA and GCL to generate high-quality pseudo-labels. - Stage 2: Pseudo-labels are used to train a lightweight PVT-B4 encoder-decoder with embedded MSFA modules for efficient inference.

Key Designs¶

Triadic Teacher-Student Self-training:
- Three encoders are maintained: an anchor encoder $f^a$ (frozen original SAM, preserving pretrained knowledge), a student encoder $f^s$ (receiving strongly augmented inputs), and a teacher encoder $f^t$ (receiving weakly augmented inputs, sharing parameters with the student).
- Student-teacher loss: Focal Loss + Dice Loss supervise the student to learn from teacher pseudo-labels.
- Anchor loss: Prevents the student and teacher from drifting too far from pretrained SAM knowledge, suppressing pseudo-label error accumulation.
- Input prompts are bounding boxes derived from GT mask bounding boxes; no pixel-level annotations are used.
Frequency-aware Low-Rank Adaptation (FoRA): Addresses non-camouflaged object responses. A cascaded transform is inserted between the encoder-decoder path of standard LoRA:
- Spatial enhancement $\mathcal{S}_{spa}$: Aggregates multi-scale context via 1×1, 3×3, and 5×5 convolutions with residual connections.
- Frequency modulation $\mathcal{S}_{fre}$: FFT → frequency-domain 3×3 convolution → IFFT, modeling high-frequency texture differences in camouflaged scenes.
- Forward pass: $h = W_0 x + W_d \mathcal{S}_{fre}(\mathcal{S}_{spa}(W_e x))$
- Core Idea: Camouflaged targets and backgrounds are highly similar in the spatial domain but exhibit distinguishable subtle texture differences in the frequency domain.
Gradient-aware Contrastive Learning (GCL): Addresses partial and extreme responses. The key innovation lies in the sampling strategy:
- Grad-CAM is applied to teacher feature maps to derive a gradient activation map $G^t$.
- A gradient-weighted background mask $\tilde{m}_0 = \hat{m}_0 \odot G^t$ is constructed, focusing on hard background regions likely to be confused with the foreground.
- Masked average pooling constructs foreground instance prototypes and background prototypes for both student and teacher branches.
- Positive pairs: student-teacher representations of the same instance; negatives: other instances + gradient-weighted background prototypes.
- InfoNCE contrastive loss pushes foreground representations away from hard background representations.
Multi-Scale Frequency Attention (MSFA): Addresses the lack of fine-grained boundary awareness. Inserted between the encoder and decoder in Stage 2:
- Dual-branch design: spatial branch $\mathcal{M}_{spa}$ (stacked 3×3 convolutions) + frequency branch $\mathcal{M}_{fre}$ (FFT → 1×1 convolution → IFFT).
- Tri-domain attention $\mathcal{T}$: multi-scale features from one domain gate features in the other domain.
- Multi-scale (S/M/L) spatial and frequency features are cross-gated and then concatenated for fusion.

Loss & Training¶

Stage 1 total loss: $$\mathcal{L} = \mathcal{L}_{st}^{dice} + \lambda_1 \mathcal{L}_{anchor} + \lambda_2 \mathcal{L}_{GCL} + \lambda_3 \mathcal{L}_{st}^{focal}$$

Optimal hyperparameters: $\lambda_1$=0.50, $\lambda_2$=1.00, $\lambda_3$=20

Stage 2 loss: BCE + uncertainty-aware loss with cosine annealing

Training setup: 2×NVIDIA H20 GPUs, PVT-B4 encoder, SGD (lr=1e-3, momentum=0.9), 60 epochs

Key Experimental Results¶

Main Results¶

Comparison with fully supervised and weakly supervised methods (SAM-H backbone):

Method	Supervision	CAMO-MAE↓	CAMO-$S_m$↑	COD10K-MAE↓	COD10K-$S_m$↑
SARNet	Full	0.046	0.874	0.021	0.885
CamoFormer-P	Full	0.046	0.872	0.023	0.869
HitNet	Full	0.055	0.849	0.023	0.871
SAM-COD	Weak (B)	0.062	0.837	0.028	0.842
FCL-COD(H)	Weak (B)	0.050	0.862	0.022	0.878

Under the weakly supervised setting, FCL-COD not only substantially outperforms SAM-COD (MAE reduced by 0.012) but also surpasses multiple fully supervised methods (e.g., ZoomNet, CamoFormer-R).

Results across different SAM scales:

Backbone	CAMO-MAE↓	COD10K-MAE↓	NC4K-MAE↓
FCL-COD(SAM-B)	0.060	0.027	0.041
FCL-COD(SAM-L)	0.054	0.022	0.034
FCL-COD(SAM-H)	0.050	0.022	0.033

Ablation Study¶

Incremental ablation of component contributions (COD10K, $E_m$↑):

FoRA	GCL	MSFA	COD-Train $E_m$	CHAMELEON $E_m$	COD10K $E_m$
✗	✗	✗	0.959	0.927	0.919
✓	✗	✗	0.963	0.928	0.923
✓	✓	✗	0.969	0.947	0.926
✓	✓	✓	—	0.954	0.938

FoRA improves pseudo-label quality → GCL further strengthens foreground-background separation → MSFA refines boundaries during inference.

FoRA sub-ablation: spatial enhancement and frequency modulation each contribute +0.001–0.002 $E_m$; combining both yields +0.004. GCL sub-ablation: standard CL contributes +0.005; adding gradient awareness yields an additional +0.001.

Key Findings¶

Frequency-domain information is key to distinguishing camouflaged targets: camouflaged scenes are highly similar in the spatial domain, but exploitable texture differences exist in the frequency domain.
Grad-CAM-guided hard negative mining is more effective than random sampling.
Multi-scale spatial-frequency cross-gating outperforms single-branch designs.
The method generalizes to weakly supervised Salient Object Detection (SOD), also outperforming fully supervised methods.

Highlights & Insights¶

Highly systematic problem decomposition: The four SAM failure modes in camouflaged scenes (non-camouflaged response / partial response / extreme response / coarse boundaries) correspond respectively to FoRA / GCL / GCL / MSFA, yielding a coherent and principled design.
Multi-level exploitation of frequency-domain priors: FoRA injects frequency priors during feature adaptation; MSFA leverages frequency branches to refine boundaries during inference, forming a comprehensive frequency-aware system.
Weakly supervised results surpassing fully supervised methods are compelling, demonstrating that SAM's strong priors combined with proper adaptation can compensate for the lack of dense annotations.
Engineering soundness of the two-stage design: Stage 1 uses a large SAM model to generate high-quality pseudo-labels; Stage 2 deploys a lightweight model for inference, balancing accuracy and efficiency.

Limitations & Future Work¶

During training, bounding box prompts are derived from GT masks; the acquisition of bounding boxes in practical applications warrants further discussion.
Inference requires two stages (pseudo-label generation + lightweight detector), making the overall pipeline relatively complex.
Evaluation on the CHAMELEON dataset (only 76 images) may be subject to statistical variance.
Extensions to video camouflaged object detection or instance-level camouflaged object detection are not discussed.

The spatial-frequency cascaded design of FoRA can be generalized to other LoRA adaptation tasks requiring fine-grained texture discrimination.
The gradient-aware hard negative mining strategy offers a useful reference for any contrastive learning scenario requiring hard negatives.
The paradigm of SAM + lightweight adaptation + pseudo-label training is transferable to other weakly supervised dense prediction tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ — FoRA and GCL are well-designed; the systematic use of frequency-domain priors is a notable highlight.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four datasets, detailed component ablations, hyperparameter analysis, qualitative visualizations, and SOD extension.
Writing Quality: ⭐⭐⭐⭐ — Problem decomposition is clear, though notation is somewhat dense.
Value: ⭐⭐⭐⭐ — Weakly supervised results surpassing fully supervised methods are impressive and demonstrate practical applicability.