Empowering DINO Representations for Underwater Instance Segmentation via Aligner and Prompter¶
Conference: AAAI2026 arXiv: 2511.08334 Code: ettof/Diveseg Area: Image Segmentation Keywords: underwater instance segmentation, DINOv2, domain adaptation, Fourier style transfer, foundation model fine-tuning
TL;DR¶
This paper is the first to introduce DINOv2 into underwater instance segmentation. Through two adaptation modules—AquaStyle Aligner (Fourier frequency-domain style injection) and ObjectPrior Prompter (binary mask prior prompting)—the proposed DiveSeg achieves efficient domain adaptation and substantially outperforms SAM-based methods on the UIIS and USIS10K benchmarks with fewer parameters.
Background & Motivation¶
Underwater Instance Segmentation (UIS) requires simultaneous pixel-level classification and instance-level discrimination, and is a core technique for ocean exploration, ecological monitoring, and underwater robot navigation. Underwater imagery presents unique challenges:
- Light absorption and scattering: Long-wavelength light is absorbed by water, causing a blue-green color shift.
- Forward scattering induces blurring; backscattering reduces visibility.
- Degradation effects are non-uniform and depth-dependent, leading to large appearance variations across instances.
Early CNN-based methods (e.g., WaterMask) are limited by representational capacity. SAM-based methods (USIS-SAM) incorporate visual foundation models but rely on large-scale annotated underwater data with limited improvement. DINOv2 acquires task-agnostic general features through self-supervised learning and exhibits strong generalization, which is particularly advantageous in underwater scenarios with scarce annotations. However, PCA visualizations reveal that directly transferring DINOv2 to underwater tasks results in features heavily affected by background noise and potential target omissions.
Core Problem¶
How to efficiently adapt DINOv2 to underwater scenarios from two complementary levels:
- Scene-level adaptation: Eliminate the misalignment between underwater color shifts and the pretraining domain.
- Object-level adaptation: Generalize the model to underwater targets such as corals, jellyfish, and sea turtles that are rarely present in the LVD-142M pretraining data.
Method¶
Overall Architecture: DiveSeg¶
DiveSeg is built upon a frozen DINOv2 ViT-L backbone combined with a Mask2Former segmentation head, augmented by two core adaptation modules. The ViT layers are evenly divided into four blocks; an AquaStyle Aligner is inserted at the first layer of each block, and an ObjectPrior Prompter is inserted after each block.
AquaStyle Aligner¶
Goal: Eliminate underwater color domain shift at the scene level.
Style Extraction:
- The input image is transformed via the Fourier transform, separating the amplitude component (containing low-level statistics such as color information) from the phase component (containing content/structural information).
- The phase is fixed to its mean value, retaining only the amplitude; the inverse Fourier transform reconstructs a "style image" that removes object content while preserving underwater color characteristics.
- A multi-layer convolutional network followed by global average pooling encodes the style image into a compact style vector \(p_x\).
Style Injection:
- Operating as a parallel branch to Multi-head Attention (MHA) in the ViT, a cross-attention mechanism is employed: ViT features serve as queries, while the style vector processed by an MLP serves as keys and values.
- The cross-attention output is added to the original MHA output: \(\omega_1 = MHA(V_{in}) + CrossAttn(V_{in}, MLP(p_x))\)
- A parallel bottleneck MLP is similarly added alongside the Feed-Forward layer for deeper feature fusion.
- All original MHA and FF parameters are frozen; only the injected module parameters are trained.
ObjectPrior Prompter¶
Goal: Provide instance-agnostic foreground priors at the object level to reduce the learning difficulty of instance segmentation.
Multi-scale Encoder: Three convolutional layers with stride-2 downsampling extract a three-scale feature pyramid \(\{f_M^1, f_M^2, f_M^3\}\) at resolutions \(1/8^2\), \(1/16^2\), and \(1/32^2\).
Pseudo-Mask Generation: At each scale, \(1\times1\) convolution followed by Sigmoid produces pseudo-masks \(P_{mask}^k\), supervised by binary foreground masks obtained by merging all instance ground-truth annotations.
Feature Enhancement: The pseudo-mask is element-wise multiplied with the original features to filter foreground regions, then fused via convolution and residual connection: \(f_{MT}^k = Conv(P_{mask}^k \cdot f_M^k) + f_M^k\)
Prior Injection: Multi-scale enhanced features are flattened and concatenated into \(O_{prompt}\), which interacts with ViT features via cross-attention (\(O_{prompt}\) as keys/values, ViT features as queries); the output is added to the original ViT features before being passed to the decoder.
Loss & Training¶
- Backbone: DINOv2 ViT-L (frozen)
- Decoder head: Mask2Former
- Optimizer: AdamW, weight decay 0.05, initial lr 1e-4 with warmup
- 30,000 iterations with lr decayed by \(1/10\) at iterations 23,000 and 27,000
- Losses: classification loss + mask loss (Mask2Former) + BCE + IoU + L1 loss (pseudo-mask)
- Hardware: NVIDIA A100, batch size 8
Key Experimental Results¶
UIIS Dataset (7 categories, 3,937 train / 691 test)¶
| Method | Backbone | Params | mAP | AP50 | AP75 |
|---|---|---|---|---|---|
| WaterMask | ResNet-101 | 67M | 27.2 | 43.7 | 29.3 |
| USIS-SAM | ViT-H | 701M | 29.4 | 45.0 | 32.3 |
| DiveSeg | ViT-L | 390M | 35.6 | 52.0 | 38.5 |
Compared with USIS-SAM: mAP +21.1%, AP50 +15.6%, AP75 +19.2%, with only 55.6% of the parameters.
USIS10K Dataset (class-agnostic / multi-class)¶
| Method | Class-agnostic mAP | Multi-class mAP |
|---|---|---|
| USIS-SAM (ViT-H, 701M) | 59.7 | 43.1 |
| DiveSeg (ViT-L, 390M) | 64.1 | 48.4 |
Ablation Study¶
| Configuration | mAP | AP50 | AP75 |
|---|---|---|---|
| DINOv2 + Mask2Former (baseline) | 30.9 | 44.6 | 32.2 |
| + AquaStyle Aligner | 34.1 | 50.8 | 37.8 |
| + ObjectPrior Prompter | 34.8 | 50.6 | 37.6 |
| Full Model | 35.6 | 52.0 | 38.5 |
Comparison of Adaptation Strategies (alternatives to AquaStyle Aligner)¶
| Strategy | mAP |
|---|---|
| Frozen (no adaptation) | 30.9 |
| Full Fine-tuning | 31.1 |
| LoRA | 31.8 |
| Adapter | 32.7 |
| AquaStyle Aligner | 34.1 |
The poor performance of full fine-tuning is likely attributable to catastrophic forgetting. By explicitly modeling underwater style information, AquaStyle Aligner outperforms generic parameter-efficient fine-tuning strategies.
Highlights & Insights¶
- First introduction of DINOv2 to underwater instance segmentation, demonstrating that self-supervised pretrained foundation models can be efficiently adapted to underwater scenarios.
- Elegant AquaStyle Aligner design: Fourier frequency-domain decomposition captures underwater color characteristics and injects them into the ViT via cross-attention, with clear physical intuition.
- Insightful ObjectPrior Prompter: Decouples instance segmentation into "foreground awareness → instance discrimination," reducing learning difficulty.
- Exceptional parameter efficiency: 390M parameters (ViT-L) surpass USIS-SAM with 701M parameters (ViT-H), with the majority of parameters remaining frozen.
- Qualitative results demonstrate clear advantages in challenging scenarios including fish-school segmentation under shadows, overlapping instance separation, and misclassification correction.
Limitations & Future Work¶
- Validation is limited to only two datasets (UIIS and USIS10K), leaving the diversity of underwater scenarios underexplored.
- ObjectPrior Prompter relies on ground-truth merged binary masks during training but uses predicted pseudo-masks at inference; prediction quality directly affects final performance.
- Only ViT-L is evaluated; scalability to ViT-B (lighter) or ViT-G (stronger) remains unexplored.
- Inference speed and real-time capability are not discussed, leaving applicability to time-critical applications such as underwater robotics uncertain.
- Style extraction relies on globally averaged phase; style variation across different depths and water conditions may be more complex in practice.
Related Work & Insights¶
| Dimension | WaterMask | USIS-SAM | DiveSeg |
|---|---|---|---|
| Foundation Model | CNN (ResNet) | SAM (ViT-H) | DINOv2 (ViT-L) |
| Parameters | 67M | 701M | 390M |
| Pretraining | Supervised | Supervised (mask annotations) | Self-supervised |
| Domain Adaptation | Dedicated modules | LoRA + Adapter | Style injection + prior prompting |
| Core Mechanism | End-to-end learning | Prompt engineering | Two-level adaptation (scene + object) |
The Fourier frequency-domain style transfer approach is generalizable to other domain adaptation settings (e.g., medical imaging, remote sensing), with the core insight that amplitude encodes domain-related low-level statistics. The decoupled adaptation strategy (scene-level + object-level) constitutes a general paradigm for transferring foundation models to domain-specific segmentation tasks. The prior injection mechanism of ObjectPrior Prompter complements SAM-style prompt engineering: the former learns implicit priors, while the latter requires explicit prompt design. In the era of foundation models, combining parameter-efficient fine-tuning with domain knowledge injection may represent the best practice for low-data domain transfer.
Rating¶
- Novelty: 4/5 (first introduction of DINOv2 to UIS; both modules are insightfully designed)
- Experimental Thoroughness: 4/5 (two datasets, thorough ablations, multi-strategy comparison; speed analysis is absent)
- Writing Quality: 4/5 (clear structure, well-motivated)
- Value: 4/5 (establishes a strong baseline for underwater vision; the transfer methodology has broad applicability)