Skip to content

SDF-Net: Structure-Aware Disentangled Feature Learning for Optical-SAR Ship Re-identification

Conference: CVPR 2026 arXiv: 2603.12588 Code: github.com/cfrfree/SDF-Net Area: Remote Sensing / Cross-Modal Retrieval Keywords: Optical-SAR cross-modal, ship re-identification, structure-aware, feature disentanglement, gradient energy

TL;DR

This paper proposes SDF-Net, a physics-guided structure-aware disentangled feature learning network that enforces cross-modal geometric consistency via intermediate-layer gradient energy (SCL) and decouples shared/modality-specific features at the terminal layer (DFL) with parameter-free additive fusion, achieving 60.9% mAP (+3.5% vs. SOTA TransOSS) on HOSS-ReID.

Background & Motivation

Background: Optical and SAR sensors are complementary in maritime surveillance — optical imagery provides high-resolution visual detail while SAR enables all-weather, all-day observation. Cross-modal ship ReID is a fundamental task for fusing these two heterogeneous data sources. Existing methods fall into three categories: implicit attention-based alignment (TransOSS), statistical/generative alignment (CycleGAN), and handcrafted geometric descriptors (HOPC).

Limitations of Prior Work: Optical and SAR imagery exhibit severe nonlinear radiometric distortion (NRD) — passive visible-light reflection versus active microwave backscattering results in completely different textural appearances for the same target. Implicit alignment methods ignore physical priors; generative synthesis introduces artifacts and incurs high cost; person ReID assumptions about deformable bodies do not apply to rigid ship structures.

Key Challenge: Ships are rigid bodies whose geometric structure remains stable across modalities, yet texture is highly modality-dependent. Existing methods attempt to align all features without distinguishing structure from texture.

Goal: To explicitly exploit geometric structure as a physics-grounded "anchor" for cross-modal association, enforcing strict structural consistency while tolerating modality-specific appearance variation.

Key Insight: Gradient energy is extracted from intermediate-layer features — abstract enough to filter low-level noise (e.g., SAR speckle) while preserving spatial topological information. At the terminal layer, shared and modality-specific features are disentangled and fused via additive residuals to maintain discriminability.

Core Idea: Use intermediate-layer gradient energy statistics as scale-invariant structural descriptors to enforce cross-modal geometric consistency, while disentangling and fusing modality-invariant and modality-specific features at the terminal layer.

Method

Overall Architecture

ViT-B/16 dual-head tokenizer encoder → intermediate-layer (Block 6) structure-aware consistency learning (SCL) → terminal-layer disentangled feature learning (DFL) → parameter-free additive residual fusion → identity classification. Training jointly optimizes identity loss, structural consistency loss, and orthogonality constraint loss.

Key Designs

  1. Structure-Aware Consistency Learning (SCL):

    • Function: Extracts cross-modal geometrically invariant structural features from the ViT intermediate layer (Block 6).
    • Mechanism: Computes spatial gradients of intermediate feature maps \(\mathbf{G}_x(h,w) = \mathbf{F}(h,w+1) - \mathbf{F}(h,w-1)\); spatially integrates them into a gradient energy descriptor \(\mathbf{f}_{struct} = \mathbf{e}_x + \mathbf{e}_y \in \mathbb{R}^{B \times C}\); applies Instance Normalization to eliminate inter-modal magnitude differences; constructs identity-level cross-modal prototypes and enforces alignment via Euclidean distance constraints.
    • Design Motivation: The intermediate layer balances spatial detail and semantic abstraction — shallow layers are corrupted by speckle noise, while deep layers lose spatial information due to global aggregation. The gradient operator naturally acts as a high-pass filter insensitive to SAR multiplicative intensity differences. The macro-level geometric skeleton of rigid ships remains consistent across modalities under near-vertical observation.
  2. Disentangled Feature Learning with Additive Fusion (DFL):

    • Function: Decomposes the terminal representation into modality-invariant shared features and modality-specific features, fused in a parameter-free manner.
    • Mechanism: Two parallel linear projection heads map \(\mathbf{F}^{(L)}\) to \(\mathbf{f}_{sh}\) and \(\mathbf{f}_{sp}\); an orthogonality constraint \(\mathcal{L}_{orth} = \mathbb{E}[|\langle \bar{\mathbf{f}}_{sh}, \bar{\mathbf{f}}_{sp} \rangle|]\) ensures subspace independence; additive fusion \(\mathbf{f}_{fuse} = \mathbf{f}_{sh} + \mathbf{f}_{sp}\).
    • Design Motivation: Unlike person ReID which discards modality-specific features, SAR corner-reflector responses and optical paint reflections in ships contain identity-discriminative cues. Additive fusion treats modality-specific features as residual corrections with zero additional parameters.

Loss & Training

\[\mathcal{L} = \mathcal{L}_{id} + 10.0 \cdot \mathcal{L}_{orth} + 1.0 \cdot \mathcal{L}_{struct}\]
  • \(\mathcal{L}_{id}\): Label-smoothed cross-entropy + weighted triplet loss
  • SGD, weight decay 1e-4, batch size 32 (P=8 identities × K=4 images, strictly 2 optical + 2 SAR per identity)
  • 100 epochs, linear warmup, single RTX 3090

Key Experimental Results

Main Results

Method Type All mAP All R1 O→S mAP S→O mAP
TransReID Single-modal ReID 48.1% 60.8% 27.3% 20.9%
D2InterNet Single-modal ReID 50.2% 59.1% 33.0% 28.8%
DEEN Cross-modal ReID 43.8% 58.5% 31.3% 27.4%
VersReID Cross-modal ReID 49.3% 59.7% 25.7% 27.7%
TransOSS Optical-SAR specific 57.4% 65.9% 48.9% 38.7%
SDF-Net Optical-SAR specific 60.9% 69.9% 50.0% 46.6%

Ablation Study

SCL DFL All mAP All R1 O→S mAP S→O mAP
58.6% 67.6% 46.5% 44.5%
59.2% 66.5% 47.6% 46.6%
59.8% 69.9% 49.3% 41.4%
60.9% 69.9% 50.0% 46.6%
Fusion Strategy All mAP All R1 S→O mAP
\(\mathbf{f}_{sp}\) only 58.7% 67.6% 43.3%
\(\mathbf{f}_{sh}\) only 59.2% 68.8% 43.1%
Concatenation 59.5% 69.3% 44.1%
Additive Sum 60.9% 69.9% 46.6%
Extraction Layer \(B_s\) All mAP S→O mAP
2 59.7% 46.0%
4 60.4% 45.3%
6 60.9% 46.6%
8 58.4% 45.5%
10 58.7% 44.7%

Key Findings

  • The most challenging SAR-to-Optical scenario shows the largest improvement (+7.9% mAP), validating the critical role of geometric anchoring in bridging the active/passive modality gap.
  • SCL and DFL contribute complementarily: SCL improves the S→O direction (geometric alignment) while DFL improves Rank-1 (identity discriminability), and their combination yields the best overall performance.
  • Additive fusion outperforms concatenation — zero additional parameters with optimal performance, confirming the effectiveness of treating modality-specific features as residual corrections.
  • Block 6 is the optimal structural extraction layer — too shallow (Block 2) suffers from speckle noise contamination, while too deep (Block 8+) collapses spatial information.

Highlights & Insights

  • Physics-guided design philosophy — leveraging rigid-body geometric invariance as a prior rather than purely data-driven alignment, yielding +7.9% mAP in the hardest SAR→Optical scenario.
  • The combination of gradient energy and Instance Normalization to construct scale-invariant structural descriptors is elegant and generalizable to other cross-modal matching tasks.
  • The parameter-free additive fusion design is concise and effective, avoiding the computational overhead and artifact issues associated with generative methods.

Limitations & Future Work

  • Validation is limited to the single HOSS-ReID dataset with only 1,063 training images, which is a relatively small scale.
  • The method assumes near-vertical observation — 3D SAR distortions (layover/foreshortening) at extreme incidence angles are not addressed.
  • In very low-resolution SAR imagery, structural contours may be entirely overwhelmed by speckle.
  • Multi-scale structural extraction is not explored; only a single layer (Block 6) is used.
  • TransOSS (ICCV 2025): ViT-based optical-SAR baseline, 57.4% mAP → Ours 60.9%; lacks explicit physical constraints.
  • HOPC (classical remote sensing): Handcrafted local geometric descriptors operating at the pixel level; this work elevates the concept to the intermediate latent space.
  • Hi-CMD (VI-ReID): Discards modality-specific features; this paper demonstrates that modality-specific information should be retained as residual corrections for rigid-body targets.

Rating

  • ⭐⭐⭐⭐ Novelty: Physics-guided gradient energy structural features combined with disentangled residual fusion; theoretical motivation is well-grounded.
  • ⭐⭐⭐⭐ Experimental Thoroughness: Three-protocol evaluation with three-dimensional ablation (modules / fusion strategies / extraction layers) and comprehensive baselines.
  • ⭐⭐⭐⭐ Writing Quality: The correspondence between physical motivation and method design is explicit, with a complete logical chain.
  • ⭐⭐⭐⭐ Value: Practically valuable for cross-modal remote sensing retrieval; the gradient energy concept is transferable to other tasks.