Skip to content

SDF-Net: Structure-Aware Disentangled Feature Learning for Optical–SAR Ship Re-Identification

Conference: CVPR2025
arXiv: 2603.12588
Code: GitHub
Area: Cross-Modal Retrieval / Remote Sensing
Keywords: Optical-SAR, Ship Re-identification, Structure Consistency, Feature Disentanglement, Gradient Energy

TL;DR

SDF-Net is proposed, which leverages the physical prior of ships as rigid bodies. It extracts scale-invariant gradient energy statistics in intermediate ViT layers as cross-modal geometric anchors. In the terminal layer, features are disentangled into modal-invariant shared features and modal-specific features, which are then fused via additive residuality, achieving state-of-the-art (SOTA) performance in optical-SAR ship re-identification.

Background & Motivation

Task Definition: Cross-modal ship re-identification (ReID) aims to associate the same ship identity across optical and SAR images, which is a core task in maritime surveillance.

Key Challenge: Severe non-linear radiation distortion (NRD) exists between optical (passive reflection) and SAR (active microwave scattering) modalities. Since the texture appearance is highly modality-dependent, direct appearance alignment is unreliable.

Limitations of Prior Work: - Methods based on statistical distribution alignment lack the utilization of physical priors. - Generative methods (e.g., CycleGAN) are computationally expensive and may introduce hallucinated artifacts. - Pedestrian ReID methods (e.g., Hi-CMD, DEEN) are designed for non-rigid body deformations and are unsuitable for rigid ships.

Key Insight: Ships are rigid bodies whose geometric structures (contours, aspect ratios, space layouts) remain stable across modalities, whereas textures are modality-dependent. Intermediate layers of network representations abstract away low-level noise while retaining spatial topology, making them the optimal position to extract structural information.

Method

Overall Architecture (Based on ViT-B/16)

(a) Input Stage: Cross-Modal Dual-Head Tokenizer - Optical and SAR images are mapped to a unified \(C\)-dimensional space via independent linear projection heads. - Design Motivation: To neutralize low-level sensor discrepancies, preventing the shared self-attention from being dominated by modality-specific intensity biases.

(b) Intermediate Stage: Structure-Aware Consistency Learning (SCL) - Intermediate Layer Gradient Energy Extraction: Feature maps \(\mathbf{F}^{(B_s)}\) are extracted from the \(B_s=6\)-th layer (out of 12 layers). - Spatial Gradient Computation: \(\mathbf{G}_x(h,w) = \mathbf{F}(h,w+1) - \mathbf{F}(h,w-1)\) in both horizontal and vertical directions. - Spatial Integration: \(\mathbf{e}_x = \frac{1}{H'W'}\sum_{h,w}|\mathbf{G}_x(h,w)|\), compressing the gradient fields into a channel-wise structural descriptor. - Instance Normalization: \(\hat{\mathbf{f}}_{\text{struct}} = \text{IN}(\mathbf{f}_{\text{struct}})\), eliminating absolute magnitude discrepancies between modalities. - Prototype-Level Consistency Loss: \(\mathcal{L}_{\text{struct}} = \frac{1}{|\mathcal{I}|}\sum_i \|\mathbf{c}_i^o - \mathbf{c}_i^s\|_2^2\), aligning optical and SAR structural prototypes of the same identity.

(c) Terminal Stage: Disentangled Feature Learning (DFL) - Two parallel linear projection heads decompose the terminal representation into: shared identity features \(\mathbf{f}_{\text{sh}}\) + modality-specific features \(\mathbf{f}_{\text{sp}}\). - Orthogonal Constraint: \(\mathcal{L}_{\text{orth}} = \mathbb{E}[|\langle\bar{\mathbf{f}}_{\text{sh}}, \bar{\mathbf{f}}_{\text{sp}}\rangle|]\), ensuring the independence of the two subspaces. - Additive Residual Fusion: \(\mathbf{f}_{\text{fuse}} = \mathbf{f}_{\text{sh}} + \mathbf{f}_{\text{sp}}\), parameter-free, where modality-specific features serve as residual supplements.

Joint Optimization: \(\mathcal{L} = \mathcal{L}_{\text{id}} + \lambda_{\text{orth}}\mathcal{L}_{\text{orth}} + \lambda_{\text{struct}}\mathcal{L}_{\text{struct}}\) - Where \(\mathcal{L}_{\text{id}}\) includes label-smoothed cross-entropy and weighted triplet loss. - Hyperparameter Settings: \(\lambda_{\text{orth}} = 10.0\), \(\lambda_{\text{struct}} = 1.0\).

Implementation Details

  • ViT-B/16 backbone initialized with TransOSS pre-trained weights.
  • Input size 256×128, with random horizontal flipping, cropping, and erasing augmentation.
  • Strict cross-modal P×K sampling: 32 images per batch = 8 identities × 4 images (2 optical + 2 SAR).
  • SGD optimizer, lr=5e-4 with linear warmup, for 100 epochs.
  • Trained on a single RTX 3090 GPU, with PyTorch 2.2.2 and CUDA 11.8.
  • Hyperparameters: \(\lambda_{\text{orth}}=10.0\), \(\lambda_{\text{struct}}=1.0\), SCL layer \(B_s=6\).

Key Experimental Results

SOTA Comparison on HOSS-ReID Benchmark (mAP / Rank-1, %)

Method Type All mAP All R1 O→S mAP O→S R1 S→O mAP S→O R1
TransReID (ICCV21) Single-Modal ReID 48.1 60.8 27.3 18.5 20.9 11.9
VersReID (TPAMI24) Cross-Modal ReID 49.3 59.7 25.7 13.8 27.7 17.9
DEEN (CVPR23) Cross-Modal ReID 43.8 58.5 31.3 21.5 27.4 22.4
TransOSS (ICCV25) RS-Specific 57.4 65.9 48.9 33.8 38.7 29.9
SDF-Net RS-Specific 60.9 69.9 50.0 35.4 46.6 38.8
  • All mAP +3.5%, Rank-1 +4.0% (vs. TransOSS).
  • SAR→Optical mAP +7.9% (38.7→46.6), R1 +8.9% (29.9→38.8) — the most significant improvement, validating the efficacy of structural anchors on the SAR side.
  • Pedestrian ReID methods (e.g., DEEN/VersReID) are unsuitable for optical-SAR scenarios, significantly lagging in performance, particularly on the O→S task.

Ablation Study

SCL DFL All mAP All R1 O→S mAP S→O mAP
58.6 67.6 46.5 44.5
59.2 66.5 47.6 46.6
59.8 69.9 49.3 41.4
60.9 69.9 50.0 46.6
  • SCL primarily contributes to structural alignment in SAR→Optical (mAP 44.5→46.6), though using it alone slightly decreases R1 (due to overly strong alignment constraints).
  • DFL primarily improves discriminative accuracy (R1 67.6→69.9), but when used alone, the S→O mAP drops to 41.4 (indicating that disentanglement without structural anchors is unreliable).
  • The combination of both yields the optimal performance: SCL provides cross-modal geometric anchors, on top of which DFL refines identification capability.

Highlights & Insights

  • Physical Prior-Driven Network Design: Systematically embeds the physical knowledge of "ships are rigid bodies" into various stages of feature learning.
  • Innovative Utilization of Intermediate Layer Gradient Energy: Avoids raw pixels (noise) and high-level features (overly abstract) to capture structural topology in intermediate layers.
  • Physical Explanation of Instance Normalization: Not merely a technical trick, but formulated from the physical mechanisms of SAR microwave scattering vs. optical reflection, mapping heterogeneous magnitudes to a unified unit-variance manifold.
  • Simplicity of Additive Residual Fusion: A parameter-free fusion strategy where modality-specific features are treated as residual supplements rather than discarded as noise.
  • Open-source Code

Limitations & Future Work

  • Validated only on a single dataset, HOSS-ReID (1,063 training images, 769 testing images); the scale is limited, and generalization remains to be verified.
  • The rigid-body assumption might not fully hold in certain scenarios, such as ships carrying deformable cargo.
  • The choice of the intermediate layer \(B_s=6\) is empirical, and hyperparameters may require tuning for backbones with different depths.
  • Training requires a strict cross-modal P×K sampling strategy, which places demands on the optical/SAR paired nature of the dataset.
  • Although additive fusion is simple, it may be less flexible than adaptive gating in extreme scenarios.

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of physical priors, intermediate-layer gradient energy, and disentangled fusion is novel in the SAR ReID field.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Three ablation dimensions (modules, fusion strategies, layer choices) plus comprehensive SOTA comparisons.
  • Writing Quality: ⭐⭐⭐⭐ In-depth discussion of physical motivations and complete mathematical derivations.
  • Value: ⭐⭐⭐⭐ Provides an effective, physically driven paradigm for cross-modal retrieval in remote sensing.