SDF-Net: Structure-Aware Disentangled Feature Learning for Optical–SAR Ship Re-Identification¶
Conference: CVPR2025
arXiv: 2603.12588
Code: GitHub
Area: Cross-Modal Retrieval / Remote Sensing
Keywords: Optical-SAR, Ship Re-identification, Structure Consistency, Feature Disentanglement, Gradient Energy
TL;DR¶
SDF-Net is proposed, which leverages the physical prior of ships as rigid bodies. It extracts scale-invariant gradient energy statistics in intermediate ViT layers as cross-modal geometric anchors. In the terminal layer, features are disentangled into modal-invariant shared features and modal-specific features, which are then fused via additive residuality, achieving state-of-the-art (SOTA) performance in optical-SAR ship re-identification.
Background & Motivation¶
Task Definition: Cross-modal ship re-identification (ReID) aims to associate the same ship identity across optical and SAR images, which is a core task in maritime surveillance.
Key Challenge: Severe non-linear radiation distortion (NRD) exists between optical (passive reflection) and SAR (active microwave scattering) modalities. Since the texture appearance is highly modality-dependent, direct appearance alignment is unreliable.
Limitations of Prior Work: - Methods based on statistical distribution alignment lack the utilization of physical priors. - Generative methods (e.g., CycleGAN) are computationally expensive and may introduce hallucinated artifacts. - Pedestrian ReID methods (e.g., Hi-CMD, DEEN) are designed for non-rigid body deformations and are unsuitable for rigid ships.
Key Insight: Ships are rigid bodies whose geometric structures (contours, aspect ratios, space layouts) remain stable across modalities, whereas textures are modality-dependent. Intermediate layers of network representations abstract away low-level noise while retaining spatial topology, making them the optimal position to extract structural information.
Method¶
Overall Architecture (Based on ViT-B/16)¶
(a) Input Stage: Cross-Modal Dual-Head Tokenizer - Optical and SAR images are mapped to a unified \(C\)-dimensional space via independent linear projection heads. - Design Motivation: To neutralize low-level sensor discrepancies, preventing the shared self-attention from being dominated by modality-specific intensity biases.
(b) Intermediate Stage: Structure-Aware Consistency Learning (SCL) - Intermediate Layer Gradient Energy Extraction: Feature maps \(\mathbf{F}^{(B_s)}\) are extracted from the \(B_s=6\)-th layer (out of 12 layers). - Spatial Gradient Computation: \(\mathbf{G}_x(h,w) = \mathbf{F}(h,w+1) - \mathbf{F}(h,w-1)\) in both horizontal and vertical directions. - Spatial Integration: \(\mathbf{e}_x = \frac{1}{H'W'}\sum_{h,w}|\mathbf{G}_x(h,w)|\), compressing the gradient fields into a channel-wise structural descriptor. - Instance Normalization: \(\hat{\mathbf{f}}_{\text{struct}} = \text{IN}(\mathbf{f}_{\text{struct}})\), eliminating absolute magnitude discrepancies between modalities. - Prototype-Level Consistency Loss: \(\mathcal{L}_{\text{struct}} = \frac{1}{|\mathcal{I}|}\sum_i \|\mathbf{c}_i^o - \mathbf{c}_i^s\|_2^2\), aligning optical and SAR structural prototypes of the same identity.
(c) Terminal Stage: Disentangled Feature Learning (DFL) - Two parallel linear projection heads decompose the terminal representation into: shared identity features \(\mathbf{f}_{\text{sh}}\) + modality-specific features \(\mathbf{f}_{\text{sp}}\). - Orthogonal Constraint: \(\mathcal{L}_{\text{orth}} = \mathbb{E}[|\langle\bar{\mathbf{f}}_{\text{sh}}, \bar{\mathbf{f}}_{\text{sp}}\rangle|]\), ensuring the independence of the two subspaces. - Additive Residual Fusion: \(\mathbf{f}_{\text{fuse}} = \mathbf{f}_{\text{sh}} + \mathbf{f}_{\text{sp}}\), parameter-free, where modality-specific features serve as residual supplements.
Joint Optimization: \(\mathcal{L} = \mathcal{L}_{\text{id}} + \lambda_{\text{orth}}\mathcal{L}_{\text{orth}} + \lambda_{\text{struct}}\mathcal{L}_{\text{struct}}\) - Where \(\mathcal{L}_{\text{id}}\) includes label-smoothed cross-entropy and weighted triplet loss. - Hyperparameter Settings: \(\lambda_{\text{orth}} = 10.0\), \(\lambda_{\text{struct}} = 1.0\).
Implementation Details¶
- ViT-B/16 backbone initialized with TransOSS pre-trained weights.
- Input size 256×128, with random horizontal flipping, cropping, and erasing augmentation.
- Strict cross-modal P×K sampling: 32 images per batch = 8 identities × 4 images (2 optical + 2 SAR).
- SGD optimizer, lr=5e-4 with linear warmup, for 100 epochs.
- Trained on a single RTX 3090 GPU, with PyTorch 2.2.2 and CUDA 11.8.
- Hyperparameters: \(\lambda_{\text{orth}}=10.0\), \(\lambda_{\text{struct}}=1.0\), SCL layer \(B_s=6\).
Key Experimental Results¶
SOTA Comparison on HOSS-ReID Benchmark (mAP / Rank-1, %)¶
| Method | Type | All mAP | All R1 | O→S mAP | O→S R1 | S→O mAP | S→O R1 |
|---|---|---|---|---|---|---|---|
| TransReID (ICCV21) | Single-Modal ReID | 48.1 | 60.8 | 27.3 | 18.5 | 20.9 | 11.9 |
| VersReID (TPAMI24) | Cross-Modal ReID | 49.3 | 59.7 | 25.7 | 13.8 | 27.7 | 17.9 |
| DEEN (CVPR23) | Cross-Modal ReID | 43.8 | 58.5 | 31.3 | 21.5 | 27.4 | 22.4 |
| TransOSS (ICCV25) | RS-Specific | 57.4 | 65.9 | 48.9 | 33.8 | 38.7 | 29.9 |
| SDF-Net | RS-Specific | 60.9 | 69.9 | 50.0 | 35.4 | 46.6 | 38.8 |
- All mAP +3.5%, Rank-1 +4.0% (vs. TransOSS).
- SAR→Optical mAP +7.9% (38.7→46.6), R1 +8.9% (29.9→38.8) — the most significant improvement, validating the efficacy of structural anchors on the SAR side.
- Pedestrian ReID methods (e.g., DEEN/VersReID) are unsuitable for optical-SAR scenarios, significantly lagging in performance, particularly on the O→S task.
Ablation Study¶
| SCL | DFL | All mAP | All R1 | O→S mAP | S→O mAP |
|---|---|---|---|---|---|
| ✗ | ✗ | 58.6 | 67.6 | 46.5 | 44.5 |
| ✓ | ✗ | 59.2 | 66.5 | 47.6 | 46.6 |
| ✗ | ✓ | 59.8 | 69.9 | 49.3 | 41.4 |
| ✓ | ✓ | 60.9 | 69.9 | 50.0 | 46.6 |
- SCL primarily contributes to structural alignment in SAR→Optical (mAP 44.5→46.6), though using it alone slightly decreases R1 (due to overly strong alignment constraints).
- DFL primarily improves discriminative accuracy (R1 67.6→69.9), but when used alone, the S→O mAP drops to 41.4 (indicating that disentanglement without structural anchors is unreliable).
- The combination of both yields the optimal performance: SCL provides cross-modal geometric anchors, on top of which DFL refines identification capability.
Highlights & Insights¶
- Physical Prior-Driven Network Design: Systematically embeds the physical knowledge of "ships are rigid bodies" into various stages of feature learning.
- Innovative Utilization of Intermediate Layer Gradient Energy: Avoids raw pixels (noise) and high-level features (overly abstract) to capture structural topology in intermediate layers.
- Physical Explanation of Instance Normalization: Not merely a technical trick, but formulated from the physical mechanisms of SAR microwave scattering vs. optical reflection, mapping heterogeneous magnitudes to a unified unit-variance manifold.
- Simplicity of Additive Residual Fusion: A parameter-free fusion strategy where modality-specific features are treated as residual supplements rather than discarded as noise.
- Open-source Code
Limitations & Future Work¶
- Validated only on a single dataset, HOSS-ReID (1,063 training images, 769 testing images); the scale is limited, and generalization remains to be verified.
- The rigid-body assumption might not fully hold in certain scenarios, such as ships carrying deformable cargo.
- The choice of the intermediate layer \(B_s=6\) is empirical, and hyperparameters may require tuning for backbones with different depths.
- Training requires a strict cross-modal P×K sampling strategy, which places demands on the optical/SAR paired nature of the dataset.
- Although additive fusion is simple, it may be less flexible than adaptive gating in extreme scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of physical priors, intermediate-layer gradient energy, and disentangled fusion is novel in the SAR ReID field.
- Experimental Thoroughness: ⭐⭐⭐⭐ Three ablation dimensions (modules, fusion strategies, layer choices) plus comprehensive SOTA comparisons.
- Writing Quality: ⭐⭐⭐⭐ In-depth discussion of physical motivations and complete mathematical derivations.
- Value: ⭐⭐⭐⭐ Provides an effective, physically driven paradigm for cross-modal retrieval in remote sensing.