MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification¶

Conference: CVPR 2026 arXiv: 2512.03404 Code: Coming soon Area: Image Generation / Cross-Modal Retrieval Keywords: Cross-modal ReID, Optical-SAR, Ship recognition, Diffusion bridge model, Modality alignment

TL;DR¶

The paper proposes the MOS framework to address optical-SAR cross-modal ship re-identification. It comprises two core modules: (1) MCRL, which reduces the modality gap during training via SAR image denoising and a category-level modality alignment loss; and (2) CDGF, which generates pseudo-SAR samples from optical images using a Brownian bridge diffusion model at inference time and fuses the resulting features. On the HOSS ReID dataset, MOS achieves a +16.4% R1 improvement in the SAR→Optical direction.

Background & Motivation¶

Background: Ship ReID is critical for maritime surveillance and management. SAR sensors enable all-weather, all-day imaging but suffer from severe speckle noise. Optical-SAR cross-modal ReID is highly challenging due to the large modality gap, with only two pioneering works (TransOSS, SMART-Ship) existing in this area.

Limitations of Prior Work: (a) The fundamentally different imaging mechanisms of optical and SAR sensors lead to severe feature misalignment; (b) inherent SAR speckle noise significantly disrupts feature extraction; (c) models tend to favor intra-modal matching over correct cross-modal matching, as modality discrepancy dominates identity discrepancy.

Key Challenge: A conflict between closing the modality gap and preserving identity discriminability — both objectives must be satisfied simultaneously.

Goal: Mitigate the optical-SAR modality gap from two complementary perspectives: the training stage and the inference stage.

Key Insight: The observation that SAR noise concentrates in low-pixel-value regions, and that modality distribution alignment can be decomposed into two independent components — mean and variance.

Core Idea: Apply SAR denoising and category-level Wasserstein alignment during training, and perform diffusion-bridge-based cross-modal generation with feature fusion during inference.

Method¶

Overall Architecture¶

The dataset is defined as \(\mathcal{D} = \{(I_i, y_i, m_i)\}\) with \(m_i \in \{opt, sar\}\). The MCRL module learns modality-invariant representations during training, while the CDGF module generates cross-modal samples and fuses features at inference time.

Key Designs¶

SAR Image Denoising:
- Function: Remove inherent speckle noise from SAR images.
- Mechanism: Sort all pixel values in ascending order, truncate the lowest \(\alpha\%\) (corresponding to noise), and renormalize the remaining values to \([0, 255]\): \(\hat{p}_k = \frac{255(p_k - p_{min})}{p_{max} - p_{min} + \epsilon}\)
- Design Motivation: Noise is empirically observed to concentrate in low-pixel-value regions, making simple truncation an effective remedy.
Category-level Modality Alignment Loss (CMAL):
- Function: Align the optical and SAR feature distributions within each identity class.
- Mechanism: For each identity \(c\), compute the class-wise optical and SAR centroids \(\mu_{opt}^c, \mu_{sar}^c\) and variances \(\text{var}_{opt}^c, \text{var}_{sar}^c\), then minimize \(\mathcal{L}_{CMAL} = \frac{1}{|C|}\sum_{c\in C}(\|\mu_{opt}^c - \mu_{sar}^c\|_2^2 + \|\text{var}_{opt}^c - \text{var}_{sar}^c\|_2^2)\)
- Theoretical Derivation: Under a diagonal covariance approximation, this serves as a tractable surrogate for the Wasserstein-2 distance. The mean term pulls class centers together, while the variance term aligns intra-class dispersion.
- Total Training Loss: \(\mathcal{L} = \lambda_{id}\mathcal{L}_{ID} + \lambda_{tri}\mathcal{L}_{Triplet} + \lambda_{cmal}\mathcal{L}_{CMAL}\)
Cross-modal Data Generation and Feature Fusion (CDGF):
- Function: Generate pseudo-SAR samples at inference time to assist retrieval.
- Mechanism: A Brownian Bridge Diffusion Model (BBDM) is trained with forward process \(q(x_t|x_0,y) = \mathcal{N}(x_t; (1-m_t)x_0 + m_t y, \delta_t I)\), where \(x_0\) denotes the SAR latent feature and \(y\) the optical feature. The reverse process learns to denoise, enabling SAR generation from optical inputs. At inference time, \(K\) pseudo-SAR samples are generated and fused as: \(f_{fused}^i = \frac{(1-\tau)f_{opt}^i + \tau(\frac{1}{K}\sum_{k=1}^K f_{pseudo}^{i,k})}{\|(1-\tau)f_{opt}^i + \tau(\frac{1}{K}\sum_{k=1}^K f_{pseudo}^{i,k})\|_2}\)
- Design Motivation: Aligning feature spaces alone is insufficient; generating a "view from the other modality" enriches cross-modal representations.

Loss & Training¶

Backbone: ViT (following the TransOSS baseline)
\(\lambda_{id} = \lambda_{tri} = 1\)
The BBDM is trained independently for use during inference-time generation.

Key Experimental Results¶

Main Results on HOSS ReID¶

Method	Type	ALL2ALL mAP/R1	O→SAR mAP/R1	SAR→O mAP/R1
TransReID	Single-modal ReID	48.1/60.8	27.3/18.5	20.9/11.9
DEEN	Cross-modal ReID	43.8/58.5	31.3/21.5	27.4/22.4
VersReID	Cross-modal ReID	49.3/59.7	25.7/13.8	27.7/17.9
TransOSS	Optical-SAR	57.4/65.9	48.9/33.8	38.7/29.9
MOS (Ours)	Optical-SAR	60.4/68.8	51.4/40.0	48.7/46.3

Ablation Study¶

Configuration	ALL R1	O→SAR R1	SAR→O R1	Notes
Baseline TransOSS	65.9	33.8	29.9	No augmentation
+ SAR Denoising	66.5	35.4	32.8	Denoising is effective
+ CMAL	67.6	38.5	40.3	Core training-stage alignment
+ CDGF	68.8	40.0	46.3	Generation fusion yields further gains

Key Findings¶

The SAR→Optical direction yields the largest improvement (+16.4% R1), as CDGF generates pseudo-SAR matches for optical queries.
CMAL is the central training-stage component: SAR→O R1 improves from 29.9 to 40.3.
CDGF contributes an additional +6.0 points through inference-time augmentation.
SAR denoising, though simple, consistently improves performance — low-pixel truncation is effective against speckle noise.
The performance advantage over general cross-modal methods (CM-NAS, LbA, etc.) confirms that the optical-SAR domain requires task-specific designs.

Highlights & Insights¶

Diagonal approximation of Wasserstein alignment: The full \(W_2\) matrix square-root computation is simplified to per-dimension mean and variance alignment, achieving high computational efficiency without sacrificing effectiveness. This simplification is transferable to general cross-domain alignment scenarios.
Training–inference two-stage synergy: MCRL establishes a shared feature space during training, and CDGF further bridges the modality gap at inference time; the two modules are complementary.
Brownian bridge diffusion for cross-modal translation: The endpoint-conditioned nature of BBDM naturally suits cross-modal mapping tasks.

Limitations & Future Work¶

The HOSS dataset is relatively small; generalizability to large-scale data remains to be validated.
The denoising strategy is overly simplistic (pixel-value truncation); more advanced SAR denoising methods may yield larger gains.
CDGF incurs inference overhead, as each query requires multiple diffusion sampling steps.
Multi-scale feature fusion and hard sample mining are not discussed.

vs. TransOSS: MOS augments TransOSS with dedicated modality alignment and cross-modal generation modules.
vs. face/pedestrian ReID methods: The poor performance of general cross-modal methods in the optical-SAR domain highlights the need for domain-specific designs.
vs. GAN-based translation methods: BBDM offers greater training stability and generates more diverse samples compared to CycleGAN and similar approaches.

Rating¶

Novelty: ⭐⭐⭐ The Wasserstein approximation and BBDM fusion are creative, though the individual components are relatively independent.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-protocol evaluation with detailed ablation studies.
Writing Quality: ⭐⭐⭐⭐ Theoretical derivations and experimental design are clearly presented.
Value: ⭐⭐⭐ The application domain is narrow, but the contribution to optical-SAR ReID is concrete.