MDReID: Modality-Decoupled Learning for Any-to-Any Multi-Modal Object Re-Identification¶

Conference: NeurIPS 2025 arXiv: 2510.23301 Code: GitHub Area: Multi-Modal VLM Keywords: Multi-modal ReID, modality decoupling, cross-modal retrieval, any-to-any matching, metric learning

TL;DR¶

This paper proposes MDReID, a framework that decouples modality features into modality-shared and modality-specific components, enabling object re-identification under arbitrary modality combinations (any-to-any ReID) and substantially outperforming existing methods in both modality-matched and modality-mismatched scenarios.

Background & Motivation¶

Background: Multi-modal object re-identification (ReID) leverages complementary spectral information from RGB, NIR, TIR, and other modalities to improve recognition robustness in complex scenes.

Limitations of Prior Work: Existing methods (e.g., TOP-ReID, EDITOR) assume strict modality alignment between query and gallery sets; however, in real deployments, camera types and environments vary, leading to modality inconsistencies.

Key Challenge: When modalities are missing, reconstructing absent modality representations from available ones is an ill-posed problem, as unpredictable modality-specific information leads to suboptimal learning.

Goal: Design a flexible framework that supports retrieval under arbitrary query–gallery modality combinations, covering both modality-matched and modality-mismatched scenarios.

Key Insight: Decompose modality information into predictable and transferable shared features and unpredictable specific features, handling each separately.

Core Idea: Introduce learnable modality-shared and modality-specific tokens into a ViT to explicitly decouple representations, and reinforce the decoupling with an orthogonality loss and a knowledge discrepancy loss.

Method¶

Overall Architecture¶

MDReID is built on a Vision Transformer (ViT) backbone and comprises two core components: - Modality Decoupled Learning (MDL): Splits each modality's representation into modality-shared and modality-specific parts. - Modality-aware Metric Learning (MML): Further enhances feature decoupling through metric learning.

Key Designs¶

Modality Decoupled Learning (MDL):
- Function: Extracts shared and specific features for each modality.
- Design Motivation: Shared features support cross-modal retrieval (modality-mismatched scenario), while specific features retain modality-exclusive discriminative information (modality-matched scenario).
- Mechanism: Two learnable tokens, $I_{sp}^M$ (modality-specific) and $I_{sh}^M$ (modality-shared), are prepended to the patch embedding sequence of each modality within ViT; after ViT encoding, decoupled features are obtained. A unified feature vector is constructed as: $v_{full} = [I_{sp}^R, I_{sp}^N, I_{sp}^T, I_{sh}^R, I_{sh}^N, I_{sh}^T]$ Missing modalities are filled with zero vectors, with a binary availability mask indicating which modalities are present.
- Novelty: Unlike TOP-ReID, which attempts to reconstruct missing modality representations, MDReID avoids the ill-posed reconstruction problem entirely.
Similarity Computation:
- Modality-specific similarity $Sim_{sp}$: Compares specific features of the same modality only, with availability masks handling missing cases.
- Modality-shared similarity $Sim_{sh}$: Computes a similarity matrix over all available shared feature pairs. $Sim_{total}(v_q, v_g) = (Sim_{sp} + Sim_{sh}) / 2$
Representation Orthogonality Loss (ROL):
- Function: Promotes channel-level aggregation of modality-shared features and enforces orthogonality between shared and specific features.
- Mechanism: An ideal $6\times6$ target similarity matrix $A$ is defined, in which specific features form an identity matrix (mutual orthogonality), shared features are all-ones (mutual consistency), and cross-group entries are all zeros (orthogonality between groups). The loss minimizes the squared error between the actual similarity and the target: $L_{ROL} = \sum_{i,j} (V_{sim}(i,j) - A(i,j))^2$
Knowledge Discrepancy Loss (KDL):
- Function: Ensures that the combination of shared and specific features is more discriminative than either component alone.
- Mechanism: Following the triplet loss paradigm, the combined features are required to yield smaller maximum positive-pair distances and larger minimum negative-pair distances: $L_{KDL} = \|D_p - 0\|_1 + \|D_n - 1\|_1$

Loss & Training¶

The total loss is: $$L = L_{ce} + L_{tri} + L_{MML}$$ where $L_{MML} = w_1 \times L_{ROL} + w_2 \times L_{KDL}$, with $w_1=1.5$ and $w_2=5.25$.

Training uses the Adam optimizer with a batch size of 64, a base learning rate of $3.5 \times 10^{-4}$, a ViT fine-tuning learning rate of $5 \times 10^{-6}$, and runs for 50 epochs. The backbone is the CLIP-Base visual encoder.

Key Experimental Results¶

Main Results¶

Modality-matched scenario (RNT-to-RNT):

Method	RGBNT201 mAP	RGBNT201 R-1	RGBNT100 mAP	RGBNT100 R-1	MSVR310 mAP	MSVR310 R-1
TOP-ReID	72.3	76.6	81.2	96.4	35.9	44.6
EDITOR	-	-	82.1	96.4	39.0	49.3
MDReID	82.1	85.2	85.3	95.6	51.0	68.9

Modality-mismatched scenario (average over 4 settings):

Method	RGBNT201 Avg mAP	RGBNT100 Avg mAP	MSVR310 Avg mAP
TOP-ReID	18.2	26.8	11.2
EDITOR	8.5	11.9	2.5
MDReID	21.6	38.6	22.1

Ablation Study¶

Config	MDL	$L_{ROL}$	$L_{KDL}$	mAP	R-1
1 (single classifier)	✕	✕	✕	27.8	27.1
2 (MDL only)	✓	✕	✕	39.4	38.2
3 (+ROL)	✓	✓	✕	41.2	40.8
5 (full model)	✓	✓	✓	43.2	42.3

Key Findings¶

MDL contributes the most: introducing modality-specific classifiers improves mAP from 27.8% to 39.4% (+11.6%).
ROL and KDL each provide an additional ~2% gain, validating the effectiveness of decoupling constraints.
In modality-mismatched scenarios, MDReID surpasses TOP-ReID by 10.9% mAP on MSVR310, demonstrating strong robustness.

Highlights & Insights¶

Clear problem formulation: The paper is the first to systematically define the image-level any-to-any multi-modal ReID problem.
Elegant decoupling design: Feature decoupling is achieved naturally via shared/specific tokens combined with ViT self-attention, avoiding complex reconstruction modules.
Flexible similarity computation: Availability-mask-based similarity calculation adaptively handles arbitrary missing modalities.
Intuitive ROL target matrix: The $6\times6$ ideal similarity matrix has a clear rationale—specific features are mutually orthogonal, shared features are mutually consistent, and the two groups are orthogonal to each other.

Limitations & Future Work¶

Validation is limited to three modalities (RGB/NIR/TIR); performance on additional modalities (e.g., depth, event cameras) remains unexplored.
Training assumes all modalities are available; modality-missing scenarios at training time are not investigated.
The degree of decoupling between shared and specific features is sensitive to the loss hyperparameters ($w_1$, $w_2$).
Scalability to open-set ReID or large-scale datasets is not discussed.

TOP-ReID: Aggregates multi-spectral features via cyclic token permutation, but is constrained by the modality alignment assumption.
RLE: Identifies cross-spectral feature prediction as an ill-posed problem, motivating the decoupling approach in this work.
CLIP backbone: Leverages the strong representational capacity of pre-trained vision–language models to enhance feature quality.
Insights: The modality decoupling paradigm is broadly transferable to other multi-modal tasks, such as vision–language and vision–audio fusion.

Rating¶

Novelty: ⭐⭐⭐⭐ The decoupling idea itself is not novel, but its application and realization in any-to-any ReID is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three datasets, multiple modality-matched/mismatched/missing scenarios, and detailed ablations.
Writing Quality: ⭐⭐⭐⭐ Well-structured, with complete formulations and intuitive illustrations.
Value: ⭐⭐⭐⭐ Addresses a practically important deployment challenge in multi-modal ReID with significant performance gains.