Multi-Memory Matching for Unsupervised Visible-Infrared Person Re-Identification¶

Conference: ECCV 2024
arXiv: 2401.06825
Code: GitHub
Area: Human Understanding
Keywords: Unsupervised Person Re-Identification, Visible-Infrared Cross-Modal, Multi-Memory Matching, Pseudo Label, Clustering

TL;DR¶

A Multi-Memory Matching (MMM) framework is proposed for unsupervised visible-infrared person re-identification. It establishes reliable cross-modal correspondences through three modules: Cross-Modal Clustering (CMC), Multi-Memory Learning and Matching (MMLM), and Soft Cluster-level Alignment loss (SCA), achieving a Rank-1 accuracy of 61.6% on SYSU-MM01 and 89.7% on RegDB.

Background & Motivation¶

Unsupervised visible-infrared person re-identification (USL-VI-ReID) aims to retrieve pedestrians across visible and infrared modalities without relying on annotations, which is a key technology for achieving 24-hour intelligent surveillance. Existing methods generate pseudo-labels and establish cross-modal correspondences based on clustering, but suffer from the following core problems:

Unreliable cross-modal correspondences: The authors introduced the ARI (Adjusted Rand Index) metric for evaluation and found that although existing methods perform well in terms of retrieval metrics, the quality of their cross-modal correspondences is poor (low ARI values).

Insufficient single-memory representation: Existing methods represent an identity using a single memory (i.e., a single cluster center), which fails to capture fine-grained variations such as multi-view and multi-pose representations of individuals, leading to highly noisy cross-modal matching.

Paradox phenomenon: Different pedestrians with overlapping attributes are further confused due to noisy correspondences. Although this may increase intra-class feature similarity and bring improvement in metrics, the actual precise retrieval capability is limited.

The core insight of this paper: Compared to a single memory, multiple memories can more completely represent the diverse features of an identity (e.g., front, back views), thereby establishing more reliable cross-modal correspondences.

Method¶

Overall Architecture¶

MMM employs ResNet50 (pre-trained on ImageNet) as the shared backbone network to extract 2048-dimensional features. The overall pipeline is: (1) The CMC module generates pseudo-labels; (2) The MMLM module establishes cross-modal correspondences through multi-memory matching; (3) The SCA loss reduces the modality gap and mitigates the impact of noisy pseudo-labels.

Key Designs¶

Cross-Modal Clustering (CMC): The foundational module for generating pseudo-labels.
- The DBSCAN algorithm is used to cluster visible samples, infrared samples, and their mixed samples separately: \(Y^t = DBSCAN(F^t)\).
- Unlike existing methods, it performs not only intra-modality clustering (\(t=v\) or \(t=r\)) but also joint inter-modality clustering (\(t=\{v,r\}\)), indirectly establishing cross-modal correspondences.
- Three memories are computed for each cluster: visible memory \(C_{V^p}\), infrared memory \(C_{R^p}\), and mixed memory \(C_{VR^p}\).
- Optimization is based on the ClusterNCE contrastive loss: \(L_{CMC} = L_V + L_R + L_{VR}\).
Multi-Memory Learning and Matching (MMLM): The core innovation to establish reliable cross-modal correspondences.
- Multi-memory learning: A single cluster is further subdivided into \(n\) sub-clusters, and the center of each sub-cluster acts as a memory. The intra-sub-cluster distance is minimized via K-Means: \(\min_{F_{C_{V_i^p}}} \sum_{i=1}^{n} \|f^v - K_{C_{V_i^p}}\|_2^2\)
- For example, Memory 1 captures frontal features, while Memory 2 captures back-view features, representing the individual more comprehensively.
- Multi-memory matching: The cross-modal matching problem is formulated as weighted bipartite graph matching. The cost matrix is designed as the sum of nearest-neighbor distances between multiple memories: \(M(K_{C_{V^p}}, K_{C_{R^{p'}}}) = \sum_{i=1}^{n} \min_{j \in \{1,...,n\}} \|K_{V_i^p} - K_{R_j^{p'}}\|_2\)
- The Hungarian algorithm is used to solve for the optimal matching \(Q\) to transfer infrared pseudo-labels to the visible domain: \(Y^v := QY^r\).
Soft Cluster-level Alignment Loss (SCA): Mitigates the impact of noisy pseudo-labels and narrows the modality gap.
- Confidence estimation: A two-component Gaussian Mixture Model (GMM) is used to model the loss distribution, calculating the label confidence \(W^v\) for each sample via posterior probability.
- Confidence-weighted memory update: Memories are updated with confidence weights to reduce the influence of noisy samples: \(C_{V^p} := \frac{1}{N_p} \sum_i f(V_i^p) W_{V_i^p}\).
- Intra-modality alignment (Intra): Aligns samples of the same ID to their confidence-weighted cluster centers: \(L_{Intra} = \sum_p \sum_{f^v} \|f^v - C_{V^p}\|_2^2 + \sum_p \sum_{f^r} \|f^r - C_{R^p}\|_2^2\).
- Inter-modality alignment (Inter): MMD² (Maximum Mean Discrepancy) is used to measure the difference in feature distributions of the same ID across the two modalities. Minimizing this difference achieves a soft many-to-many alignment: \(L_{Inter} = \frac{1}{P} \sum_p \frac{1}{2}[D(F_p^v, sg(F_p^r)) + D(F_p^r, sg(F_p^v))]\)
- A stop-gradient instruction is used to prevent mutual collapse between the two modalities.
- Total SCA loss: \(L_{SCA} = \lambda_{Intra} L_{Intra} + \lambda_{Inter} L_{Inter}\).

Loss & Training¶

Total loss: \(L_{overall} = L_{CMC} + L_{SCA}\)

Backbone: ResNet50, pre-trained on ImageNet.
Trained for 80 epochs. At each step, 8 IDs are sampled, with 4 visible + 4 infrared images selected per ID.
Image size is 288×144, with random flipping and cropping augmentation.
SGD optimizer with momentum=0.9, weight decay=5e-4.
Intra loss is added from the 1st epoch, and Inter loss is added from the 15th epoch.
Temperature coefficient \(\tau=0.05\), DBSCAN parameters eps=0.6, min_samples=4.
Optimal hyperparameters: \(n=4\) (number of memories), \(\lambda_{Intra}=0.5\), \(\lambda_{Inter}=0.05\).

Key Experimental Results¶

Main Results¶

SYSU-MM01 All Search & RegDB Visible2Thermal

Method	Type	SYSU R-1	SYSU mAP	RegDB R-1	RegDB mAP
ADCA	USL	45.5	42.7	67.2	64.1
ADCA+MMM	USL	49.7	44.7	77.8	70.9
GUR*	USL	61.0	57.0	73.9	70.2
PCLHD	USL	64.4	58.7	84.3	80.7
MMM	USL	61.6	57.9	89.7	80.5
MMM+PCLHD	USL	65.9	61.8	89.6	83.7
DPIS	Semi	58.4	55.6	62.3	53.2
AGW	Sup.	47.5	47.7	70.1	66.4

MMM surpasses several semi-supervised and supervised methods under the unsupervised setting. On RegDB, compared to GUR, it improves Rank-1 by +15.8% and mAP by +10.3%.

Ablation Study¶

Configuration	SYSU R-1	SYSU mAP	Indoor R-1	Indoor mAP
Baseline (CMC only)	51.74	49.81	56.34	64.46
+ MMLM	55.15	52.21	58.76	65.47
+ MMLM + Intra	58.48	55.05	62.19	68.09
+ MMLM + Inter	57.26	53.81	60.26	66.66
+ MMLM + Intra + Inter	61.56	57.92	64.37	70.40

Key Findings¶

The MMLM module yields an improvement of +3.41% in Rank-1, validating that multiple memories establish cross-modal correspondences more effectively than a single memory.
Intra and Inter losses complement each other. The complete SCA loss improves Rank-1 by +9.82% and mAP by +8.11% compared to the baseline.
The number of memories \(n=4\) is optimal, indicating that too few memories cannot express diversity, while too many introduce noise.
Visualization analysis shows that with MMM, the mean intra-modality distance decreases and the mean inter-modality distance increases, making the feature distribution more discriminative.
The ARI metric indicates that the reliability of cross-modal correspondences of MMM is significantly superior to methods like GUR.

Highlights & Insights¶

Discovery of an important paradox: Revealing the contradictory phenomenon in existing USL-VI-ReID methods where cross-modal correspondences achieve good retrieval results but are actually highly unreliable.
Concept of multi-memory: Splitting a single cluster center into multiple sub-cluster centers to describe identity diversity in a more fine-grained manner, which is an effective improvement over the Cluster-Contrast paradigm.
GMM confidence estimation: Utilizing loss distribution modeling to soften the impact of noisy pseudo-labels, which is more elegant than hard threshold filtering.
Method versatility: MMM can serve as a plug-and-play module to enhance other methods (e.g., ADCA+MMM, MMM+PCLHD), validating the broad applicability of the framework.

Limitations & Future Work¶

The authors admit there is still a gap compared to supervised methods (e.g., DEEN Rank-1 74.7% vs MMM 61.6%), mainly limited by the lack of cross-modal data annotation.
The number of sub-clusters for multi-memory, \(n\), needs to be set manually, and different datasets/identities may require different values.
DBSCAN clustering parameters are sensitive to the results and require careful tuning.
Leveraging vision-language pre-trained models such as CLIP to introduce semantic priors could be explored to enhance the quality of cross-modal matching.
Computing Hungarian matching for multi-memory may become a bottleneck in large-scale scenarios.

Cluster-Contrast: A contrastive learning method using a single cluster center; MMM is a generalization of its memory representation.
PGM: Models cross-modal matching as bipartite graph matching, which inspired the matching strategy of MMLM.
DivideMix: Uses GMM to model loss distributions to handle noisy labels, which inspired the confidence estimation of SCA.
Insights: Multi-memory representation and soft alignment strategies can be generalized to other unsupervised cross-domain/cross-modal matching tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ The multi-memory matching idea is novel, and the ARI metric reveals the blind spots of existing methods.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated on two standard datasets with three settings (supervised/semi-supervised/unsupervised) for comparison, with comprehensive ablation studies.
Writing Quality: ⭐⭐⭐⭐ Deep problem analysis, and the observation of the paradox phenomenon is highly valuable.
Value: ⭐⭐⭐⭐ Achieves new SOTA in unsupervised VI-ReID, and the framework can serve as a plug-and-play module compatible with other methods.