Weakly Supervised Visible-Infrared Person Re-Identification via Heterogeneous Expert Collaborative Consistency Learning¶
Conference: ICCV 2025 arXiv: 2507.12942 Code: GitHub Area: Human Understanding Keywords: Visible-Infrared Person Re-Identification, Weakly Supervised Learning, Cross-Modal Matching, Heterogeneous Experts, Collaborative Consistency Learning
TL;DR¶
This paper proposes the first weakly supervised paradigm for visible-infrared person re-identification (VIReID), which relies solely on intra-modality identity annotations (without cross-modal correspondence labels). A heterogeneous expert collaborative consistency learning framework is introduced to establish cross-modal identity correspondences, achieving performance close to fully supervised methods.
Background & Motivation¶
Visible-infrared person re-identification (VIReID) is a core task in intelligent surveillance, requiring the matching of the same pedestrian across visible and infrared images. Existing methods face annotation challenges at three levels:
Fully supervised methods: Require accurate cross-modal identity correspondence annotations. However, visible and infrared cameras typically operate at different periods (day/night), making it costly and difficult to establish direct cross-modal pedestrian correspondences.
Semi-supervised methods: Utilize single-modality annotations combined with unlabeled data from the other modality. However, pseudo-label noise in the unlabeled modality severely degrades performance.
Unsupervised methods: Generate pseudo-labels via clustering without any annotations. Cross-modal pseudo-label quality is poor, and noise accumulation limits the performance ceiling.
The proposed weakly supervised setting: Each modality has intra-modality identity annotations (i.e., identity labels are available within the visible modality and within the infrared modality separately), but the cross-modal identity correspondences remain unknown. This setting offers two key advantages: 1. Intra-modality annotation is relatively easy (identity recognition is straightforward under the same camera and spectral characteristics). 2. It avoids the cumulative pseudo-label noise inherent in unsupervised methods.
Method¶
Overall Architecture¶
The method consists of two stages: 1. Heterogeneous Expert Learning (HEL): Identity classification experts are trained independently within each modality. 2. Collaborative Consistency Learning (CCL): The experts perform cross-modal identity prediction to establish correspondences and enable collaborative training.
Key Designs¶
-
Heterogeneous Expert Learning (HEL):
- Function: Independently trains an identity classification expert for each modality.
- Mechanism: ResNet-50 is adopted as the backbone. The early layers are modality-specific (non-shared), while subsequent convolutional layers share parameters. Separate classifiers \(\boldsymbol{W}^v\) and \(\boldsymbol{W}^r\) are built for the visible and infrared modalities, respectively. Training employs a joint loss of cross-entropy and weighted regularized triplet loss: \(\mathcal{L}_{phase1} = \mathcal{L}_{id}^{exp} + \lambda_1 \mathcal{L}_{wrt}^{intr}\) where the cross-entropy loss is: \(\mathcal{L}_{id}^{exp} = -\sum_{t \in \{v,r\}} \frac{1}{n^t} \sum_i \boldsymbol{y}_i^t \log \boldsymbol{p}_i^t\)
- Design Motivation: The two experts are trained on different modalities and thus focus on different identity-relevant cues, hence the term "heterogeneous experts." Each expert attains strong intra-modality discriminability, providing the foundation for subsequent cross-modal prediction.
-
Cross-Modal Relation Establishment (CRE):
- Function: Fuses cross-modal predictions from both experts to establish reliable cross-modal identity correspondences.
- Mechanism: Each expert predicts identities of samples from the other modality. A Count Priority Selection strategy yields decision matrices \(\boldsymbol{M}^{t \to \bar{t}}\). Cross-modal correspondences are categorized into three types:
- Consistent matches \(\boldsymbol{M}_c\): Both experts agree on the identity correspondence — most reliable. $\(\boldsymbol{M}_c = \boldsymbol{M}^{t \to \bar{t}} \odot (\boldsymbol{M}^{\bar{t} \to t})^T\)$
- Single matches \(\boldsymbol{M}_s\): Only one expert provides a correspondence prediction while the other is indecisive.
- Contradictory matches \(\boldsymbol{M}_w\): The two experts give conflicting correspondence predictions. $\(\boldsymbol{M}_w = \boldsymbol{M}^{v \to r} + (\boldsymbol{M}^{r \to v})^T - 2\boldsymbol{M}_c - \boldsymbol{M}_s\)$
- Design Motivation: A single expert may produce inaccurate cross-modal predictions, but mutual agreement between two experts yields high confidence. Categorizing correspondences into three types enables differentiated training strategies according to reliability level.
-
Collaborative Consistency Learning (CCL):
- Function: Uses cross-modal correspondences to constrain the encoder to learn modality-invariant features, while progressively enhancing experts' cross-modal discriminability.
- Mechanism comprises two components:
- Cross-Modal Consistency Learning (CMCL):
- For consistent/single-match samples, a strong constraint (cross-modal cross-entropy) is applied: \(\mathcal{L}_{id}^{stro}\)
- For contradictory-match samples, a weak constraint (excluding impossible identities) is applied: \(\mathcal{L}_{id}^{weak} = -\frac{1}{n_w^v} \sum_i \boldsymbol{m}_i \log(1 - \boldsymbol{W}^c(\boldsymbol{f}_i^v) + \epsilon)\)
- Collaborative Learning of Asymmetric Experts (CLAE):
- Identity prototype features are maintained for each modality: \(\mathcal{P}_i^t \leftarrow \lambda \mathcal{P}_i^t + (1-\lambda) \bar{\boldsymbol{f}}_i^t\)
- A collaborative consistency loss encourages the experts to produce consistent predictions for cross-modal positive pairs: \(\mathcal{L}_{homo}^v = \frac{1}{n^c \times C^v} \sum_i \| \boldsymbol{p}_i^{v \to v} - \boldsymbol{p}_i^{r \to v} \|_2^2\)
- Information entropy is used to adaptively modulate constraint strength: the more uncertain the prediction, the stronger the collaborative constraint.
- Design Motivation: CMCL drives the encoder to learn cross-modal consistent features, while CLAE progressively improves experts' cross-modal prediction capability. The two components reinforce each other in a positive feedback loop. The weak constraint design prevents contradictory labels from harming training.
Loss & Training¶
The total loss for the collaborative consistency learning stage: $\(\mathcal{L}_{phase2} = \mathcal{L}_{id}^{exp} + \mathcal{L}_{id}^{stro} + \mathcal{L}_{homo} + \lambda_1 \mathcal{L}_{wrt}^{cros} + \lambda_2 \mathcal{L}_{id}^{weak}\)$
Training hyperparameters: - Momentum update coefficient \(\lambda = 0.8\), \(\lambda_1 = \lambda_2 = 0.25\) - Initial learning rate: \(3 \times 10^{-4}\) for the encoder, \(6 \times 10^{-4}\) for experts and shared classifiers - CCL stage: 120 epochs with 10 warmup epochs
Key Experimental Results¶
Main Results¶
SYSU-MM01 (All Search mode):
| Method | Type | Rank-1 | mAP | Gain |
|---|---|---|---|---|
| DPIS (ICCV'23) | Semi-supervised | 58.4 | 55.6 | - |
| GUR (ICCV'23) | Unsupervised | 63.5 | 61.6 | - |
| DEEN (CVPR'23) | Fully supervised | 74.7 | 71.8 | - |
| Ours | Weakly supervised | 70.4 | 66.6 | +12.0/+11.0 vs DPIS |
LLCM (VIS to IR mode):
| Method | Type | Rank-1 | mAP | Gain |
|---|---|---|---|---|
| OTLA (ECCV'22) | Semi-supervised | 44.2 | 48.2 | - |
| PGM (CVPR'23) | Unsupervised | 44.9 | 49.0 | - |
| DEEN (CVPR'23) | Fully supervised | 62.5 | 65.8 | - |
| Ours | Weakly supervised | 55.3 | 58.7 | +11.1/+10.5 vs OTLA |
Ablation Study¶
Contribution of each module (SYSU-MM01 All Search):
| Configuration | Rank-1 | mAP | Note |
|---|---|---|---|
| Baseline (HEL only) | 47.8 | 47.2 | Intra-modality training only |
| B + CMCL\CRE | 66.7 | 62.8 | CMCL with consistent predictions only |
| B + CRE + CMCL | 68.3 | 64.5 | Adding relation fusion |
| B + CRE + CMCL + CLAE | 70.4 | 66.6 | Full model |
Key Findings¶
- Intra-modality training alone (Baseline) already achieves non-trivial cross-modal retrieval (47.8% Rank-1), indicating that shared-parameter layers capture preliminary modality-invariant features.
- CMCL contributes the largest gain (+18.9% Rank-1), confirming that establishing cross-modal correspondences is the core contribution.
- Relation fusion via CRE further improves Rank-1 by 1.6%, demonstrating that fusing predictions from two experts outperforms relying on a single expert.
- CLAE provides an additional 2.1% Rank-1 gain, showing that experts continuously improve cross-modal discriminability during training.
- The weakly supervised performance (70.4%) approaches that of the fully supervised method DEEN (74.7%), with only a 4.3% gap, while substantially reducing annotation cost.
Highlights & Insights¶
- The weakly supervised paradigm is elegantly designed: it avoids the expensive annotations of fully supervised methods while providing more reliable intra-modality supervision signals than semi-supervised or unsupervised approaches.
- Hierarchical treatment of three correspondence types is the core innovation: consistent, single, and contradictory matches are handled with strong, moderate, and weak (exclusion-based) constraints, respectively, mitigating the harmful effects of noisy labels.
- Expert collaborative learning forms a positive feedback loop: better correspondences → better features → more accurate expert predictions → better correspondences.
- Entropy-based adaptive weighting realizes the intuition that "greater uncertainty warrants stronger constraint."
Limitations & Future Work¶
- The heterogeneous experts employ simple classifiers; more sophisticated expert architectures (e.g., Mixture of Experts) could be explored.
- Correspondence establishment is a discrete, one-shot process; end-to-end soft alignment approaches (e.g., continuous relaxation via optimal transport) are worth investigating.
- Validation is conducted only on SYSU-MM01 and LLCM; generalizability requires further verification on additional datasets.
- The weakly supervised setting assumes complete intra-modality identity annotations, which may still incur substantial annotation cost in extreme scenarios.
Related Work & Insights¶
- The optimal transport strategy in OTLA (ECCV 2022) is complementary to the proposed CRE and could be integrated in future work.
- Contrastive learning frameworks (e.g., InfoNCE) may replace cross-entropy loss for more robust cross-modal alignment.
- The weakly supervised paradigm has potential for generalization to other cross-modal matching tasks (e.g., RGB-depth, RGB-text person re-identification).
- The expert collaboration mechanism can draw inspiration from the momentum teacher paradigm in self-supervised learning.
Rating¶
- Novelty: ⭐⭐⭐⭐ Pioneers the weakly supervised VIReID paradigm; the three-category correspondence design is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive ablation study; thorough comparison across fully supervised, semi-supervised, and unsupervised paradigms.
- Writing Quality: ⭐⭐⭐⭐ Problem formulation and method description are clear; mathematical derivations are rigorous.
- Value: ⭐⭐⭐⭐ Significantly reduces annotation cost in practical deployment while maintaining competitive performance.