Multi-modal Multi-platform Person Re-Identification: Benchmark and Method¶
Conference: ICCV 2025 arXiv: 2503.17096 Code: GitHub Area: Video Understanding Keywords: Person Re-Identification, Multi-modal, Multi-platform, Prompt Learning, CLIP
TL;DR¶
This paper presents MP-ReID, the first multi-modal multi-platform person re-identification benchmark encompassing three modalities (RGB, infrared, thermal) and two platforms (ground and UAV), along with a unified prompt learning framework, Uni-Prompt ReID, which leverages modality-aware, platform-aware, and visual-enhanced prompts to substantially improve ReID performance under complex real-world conditions.
Background & Motivation¶
Conventional ReID research has been largely confined to single-modality (RGB) + fixed-camera settings, which are inadequate for the increasingly heterogeneous sensor deployments found in real urban environments. Consider a 24/7 urban pedestrian surveillance system comprising:
- Ground RGB cameras: daytime scenarios
- Infrared/thermal sensors: nighttime or adverse lighting
- Unmanned Aerial Vehicles (UAVs): dynamic tracking with flexible viewpoints
Such a multi-modal + multi-platform configuration introduces three compounding challenges: (1) modality gap (appearance discrepancy among RGB, infrared, and thermal imaging), (2) platform gap (viewpoint and resolution differences between ground-level and aerial capture), and (3) the extreme difficulty when both gaps are present simultaneously.
Limitations of existing datasets: - Cross-modal datasets (SYSU-MM01, LLCM) cover only RGB + infrared and use only ground cameras. - UAV datasets (AG-ReID) cover only the RGB modality. - No existing dataset simultaneously addresses multiple modalities and multiple platforms.
This critical gap motivates the construction of MP-ReID and the design of a corresponding unified learning framework.
Method¶
Overall Architecture¶
Uni-Prompt ReID is built upon the CLIP vision-language model and is fine-tuned through carefully designed multi-part textual prompts. The framework consists of three categories of learnable prompts and a visual-enhanced network that injects image features into the text prompt space.
Key Designs¶
- MP-ReID Dataset Construction
The dataset spans 3 modalities × 2 platforms: - Ground RGB (6 Hikvision 1920×1080 full-color cameras) - Ground infrared (6 cameras in infrared night-vision mode) - UAV RGB (DJI Mavic 3T, 3840×2160) - UAV thermal (DJI Mavic 3T thermal camera, 640×512)
Dataset scale: 1,930 identities, 136,156 annotated bounding boxes, 14 cameras, and over 13 hours of total video footage. UAV data was collected at three altitudes (5 m / 7 m / 10 m) with pitch angles ranging from 30° to 80°. All data underwent facial mosaicking and original footage deletion to protect privacy.
- Uni-Prompt Multi-Part Textual Prompts
The textual prompt is formed by concatenating three parts:
\(t_i(a) = X_1(a) \cdots X_M(a) \; P_1(a) \cdots P_R(a) \; M_1(a) \cdots M_B(a), \text{person}_i\)
- Specific ReID Prompt (\(X\)): encodes individual-specific information (identity level)
- Modality-Aware Prompt (\(M\)): captures modality-specific details (RGB vs. infrared vs. thermal)
-
Platform-Aware Prompt (\(P\)): incorporates platform-specific context (ground vs. aerial)
-
Visual-Enhanced Network
A lightweight neural network \(g_\theta(\cdot)\) maps the image feature \(a\) to context vectors:
\(\sigma = (\sigma_X, \sigma_P, \sigma_M) = g_\theta(a)\)
These are added to the corresponding prompts: \(S_m(a) = [S]_m + \sigma_S\)
Intuition: visual features of infrared images inherently contain modality cues, which can guide the modality-aware prompts to specialize toward the infrared domain.
Loss & Training¶
Two-stage training:
- Stage 1: Freeze modality and platform prompts; learn the Specific ReID Prompt using CLIP-ReID's \(\mathcal{L}_{i2t} + \mathcal{L}_{t2i}\).
- Stage 2: Freeze the ReID Prompt; learn the remaining prompts using modality-level and platform-level contrastive losses:
Each term is a contrastive loss (InfoNCE form) aligning features to modality and platform labels respectively. Data augmentation includes random erasing (\(p=0.5\)), random horizontal flipping, and random cropping.
Key Experimental Results¶
Main Results¶
Average results across three MP-ReID benchmark settings
| Method | Cross-Platform Rank-1 | Cross-Modal Rank-1 | Cross-Modal+Platform Rank-1 | Avg. Rank-1 | Avg. mAP |
|---|---|---|---|---|---|
| CAJ | 40.36 | 45.34 | 10.62 | 32.11 | 21.51 |
| CAJ+ | 47.60 | 58.16 | 21.51 | 42.42 | 30.61 |
| AGW | 53.68 | 51.88 | 19.21 | 41.59 | 30.56 |
| DEEN | 60.05 | 69.59 | 27.59 | 52.41 | 39.33 |
| OTLA-ReID | 73.24 | 68.12 | 29.31 | 56.89 | 43.03 |
| Uni-Prompt | 78.77 | 72.26 | 43.16 | 64.73 | 58.45 |
Average Rank-1 improves by +7.87% and mAP by +15.42%. The gain is most pronounced in the hardest cross-modal + cross-platform setting (+13.85% Rank-1).
Ablation Study¶
| Configuration | Cross-Platform R1 | Cross-Modal R1 | Cross-Modal+Platform R1 | Avg. R1 | Avg. mAP |
|---|---|---|---|---|---|
| Base (ReID Prompt) | 77.01 | 61.11 | 28.40 | 55.51 | 47.98 |
| +Modality-Aware | 77.18 | 67.34 | 31.57 | 58.70 | 51.67 |
| +Platform-Aware | 78.62 | 70.31 | 40.66 | 63.20 | 57.48 |
| +Visual-Enhanced (Full) | 78.77 | 72.26 | 43.16 | 64.73 | 58.45 |
Key Findings¶
- Cross-modal + cross-platform is the most challenging setting: existing methods degrade drastically (CAJ achieves only 10.62% Rank-1), while Uni-Prompt reaches 43.16%, demonstrating the necessity of dedicated design for jointly handling both gaps.
- Modality-aware prompts are most effective in the cross-modal setting (+6.23% Rank-1) with minimal impact on the cross-platform setting.
- Platform-aware prompts contribute most substantially in the cross-modal + cross-platform setting (+9.09% Rank-1), constituting the key component for the hardest scenario.
- The visual-enhanced network yields marginal but consistent gains across all settings, with a 2.50% improvement in the cross-modal + cross-platform setting, confirming that visual cues provide auxiliary guidance for prompt learning.
- Existing baseline methods perform acceptably only in single-gap settings where ground RGB data is available, and degrade significantly once UAV capture and multi-modal inputs are jointly introduced.
Highlights & Insights¶
- The first multi-modal multi-platform ReID benchmark fills a critical gap—1,930 identities, 14 cameras, 3 modalities, and 2 platforms—combining both scale and diversity.
- The unified prompt learning framework elegantly decomposes modality and platform information into separate learnable prompts, avoiding complex feature fusion networks.
- The two-stage training strategy (learning identity prompts first, then modality/platform prompts) resembles curriculum learning, ensuring the model first establishes an identity concept before learning cross-domain alignment.
- Privacy protection measures are comprehensive: facial mosaicking, deletion of raw footage, ethics committee approval, and public notification.
Limitations & Future Work¶
- Dataset scale is constrained by the high cost of multi-modal multi-platform collection (1,930 identities vs. 4,101 in MSMT17).
- Evaluation is conducted on a single dataset, leaving transferability to other benchmarks unverified.
- Wearable-device platforms and event camera modalities are not addressed; the authors encourage future extensions in these directions.
- The visual-enhanced network design is relatively simple (lightweight linear mapping); more sophisticated adapters may yield further improvements.
- The low resolution of the UAV thermal camera (640×512) degrades YOLOX tracking performance, necessitating substantial manual annotation effort.
Related Work & Insights¶
- CLIP-ReID and DAPrompt serve as the foundational prompt learning baselines; CoCoOp inspires the design of visually-conditioned prompts.
- Cross-modal datasets such as SYSU-MM01 and LLCM are limited to RGB + infrared with ground-only cameras; AG-ReID covers aerial imagery but is restricted to the RGB modality.
- The multi-platform design of MP-ReID has direct application value for smart city and public safety scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐ The multi-modal multi-platform dataset design constitutes the primary contribution; the method represents an incremental extension of existing prompt learning frameworks.
- Experimental Thoroughness: ⭐⭐⭐⭐ 12 experimental settings + detailed ablations + 10-run averaging; the evaluation protocol is rigorous.
- Writing Quality: ⭐⭐⭐⭐ The dataset description is thorough and the methodological exposition is clear.
- Value: ⭐⭐⭐⭐⭐ The dataset and benchmark make a significant contribution to the person re-identification community.