Multi-modal Multi-platform Person Re-Identification: Benchmark and Method¶

Conference: ICCV 2025 arXiv: 2503.17096 Code: GitHub Area: Video Understanding Keywords: Person Re-Identification, Multi-modal, Multi-platform, Prompt Learning, CLIP

TL;DR¶

This paper presents MP-ReID, the first multi-modal multi-platform person re-identification benchmark encompassing three modalities (RGB, infrared, thermal) and two platforms (ground and UAV), along with a unified prompt learning framework, Uni-Prompt ReID, which leverages modality-aware, platform-aware, and visual-enhanced prompts to substantially improve ReID performance under complex real-world conditions.

Background & Motivation¶

Conventional ReID research has been largely confined to single-modality (RGB) + fixed-camera settings, which are inadequate for the increasingly heterogeneous sensor deployments found in real urban environments. Consider a 24/7 urban pedestrian surveillance system comprising:

Ground RGB cameras: daytime scenarios
Infrared/thermal sensors: nighttime or adverse lighting
Unmanned Aerial Vehicles (UAVs): dynamic tracking with flexible viewpoints

Such a multi-modal + multi-platform configuration introduces three compounding challenges: (1) modality gap (appearance discrepancy among RGB, infrared, and thermal imaging), (2) platform gap (viewpoint and resolution differences between ground-level and aerial capture), and (3) the extreme difficulty when both gaps are present simultaneously.

Limitations of existing datasets: - Cross-modal datasets (SYSU-MM01, LLCM) cover only RGB + infrared and use only ground cameras. - UAV datasets (AG-ReID) cover only the RGB modality. - No existing dataset simultaneously addresses multiple modalities and multiple platforms.

This critical gap motivates the construction of MP-ReID and the design of a corresponding unified learning framework.

Method¶

Overall Architecture¶

Uni-Prompt ReID is built upon the CLIP vision-language model and is fine-tuned through carefully designed multi-part textual prompts. The framework consists of three categories of learnable prompts and a visual-enhanced network that injects image features into the text prompt space.

Key Designs¶

MP-ReID Dataset Construction

The dataset spans 3 modalities × 2 platforms: - Ground RGB (6 Hikvision 1920×1080 full-color cameras) - Ground infrared (6 cameras in infrared night-vision mode) - UAV RGB (DJI Mavic 3T, 3840×2160) - UAV thermal (DJI Mavic 3T thermal camera, 640×512)

Dataset scale: 1,930 identities, 136,156 annotated bounding boxes, 14 cameras, and over 13 hours of total video footage. UAV data was collected at three altitudes (5 m / 7 m / 10 m) with pitch angles ranging from 30° to 80°. All data underwent facial mosaicking and original footage deletion to protect privacy.

Uni-Prompt Multi-Part Textual Prompts

The textual prompt is formed by concatenating three parts:

\(t_i(a) = X_1(a) \cdots X_M(a) \; P_1(a) \cdots P_R(a) \; M_1(a) \cdots M_B(a), \text{person}_i\)

Specific ReID Prompt (\(X\)): encodes individual-specific information (identity level)
Modality-Aware Prompt (\(M\)): captures modality-specific details (RGB vs. infrared vs. thermal)
Platform-Aware Prompt (\(P\)): incorporates platform-specific context (ground vs. aerial)
Visual-Enhanced Network

A lightweight neural network \(g_\theta(\cdot)\) maps the image feature \(a\) to context vectors:

\(\sigma = (\sigma_X, \sigma_P, \sigma_M) = g_\theta(a)\)

These are added to the corresponding prompts: \(S_m(a) = [S]_m + \sigma_S\)

Intuition: visual features of infrared images inherently contain modality cues, which can guide the modality-aware prompts to specialize toward the infrared domain.

Loss & Training¶

Two-stage training:

Stage 1: Freeze modality and platform prompts; learn the Specific ReID Prompt using CLIP-ReID's \(\mathcal{L}_{i2t} + \mathcal{L}_{t2i}\).
Stage 2: Freeze the ReID Prompt; learn the remaining prompts using modality-level and platform-level contrastive losses:

\[\mathcal{L}_{\text{Uni-Prompt}} = \mathcal{L}_{mi2t} + \mathcal{L}_{mt2i} + \mathcal{L}_{pi2t} + \mathcal{L}_{pt2i}\]

Each term is a contrastive loss (InfoNCE form) aligning features to modality and platform labels respectively. Data augmentation includes random erasing (\(p=0.5\)), random horizontal flipping, and random cropping.

Key Experimental Results¶

Main Results¶

Average results across three MP-ReID benchmark settings

Method	Cross-Platform Rank-1	Cross-Modal Rank-1	Cross-Modal+Platform Rank-1	Avg. Rank-1	Avg. mAP
CAJ	40.36	45.34	10.62	32.11	21.51
CAJ+	47.60	58.16	21.51	42.42	30.61
AGW	53.68	51.88	19.21	41.59	30.56
DEEN	60.05	69.59	27.59	52.41	39.33
OTLA-ReID	73.24	68.12	29.31	56.89	43.03
Uni-Prompt	78.77	72.26	43.16	64.73	58.45

Average Rank-1 improves by +7.87% and mAP by +15.42%. The gain is most pronounced in the hardest cross-modal + cross-platform setting (+13.85% Rank-1).

Ablation Study¶

Configuration	Cross-Platform R1	Cross-Modal R1	Cross-Modal+Platform R1	Avg. R1	Avg. mAP
Base (ReID Prompt)	77.01	61.11	28.40	55.51	47.98
+Modality-Aware	77.18	67.34	31.57	58.70	51.67
+Platform-Aware	78.62	70.31	40.66	63.20	57.48
+Visual-Enhanced (Full)	78.77	72.26	43.16	64.73	58.45

Key Findings¶

Cross-modal + cross-platform is the most challenging setting: existing methods degrade drastically (CAJ achieves only 10.62% Rank-1), while Uni-Prompt reaches 43.16%, demonstrating the necessity of dedicated design for jointly handling both gaps.
Modality-aware prompts are most effective in the cross-modal setting (+6.23% Rank-1) with minimal impact on the cross-platform setting.
Platform-aware prompts contribute most substantially in the cross-modal + cross-platform setting (+9.09% Rank-1), constituting the key component for the hardest scenario.
The visual-enhanced network yields marginal but consistent gains across all settings, with a 2.50% improvement in the cross-modal + cross-platform setting, confirming that visual cues provide auxiliary guidance for prompt learning.
Existing baseline methods perform acceptably only in single-gap settings where ground RGB data is available, and degrade significantly once UAV capture and multi-modal inputs are jointly introduced.

Highlights & Insights¶

The first multi-modal multi-platform ReID benchmark fills a critical gap—1,930 identities, 14 cameras, 3 modalities, and 2 platforms—combining both scale and diversity.
The unified prompt learning framework elegantly decomposes modality and platform information into separate learnable prompts, avoiding complex feature fusion networks.
The two-stage training strategy (learning identity prompts first, then modality/platform prompts) resembles curriculum learning, ensuring the model first establishes an identity concept before learning cross-domain alignment.
Privacy protection measures are comprehensive: facial mosaicking, deletion of raw footage, ethics committee approval, and public notification.

Limitations & Future Work¶

Dataset scale is constrained by the high cost of multi-modal multi-platform collection (1,930 identities vs. 4,101 in MSMT17).
Evaluation is conducted on a single dataset, leaving transferability to other benchmarks unverified.
Wearable-device platforms and event camera modalities are not addressed; the authors encourage future extensions in these directions.
The visual-enhanced network design is relatively simple (lightweight linear mapping); more sophisticated adapters may yield further improvements.
The low resolution of the UAV thermal camera (640×512) degrades YOLOX tracking performance, necessitating substantial manual annotation effort.

CLIP-ReID and DAPrompt serve as the foundational prompt learning baselines; CoCoOp inspires the design of visually-conditioned prompts.
Cross-modal datasets such as SYSU-MM01 and LLCM are limited to RGB + infrared with ground-only cameras; AG-ReID covers aerial imagery but is restricted to the RGB modality.
The multi-platform design of MP-ReID has direct application value for smart city and public safety scenarios.

Rating¶

Novelty: ⭐⭐⭐⭐ The multi-modal multi-platform dataset design constitutes the primary contribution; the method represents an incremental extension of existing prompt learning frameworks.
Experimental Thoroughness: ⭐⭐⭐⭐ 12 experimental settings + detailed ablations + 10-run averaging; the evaluation protocol is rigorous.
Writing Quality: ⭐⭐⭐⭐ The dataset description is thorough and the methodological exposition is clear.
Value: ⭐⭐⭐⭐⭐ The dataset and benchmark make a significant contribution to the person re-identification community.