Hierarchical Prompt Learning for Image- and Text-Based Person Re-Identification¶
Conference: AAAI 2026 arXiv: 2511.13575 Code: https://github.com/LH-Z-Ac/HPL-AAAI26 Area: Autonomous Driving Keywords: Person Re-Identification, Prompt Learning, Cross-Modal Alignment, CLIP, Unified Retrieval Framework
TL;DR¶
This paper proposes HPL, a unified framework that decouples I2I and T2I tasks via a Task-Routed Transformer (dual classification tokens), and employs hierarchical prompt learning (identity-level + instance-level pseudo-text tokens) combined with cross-modal prompt regularization, achieving simultaneous state-of-the-art performance on both image-to-image and text-to-image person re-identification within a single model for the first time.
Background & Motivation¶
Two Primary Tasks in Person Re-Identification¶
Person re-identification (ReID) is divided into two categories based on query modality: - I2I (Image-to-Image): Given a query image, retrieve images of the same person from a large gallery. The core challenge is extracting identity-discriminative features robust to viewpoint and background variations. - T2I (Text-to-Image): Given a natural language description, retrieve matching pedestrian images. The core challenge is precise cross-modal semantic alignment.
The Dilemma of Joint Training¶
Existing methods typically treat I2I and T2I as independent tasks. However, in practical applications, image and text queries may coexist, necessitating a unified framework capable of handling both tasks simultaneously.
Directly joint-training the two tasks within a single model, however, leads to performance degradation (as shown in Fig. 1(a) of the paper). The authors attribute this to semantic conflicts: - I2I focuses on identity-level semantics (clothing, gender, and other cross-view consistent features). - T2I additionally relies on instance-level attributes (e.g., "holding a table" — details described in text but ignored in I2I). - These inconsistent supervision signals produce conflicting optimization directions, causing mutual interference between the two tasks.
Inspiration from Prompt Learning¶
Works such as CLIP-ReID have demonstrated the success of prompt learning in I2I ReID — injecting identity-level semantics significantly improves performance. However, direct extension to T2I is non-trivial: T2I additionally requires fine-grained instance-level semantics (actions, gestures, carried objects, etc.) that vary across samples and viewpoints.
Core Idea: Design a hierarchical prompt structure — "A photo of [id-tokens] and [inst-tokens] person" — where identity-level tokens provide a stable identity anchor, while instance-level tokens are dynamically generated by modality-specific inversion networks to capture sample-specific attributes.
Method¶
Overall Architecture¶
The framework consists of three core modules: 1. Task-Routed Transformer (TRT): Dual classification tokens for task-specific encoding. 2. Hierarchical Prompt Learning (HPL): Hierarchical prompt generation and alignment. 3. Cross-Modal Prompt Regularization (CMPR): Cross-modal prompt regularization.
Training proceeds in two stages: prompt construction (Stage I) → representation learning (Stage II).
Key Designs¶
1. Task-Routed Transformer (TRT)¶
An additional classification token is introduced into the CLIP visual encoder, forming a dual-token design: - \(v_{t2i}^i\): The original classification token, optimized for T2I cross-modal alignment. - \(v_{i2i}^i\): The newly added classification token, optimized for I2I identity discrimination.
Visual feature extraction:
Multi-objective supervision strategy:
where \(v_{t2i}\) is optimized by cross-modal similarity distribution matching and cross-modal identity classification losses, while \(v_{i2i}\) is optimized by identity classification and triplet ranking losses.
Design Motivation: In ViT, classification tokens naturally aggregate contextual information via self-attention, and the aggregated semantics are guided by task objectives. The dual-token design achieves task decoupling with minimal overhead (adding only one token), enabling the shared backbone to serve both tasks simultaneously.
2. Hierarchical Prompt Learning (HPL)¶
Hierarchical Prompt Construction: The template "A photo of [id-tokens] and [inst-tokens] person" is designed as follows: - [id-tokens]: A fixed number of learnable tokens encoding identity-level semantics. - [inst-tokens]: Pseudo-text tokens dynamically generated by modality-specific inversion networks.
Inversion networks generate pseudo-text tokens from visual/textual features:
where \(\mathcal{I}_v\) and \(\mathcal{I}_t\) each consist of 4 Transformer blocks. The generated pseudo-tokens are inserted into the template: - \(T_i^v\): "A photo of [id-tokens] and \(P_i^v\) person." - \(T_i^t\): "A photo of [id-tokens] and \(P_i^t\) person."
An inversion consistency loss ensures the pseudo-prompts retain source-modality semantics:
Hierarchical Prompt Alignment: - The T2I task uses complete hierarchical prompts (identity + instance) and is aligned via the ILPA loss:
- The I2I task uses simplified identity-level prompts and is aligned via a cross-modal identity classification loss:
Design Motivation: HPL offers three key advantages: (1) instance-level pseudo-tokens capture fine-grained attributes beyond categorical identity; (2) bidirectional cross-modal alignment bridges the modality gap; (3) identity and instance prompts are concatenated and jointly optimized, balancing identity consistency and instance specificity.
3. Cross-Modal Prompt Regularization (CMPR)¶
Instance-level prompts \(P_i^v\) and \(P_i^t\) may encode modality-specific biases. CMPR directly aligns the two in the prompt token space:
Design Motivation: This ensures that pseudo-prompts generated independently from images and text are semantically consistent, reducing cross-modal semantic drift and improving text-guided visual retrieval.
Loss & Training¶
Stage I (Prompt Construction, 10 epochs):
Encoders are frozen; only inversion networks and learnable prompts are updated.
Stage II (Representation Learning, 60 epochs):
where \(\lambda_1 = 0.4\) and \(\lambda_2 = 0.06\).
Key Experimental Results¶
Main Results¶
T2I ReID Performance:
| Dataset | Metric | HPL (Ours) | Propot (MM'24) | UMSA (AAAI'24) |
|---|---|---|---|---|
| CUHK-PEDES | Rank-1 | 76.28 | 74.89 | 74.25 |
| CUHK-PEDES | mAP | 70.90 | 67.12 | 66.15 |
| ICFG-PEDES | Rank-1 | 66.61 | 65.12 | 65.62 |
| RSTPReID | Rank-1 | 64.00 | 61.87 | 63.40 |
| RSTPReID | mAP | 53.13 | 47.82 | 49.28 |
I2I ReID Performance:
| Dataset | Metric | HPL (Ours) | CLIP-ReID (AAAI'23) | TransReID (ICCV'21) |
|---|---|---|---|---|
| Market1501 | Rank-1 | 95.99 | 95.50 | 95.20 |
| Market1501 | mAP | 89.82 | 89.60 | 88.90 |
| MSMT17 | Rank-1 | 91.04 | 88.70 | 85.30 |
| MSMT17 | mAP | 79.01 | 73.40 | 67.40 |
| DukeMTMC | Rank-1 | 90.35 | 90.00 | 90.70 |
Ablation Study¶
Contribution of each module (CUHK-PEDES + Market1501):
| TRT | HPL | CMPR | T2I Rank-1 | T2I mAP | I2I Rank-1 | I2I mAP |
|---|---|---|---|---|---|---|
| ✗ | ✗ | ✗ | 74.22 | 70.45 | 94.50 | 86.91 |
| ✓ | ✗ | ✗ | 75.27 (+1.05) | 70.80 | 95.36 (+0.86) | 88.98 (+2.07) |
| ✓ | ✓ | ✗ | 75.60 (+0.33) | 70.88 | 95.57 (+0.21) | 89.72 (+0.74) |
| ✓ | ✓ | ✓ | 76.28 (+0.68) | 70.89 | 95.99 (+0.42) | 89.82 (+0.10) |
Ablation on instance-level alignment:
| \(\mathcal{L}_{tgps}\) | \(\mathcal{L}_{vgps}\) | T2I Rank-1 | I2I mAP |
|---|---|---|---|
| ✗ | ✗ | 75.27 | 88.98 |
| ✓ | ✗ | 75.58 | 89.35 |
| ✗ | ✓ | 75.60 | 89.59 |
| ✓ | ✓ | 75.60 | 89.72 |
Key Findings¶
- The dual-token design in TRT contributes the most (I2I mAP +2.07%), validating the necessity of task decoupling.
- HPL alone yields limited gains; it achieves maximum effect when combined with CMPR (which provides +0.68% T2I Rank-1).
- Vision-guided prompts (\(\mathcal{L}_{vgps}\)) contribute more to I2I, while text-guided prompts (\(\mathcal{L}_{tgps}\)) contribute more to T2I.
- Grad-CAM visualizations show that I2I tokens attend to cross-view consistent features such as clothing and body shape, while T2I tokens attend to specific items described in text (e.g., phones, satchels), confirming that task-aware attention decoupling indeed occurs.
Highlights & Insights¶
- Practical significance of the unified framework: For the first time, a single model achieves simultaneous state-of-the-art performance on both I2I and T2I, eliminating the overhead of deploying two separate models.
- Elegance of the dual-token design: Task routing is achieved by adding a single classification token — a minimal modification with significant effect.
- Sound hierarchical prompt design: Identity-level tokens provide a stable anchor while instance-level tokens enable fine-grained adaptation, together covering the core requirements of ReID.
- Cross-modal regularization as the key binding element: CMPR elevates HPL from "two parallel tasks" to a "collaboratively unified framework."
Limitations & Future Work¶
- Joint training requires both I2I and T2I datasets simultaneously, incurring non-trivial data preparation costs.
- The inversion networks employ 4-layer Transformers, which may need to be streamlined for lightweight deployment scenarios.
- Generative T2I methods (e.g., using LLMs to synthesize additional descriptions) for augmenting training data remain unexplored.
- Performance on larger backbones (e.g., ViT-L/14) has not been verified.
- Cross-dataset generalization (e.g., training on CUHK and testing on RSTPReID) is not discussed.
Related Work & Insights¶
- Relationship to CLIP-ReID: HPL directly extends its identity-level prompt scheme by adding dynamic instance-level prompts — a natural and effective improvement.
- Relationship to GET: The prompt inversion idea is borrowed to translate visual features into pseudo-text prompts.
- Inspiration: The dual classification token design is generalizable to other multi-task learning scenarios, such as simultaneous detection and segmentation.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Individual components have precedents, but the combination and unified framework design are novel.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Six datasets, detailed ablations, and comprehensive visualization analyses.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure, rich figures and tables, well-articulated motivation.
- Value: ⭐⭐⭐⭐ — The unified framework reduces deployment costs, though paired datasets are required.