Hierarchical Prompt Learning for Image- and Text-Based Person Re-Identification¶

Conference: AAAI 2026 arXiv: 2511.13575 Code: https://github.com/LH-Z-Ac/HPL-AAAI26 Area: Autonomous Driving Keywords: Person Re-Identification, Prompt Learning, Cross-Modal Alignment, CLIP, Unified Retrieval Framework

TL;DR¶

This paper proposes HPL, a unified framework that decouples I2I and T2I tasks via a Task-Routed Transformer (dual classification tokens), and employs hierarchical prompt learning (identity-level + instance-level pseudo-text tokens) combined with cross-modal prompt regularization, achieving simultaneous state-of-the-art performance on both image-to-image and text-to-image person re-identification within a single model for the first time.

Background & Motivation¶

Two Primary Tasks in Person Re-Identification¶

Person re-identification (ReID) is divided into two categories based on query modality: - I2I (Image-to-Image): Given a query image, retrieve images of the same person from a large gallery. The core challenge is extracting identity-discriminative features robust to viewpoint and background variations. - T2I (Text-to-Image): Given a natural language description, retrieve matching pedestrian images. The core challenge is precise cross-modal semantic alignment.

The Dilemma of Joint Training¶

Existing methods typically treat I2I and T2I as independent tasks. However, in practical applications, image and text queries may coexist, necessitating a unified framework capable of handling both tasks simultaneously.

Directly joint-training the two tasks within a single model, however, leads to performance degradation (as shown in Fig. 1(a) of the paper). The authors attribute this to semantic conflicts: - I2I focuses on identity-level semantics (clothing, gender, and other cross-view consistent features). - T2I additionally relies on instance-level attributes (e.g., "holding a table" — details described in text but ignored in I2I). - These inconsistent supervision signals produce conflicting optimization directions, causing mutual interference between the two tasks.

Inspiration from Prompt Learning¶

Works such as CLIP-ReID have demonstrated the success of prompt learning in I2I ReID — injecting identity-level semantics significantly improves performance. However, direct extension to T2I is non-trivial: T2I additionally requires fine-grained instance-level semantics (actions, gestures, carried objects, etc.) that vary across samples and viewpoints.

Core Idea: Design a hierarchical prompt structure — "A photo of [id-tokens] and [inst-tokens] person" — where identity-level tokens provide a stable identity anchor, while instance-level tokens are dynamically generated by modality-specific inversion networks to capture sample-specific attributes.

Method¶

Overall Architecture¶

The framework consists of three core modules: 1. Task-Routed Transformer (TRT): Dual classification tokens for task-specific encoding. 2. Hierarchical Prompt Learning (HPL): Hierarchical prompt generation and alignment. 3. Cross-Modal Prompt Regularization (CMPR): Cross-modal prompt regularization.

Training proceeds in two stages: prompt construction (Stage I) → representation learning (Stage II).

Key Designs¶

1. Task-Routed Transformer (TRT)¶

An additional classification token is introduced into the CLIP visual encoder, forming a dual-token design: - \(v_{t2i}^i\): The original classification token, optimized for T2I cross-modal alignment. - \(v_{i2i}^i\): The newly added classification token, optimized for I2I identity discrimination.

Visual feature extraction:

\[F_i^v = [v_{t2i}^i, v_{i2i}^i, v_1^i, \ldots, v_N^i] = \mathcal{M}^v(V_i)\]

Multi-objective supervision strategy:

\[\mathcal{L}_{base} = \mathcal{L}_{sdm} + \mathcal{L}_{id}^{t2i} + \mathcal{L}_{tri} + \mathcal{L}_{id}^{i2i}\]

where \(v_{t2i}\) is optimized by cross-modal similarity distribution matching and cross-modal identity classification losses, while \(v_{i2i}\) is optimized by identity classification and triplet ranking losses.

Design Motivation: In ViT, classification tokens naturally aggregate contextual information via self-attention, and the aggregated semantics are guided by task objectives. The dual-token design achieves task decoupling with minimal overhead (adding only one token), enabling the shared backbone to serve both tasks simultaneously.

2. Hierarchical Prompt Learning (HPL)¶

Hierarchical Prompt Construction: The template "A photo of [id-tokens] and [inst-tokens] person" is designed as follows: - [id-tokens]: A fixed number of learnable tokens encoding identity-level semantics. - [inst-tokens]: Pseudo-text tokens dynamically generated by modality-specific inversion networks.

Inversion networks generate pseudo-text tokens from visual/textual features:

\[P_i^t = \mathcal{I}_t(F_i^t), \quad P_i^v = \mathcal{I}_v(F_i^v)\]

where \(\mathcal{I}_v\) and \(\mathcal{I}_t\) each consist of 4 Transformer blocks. The generated pseudo-tokens are inserted into the template: - \(T_i^v\): "A photo of [id-tokens] and \(P_i^v\) person." - \(T_i^t\): "A photo of [id-tokens] and \(P_i^t\) person."

An inversion consistency loss ensures the pseudo-prompts retain source-modality semantics:

\[\mathcal{L}_{ic} = \frac{1}{|B|}\sum_{i \in B}\|\tilde{v}_{eos}^i - v_{t2i}^i\|_2^2 + \frac{1}{|B_{t2i}|}\sum_{i \in B_{t2i}}\|\tilde{t}_{eos}^i - t_{eos^i}\|_2^2\]

Hierarchical Prompt Alignment: - The T2I task uses complete hierarchical prompts (identity + instance) and is aligned via the ILPA loss:

\[\mathcal{L}_{ILPA} = \mathcal{L}_{tgps} + \mathcal{L}_{vgps}\]

The I2I task uses simplified identity-level prompts and is aligned via a cross-modal identity classification loss:

\[\mathcal{L}_{cic} = -\sum_{i \in B}\log \frac{\exp[\text{sim}(v_{i2i}^i, \tilde{r}_{eos}^{y_i})]}{\sum_{j=1}^{N_{id}}\exp[\text{sim}(v_{i2i}^i, \tilde{r}_{eos}^j)]}\]

Design Motivation: HPL offers three key advantages: (1) instance-level pseudo-tokens capture fine-grained attributes beyond categorical identity; (2) bidirectional cross-modal alignment bridges the modality gap; (3) identity and instance prompts are concatenated and jointly optimized, balancing identity consistency and instance specificity.

Instance-level prompts \(P_i^v\) and \(P_i^t\) may encode modality-specific biases. CMPR directly aligns the two in the prompt token space:

\[\mathcal{L}_{CMPR} = \frac{1}{|B_{t2i}|}\sum_{i \in B_{t2i}}\|P_i^t - P_i^v\|_F^2\]

Design Motivation: This ensures that pseudo-prompts generated independently from images and text are semantically consistent, reducing cross-modal semantic drift and improving text-guided visual retrieval.

Loss & Training¶

Stage I (Prompt Construction, 10 epochs):

\[\mathcal{L}_{construct} = \mathcal{L}_{t2i} + \mathcal{L}_{i2t} + \mathcal{L}_{ic}\]

Encoders are frozen; only inversion networks and learnable prompts are updated.

Stage II (Representation Learning, 60 epochs):

\[\mathcal{L}_{total} = \mathcal{L}_{base} + \mathcal{L}_{cic} + \lambda_1 \mathcal{L}_{ILPA} + \lambda_2 \mathcal{L}_{CMPR}\]

where \(\lambda_1 = 0.4\) and \(\lambda_2 = 0.06\).

Key Experimental Results¶

Main Results¶

T2I ReID Performance:

Dataset	Metric	HPL (Ours)	Propot (MM'24)	UMSA (AAAI'24)
CUHK-PEDES	Rank-1	76.28	74.89	74.25
CUHK-PEDES	mAP	70.90	67.12	66.15
ICFG-PEDES	Rank-1	66.61	65.12	65.62
RSTPReID	Rank-1	64.00	61.87	63.40
RSTPReID	mAP	53.13	47.82	49.28

I2I ReID Performance:

Dataset	Metric	HPL (Ours)	CLIP-ReID (AAAI'23)	TransReID (ICCV'21)
Market1501	Rank-1	95.99	95.50	95.20
Market1501	mAP	89.82	89.60	88.90
MSMT17	Rank-1	91.04	88.70	85.30
MSMT17	mAP	79.01	73.40	67.40
DukeMTMC	Rank-1	90.35	90.00	90.70

Ablation Study¶

Contribution of each module (CUHK-PEDES + Market1501):

TRT	HPL	CMPR	T2I Rank-1	T2I mAP	I2I Rank-1	I2I mAP
✗	✗	✗	74.22	70.45	94.50	86.91
✓	✗	✗	75.27 (+1.05)	70.80	95.36 (+0.86)	88.98 (+2.07)
✓	✓	✗	75.60 (+0.33)	70.88	95.57 (+0.21)	89.72 (+0.74)
✓	✓	✓	76.28 (+0.68)	70.89	95.99 (+0.42)	89.82 (+0.10)

Ablation on instance-level alignment:

\(\mathcal{L}_{tgps}\)	\(\mathcal{L}_{vgps}\)	T2I Rank-1	I2I mAP
✗	✗	75.27	88.98
✓	✗	75.58	89.35
✗	✓	75.60	89.59
✓	✓	75.60	89.72

Key Findings¶

The dual-token design in TRT contributes the most (I2I mAP +2.07%), validating the necessity of task decoupling.
HPL alone yields limited gains; it achieves maximum effect when combined with CMPR (which provides +0.68% T2I Rank-1).
Vision-guided prompts (\(\mathcal{L}_{vgps}\)) contribute more to I2I, while text-guided prompts (\(\mathcal{L}_{tgps}\)) contribute more to T2I.
Grad-CAM visualizations show that I2I tokens attend to cross-view consistent features such as clothing and body shape, while T2I tokens attend to specific items described in text (e.g., phones, satchels), confirming that task-aware attention decoupling indeed occurs.

Highlights & Insights¶

Practical significance of the unified framework: For the first time, a single model achieves simultaneous state-of-the-art performance on both I2I and T2I, eliminating the overhead of deploying two separate models.
Elegance of the dual-token design: Task routing is achieved by adding a single classification token — a minimal modification with significant effect.
Sound hierarchical prompt design: Identity-level tokens provide a stable anchor while instance-level tokens enable fine-grained adaptation, together covering the core requirements of ReID.
Cross-modal regularization as the key binding element: CMPR elevates HPL from "two parallel tasks" to a "collaboratively unified framework."

Limitations & Future Work¶

Joint training requires both I2I and T2I datasets simultaneously, incurring non-trivial data preparation costs.
The inversion networks employ 4-layer Transformers, which may need to be streamlined for lightweight deployment scenarios.
Generative T2I methods (e.g., using LLMs to synthesize additional descriptions) for augmenting training data remain unexplored.
Performance on larger backbones (e.g., ViT-L/14) has not been verified.
Cross-dataset generalization (e.g., training on CUHK and testing on RSTPReID) is not discussed.

Relationship to CLIP-ReID: HPL directly extends its identity-level prompt scheme by adding dynamic instance-level prompts — a natural and effective improvement.
Relationship to GET: The prompt inversion idea is borrowed to translate visual features into pseudo-text prompts.
Inspiration: The dual classification token design is generalizable to other multi-task learning scenarios, such as simultaneous detection and segmentation.

Rating¶

Novelty: ⭐⭐⭐⭐ — Individual components have precedents, but the combination and unified framework design are novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Six datasets, detailed ablations, and comprehensive visualization analyses.
Writing Quality: ⭐⭐⭐⭐ — Clear structure, rich figures and tables, well-articulated motivation.
Value: ⭐⭐⭐⭐ — The unified framework reduces deployment costs, though paired datasets are required.