LLaVA-ReID: Selective Multi-Image Questioner for Interactive Person Re-Identification¶

Conference: ICML 2025
arXiv: 2504.10174
Area: Human Understanding

TL;DR¶

This paper defines a new task of interactive person re-identification (Inter-ReID), constructs the Interactive-PEDES multi-turn dialogue dataset, and proposes LLaVA-ReID—a large multimodal question generation model based on selective multi-image context and look-ahead supervision, which progressively refines target person descriptions through iterative dialogue.

Background & Motivation¶

Traditional text-based person re-identification (T-ReID) assumes that the descriptions provided by witnesses are complete and given all at once. However, in real-world scenarios, witness descriptions are often partial or vague. This fundamental assumption is inconsistent with reality.

Inspired by Sherlock Holmes-style questioning, where a detective gradually obtains more details from a witness through targeted questions, this paper proposes an interactive person re-identification framework that iteratively refines the initial description through multi-turn dialogues to identify the target person more accurately.

Method¶

Task Definition¶

Interactive person re-identification is a multi-turn dialogue and retrieval process:

The witness provides an initial description $T$.
In each round $t$, the system generates a question $Q_t$ to guide the witness in recalling more details.
The witness provides an answer $A_t$, and the dialogue context $\mathcal{D}_t = \{T, (Q_1, A_1), \ldots, (Q_t, A_t)\}$ is used to retrieve the target person.

Interactive-PEDES Dataset¶

The dataset contains 54,749 images and 13,051 individuals, with an average of 9 dialogue turns per image. The construction consists of three steps:

Coarse-to-Fine Description Generation: GPT-4o is used to generate coarse-grained initial descriptions (simulating witness impressions) and fine-grained descriptions (simulating witness potential memories).
Sub-description Decomposition: Fine-grained descriptions are decomposed into non-overlapping sub-descriptions, with each focusing on a unique attribute.
Dialogue Generation: Three types of questions are generated—descriptive questions (50%), yes/no questions (40%), and multiple-choice questions (10%).

Interactive ReID Framework¶

The framework comprises three components:

Retriever: A CLIP-based dual-stream network that encodes dialogue descriptions and person images in a shared cross-modal space: $$p(I_i|\mathcal{D}_t) = \frac{\exp \text{sim}(z_t, f_i)}{\sum_j^m \exp \text{sim}(z_t, f_j)}$$
Questioner: LLaVA-ReID, which generates discriminative questions based on visual and textual contexts.
Answerer: An LLM based on Qwen2.5-7B-Instruct that simulates witness answers.

LLaVA-ReID: Selective Multi-Image Questioner¶

Selective Visual Context¶

Traditional methods directly use top-k or k-means selection for candidate images, which lacks attention to fine-grained differences. LLaVA-ReID designs a hard pass selection model:

Obtain top-k candidates using the retriever.
Feed the candidate image embeddings and the dialogue embedding into a shallow Transformer encoder: $\mathbf{v} = \phi_s(f_c; z_t)$.
Predict selection weights via a linear layer: $\mathbf{w} = \text{Softmax}(\phi_h(\mathbf{v}))$.
Send the top-c candidates with the highest selection weights to the LMM.

During training, a Gumbel-top-k relaxation is used to achieve a differentiable random sampling strategy.

Look-Ahead Supervision (Looking-Forward)¶

Different questions yield varying information gains depending on the retrieval state. The paper proposes a one-step look-ahead strategy to dynamically select the most informative question:

\[Q_t^* = \underset{Q_i \in (\mathcal{S} \setminus \mathcal{Q}_{pre}^{t-1})}{\arg\max} \text{rank}(I_{gt}, \{T, A_1, \ldots, A_{t-1}, A_t^*\})\]

The question that maximizes the improvement in the target person's retrieval rank is chosen as the supervision signal for the current round, trained using NLL loss:

\[\mathcal{L}_{\text{NLL}} = -\log p(Q_t^* | \mathcal{C}_{t-1}, \mathcal{D}_{t-1})\]

Key Experimental Results¶

Interactive-PEDES Main Results¶

Method	R3@1	R5@1	R5@5	BRI ↓
Initial	35.86	35.86	55.17	-
SimIRV	50.45	61.27	82.00	1.024
ChatIR	57.85	63.86	83.81	0.935
PlugIR	60.34	65.44	85.33	0.849
LLaVA-ReID	63.96	73.20	90.62	0.719

After 5 rounds of interaction, R@1 improves by 37.34% (73.20 vs 35.86), outperforming PlugIR by 7.76%.

Transfer to Traditional T-ReID¶

By integrating LLaVA-ReID, IRRA improves R@1 from 73.38 to 78.51 and RDE from 75.94 to 79.39 on CUHK-PEDES, demonstrating the transferability of the method.

Ablation Study¶

Removing selective visual context: R@1 decreases by approximately 3%.
Removing look-ahead supervision: R@1 drops from 73.20 to approximately 68%.
The number of candidates $c=4$ is the optimal choice.

Highlights & Insights¶

First to define the Inter-ReID task: Extending static text-based person re-identification to interactive dialogue retrieval.
Well-crafted dataset: Interactive-PEDES contains three types of questions, simulating real-world questioning scenarios.
Innovative look-ahead strategy: Dynamically selecting questions with the maximum information gain, avoiding the combinatorial explosion of question permutations.
Strong transferability: Serves as a plug-and-play module to boost the performance of existing T-ReID frameworks.

Limitations & Future Work¶

The witness simulation uses an LLM instead of real human evaluation, which may lead to a gap with real-world scenarios.
Image sources in the dataset are relatively limited (CUHK-PEDES and ICFG-PEDES).
Forward inference of large multimodal models is required for each round, potentially limiting real-time performance.
The look-ahead strategy requires pre-computing retrieval ranks for all candidate questions, incurring high training overhead.
The robustness of the system is not discussed when witness answers are inaccurate or contradictory.

Rating¶

⭐⭐⭐⭐ (4/5)

The paper demonstrates strong innovation in task definition, dataset construction, and method design. Upgrading person re-identification from static retrieval to interactive dialogue is a natural and valuable direction, with convincing experimental results.