Generalizable Object Re-Identification via Visual In-Context Prompting¶
Conference: ICCV 2025 arXiv: 2508.21222 Code: https://github.com/Hzzone/VICP Area: Multimodal / Vision-Language Model Keywords: Object Re-Identification, Generalizable ReID, Visual In-Context Prompting, LLM-Guided, Visual Foundation Model
TL;DR¶
VICP proposes a generalizable object re-identification framework in which an LLM infers identity-discriminative rules from a small set of positive/negative image pairs and converts them into dynamic visual prompts injected into a frozen visual foundation model (DINOv2), enabling zero-parameter-update generalization to unseen object categories.
Background & Motivation¶
- Traditional ReID methods train category-specific models (pedestrian, vehicle), resulting in poor generalizability; every new category requires expensive annotation and retraining.
- Self-supervised learning (DINO, MoCo, etc.) reduces annotation requirements but learns semantic consistency rather than identity-sensitive features (e.g., stitching texture on a backpack, sole pattern on a shoe), yielding suboptimal ReID performance.
- Core Problem: How to build a ReID model that generalizes to arbitrary object categories without category-specific training?
- Key Insights:
- Visual Foundation Models (VFMs) possess strong visual priors, but their general-purpose features lack the fine-grained identity discriminability required for ReID.
- Large Language Models (LLMs) excel at in-context learning—inferring task rules from a handful of examples.
- Unifying both: LLM infers identity-discriminative rules → generates visual prompts → VFM extracts identity-sensitive features.
Method¶
Overall Architecture¶
The VICP framework consists of two main modules: 1. In-Context Visual Prompt Generation: Processes a small set of positive/negative pairs, uses a frozen LLM to infer identity rules, and generates visual prompts. 2. Generalizable Object ReID: Injects the visual prompts into a frozen ViT (DINOv2) to dynamically modulate self-attention toward identity-sensitive features.
Key Designs¶
-
In-Context Visual Prompt Generation:
- Input: support set \(\mathcal{S} = \{(\boldsymbol{x}_i, \boldsymbol{x}_j, y_{ij})\}\) (positive/negative pairs).
- Each image is encoded by DINOv2; a Q-Former (Query-based Connector, inspired by BLIP-2) compresses each image into \(N\) latent tokens.
- For each pair, the compressed tokens of both images are concatenated with label embeddings: \(\mathbf{T}_{ij} = [\mathbf{I}_i; \mathbf{I}_j; \mathbf{L}_{ij}]\).
- \(K\) pairs form the complete context sequence \(\mathbf{T}_{\text{ctx}}\).
- A frozen LLM (LLaMA) processes the sequence autoregressively; loss is computed only on label tokens (ICL Loss).
- \(M\) learnable visual prompt tokens \(\mathbf{P}_{\text{learn}}\) are appended at the end of the sequence.
- LLM outputs are mapped by a Visual Head (two-layer MLP) to visual prompts \(\mathbf{P}_{\text{task}} \in \mathbb{R}^{M \times d_{\text{vision}}}\).
-
Prompt Injection into VFM:
- \(\mathbf{P}_{\text{task}}\) is concatenated to the input token sequence of each ViT layer.
- Through self-attention, prompt tokens interact with spatial features, dynamically amplifying identity-sensitive regions (logos, textures) and suppressing irrelevant regions (background, lighting variations).
- ViT parameters are fully frozen; only prompts modulate the feature space.
- At inference time, prompts can be cached and reused—only one prompt generation is needed per category for all query-gallery comparisons.
-
Loss Function Design:
- ReID Loss (Triplet Loss): \(\mathcal{L}_{\text{ID}} = \sum \max(0, \alpha - \text{sim}(\phi(\boldsymbol{x}_a), \phi(\boldsymbol{x}_p)) + \text{sim}(\phi(\boldsymbol{x}_a), \phi(\boldsymbol{x}_n)))\)
- Triplet loss is preferred over ArcFace/contrastive loss because it penalizes only margin-violating samples, imposing softer updates that preserve the pretrained model's semantic priors.
- Patch Alignment Loss (OT Distance): \(\mathcal{L}_{\text{align}}\), measuring patch-level feature matching quality via optimal transport distance, aligning positive pairs and separating negative pairs.
- ICL Loss: Supervises only label token predictions, preserving the LLM's pretrained semantic knowledge.
- Total loss: \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{ID}} + \lambda_{\text{ICL}} \mathcal{L}_{\text{ICL}} + \lambda_{\text{align}} \mathcal{L}_{\text{align}}\)
- ReID Loss (Triplet Loss): \(\mathcal{L}_{\text{ID}} = \sum \max(0, \alpha - \text{sim}(\phi(\boldsymbol{x}_a), \phi(\boldsymbol{x}_p)) + \text{sim}(\phi(\boldsymbol{x}_a), \phi(\boldsymbol{x}_n)))\)
Loss & Training¶
- Backbone: DINOv2 ViT-small.
- Trained on 2× H100 GPUs; learning rate \(10^{-4}\); batch size 256.
- Image resolution \(224 \times 224\); data augmentation: horizontal flip only.
- 10 training epochs; 64 positive/negative pairs randomly sampled per batch.
- Q-Former generates 32 visual tokens.
- Triplet loss margin \(\alpha = 0.1\).
Key Experimental Results¶
Main Results¶
PetFace Dataset:
| Method | AUC↑ | ACC↑ | mAP↑ | Top-1↑ |
|---|---|---|---|---|
| CLIP | 71.5 | 64.6 | 7.1 | 4.4 |
| DINOv2 | 71.6 | 65.9 | 6.5 | 5.1 |
| Triplet+ | 92.5 | 85.6 | 49.8 | 47.7 |
| VICP (Ours) | 93.5 | 86.0 | 51.2 | 49.7 |
| Supervised (upper bound) | 95.5 | 89.3 | 57.7 | 56.3 |
ShopID10K Dataset:
| Method | mAP↑ | Rank-1↑ | Rank-5↑ |
|---|---|---|---|
| CLIP | 37.1 | 48.6 | 72.1 |
| DINOv2 | — | — | — |
| Triplet+ | — | — | — |
| VICP (Ours) | ~4% mAP over Triplet+ | — | — |
| Supervised (upper bound) | 62.6 | 71.2 | 89.8 |
Ablation Study¶
Comparison of Loss Functions (PetFace):
| Method | AUC↑ | mAP↑ |
|---|---|---|
| ArcFace | 89.1 | 46.6 |
| AdaFace | 89.3 | 46.9 |
| SCL (Contrastive) | 91.1 | 46.3 |
| Triplet | 91.7 | 48.2 |
| Triplet+ (few-shot) | 92.5 | 49.8 |
| VICP | 93.5 | 51.2 |
Triplet loss outperforms ArcFace/AdaFace/SCL because the latter always minimize loss over all samples, potentially disrupting the generalization capacity of pretrained representations.
Key Findings¶
- DINOv2 applied directly to ReID performs poorly (mAP only 6.5%), confirming that semantic features do not equate to identity features.
- Fine-tuning with Triplet loss yields substantial gains (mAP 48.2%), demonstrating that explicit identity-discriminative optimization is necessary for ReID.
- VICP surpasses fine-tuned Triplet+ without any parameter updates, validating the effectiveness of LLM-driven visual prompting.
- Strong performance across all unseen categories (pets, products, vehicles) confirms cross-category generalization capability.
- The ShopID10K dataset exposes significant challenges in real-world scenarios (lighting, occlusion, background variation).
Highlights & Insights¶
- Clear problem formulation: The paper is the first to systematically define the task of "generalizable object re-identification" targeting arbitrary rather than specific categories.
- The LLM→visual prompt pipeline is elegant: the LLM "reasons" from a few positive/negative pairs about which features matter, then visual prompts guide the VFM to focus accordingly—analogous to the human cognitive process of "learning rules from examples."
- Q-Former compression of visual tokens effectively controls the computational overhead imposed on the LLM.
- Prompt caching and reuse reduces inference to a single prompt generation per category, making deployment compatible with standard ReID pipelines.
- ShopID10K fills an important gap in the field as a new benchmark dataset.
- The rationale for choosing Triplet loss over more aggressive metric learning objectives (preserving pretrained priors) is insightful.
Limitations & Future Work¶
- The framework assumes that object categories are known at inference time (requiring an upstream detector), and does not handle cross-category ambiguity.
- The LLM processes visual tokens rather than raw images, potentially limiting its semantic reasoning capacity.
- Only DINOv2 ViT-small is used; larger backbone models may yield further improvements.
- The quality and representativeness of few-shot pairs are critical to prompt generation, yet the optimal support set selection strategy is not thoroughly investigated.
- Integration with more recent VLMs (e.g., GPT-4V) may represent a more powerful direction.
Related Work & Insights¶
- BLIP-2: Source of inspiration for the Q-Former design, using learnable queries to compress visual tokens.
- Visual Prompt Tuning (VPT): Methodological foundation for injecting prompt tokens into each ViT layer.
- MegaDescriptor / PetFace: Category-specific ReID baselines; the proposed method surpasses them in generality.
- In-Context Learning (GPT series): Core idea transferred from NLP to the visual domain.
- Insight: LLMs can not only understand text but also infer task-specific rules from sequences of visual tokens.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First to apply LLM-driven in-context learning to generalizable ReID; the approach is conceptually original.
- Experimental Thoroughness: ⭐⭐⭐⭐ — 7 datasets, diverse baselines, comprehensive ablations, and a new dataset of independent value.
- Writing Quality: ⭐⭐⭐⭐ — Problem formulation is clear, method description is detailed, and figures are informative.
- Value: ⭐⭐⭐⭐ — Introduces a new task, a new dataset, and a new method, making a meaningful contribution to the ReID community.