Generalizable Object Re-Identification via Visual In-Context Prompting¶

Conference: ICCV 2025 arXiv: 2508.21222 Code: https://github.com/Hzzone/VICP Area: Multimodal / Vision-Language Model Keywords: Object Re-Identification, Generalizable ReID, Visual In-Context Prompting, LLM-Guided, Visual Foundation Model

TL;DR¶

VICP proposes a generalizable object re-identification framework in which an LLM infers identity-discriminative rules from a small set of positive/negative image pairs and converts them into dynamic visual prompts injected into a frozen visual foundation model (DINOv2), enabling zero-parameter-update generalization to unseen object categories.

Background & Motivation¶

Traditional ReID methods train category-specific models (pedestrian, vehicle), resulting in poor generalizability; every new category requires expensive annotation and retraining.
Self-supervised learning (DINO, MoCo, etc.) reduces annotation requirements but learns semantic consistency rather than identity-sensitive features (e.g., stitching texture on a backpack, sole pattern on a shoe), yielding suboptimal ReID performance.
Core Problem: How to build a ReID model that generalizes to arbitrary object categories without category-specific training?
Key Insights:
Visual Foundation Models (VFMs) possess strong visual priors, but their general-purpose features lack the fine-grained identity discriminability required for ReID.
Large Language Models (LLMs) excel at in-context learning—inferring task rules from a handful of examples.
Unifying both: LLM infers identity-discriminative rules → generates visual prompts → VFM extracts identity-sensitive features.

Method¶

Overall Architecture¶

The VICP framework consists of two main modules: 1. In-Context Visual Prompt Generation: Processes a small set of positive/negative pairs, uses a frozen LLM to infer identity rules, and generates visual prompts. 2. Generalizable Object ReID: Injects the visual prompts into a frozen ViT (DINOv2) to dynamically modulate self-attention toward identity-sensitive features.

Key Designs¶

In-Context Visual Prompt Generation:
- Input: support set \(\mathcal{S} = \{(\boldsymbol{x}_i, \boldsymbol{x}_j, y_{ij})\}\) (positive/negative pairs).
- Each image is encoded by DINOv2; a Q-Former (Query-based Connector, inspired by BLIP-2) compresses each image into \(N\) latent tokens.
- For each pair, the compressed tokens of both images are concatenated with label embeddings: \(\mathbf{T}_{ij} = [\mathbf{I}_i; \mathbf{I}_j; \mathbf{L}_{ij}]\).
- \(K\) pairs form the complete context sequence \(\mathbf{T}_{\text{ctx}}\).
- A frozen LLM (LLaMA) processes the sequence autoregressively; loss is computed only on label tokens (ICL Loss).
- \(M\) learnable visual prompt tokens \(\mathbf{P}_{\text{learn}}\) are appended at the end of the sequence.
- LLM outputs are mapped by a Visual Head (two-layer MLP) to visual prompts \(\mathbf{P}_{\text{task}} \in \mathbb{R}^{M \times d_{\text{vision}}}\).
Prompt Injection into VFM:
- \(\mathbf{P}_{\text{task}}\) is concatenated to the input token sequence of each ViT layer.
- Through self-attention, prompt tokens interact with spatial features, dynamically amplifying identity-sensitive regions (logos, textures) and suppressing irrelevant regions (background, lighting variations).
- ViT parameters are fully frozen; only prompts modulate the feature space.
- At inference time, prompts can be cached and reused—only one prompt generation is needed per category for all query-gallery comparisons.
Loss Function Design:
- ReID Loss (Triplet Loss): \(\mathcal{L}_{\text{ID}} = \sum \max(0, \alpha - \text{sim}(\phi(\boldsymbol{x}_a), \phi(\boldsymbol{x}_p)) + \text{sim}(\phi(\boldsymbol{x}_a), \phi(\boldsymbol{x}_n)))\)
  - Triplet loss is preferred over ArcFace/contrastive loss because it penalizes only margin-violating samples, imposing softer updates that preserve the pretrained model's semantic priors.
- Patch Alignment Loss (OT Distance): \(\mathcal{L}_{\text{align}}\), measuring patch-level feature matching quality via optimal transport distance, aligning positive pairs and separating negative pairs.
- ICL Loss: Supervises only label token predictions, preserving the LLM's pretrained semantic knowledge.
- Total loss: \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{ID}} + \lambda_{\text{ICL}} \mathcal{L}_{\text{ICL}} + \lambda_{\text{align}} \mathcal{L}_{\text{align}}\)

Loss & Training¶

Backbone: DINOv2 ViT-small.
Trained on 2× H100 GPUs; learning rate \(10^{-4}\); batch size 256.
Image resolution \(224 \times 224\); data augmentation: horizontal flip only.
10 training epochs; 64 positive/negative pairs randomly sampled per batch.
Q-Former generates 32 visual tokens.
Triplet loss margin \(\alpha = 0.1\).

Key Experimental Results¶

Main Results¶

PetFace Dataset:

Method	AUC↑	ACC↑	mAP↑	Top-1↑
CLIP	71.5	64.6	7.1	4.4
DINOv2	71.6	65.9	6.5	5.1
Triplet+	92.5	85.6	49.8	47.7
VICP (Ours)	93.5	86.0	51.2	49.7
Supervised (upper bound)	95.5	89.3	57.7	56.3

ShopID10K Dataset:

Method	mAP↑	Rank-1↑	Rank-5↑
CLIP	37.1	48.6	72.1
DINOv2	—	—	—
Triplet+	—	—	—
VICP (Ours)	~4% mAP over Triplet+	—	—
Supervised (upper bound)	62.6	71.2	89.8

Ablation Study¶

Comparison of Loss Functions (PetFace):

Method	AUC↑	mAP↑
ArcFace	89.1	46.6
AdaFace	89.3	46.9
SCL (Contrastive)	91.1	46.3
Triplet	91.7	48.2
Triplet+ (few-shot)	92.5	49.8
VICP	93.5	51.2

Triplet loss outperforms ArcFace/AdaFace/SCL because the latter always minimize loss over all samples, potentially disrupting the generalization capacity of pretrained representations.

Key Findings¶

DINOv2 applied directly to ReID performs poorly (mAP only 6.5%), confirming that semantic features do not equate to identity features.
Fine-tuning with Triplet loss yields substantial gains (mAP 48.2%), demonstrating that explicit identity-discriminative optimization is necessary for ReID.
VICP surpasses fine-tuned Triplet+ without any parameter updates, validating the effectiveness of LLM-driven visual prompting.
Strong performance across all unseen categories (pets, products, vehicles) confirms cross-category generalization capability.
The ShopID10K dataset exposes significant challenges in real-world scenarios (lighting, occlusion, background variation).

Highlights & Insights¶

Clear problem formulation: The paper is the first to systematically define the task of "generalizable object re-identification" targeting arbitrary rather than specific categories.
The LLM→visual prompt pipeline is elegant: the LLM "reasons" from a few positive/negative pairs about which features matter, then visual prompts guide the VFM to focus accordingly—analogous to the human cognitive process of "learning rules from examples."
Q-Former compression of visual tokens effectively controls the computational overhead imposed on the LLM.
Prompt caching and reuse reduces inference to a single prompt generation per category, making deployment compatible with standard ReID pipelines.
ShopID10K fills an important gap in the field as a new benchmark dataset.
The rationale for choosing Triplet loss over more aggressive metric learning objectives (preserving pretrained priors) is insightful.

Limitations & Future Work¶

The framework assumes that object categories are known at inference time (requiring an upstream detector), and does not handle cross-category ambiguity.
The LLM processes visual tokens rather than raw images, potentially limiting its semantic reasoning capacity.
Only DINOv2 ViT-small is used; larger backbone models may yield further improvements.
The quality and representativeness of few-shot pairs are critical to prompt generation, yet the optimal support set selection strategy is not thoroughly investigated.
Integration with more recent VLMs (e.g., GPT-4V) may represent a more powerful direction.

BLIP-2: Source of inspiration for the Q-Former design, using learnable queries to compress visual tokens.
Visual Prompt Tuning (VPT): Methodological foundation for injecting prompt tokens into each ViT layer.
MegaDescriptor / PetFace: Category-specific ReID baselines; the proposed method surpasses them in generality.
In-Context Learning (GPT series): Core idea transferred from NLP to the visual domain.
Insight: LLMs can not only understand text but also infer task-specific rules from sequences of visual tokens.

Rating¶

Novelty: ⭐⭐⭐⭐ — First to apply LLM-driven in-context learning to generalizable ReID; the approach is conceptually original.
Experimental Thoroughness: ⭐⭐⭐⭐ — 7 datasets, diverse baselines, comprehensive ablations, and a new dataset of independent value.
Writing Quality: ⭐⭐⭐⭐ — Problem formulation is clear, method description is detailed, and figures are informative.
Value: ⭐⭐⭐⭐ — Introduces a new task, a new dataset, and a new method, making a meaningful contribution to the ReID community.