Training-Free Personalization via Retrieval and Reasoning on Fingerprints¶
Meta Info¶
- Conference: ICCV 2025
- arXiv: 2503.18623
- Code: Project Page
- Area: Multimodal VLM
- Keywords: VLM Personalization, Training-Free, Retrieval-Reasoning, Concept Fingerprints, Cross-Modal Verification
TL;DR¶
This work proposes R2P, the first training-free VLM personalization method. It leverages the inherent world knowledge of VLMs to extract concept "fingerprint" attributes, achieving personal concept recognition through a retrieval-reasoning paradigm and cross-modal attribute verification, without requiring any fine-tuning or large-scale pre-training.
Background & Motivation¶
VLM personalization aims to enable models to understand user-specific concepts (e.g., "my keys", "what is Fluffy doing"). However, existing methods suffer from severe limitations:
Training Dependency: Methods like MyVLM and Yo'LLaVA require fine-tuning for each new concept, necessitating the collection of multiple reference samples and numerous negative samples.
Large-Scale Pre-training: Methods like RAP require expensive large-scale instruction tuning on synthesized personalization data.
Poor Scalability: Adding new concepts requires retraining, leading to high time and computational costs.
Key Insight: VLMs have already been exposed to almost all semantic concepts through training on web-scale data. Can we leverage the existing internal knowledge of VLMs to achieve personalization without any training?
Method¶
Overall Architecture¶
R2P consists of two phases: - Phase I: Personal Database Creation: Extracts "fingerprint" attributes for each concept. - Phase II: Inference Phase: Recognizes personal concepts through a retrieval-reasoning pipeline.
Phase I: Concept Fingerprint Extraction¶
Given a reference image \(I_i\), a concept name \(c_i\), and a semantic category \(g_i\), a VLM is used to extract unique fingerprint attributes \(A_i\) and a description \(d_i\):
Fingerprint attributes are used to uniquely identify concepts (e.g., "orange and white fur pattern" or "bell on collar" to distinguish different cats). Meanwhile, CLIP is employed to encode the embeddings of the reference image and description.
Phase II: Retrieval-Reasoning Inference¶
1. Multimodal Concept Retrieval: Calculate the visual and textual similarity between the query image and all concepts in the database:
Select the top-\(K\) candidate concepts \(\mathcal{C}^K\).
2. Attribute-Focused CoT Reasoning: Prompt the VLM to focus on fingerprint attributes to infer the best-matching candidate concept \(\tilde{c}\) in the query image, while extracting the shared attribute set \(A_{q,i}\) between each candidate concept and the query image.
3. Cross-Modal Attribute Verification: Since VLMs may hallucinate, the reasoning result is verified. The attribute-based cross-modal similarity is calculated:
If \(\tilde{c} = \tilde{c}_a\) (the CoT reasoning result is consistent with the attribute verification), the verification passes.
4. Pairwise Reasoning (Triggered upon verification failure): Compare each candidate concept with the query image one by one, calculating the matching probability:
Select the concept with the highest probability as the final prediction.
Experiments¶
Identification and Description Task Results¶
| Method | Training? | MyVLM Wtd↑ | Yo'LLaVA Wtd↑ | Yo'LLaVA Recall↑ | PerVA Wtd↑ | PerVA Recall↑ |
|---|---|---|---|---|---|---|
| MyVLM | ✓ | 93.8 | - | 0.1 | 62.2 | 12.3 |
| Yo'LLaVA | ✓ | 96.4 | 89.7 | 64.8 | 72.0 | 38.2 |
| RAP | ✓ | 96.6 | 91.8 | 81.6 | 89.0 | 64.1 |
| MiniCPM-o + prompt | ✗ | 96.1 | 89.4 | 78.5 | 88.5 | 65.7 |
| R2P (Ours) | ✗ | 97.4 | 94.0 | 87.1 | 91.8 | 72.5 |
R2P achieves state-of-the-art (SOTA) results across all datasets while being training-free. On PerVA, it outperforms the training-based method RAP by 2.8% Wtd and 8.4% Recall.
PerVA Dataset¶
The authors introduce a new personalization benchmark: - 329 personal concepts spanning 21 everyday object categories. - Emphasizes visual ambiguity: multiple instances from the same category possess high visual similarity. - Contains challenges such as deformation, illumination variations, and different states.
Ablation Study¶
| Component | Yo'LLaVA Pos.Acc | Yo'LLaVA Neg.Acc | Yo'LLaVA Wtd |
|---|---|---|---|
| Visual Retrieval Only | 89.6 | 85.2 | 87.4 |
| + Text Retrieval | 91.8 | 88.7 | 90.3 |
| + CoT Reasoning | 95.2 | 90.4 | 92.8 |
| + Attribute Verification | 95.8 | 91.6 | 93.7 |
| + Pairwise Reasoning | 96.1 | 91.9 | 94.0 |
Each component progressively improves the performance, with cross-modal attribute verification and pairwise reasoning effectively reducing VLM hallucinations.
Key Findings¶
- Training-free methods outperform training-based ones, demonstrating that the world knowledge embedded in VLMs is sufficient to support personalization.
- Purely textual reasoning (CoT) contributes significantly to positive accuracy, while cross-modal verification enhances negative accuracy.
- Multimodal retrieval combining visual and textual modalities is more robust than single-modality retrieval.
- When concepts are highly visually similar, the fine-grained discriminative capability of fingerprint attributes is critical.
Highlights & Insights¶
- Pioneering Training-Free VLM Personalization: Challenges the assumption that training is necessary for personalization.
- Concept Fingerprints: Regulates individual recognition as an attribute-matching problem, providing great interpretability.
- Hierarchical Reasoning Pipeline: The pipeline of retrieval \(\rightarrow\) CoT \(\rightarrow\) verification \(\rightarrow\) pairwise reasoning elegantly handles queries of varying difficulties.
- The PerVA dataset fills the gap in personalization evaluation under visually ambiguous scenarios.
Limitations & Future Work¶
- Relies on the quality of fingerprint attributes generated by the VLM; VLM hallucinations can lead to incorrect fingerprints.
- The inference pipeline involves multiple VLM calls, resulting in higher latency.
- It assumes that users provide accurate semantic category information.
- Currently, it only supports a single reference image.
Related Work¶
- VLM Personalization: MyVLM (concept vectors), Yo'LLaVA (concept tokens), RAP (retrieval + instruction tuning)
- Attribute-Based Recognition: Zero-shot learning, compositional recognition
- Retrieval-Augmented Reasoning: RAG, ICL
Rating¶
- Novelty: ★★★★★ — Pioneering work in the training-free personalization paradigm.
- Value: ★★★★★ — New concepts can be added without training, making it highly practical and deployment-friendly.
- Experimental Thoroughness: ★★★★☆ — Three datasets + exhaustive ablations + new dataset.
- Writing Quality: ★★★★★ — Extremely clear description of methodology, with well-motivated steps for every part of the pipeline.