MMPB: It's Time for Multi-Modal Personalization¶
Conference: NeurIPS 2025 arXiv: 2509.22820 Code: https://aidaslab.github.io/MMPB (project page) Area: Recommender Systems Keywords: VLM, Personalization, benchmark, visual question answering, Cold-start
TL;DR¶
This paper introduces MMPB, the first VLM personalization evaluation benchmark, comprising 111 personalizable concepts, 10k+ image-text QA pairs, and 15 task types. Evaluation of 23 VLMs reveals that even the strongest model, GPT-4o, performs poorly on personalization tasks, exposing critical limitations in preference reasoning, visual cue utilization, and conflicts between safety alignment and personalization.
Background & Motivation¶
Background: VLMs (e.g., GPT-4o, LLaVA) are widely used for general visual question answering, but follow a one-size-fits-all paradigm—responding identically to all users without adapting to individual identity, preferences, or history.
Limitations of Prior Work: (a) Existing VQA benchmarks focus solely on general knowledge (commonsense, science, etc.) and do not evaluate personalization capabilities; (b) Prior personalization works (e.g., MyVLM, Yo'LLaVA) are small-scale (29–95 concepts), unsystematic, and exclude preference reasoning; (c) A unified evaluation framework and cold-start setting are absent.
Key Challenge: Strong performance on general tasks does not imply effectiveness in personalized scenarios. Personalization requires models to understand user-specific visual concepts and preferences—capabilities not covered by general-purpose training.
Goal: To establish a comprehensive and systematic evaluation benchmark for VLM personalization.
Key Insight: Defining four core attributes of personalization (Awareness, Appropriateness, Coherency, Persistency) and designing corresponding task types and evaluation protocols.
Core Idea: To reveal the true state and primary bottlenecks of VLM personalization capabilities through systematic benchmarking.
Method¶
Overall Architecture¶
MMPB evaluation proceeds in three stages: 1. Concept Injection: Personalizable concepts are injected into VLMs via reference images or textual descriptions. 2. Multi-turn Dialogue: Concept retention is tested through general conversation. 3. Personalized Query: The model is tested on new images to assess whether injected concepts can be applied.
Key Designs¶
-
Concept Taxonomy:
- 111 concepts across 4 categories: persons, animals, objects, and characters.
- Each concept includes 5 reference images and 4-level textual descriptions (simple / moderate / detailed / extended).
- Person-category concepts are additionally equipped with preference information: 5 major domains × 6 sub-domains = 30 preference sub-domains.
-
Task Types:
- Awareness: Whether a concept is recognized in positive images, distinguishing single-entity vs. multi-entity settings.
- Appropriateness: Whether irrelevant concepts are correctly suppressed in negative images; animal categories further distinguish same-species vs. different-species.
- Coherency: Whether responses are generated consistently with the concept (4-option MCQ).
- Persistency: Concept retention tested through multi-turn dialogue.
- 3 task types × 5 concept scenarios = 15 evaluation tasks.
-
Quality Control:
- To prevent concept-only solvability: at least one distractor is plausible for the concept but inconsistent with the image.
- To prevent image-only solvability: at least one distractor is plausible for the image but inconsistent with the concept.
- Options are shuffled to mitigate positional bias.
- Each question is annotated by at least 3 annotators and retained only upon majority agreement.
Evaluation Protocol¶
- Cold-start setting: only moderate-level textual descriptions or 2 reference images are provided.
- Two dialogue settings: 0-turn and 10-turn.
- Evaluation metric: overall accuracy.
Key Experimental Results¶
Main Results — Personalization Evaluation of 23 VLMs¶
| Model | Awareness | Appropriateness | Coherency | Overall |
|---|---|---|---|---|
| GPT-4o | Upper-mid | Good | Poor | ~60% |
| Claude-Sonnet | Mid | Good | Poor | ~55% |
| InternVL2.5-78B | Mid | Mid | Mid | ~50% |
| LLaVA-NeXT | Low | Low | Low | ~40% |
Ablation Study¶
| Configuration | Key Finding |
|---|---|
| Text vs. image injection | 1 image ≈ 3 textual keywords, indicating models struggle to leverage visual cues |
| 0-turn vs. 10-turn dialogue | Performance degrades significantly after 10 turns; intermediate concepts are prone to forgetting |
| Simple vs. detailed description | More detailed descriptions do not consistently improve performance; models may be adversely affected by long contexts |
Key Findings¶
- Even GPT-4o struggles on preference reasoning tasks, which require abductive reasoning capabilities.
- Safety alignment in closed-source models impedes personalization: responses involving persons are frequently refused.
- VLMs fail to effectively leverage visual cues for personalization: the small performance gap between image- and text-based injection suggests shallow visual understanding.
- Mid-sequence forgetting occurs in multi-turn dialogue: concepts injected at intermediate positions are most susceptible to being forgotten.
- Personalization bias: models exhibit significantly weaker personalization for certain concept types (e.g., persons) compared to others.
Highlights & Insights¶
- First systematic VLM personalization benchmark: fills a critical gap in existing benchmarks, with a scale (111 concepts + 10k QA pairs + 15 task types) far surpassing prior work.
- Reveals the conflict between safety alignment and personalization: closed-source models refuse person-related personalization for safety reasons, representing an important policy-capability trade-off.
- Four-level textual description design: elegantly enables future research to explore optimal personalization strategies at varying levels of granularity.
- Rigorous quality control: the anti-concept-only and anti-image-only design ensures the benchmark genuinely measures multimodal personalization reasoning.
Limitations & Future Work¶
- Static concept assumption: temporal changes in concepts (e.g., evolving user appearance or preference drift) are not considered.
- Fixed cold-start setting: only 2 reference images / moderate-level descriptions are tested; the effect of larger injection volumes remains unexplored.
- MCQ-only format: despite the possibility of converting to open-ended evaluation, the current benchmark is limited to multiple-choice questions.
- Future directions: incorporating concept temporality; exploring post-hoc fine-tuning using all 5 reference images; extending to additional modalities.
Related Work & Insights¶
- vs. MyVLM/Yo'LLaVA: These works contain only 29–40 concepts and exclude preference reasoning; MMPB comprehensively surpasses them.
- vs. MC-LLaVA: Covers 95 concepts but lacks systematic evaluation, preference modeling, and multi-turn testing.
- vs. general VQA benchmarks: ScienceQA, MMBench, and similar benchmarks do not evaluate personalization.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First comprehensive VLM personalization benchmark, filling a significant gap.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 23 models + 15 task types + multi-dimensional analysis.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear structure, rigorous formalization, in-depth analysis.
- Value: ⭐⭐⭐⭐⭐ Significant contribution to advancing VLM personalization research.