When Identities Collapse: A Stress-Test Benchmark for Multi-Subject Personalization¶
Conference: CVPR 2026 arXiv: 2603.26078 Code: None Area: Image Generation Keywords: multi-subject personalization, identity collapse, Subject Collapse Rate, DINOv2, stress test
TL;DR¶
This paper exposes the "identity collapse" bottleneck in multi-subject personalization: three SOTA models (MOSAIC, XVerse, PSR) already reach ~50% SCR at 2 subjects, surging to ~97% at 10 subjects. The paper proposes the DINOv2-based Subject Collapse Rate (SCR) metric to replace the inadequate CLIP-I, and constructs a systematic benchmark covering 2–10 subjects × 3 scene types.
Background & Motivation¶
- Background: Multi-subject personalization requires diffusion models to simultaneously generate multiple user-specified individuals (persons/animals) in a single image. SOTA methods such as MOSAIC, XVerse, and PSR claim to support multi-subject generation.
- Limitations of Prior Work: (1) As the number of subjects increases, generated individuals "collapse" into a shared appearance, losing their distinct identity features; (2) the standard metric CLIP-I is insensitive to identity collapse—it maintains high values even when all subjects converge to the same person; (3) there is no systematic stress-test benchmark for multi-subject generation.
- Key Challenge: A substantial gap exists between models' claimed capabilities ("supports multi-subject") and their actual performance, a gap concealed by the lack of appropriate metrics.
- Goal: (1) Design a metric that accurately quantifies identity collapse; (2) construct a systematic multi-subject stress-test benchmark; (3) quantify the true capability boundaries of existing SOTA methods.
- Key Insight: CLIP is a language–vision alignment model that excels at semantic matching but is insensitive to fine-grained structural differences (e.g., facial distinctions between two individuals); DINOv2 is a self-supervised visual model that captures finer structural correspondences.
- Core Idea: Replace CLIP with DINOv2 for identity similarity measurement and define a thresholded SCR metric to quantify per-subject collapse rate.
Method¶
Overall Architecture¶
Construct a diverse subject pool (XVerse + COSMISC datasets) → design 75 prompts (5 subject-count levels × 3 scene types × 5 prompts) → generate 225 images per model using 3 SOTA models → evaluate with DINOv2 Score and SCR → compare against CLIP-I/CLIP-T to analyze metric validity.
Key Designs¶
-
Subject Collapse Rate (SCR)
-
Function: Quantifies identity preservation failure at the individual subject level.
- Mechanism: For each subject, compute the DINOv2 cosine similarity between the generated image and the reference image; if below threshold \(\tau\), the subject is deemed "collapsed": $\(\text{SCR}_{@\tau} = \frac{1}{N}\sum_{i=1}^N \mathds{1}[\cos(\text{DINOv2}(I_{gen}), \text{DINOv2}(I_{ref}^{(i)})) < \tau], \quad \tau \in \{0.4, 0.5, 0.6\}\)$
-
Design Motivation: CLIP-I computes image-level mean similarity, which is inflated by subjects whose identity is preserved and thereby masks collapsed subjects. SCR evaluates each subject independently so that every collapse is counted.
-
DINOv2 Score as a Replacement for CLIP-I
-
Function: Provides more accurate identity preservation measurement.
- Mechanism: $\(\text{DINOv2 Score} = \frac{1}{N}\sum_{i=1}^N \cos(\text{DINOv2}(I_{gen}), \text{DINOv2}(I_{ref}^{(i)}))\)$ using DINOv2 (self-supervised ViT) in place of CLIP (language-trained ViT).
-
Design Motivation: DINOv2's self-supervised training objective makes it more sensitive to part-level correspondences and structural geometry, enabling it to capture facial detail differences. CLIP-I assigns high scores even when all subjects collapse into a single identity.
-
Multi-Dimensional Benchmark Design
-
Function: Systematically tests multi-subject capability under varying conditions.
- Mechanism: 5 subject counts (2/4/6/8/10) × 3 scene types (neutral side-by-side / occlusion / interaction) × 5 prompts per condition × 3 random seeds = 225 evaluation images per model.
- Design Motivation: Existing evaluations test only 2-subject, non-occluded scenarios, which cannot expose scalability bottlenecks or failures in complex scenes.
Loss & Training¶
This paper presents a benchmark study and does not involve model training. Evaluation uses the official inference configurations of MOSAIC, XVerse, and PSR.
Key Experimental Results¶
Main Results¶
| # Subjects | MOSAIC SCR↓ | XVerse SCR↓ | PSR SCR↓ | MOSAIC DINOv2↑ | XVerse DINOv2↑ | PSR DINOv2↑ |
|---|---|---|---|---|---|---|
| 2 | 48.9% | 58.9% | 63.3% | 0.425 | 0.355 | 0.325 |
| 4 | 81.7% | 80.0% | 85.6% | 0.235 | 0.211 | 0.189 |
| 6 | 90.7% | 93.0% | 94.1% | 0.164 | 0.142 | 0.136 |
| 8 | 94.7% | 94.4% | 95.0% | 0.126 | 0.123 | 0.117 |
| 10 | 96.0% | 96.4% | 97.8% | 0.110 | 0.104 | 0.101 |
Ablation Study¶
| Metric Comparison | Change from 2 → 10 Subjects | Notes |
|---|---|---|
| CLIP-T (MOSAIC) | 0.261 → 0.300 | Counter-intuitive increase—unable to detect collapse |
| CLIP-I (MOSAIC) | 0.695 → 0.504 | 27% decrease, yet retains a seemingly reasonable high value |
| DINOv2 (MOSAIC) | 0.425 → 0.110 | 74% decrease, accurately reflecting collapse |
| SCR (MOSAIC) | 48.9% → 96.0% | Most intuitive quantification of collapse |
Key Findings¶
- CLIP-T is completely invalid: CLIP-T scores increase as subject count grows (PSR: 0.274→0.309), because models produce more "generic" images rather than personalized ones.
- Severe collapse already occurs at 2 subjects: MOSAIC achieves the best performance, yet its SCR still reaches 48.9%—nearly half of all subject identities are lost.
- Collapse degrades exponentially: SCR jumps by 30%+ from 2 to 4 subjects; all models exceed 80% SCR beyond 4 subjects.
- MOSAIC is relatively the strongest: It achieves the highest DINOv2 Score across all subject counts, but absolute performance remains poor.
Highlights & Insights¶
- An "emperor's new clothes" revelation: Multi-subject personalization has been treated as a largely solved problem, yet SCR demonstrates that even SOTA methods can barely handle 2 subjects—a finding that may redefine research priorities in this field.
- A cautionary tale in metric design: The failure of CLIP-I in multi-subject scenarios is a textbook case of how a widely adopted metric can completely mislead evaluation under specific conditions.
- General insight that DINOv2 > CLIP: For tasks requiring fine-grained visual discrimination, self-supervised visual features may be more reliable than language-aligned features.
Limitations & Future Work¶
- Only three models are evaluated, providing limited coverage (e.g., IP-Adapter and PhotoMaker are not tested).
- The study focuses primarily on persons and animals; collapse patterns for rigid objects (e.g., furniture, vehicles) may differ.
- Scene types are controlled via prompts rather than 3D layouts, so occlusion severity depends on model interpretation.
- DINOv2 Score still performs image-level comparison; the ideal approach would apply instance segmentation prior to per-subject evaluation.
- The paper diagnoses the problem without providing solutions—how to improve attention mechanisms to prevent collapse remains an open question for future work.
Related Work & Insights¶
- vs. CLIP-I evaluation standard: The finding that CLIP-I fails in multi-subject settings may have implications for all personalization work that uses CLIP-I—reassessment with DINOv2 may be warranted.
- vs. DreamBooth/Textual Inversion: Single-subject personalization methods do not encounter the collapse issue during evaluation, but SCR would be equally applicable when they are extended to multi-subject settings.
- vs. MOSAIC: As the strongest baseline (SCR = 48.9% at 2 subjects), MOSAIC's attention separation design partially mitigates collapse but falls far short of a complete solution.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First systematic exposure of multi-subject identity collapse with a targeted metric.
- Experimental Thoroughness: ⭐⭐⭐⭐ 3 models × 5 subject counts × 3 scenes × 3 seeds, though model variety is limited.
- Writing Quality: ⭐⭐⭐⭐⭐ Problem formulation is clear and data visualization is intuitive.
- Value: ⭐⭐⭐⭐⭐ Likely to reshape evaluation standards and research directions in multi-subject personalization.