When Identities Collapse: A Stress-Test Benchmark for Multi-Subject Personalization¶

Conference: CVPR 2026 arXiv: 2603.26078 Code: None Area: Image Generation Keywords: multi-subject personalization, identity collapse, Subject Collapse Rate, DINOv2, stress test

TL;DR¶

This paper exposes the "identity collapse" bottleneck in multi-subject personalization: three SOTA models (MOSAIC, XVerse, PSR) already reach ~50% SCR at 2 subjects, surging to ~97% at 10 subjects. The paper proposes the DINOv2-based Subject Collapse Rate (SCR) metric to replace the inadequate CLIP-I, and constructs a systematic benchmark covering 2–10 subjects × 3 scene types.

Background & Motivation¶

Background: Multi-subject personalization requires diffusion models to simultaneously generate multiple user-specified individuals (persons/animals) in a single image. SOTA methods such as MOSAIC, XVerse, and PSR claim to support multi-subject generation.
Limitations of Prior Work: (1) As the number of subjects increases, generated individuals "collapse" into a shared appearance, losing their distinct identity features; (2) the standard metric CLIP-I is insensitive to identity collapse—it maintains high values even when all subjects converge to the same person; (3) there is no systematic stress-test benchmark for multi-subject generation.
Key Challenge: A substantial gap exists between models' claimed capabilities ("supports multi-subject") and their actual performance, a gap concealed by the lack of appropriate metrics.
Goal: (1) Design a metric that accurately quantifies identity collapse; (2) construct a systematic multi-subject stress-test benchmark; (3) quantify the true capability boundaries of existing SOTA methods.
Key Insight: CLIP is a language–vision alignment model that excels at semantic matching but is insensitive to fine-grained structural differences (e.g., facial distinctions between two individuals); DINOv2 is a self-supervised visual model that captures finer structural correspondences.
Core Idea: Replace CLIP with DINOv2 for identity similarity measurement and define a thresholded SCR metric to quantify per-subject collapse rate.

Method¶

Overall Architecture¶

Construct a diverse subject pool (XVerse + COSMISC datasets) → design 75 prompts (5 subject-count levels × 3 scene types × 5 prompts) → generate 225 images per model using 3 SOTA models → evaluate with DINOv2 Score and SCR → compare against CLIP-I/CLIP-T to analyze metric validity.

Key Designs¶

Subject Collapse Rate (SCR)
Function: Quantifies identity preservation failure at the individual subject level.
Mechanism: For each subject, compute the DINOv2 cosine similarity between the generated image and the reference image; if below threshold $\tau$, the subject is deemed "collapsed": $$\text{SCR}_{@\tau} = \frac{1}{N}\sum_{i=1}^N \mathds{1}[\cos(\text{DINOv2}(I_{gen}), \text{DINOv2}(I_{ref}^{(i)})) < \tau], \quad \tau \in \{0.4, 0.5, 0.6\}$$
Design Motivation: CLIP-I computes image-level mean similarity, which is inflated by subjects whose identity is preserved and thereby masks collapsed subjects. SCR evaluates each subject independently so that every collapse is counted.
DINOv2 Score as a Replacement for CLIP-I
Function: Provides more accurate identity preservation measurement.
Mechanism: $$\text{DINOv2 Score} = \frac{1}{N}\sum_{i=1}^N \cos(\text{DINOv2}(I_{gen}), \text{DINOv2}(I_{ref}^{(i)}))$$ using DINOv2 (self-supervised ViT) in place of CLIP (language-trained ViT).
Design Motivation: DINOv2's self-supervised training objective makes it more sensitive to part-level correspondences and structural geometry, enabling it to capture facial detail differences. CLIP-I assigns high scores even when all subjects collapse into a single identity.
Multi-Dimensional Benchmark Design
Function: Systematically tests multi-subject capability under varying conditions.
Mechanism: 5 subject counts (2/4/6/8/10) × 3 scene types (neutral side-by-side / occlusion / interaction) × 5 prompts per condition × 3 random seeds = 225 evaluation images per model.
Design Motivation: Existing evaluations test only 2-subject, non-occluded scenarios, which cannot expose scalability bottlenecks or failures in complex scenes.

Loss & Training¶

This paper presents a benchmark study and does not involve model training. Evaluation uses the official inference configurations of MOSAIC, XVerse, and PSR.

Key Experimental Results¶

Main Results¶

# Subjects	MOSAIC SCR↓	XVerse SCR↓	PSR SCR↓	MOSAIC DINOv2↑	XVerse DINOv2↑	PSR DINOv2↑
2	48.9%	58.9%	63.3%	0.425	0.355	0.325
4	81.7%	80.0%	85.6%	0.235	0.211	0.189
6	90.7%	93.0%	94.1%	0.164	0.142	0.136
8	94.7%	94.4%	95.0%	0.126	0.123	0.117
10	96.0%	96.4%	97.8%	0.110	0.104	0.101

Ablation Study¶

Metric Comparison	Change from 2 → 10 Subjects	Notes
CLIP-T (MOSAIC)	0.261 → 0.300	Counter-intuitive increase—unable to detect collapse
CLIP-I (MOSAIC)	0.695 → 0.504	27% decrease, yet retains a seemingly reasonable high value
DINOv2 (MOSAIC)	0.425 → 0.110	74% decrease, accurately reflecting collapse
SCR (MOSAIC)	48.9% → 96.0%	Most intuitive quantification of collapse

Key Findings¶

CLIP-T is completely invalid: CLIP-T scores increase as subject count grows (PSR: 0.274→0.309), because models produce more "generic" images rather than personalized ones.
Severe collapse already occurs at 2 subjects: MOSAIC achieves the best performance, yet its SCR still reaches 48.9%—nearly half of all subject identities are lost.
Collapse degrades exponentially: SCR jumps by 30%+ from 2 to 4 subjects; all models exceed 80% SCR beyond 4 subjects.
MOSAIC is relatively the strongest: It achieves the highest DINOv2 Score across all subject counts, but absolute performance remains poor.

Highlights & Insights¶

An "emperor's new clothes" revelation: Multi-subject personalization has been treated as a largely solved problem, yet SCR demonstrates that even SOTA methods can barely handle 2 subjects—a finding that may redefine research priorities in this field.
A cautionary tale in metric design: The failure of CLIP-I in multi-subject scenarios is a textbook case of how a widely adopted metric can completely mislead evaluation under specific conditions.
General insight that DINOv2 > CLIP: For tasks requiring fine-grained visual discrimination, self-supervised visual features may be more reliable than language-aligned features.

Limitations & Future Work¶

Only three models are evaluated, providing limited coverage (e.g., IP-Adapter and PhotoMaker are not tested).
The study focuses primarily on persons and animals; collapse patterns for rigid objects (e.g., furniture, vehicles) may differ.
Scene types are controlled via prompts rather than 3D layouts, so occlusion severity depends on model interpretation.
DINOv2 Score still performs image-level comparison; the ideal approach would apply instance segmentation prior to per-subject evaluation.
The paper diagnoses the problem without providing solutions—how to improve attention mechanisms to prevent collapse remains an open question for future work.

vs. CLIP-I evaluation standard: The finding that CLIP-I fails in multi-subject settings may have implications for all personalization work that uses CLIP-I—reassessment with DINOv2 may be warranted.
vs. DreamBooth/Textual Inversion: Single-subject personalization methods do not encounter the collapse issue during evaluation, but SCR would be equally applicable when they are extended to multi-subject settings.
vs. MOSAIC: As the strongest baseline (SCR = 48.9% at 2 subjects), MOSAIC's attention separation design partially mitigates collapse but falls far short of a complete solution.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First systematic exposure of multi-subject identity collapse with a targeted metric.
Experimental Thoroughness: ⭐⭐⭐⭐ 3 models × 5 subject counts × 3 scenes × 3 seeds, though model variety is limited.
Writing Quality: ⭐⭐⭐⭐⭐ Problem formulation is clear and data visualization is intuitive.
Value: ⭐⭐⭐⭐⭐ Likely to reshape evaluation standards and research directions in multi-subject personalization.