CVPR2026 Segmentation personalized diffusion models concept disentanglement residual token optimization Textual Inversion LoRA contrastive learning

ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization¶

Conference: CVPR2026 arXiv: 2602.19575 Code: To be confirmed Area: Image Segmentation Keywords: personalized diffusion models, concept disentanglement, residual token optimization, Textual Inversion, LoRA, contrastive learning

TL;DR¶

This paper proposes ConceptPrism, which introduces image-level residual tokens and cross-image repulsion losses to automatically disentangle shared target concepts from image-specific residual information in personalized T2I diffusion models, achieving state-of-the-art performance on DreamBench across all three metrics: CLIP-T, DINO, and CLIP-I.

Background & Motivation¶

Concept entanglement in personalized T2I: Methods such as Textual Inversion and DreamBooth learn concept tokens from a small number of images, but the learned tokens inevitably conflate the target concept (e.g., the appearance of a specific dog) with image-specific information (e.g., background, pose, and lighting).
Concrete consequences of entanglement: When generating new scenes, residual information "leaks" into the output—for instance, indoor background elements from training images may appear in "a [V] dog on the beach"—degrading text alignment and reducing generation diversity.
Limitations of existing disentanglement methods: Break-A-Scene requires segmentation mask annotations; Custom Diffusion only indirectly mitigates entanglement by constraining the fine-tuned parameters; Cones requires manual specification of concept-corresponding layers—all relying on additional supervision or hand-crafted priors.
Cross-image contrast as a disentanglement signal: Different images of the same concept share target information while each carrying unique residual information; cross-image contrastive learning can naturally separate shared from image-specific components without any extra annotation.
Information allocation in token space: When multiple tokens are learned without explicit constraints, all tokens redundantly encode the same information; a mechanism is needed to ensure different tokens serve distinct roles.

Core Problem¶

How can a clean concept representation be learned from a small number of reference images—without additional annotations—such that it retains only the shared target concept while stripping away image-specific residual information (background, pose, lighting, etc.)?

Method¶

Overall Architecture¶

ConceptPrism defines two types of learnable tokens: a single shared target token \(t_{target}\) (encoding the cross-image shared concept) and per-image residual tokens \(t_{residual}^{(i)}\) (absorbing the image-specific information of the \(i\)-th image). Joint optimization via a reconstruction loss and a repulsion loss enables automatic concept disentanglement.

Token Definition and Initialization¶

Target token \(t_{target}\): Randomly initialized and shared across all images, responsible for learning a clean representation of the target concept. Random initialization creates an "information vacuum," which is filled with cross-image shared concept information driven by the reconstruction loss.
Residual tokens \(\{t_{residual}^{(i)}\}_{i=1}^N\): One token per reference image, initialized with the CLIP embedding of a descriptive caption for that image. Captions are automatically generated by BLIP-2 (e.g., "a photo of a dog sitting on a couch"), providing rich image-level initial information.
Asymmetric initialization is key: The target token learns shared signals from scratch, while residual tokens start from image captions and discard the shared components; the two are complementary.

Reconstruction Loss \(\mathcal{L}_{recon}\)¶

The conditioning "[\(t_{target}\)] with [\(t_{residual}^{(i)}\)]" should reconstruct the \(i\)-th reference image \(x^{(i)}\):

\[\mathcal{L}_{recon} = \mathbb{E}_{i, t, \epsilon} \left[ \| \epsilon - \epsilon_\theta(z_t^{(i)}, c_{target+residual}^{(i)}) \|^2 \right]\]

where \(z_t^{(i)}\) is the noised \(i\)-th image and \(c_{target+residual}^{(i)}\) is the text conditioning incorporating both token types. This loss ensures that the target and residual tokens together fully encode the image information.

Repulsion Loss \(\mathcal{L}_{excl}\) (Core Contribution)¶

This loss forces residual tokens to discard shared concept information and retain only image-specific content. The intuition is: if \(t_{residual}^{(i)}\) still encodes shared concept information, conditioning on it to generate another image \(x^{(j)}\) (\(j \neq i\)) will cause the generation to deviate from the unconditional output; conversely, if the residual token contains no shared information, it should have no effect on other images' generation and should be consistent with unconditional generation.

\[\mathcal{L}_{excl} = \mathbb{E}_{i, j \neq i, t, \epsilon} \left[ \| \epsilon_\theta(z_t^{(j)}, c_{residual}^{(i)}) - \epsilon_\theta(z_t^{(j)}, \varnothing) \|^2 \right]\]

\(c_{residual}^{(i)}\) uses only the residual token of image \(i\) as conditioning.
\(\varnothing\) denotes the unconditional (empty text) baseline.
\(j \neq i\) is critical: cross-image noise samples ensure that what is measured is "concept information leakage" rather than "image-specific information matching."
Minimizing this loss is equivalent to minimizing \(\text{KL}(p(x|c_{residual}^{(i)}) \| p(x))\), driving the residual token's conditional distribution toward the unconditional distribution.

Total Loss¶

\[\mathcal{L}_{total} = \mathcal{L}_{recon} + \lambda \mathcal{L}_{excl}\]

Two-Stage Optimization¶

Token optimization stage (200 steps): The U-Net parameters are frozen; only the embeddings of \(t_{target}\) and \(\{t_{residual}^{(i)}\}\) are optimized. This stage rapidly learns a coarse-grained concept representation.
LoRA fine-tuning stage (120 steps): LoRA is applied to the attention layers of the U-Net, jointly fine-tuning the LoRA parameters and token embeddings. LoRA provides model-level fine-grained adaptation to enhance concept fidelity.

Inference¶

Only \(t_{target}\) is used at inference time (all residual tokens are discarded), combined with arbitrary text prompts to generate new images. Since \(t_{target}\) is disentangled, generated results contain only the target concept without any residual information leakage.

Key Experimental Results¶

Dataset and Setup¶

DreamBench: 30 subjects, 4–6 reference images per subject, 25 text prompts.
Concept types: object (specific objects), style (artistic styles), pose (body poses), etc.
Evaluation metrics: CLIP-T (text alignment), DINO (subject fidelity), CLIP-I (image similarity).
Baselines: Textual Inversion, DreamBooth, Custom Diffusion, Break-A-Scene, SVDiff, ELITE, Cones, P+.

Main Results¶

Method	CLIP-T↑	DINO↑	CLIP-I↑
Textual Inversion	0.321	0.154	0.305
DreamBooth	0.340	0.189	0.332
Custom Diffusion	0.338	0.183	0.328
Break-A-Scene	0.335	0.178	0.322
SVDiff	0.331	0.171	0.319
P+	0.342	0.192	0.341
ConceptPrism	0.357	0.210	0.353

ConceptPrism achieves the best performance across all three metrics. The highest CLIP-T indicates superior text alignment (the repulsion loss effectively reduces interference from residual information on text following); the highest DINO indicates superior concept fidelity (the target token precisely encodes the shared concept).

Analysis by Concept Type¶

Concept Type	CLIP-T↑	DINO↑
Object	0.361	0.223
Style	0.349	0.185
Pose	0.352	0.198

ConceptPrism is effective across all three concept types—object, style, and pose—demonstrating that the disentanglement mechanism is general and not limited to specific concept categories.

Ablation Study¶

Removing \(\mathcal{L}_{excl}\): CLIP-T drops by 0.020 and DINO by 0.018, degenerating to standard multi-token learning where target and residual tokens encode redundant information.
\(j = i\) (non-cross repulsion): Performance degrades substantially, as noise from the same image is naturally correlated with its residual token, making it impossible to separate shared from image-specific information.
Removing residual tokens (target only): CLIP-T drops by 0.015; the target token is forced to encode all information, compromising concept purity.
Removing descriptive caption initialization: DINO drops by 0.012; randomly initialized residual tokens learn more slowly and fail to fully absorb residual information.
Removing the LoRA stage: DINO drops by 0.025; token optimization alone cannot capture fine-grained concept details.
Sensitivity to \(\lambda\): \(\lambda = 0.5\) is optimal; too small yields insufficient repulsion, while too large over-suppresses residual tokens and degrades reconstruction quality.

Qualitative Analysis¶

Visualizations show that ConceptPrism preserves precise characteristics of the target concept (e.g., a dog's coat color and breed features) in novel scenes while fully adhering to the text prompt.
Compared to DreamBooth and Custom Diffusion, the latter two leak indoor background elements from training images into "beach" scenes.
When residual tokens are used alone for generation, they produce blurry images unrelated to the target concept, validating the effectiveness of the repulsion loss.

Highlights & Insights¶

Elegant design of the repulsion loss: Cross-image comparison (\(j \neq i\)) forces residual tokens to discard shared information; this is theoretically equivalent to minimizing KL divergence, with clear motivation and concise implementation.
No additional annotations required: The method requires no segmentation masks, concept labels, or manual specification—learning disentanglement entirely from natural cross-image contrast, making it more practical than Break-A-Scene and Cones.
Sophisticated initialization strategy: The asymmetric design of random initialization for target tokens and descriptive caption initialization for residual tokens leverages the "information vacuum" principle to naturally guide information flow without complex optimization strategies.
Applicable to multiple concept types: Effective for objects, styles, and poses alike; the disentanglement mechanism is general rather than domain-specific.
Lightweight and efficient: 200-step token optimization plus 120-step LoRA fine-tuning totaling only 320 steps, far fewer than the full fine-tuning of DreamBooth.
Clear theoretical grounding: The repulsion loss is derived from KL divergence and further simplified to noise prediction matching, with a complete derivation.

Limitations & Future Work¶

Experiments are conducted only on Stable Diffusion v1.5; effectiveness on more recent architectures such as SDXL and SD3 has not been verified.
The repulsion loss requires at least two reference images (\(j \neq i\)); in single-image settings it degrades to having no repulsion loss, limiting disentanglement capability.
The number of residual tokens is tied one-to-one to the number of reference images; token optimization overhead grows with the number of reference images.
Descriptive captions are automatically generated by BLIP-2, and their quality affects residual token initialization; captions for complex scenes (e.g., multiple overlapping objects) may be inaccurate.
The potential value of residual tokens themselves is unexplored—residual information (e.g., background style) could theoretically be leveraged independently, but the paper discards it at inference time.
No comparison is made against training-free personalization methods such as IP-Adapter, which offer significant efficiency advantages.

vs. Textual Inversion: Textual Inversion encodes all information in a single token, making concept-residual disentanglement impossible; ConceptPrism's multi-token design with repulsion explicitly separates the two.
vs. DreamBooth: DreamBooth fully fine-tunes the U-Net for concept learning, achieving high fidelity but severe entanglement; ConceptPrism achieves a better balance between fidelity and disentanglement via LoRA and the repulsion loss.
vs. Custom Diffusion: Custom Diffusion only fine-tunes the K/V matrices of cross-attention to indirectly reduce entanglement—a parameter-restriction approach rather than explicit disentanglement; ConceptPrism directly optimizes a disentanglement objective via the repulsion loss.
vs. Break-A-Scene: Break-A-Scene requires segmentation mask annotations to separate foreground/background concepts, constituting supervised disentanglement; ConceptPrism requires no annotations and achieves self-supervised disentanglement through cross-image contrast.
vs. Cones: Cones requires manual specification of U-Net layers (at the neuron level) corresponding to each concept, relying on human priors; ConceptPrism's token-level disentanglement is more natural and fully automatic.

Rating¶

Novelty: ⭐⭐⭐⭐ — The residual token + repulsion loss disentanglement mechanism is the core contribution; the cross-image contrastive design is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive comparison on DreamBench with multiple concept types and complete ablations, though limited to SD1.5.
Writing Quality: ⭐⭐⭐⭐ — The derivation from KL divergence to noise matching is clear and the figures are intuitive.
Value: ⭐⭐⭐⭐ — Addresses a core pain point in personalized T2I; highly practical and easy to integrate as a lightweight solution.