Magic Insert: Style-Aware Drag-and-Drop¶

Basic Information¶

Conference: ICCV 2025
arXiv: 2407.02489
Code: MagicInsert.github.io
Area: Other
Keywords: style-aware personalization, object insertion, drag-and-drop editing, LoRA, Bootstrap Domain Adaptation, diffusion models

TL;DR¶

This paper proposes Magic Insert, the first method to formally define and address the "style-aware drag-and-drop" problem—inserting a subject from an arbitrary style into a target image of a different style, such that the subject automatically adapts to the target style while being composited in a physically plausible manner. The core components are style-aware personalization (LoRA + IP-Adapter style injection) and Bootstrap Domain Adaptation (adapting a real-image-trained insertion model to the stylized image domain).

Background & Motivation¶

Problem Definition: Given a subject image \(x_s\) and a target image \(x_t\) (potentially of entirely different styles), generate \(\hat{x}_t\) such that: (1) the subject is inserted in a semantically consistent and physically plausible manner (including occlusion, shadows, and reflections); and (2) the inserted subject adopts the style of the target image while preserving its own identity and core attributes.
Formalization: Learn a function \(h: \mathcal{I}_s \times \mathcal{I}_t \rightarrow \mathcal{I}_t\) such that \(\hat{x}_t \sim p(\hat{x}_t | x_t, x_s)\).
Limitations of Prior Work:
- Pure inpainting pipelines (DreamBooth + StyleDrop + inpainting): computationally expensive and ineffective.
- Existing style-learning methods are fast but struggle to accurately capture fine-grained subject identity.
- Existing insertion models (e.g., ObjectDrop) are trained exclusively on real images and fail to generalize to stylized images.
- Naive copy-paste inpainting suffers from background corruption, incomplete insertion, and low output quality.

Method¶

Overall Architecture¶

Magic Insert decomposes the task into two sub-problems: (1) style-aware personalization—generating a style-matched subject; and (2) style-consistent insertion—realistically compositing the stylized subject into the target image.

Style-Aware Personalization¶

Step 1: Personalized Fine-Tuning

LoRA weights \(\Delta_\theta\) and two text token embeddings \(e_1, e_2\) are jointly optimized:

\[\mathcal{L}_\text{joint} = \mathbb{E}_{t,\epsilon}\left[\|\epsilon - \epsilon_{\theta'}(x_s^t, t, [e_1; e_2])\|_2^2\right]\]

Two learned embeddings (rather than one) are used to achieve a better balance between subject fidelity and editability.
LoRA learns identity in weight space; text embeddings reinforce identity representation in embedding space.
Training configuration: 600 iterations, batch size 1, UNet lr=1e-5, text encoder lr=1e-3.

Step 2: Style-Injected Inference

A frozen CLIP encoder extracts the target image style embedding \(e_t = \text{CLIP}(x_t)\).
A frozen IP-Adapter injects \(e_t\) into the upsampling blocks of the personalized model:

\[\hat{x}_s = f_{\theta'}([e_1; e_2], \textbf{v}(e_t))\]

Injection is applied only to the upsampling layers near the middle blocks (following InstantStyle), without explicit content/style embedding disentanglement.
Core Idea: The combination of adapter injection with a personalized model is a previously unexplored direction in the literature.

Bootstrap Domain Adaptation¶

Problem: Existing subject insertion models (ObjectDrop) are trained on real images \(\mathcal{D}_r\) and cannot handle stylized images \(\mathcal{D}_s\).

Mechanism: 1. Apply the real-image-trained insertion model \(g_\theta\) to perform subject removal on stylized data \(\mathcal{S} \sim \mathcal{D}_s\). 2. Filter out failed outputs, retaining successful results \(\mathcal{S}' \subseteq \mathcal{S}\). 3. Retrain the model on the filtered data:

\[\omega = \arg\min_\omega \mathbb{E}_{(x,y)\sim\mathcal{S}'} \mathcal{L}(g_\omega(x), y)\]

Key finding: Diffusion models trained on real data partially generalize to the stylized domain—limited but non-trivial.
Approximately 50 samples and a single bootstrap iteration suffice to yield substantial improvement.
After bootstrapping, the model correctly synthesizes shadows and reflections; without it, these are absent or appear as artifacts.

Insertion Pipeline¶

The segmented, stylized subject is copy-pasted into the target image.
The bootstrap-adapted insertion model is applied to the deshadowed image to generate contextual cues such as shadows and reflections.

Key Experimental Results¶

Subject Fidelity Comparison (SubjectPlop Dataset)¶

Method	DINO↑	CLIP-I↑	CLIP-T Simple↑	CLIP-T Detailed↑	Overall Mean↑
StyleAlign Prompt	0.223	0.743	0.266	0.299	0.383
StyleAlign ControlNet	0.414	0.808	0.289	0.294	0.451
InstantStyle Prompt	0.231	0.778	0.283	0.300	0.398
InstantStyle ControlNet	0.446	0.806	0.281	0.283	0.454
Ours	0.295	0.829	0.276	0.293	0.423
Ours ControlNet	0.514	0.869	0.289	0.308	0.495

Magic Insert + ControlNet achieves superior subject fidelity across all metrics.

Style Fidelity and Human Preference¶

Method	CLIP-I↑	CSD↑	CLIP-T↑	ImageReward↑
StyleAlign ControlNet	0.575	0.188	0.274	-0.518
InstantStyle ControlNet	0.588	0.334	0.279	-0.276
Ours	0.560	0.243	0.268	-0.211
Ours ControlNet	0.575	0.294	0.274	-0.147

While InstantStyle achieves marginally higher scores on certain style metrics, its outputs are frequently blurry and lose subject detail. The proposed method demonstrates a clear advantage on ImageReward, which correlates strongly with human preference.

User Study (60 participants, 1,200 evaluations)¶

Comparison	User Preference for Ours
Ours vs. StyleAlign ControlNet	85%
Ours vs. InstantStyle ControlNet	80%

The overwhelming user preference validates the effectiveness of the proposed method.

Highlights & Insights¶

Value of Problem Formalization: This work is the first to clearly define the "style-aware drag-and-drop" problem and introduces the SubjectPlop evaluation dataset (20 backgrounds × 35 subjects = 700 pairs), establishing a foundation for future research.
Key Design Decision Against Direct Inpainting: The pipeline first generates a high-quality stylized subject, then composites it via an insertion model—a divide-and-conquer strategy that outperforms end-to-end inpainting.
Generality of Bootstrap Domain Adaptation: This idea is not limited to insertion tasks; it can be applied to any scenario where a real-domain model must be adapted to a target domain by leveraging the model's partial generalization capacity.
Complementary Combination of LoRA + Textual Inversion + IP-Adapter: The three components operate in weight space, embedding space, and adapter space, respectively, providing complementary and controllable contributions.
LLM-Guided Interaction: ChatGPT-4o is used to automatically suggest subject poses and environmental interactions, demonstrating the potential for LLM-integrated applications.

Limitations & Future Work¶

Each subject requires independent LoRA fine-tuning (~600 steps), precluding real-time use.
A trade-off exists between editability and fidelity—longer training improves fidelity but reduces editability.
The method is built on SDXL; generation quality is bounded by the base model.
Bootstrap Domain Adaptation is validated only with a single iteration on ~50 samples; the effects of larger-scale or multi-step bootstrapping remain unexplored.
Without ControlNet, pose control is limited.
The SubjectPlop dataset is AI-generated and does not include real photographic subjects.

DreamBooth / Textual Inversion: Foundational methods for subject personalization; this work extends them with a style dimension.
IP-Adapter / InstantStyle: Key techniques for adapter-based style injection.
ObjectDrop: An insertion model trained on real-world counterfactual data; this work extends its applicability via bootstrapping.
ZipLoRA: An alternative approach for merging style and subject LoRAs.
Insight: Bootstrap Domain Adaptation, as a lightweight domain adaptation strategy, merits exploration in other vision tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ (Novel problem formulation; Bootstrap Domain Adaptation is an original contribution)
Experimental Thoroughness: ⭐⭐⭐⭐ (Comprehensive metrics; large-scale user study with decisive results)
Writing Quality: ⭐⭐⭐⭐ (Clear problem formalization; rich illustrations)
Value: ⭐⭐⭐⭐ (Opens a new direction in style-aware editing; dataset facilitates future research)