Inpaint4Drag: Repurposing Inpainting Models for Drag-Based Image Editing via Bidirectional Warping¶

Conference: ICCV 2025 arXiv: 2509.04582 Code: Project Page Area: Image Generation / Image Editing Keywords: Drag-based editing, image inpainting, bidirectional warping, real-time preview, pixel-space deformation

TL;DR¶

This paper proposes Inpaint4Drag, which decomposes drag-based image editing into two stages: pixel-space bidirectional warping and image inpainting. Inspired by elastic object deformation, the proposed bidirectional warping algorithm enables real-time preview (0.01s) and efficient generation (0.3s), achieving a 600× speedup over existing methods while serving as a universal adapter for arbitrary inpainting models.

Background & Motivation¶

Background: Methods such as DragGAN and DragDiffusion enable intuitive image manipulation via mouse drag interactions.
Limitations of Prior Work: Three fundamental limitations exist in current approaches:
Insufficient precision: Latent-space manipulation downsamples control points from 512×512 to 32×32, resulting in significant loss of spatial accuracy.
Poor interactivity: The generation process provides no immediate visual feedback, forcing users into repeated trial-and-error cycles.
Limited capability: General text-to-image models handle occluded regions poorly (e.g., head rotation, mouth opening with large exposed areas).
Core Idea: Drag-based editing can be fundamentally decomposed into two steps: geometric transformation (warping) + content generation (inpainting). Pixel-space deformation estimation provides precise control and real-time preview, while specialized inpainting models outperform general generative models in filling newly exposed regions.

Method¶

Overall Architecture¶

Inpaint4Drag consists of three modules: 1. Region specification and boundary refinement (optional, SAM-based) 2. Bidirectional warping algorithm (real-time pixel-space deformation) 3. Inpainting model integration (standard inpainting format input)

The user draws a mask to specify the deformable region; an optional SAM refinement module improves boundary quality.

Problem: Direct SAM application may produce disconnected regions or capture unintended objects.

Solution: Dilated/eroded masks are used to constrain SAM predictions: $$M = (M_{\text{pred}} \cap M_{\text{dilated}}) \cup M_{\text{eroded}}$$

where $M_{\text{dilated}}$ and $M_{\text{eroded}}$ are obtained by dilating and eroding the user mask with a kernel of radius $r_1=10$, respectively.

Module 2: Bidirectional Warping Algorithm¶

Inspired by elastic object deformation, the image region is treated as a deformable material. The algorithm consists of four steps:

Step 1 – Contour extraction and control point association: $$\mathcal{C} = \text{findContours}(M)$$ Each contour responds only to control points located within it.

Step 2 – Forward warping (defines target region boundary and establishes initial mapping): $$p_t = p + \sum_{i=1}^{N_\mathcal{C}} w_i(t_i - h_i)$$

Weights are computed via inverse-distance weighting: $$w_i = \frac{1/(\|p - h_i\| + \epsilon)}{\sum_{j=1}^{N_C} 1/(\|p - h_j\| + \epsilon)}$$

Step 3 – Inverse mapping (fills sampling gaps from forward warping): $$p_s = p_t + \sum_{i=1}^{N_n} w_i(p_i^{\text{src}} - p_i^{\text{tgt}})$$

Using $N_n=4$ nearest-neighbor reference points, local-neighborhood-based inverse mapping ensures complete pixel coverage.

Step 4 – Generating warped result and inpainting mask: $$I_{\text{warped}}(p_t) = I(p_s) \quad \text{for all valid pairs}$$ $$M_{\text{inpaint}} = \text{dilate}(M_{\text{temp}} \cup \partial M_{\text{warped}}, K_2)$$

The inpainting mask comprises: (1) unmapped regions exposed by deformation; (2) a narrow buffer band along the warp boundary. Dilation kernel radius $r_2=5$.

Module 3: Inpainting Model Integration¶

$I_{\text{warped}}$ and $M_{\text{inpaint}}$ are fed as standard inpainting inputs: $$I_{\text{edit}} = \text{Inpaint}(I_{\text{warped}}, M_{\text{inpaint}})$$

The SD 1.5 Inpainting Checkpoint is used, together with TinyAutoencoder SD (TAESD), LCM LoRA (reducing sampling steps to 8), empty-text prompts (eliminating CFG computation), and cached null prompt embeddings.

Key design: A real-time preview is provided prior to inpainting — users may adjust masks and control points until satisfied before executing inpainting.

Key Experimental Results¶

Main Results: Comparison with Drag-Based Editing Methods¶

Method	DragBench-S MD↓	DragBench-S LPIPS↓	DragBench-D MD↓	DragBench-D LPIPS↓	VRAM (GB)↓	Time (s)↓
DragDiffusion	7.0	18.0	6.7	10.2	11.6	177.7
DiffEditor	23.6	17.6	22.1	10.9	6.6	43.1
SDE-Drag	7.5	11.4	8.1	14.9	6.9	126.1
FastDrag	4.1	24.1	5.1	13.5	5.0	4.2
Inpaint4Drag	3.6	11.4	3.9	9.1	2.7	0.3

(MD and LPIPS values ×100)

Key findings: - Highest precision: Lowest MD scores (3.6/3.9), best drag accuracy. - Fastest: 14× faster than FastDrag, 600× faster than DragDiffusion. - Lowest memory: Requires only 2.7 GB VRAM. - Best image consistency: LPIPS is optimal or competitive on both benchmarks.

Per-Stage Latency Breakdown¶

Stage	Time
SAM boundary refinement	0.02s
Bidirectional warping preview	0.01s
SD inpainting generation	0.29s

Ablation Study: Unidirectional vs. Bidirectional Warping¶

Qualitative comparisons show: - Unidirectional (forward-only) warping: Produces visible sampling artifacts in stretched regions with unmapped gaps at target positions. - Bidirectional warping: By first determining the target contour via forward warping and then filling gaps through per-pixel inverse mapping, smooth and artifact-free transformations are achieved.

Qualitative results demonstrate improvements across three scenarios: - Raw user mask → raw SAM prediction (may include disconnected regions or unintended objects) → refined result (preserving user intent with improved boundary accuracy).

Highlights & Insights¶

Paradigm innovation: Reframes drag-based editing from "guiding generation in latent space" to "pixel-space warping + inpainting," achieving a clean separation between geometric transformation and content generation.
Physics-inspired design: Treating image regions as elastic materials is intuitive and aligns naturally with users' physical-world intuition.
Universal adapter: By outputting standard inpainting format, the method integrates seamlessly with any inpainting model and automatically inherits future advances in inpainting technology.
Real-time interaction: The 0.01s warp preview allows users to visualize results before executing the costly inpainting step, substantially improving the editing experience.
Extreme efficiency: At 2.7 GB VRAM and 0.3s inference, drag-based editing becomes practically deployable on consumer-grade GPUs for the first time.

Limitations & Future Work¶

The bidirectional warping relies on inverse-distance interpolation, which may produce unnatural results for complex non-rigid deformations such as facial expression changes.
Generation quality depends on the downstream inpainting model; when exposed regions are large, quality is bounded by the inpainting model's capability.
DragBench's deformable regions were re-annotated (differing from the original editable region concept), which may introduce evaluation inconsistencies.
The method is not applicable to drag editing scenarios requiring texture or style changes rather than geometric deformation.

Fundamental distinction from DragGAN/DragDiffusion: Prior methods perform iterative optimization in latent space, whereas this work applies analytic transformation in pixel space.
Comparison with FastDrag: FastDrag also adopts a "stretching" intuition but operates sequentially in latent space, lacking pixel-level precision and inpainting capability.
Broader insight: Many generative tasks can achieve substantial gains in efficiency and precision by decomposing into "deterministic geometric transformation + localized generation."

Rating ⭐⭐⭐⭐¶

The method is elegant and concise, directly addressing the core pain points of precision and efficiency in drag-based image editing. The 600× speedup and real-time preview confer strong practical value. Reframing drag editing as an inpainting problem is both novel and effective.

Inpaint4Drag: Repurposing Inpainting Models for Drag-Based Image Editing via Bidirectional Warping¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Module 1: Region Specification and SAM Boundary Refinement¶

Module 2: Bidirectional Warping Algorithm¶

Module 3: Inpainting Model Integration¶

Key Experimental Results¶

Main Results: Comparison with Drag-Based Editing Methods¶

Per-Stage Latency Breakdown¶

Ablation Study: Unidirectional vs. Bidirectional Warping¶

Mask Refinement Module Effectiveness¶

Highlights & Insights¶

Limitations & Future Work¶

Rating ⭐⭐⭐⭐¶

Inpaint4Drag: Repurposing Inpainting Models for Drag-Based Image Editing via Bidirectional Warping¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Module 1: Region Specification and SAM Boundary Refinement¶

Module 2: Bidirectional Warping Algorithm¶

Module 3: Inpainting Model Integration¶

Key Experimental Results¶

Main Results: Comparison with Drag-Based Editing Methods¶

Per-Stage Latency Breakdown¶

Ablation Study: Unidirectional vs. Bidirectional Warping¶

Mask Refinement Module Effectiveness¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating ⭐⭐⭐⭐¶

Related Papers¶