DragAnything: Motion Control for Anything using Entity Representation¶
Conference: ECCV 2024
arXiv: 2403.07420
Code: https://github.com/showlab/DragAnything
Area: Video Generation
Keywords: Controllable Video Generation, Motion Control, Entity Representation, Diffusion Models, Trajectory Guidance
TL;DR¶
This paper proposes DragAnything, which utilizes the latent space features of diffusion models as Entity Representations to achieve entity-level motion control. It addresses the issue of existing trajectory-driven methods only dragging pixels without being able to precisely control the motion of target objects. DragAnything achieves state-of-the-art (SOTA) FVD/FID metrics on VIPSeg, outperforming DragNUWA by 26% in motion control votes in a user study.
Background & Motivation¶
Background: Significant progress has been made in the field of video generation (e.g., Imagen Video, SVD, SORA), but progress in controllable video generation has been relatively slow. Among various control signals, trajectory-based motion control has become a mainstream direction due to its user-friendly interaction (where users only need to draw lines). Representative works like DragNUWA encode sparse trajectories into a dense optical flow space, while MotionCtrl encodes trajectory coordinates into a vector map.
Limitations of Prior Work: An overlooked core problem is: Can a single pixel point truly represent the target entity to be controlled? The answer is no. Existing methods exhibit two key insights:
Insight 1: Trajectory points cannot represent entities. As shown in the figure, when dragging a pixel point of a star in a starry sky, the model cannot distinguish whether the user wants to control that star or the entire night sky. Similarly, dragging a point on a cloud results in camera movement rather than cloud movement. This demonstrates that a single point cannot carry the semantic identity information of an entity.
Insight 2: Pixels closer to the drag point undergo larger motion. In videos generated by DragNUWA, the amount of pixel motion is inversely proportional to the distance from the drag point. However, the expectation is for the entire object to move as a unified entity along the trajectory, rather than experiencing uneven, pixel-level deformation.
Key Challenge: Existing trajectory-based methods essentially perform "pixel dragging" rather than "entity control"—they fail to establish a semantic association between the "dragged points" and the "entities to be controlled." Two questions must be addressed: (1) How to represent entities? (2) How to drag only the selected entities?
Key Insight: Utilizing the latent space features of the diffusion model itself to represent entities. Prior works (e.g., AnyDoor using DINOv2 features, VideoSwap and DIFT using diffusion model features) have demonstrated the effectiveness of latent features in representing objects.
Core Idea: Extracting latent features corresponding to the target entity's region from the diffusion model as the entity representation, and achieving entity-level motion control by manipulating the spatial positions of this representation.
Method¶
Overall Architecture¶
DragAnything is built on the Stable Video Diffusion (SVD) architecture. Its core components include: - Denoising Diffusion Model: A 3D U-Net that performs spatial-temporal denoising. - Entity Semantic Representation Extraction: Extracts entity embeddings from diffusion features. - 2D Gaussian Representation: Provides spatial location guidance and center-focused attention. - Control Encoder: Inspired by ControlNet, it encodes both representations and injects them into the decoder of the denoising U-Net.
Key Designs¶
1. Entity Representation Extraction¶
Function: Extracts semantic embeddings of each entity from the initial frame to represent target objects instead of simple coordinate points.
Mechanism: Utilizing a diffusion forward process (noise addition) and a single-step denoising U-Net inference, the method extracts latent features corresponding to the entity mask to serve as the open-domain embeddings for that entity.
Specific Steps: 1. Given an initial frame image \(\mathbf{I} \in \mathbb{R}^{H \times W \times 3}\) and an entity mask \(\mathbf{M}\). 2. Obtain the noisy latent variables \(\boldsymbol{x}_t\) through the diffusion forward process. 3. Perform a one-step inference using the denoising U-Net to extract diffusion features: \(\mathcal{F} = \epsilon_\theta(\boldsymbol{x}_t, t)\), where \(\mathcal{F} \in \mathbb{R}^{H \times W \times C}\). 4. Index the features of the corresponding region in \(\mathcal{F}\) using the entity mask coordinates. 5. Apply average pooling to the extracted features to obtain entity embeddings \(\{e_1, e_2, ..., e_k\}\).
Trajectory Association: 1. Initialize a zero matrix \(\mathbf{E} \in \mathbb{R}^{H \times W \times C}\). 2. During training, extract the center coordinates using the entity mask, and obtain the trajectory sequence using Co-Tracker. 3. In each frame, insert the entity embeddings into the corresponding positions of the trajectory points to obtain the entity representation \(\{\hat{\mathbf{E}}_i\}_{i=1}^L\).
Design Motivation: The intermediate features of diffusion models naturally contain rich semantic information. One-step inference is both fast and yields high-quality representations, outperforming multi-step inference. The entity representation is open-domain and can represent any object (including the background).
2. 2D Gaussian Representation¶
Function: Provides spatial location guidance and forces the model to focus on the central region of the entity.
Mechanism: Generates a 2D Gaussian distribution map centered on the entity's coordinate center, with the inscribed circle's radius as a parameter. The center pixel carries a high weight, which decays towards the edges.
Specific Implementation: 1. Compute the center \((x, y)\) and radius \(r\) of the inscribed circle from the entity mask. 2. Generate 2D Gaussian distribution maps \(\{\mathbf{G}_i\}_{i=1}^L\) at the trajectory point locations for each frame. 3. Process them through the encoder and then merge them with the entity representations.
Design Motivation: Pixels closer to the entity center are usually more important (containing more information and fewer occlusions). The 2D Gaussian naturally achieves center-focusing through spatial weight allocation, compensating for the uniformity of the entity representation.
3. Control Signal Encoding and Injection¶
Function: Encodes both representations into a latent space compatible with SVD.
Encoder Structure: Composed of 4 convolution blocks (each block has 2 convolutional layers + SiLU activation), progressively downsampling by a factor of 2 to reach 1/8 resolution.
Signal Injection: $\(\{\mathbf{R}_i\}_{i=1}^L = \mathcal{E}(\{\hat{\mathbf{E}}_i\}_{i=1}^L) + \mathcal{E}(\{\mathbf{G}_i\}_{i=1}^L) + \{\mathbf{Z}_i\}_{i=1}^L\)$ where \(\mathbf{Z}_i\) represents the noisy video frame latents. The encoded features are processed by the 3D U-Net encoder to obtain four different resolutions of features, which are then added to the corresponding layers of the denoising U-Net.
Loss & Training¶
Loss Function: MSE loss with mask constraints: $\(\mathcal{L}_\theta = \sum_{i=1}^L \mathbf{M} \| \epsilon - \epsilon_\theta(\boldsymbol{x}_{t,i}, \mathcal{E}_\theta(\hat{\mathbf{E}}_i), \mathcal{E}_\theta(\mathbf{G}_i)) \|_2^2\)$
Design Motivation of Mask Constraint: The optimization objective focuses strictly on controlling the motion of the entity, imposing no constraints on the background or other objects to prevent degrading the overall generation quality.
Training Data: Evaluated using the VIPSeg video segmentation dataset (which provides entity-level annotations), tracking center coordinates with Co-Tracker to obtain trajectories.
Training Settings: SVD-based weights, AdamW optimizer, learning rate of 1e-5, 100k steps, Tesla A100 GPU, generating 25-frame videos at 320×576 resolution.
Inference Interaction: The user clicks to select the control region using SAM \(\rightarrow\) drags within the region to draw a trajectory \(\rightarrow\) generates the video.
Key Experimental Results¶
Main Results¶
VIPSeg Validation Set (256×256):
| Method | Base Architecture | ObjMC↓ | FVD↓ | FID↓ |
|---|---|---|---|---|
| DragNUWA | SVD | 324.6 | 519.3 | 39.8 |
| DragAnything | SVD | 305.7 | 494.8 | 33.5 |
| Gain | - | -18.9 | -24.5 | -6.3 |
User Study:
| Evaluation Metric | DragAnything Win Rate | DragNUWA Win Rate |
|---|---|---|
| Motion Control | 63% | 37% (+26%) |
| Video Quality | 56% | 44% (+12%) |
Ablation Study¶
Contributions of Entity Representation and 2D Gaussian Representation:
| Entity Rep. | Gaussian Rep. | ObjMC↓ | FVD↓ | FID↓ |
|---|---|---|---|---|
| ✗ | ✗ | 410.7 | 496.3 | 34.2 |
| ✓ | ✗ | 318.4 | 494.5 | 34.1 |
| ✗ | ✓ | 339.3 | 495.3 | 34.0 |
| ✓ | ✓ | 305.7 | 494.8 | 33.5 |
Key Observations: - Entity Rep. contributes the most to ObjMC: -92.3 (\(410.7 \rightarrow 318.4\)) - Gaussian Rep. improvement: -71.4 (\(410.7 \rightarrow 339.3\)) - The combination of both exhibits a complementary effect: -105.0 (\(410.7 \rightarrow 305.7\))
Ablation on Loss Mask:
| Loss Mask | ObjMC↓ | FVD↓ | FID↓ |
|---|---|---|---|
| ✗ | 311.1 | 500.2 | 34.3 |
| ✓ | 305.7 | 494.8 | 33.5 |
Key Findings¶
- Entity Representation is the core of motion control: Bringing a 92.3 improvement on ObjMC, demonstrating that diffusion features indeed encode sufficient entity semantic information.
- The two representations are complementary: Entity Rep. provides semantic identity, while Gaussian Rep. provides spatial localization and center focus, yielding the best performance when combined.
- Importance of Loss Mask for precise control: Restricting loss backpropagation strictly to the target region prevents interfering with the generation quality of other areas.
- DragAnything consistently outperforms DragNUWA across all metrics: FVD (temporal consistency), FID (visual quality), and ObjMC (motion accuracy).
- Supports diverse motion control: Foreground, background, simultaneous foreground + background control, and camera motion control.
Highlights & Insights¶
- Profound Insight: Using a toy experiment tracking pixel motion via Co-Tracker, the paper intuitively reveals the fundamental issue of "pixel dragging \(\neq\) entity control."
- Generality of Entity Representation: The open-domain embeddings can represent any object—not only instance-level objects, but also backgrounds (clouds, starry sky) and camera motion.
- Leveraging the Diffusion Model's Own Features: The method avoids introducing extra encoders (e.g., CLIP, DINOv2) and instead leverages the pre-existing semantic capabilities of the diffusion U-Net, which is computationally efficient and semantically aligned.
- Thorough Paradigm Comparison: Systems-level comparisons are made among five paradigms: point representation, trajectory maps, 2D Gaussians, box representations, and entity representations.
- Extremely User-Friendly Interaction: During inference, users only need to select with SAM and draw trajectories, making the entire pipeline intuitive with no professional knowledge required.
Limitations & Future Work¶
- Weak performance under extreme motion control: As shown in the bad cases in Figure 10, controlling very large motions may lead to visual distortion.
- Dependence on first-frame entity masks: The method requires SAM or manual mask annotations to extract the entity representation; completely mask-free scenarios require additional processing.
- Trajectory sources depend on Co-Tracker: The quality of trajectory labels in training data is limited by the tracker's accuracy.
- Resolution limits: Currently evaluated at 256×256; performance and efficiency under higher resolutions need validation.
- Multi-entity interaction scenarios: In scenarios where multiple entities overlap or physically interact (e.g., collisions), the physical plausibility of motion control may be insufficient.
- Exploring mask-free approaches: Integrating text instructions for entity selection (e.g., "move the dog to the right") could further enhance user experience.
Related Work & Insights¶
- DragNUWA / MotionCtrl: Pioneers in trajectory-controlled video generation, but limited to pixel-level manipulation.
- SVD (Stable Video Diffusion): The base model for DragAnything.
- ControlNet: A classic architecture for injecting control signals into denoising U-Nets, which inspired the encoder design in DragAnything.
- DIFT / VideoSwap: Proved that diffusion features can serve as effective object representations.
- SAM: Provides interactive segmentation support for entity selection.
- Co-Tracker: A point tracker providing motion trajectory annotations required for training.
- Insights: The intermediate representations of diffusion models are themselves excellent semantic descriptors. Fully utilizing the intrinsic capabilities of existing models is more efficient than introducing auxiliary modules.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ A profound paradigm shift from "pixel dragging" to "entity control," with an elegant design for entity representation.
- Experimental Thoroughness: ⭐⭐⭐⭐ Includes automated metrics + user study + ablation + visualization of multiple control modes, which is quite comprehensive; however, comparisons are chiefly limited to DragNUWA (as MotionCtrl did not open-source its SVD version).
- Writing Quality: ⭐⭐⭐⭐ The motivation illustrated via the toy experiment is intuitive and compelling, and the method description is clear.
- Value: ⭐⭐⭐⭐⭐ Proposes a crucial paradigm improvement in controllable video generation; the entity-level control mechanism will likely have a significant impact on future works.