StyleMaster: Stylize Your Video with Artistic Generation and Translation¶
Conference: CVPR 2025
arXiv: 2412.07744
Code: Project Page
Area: Image Generation / Video Stylization
Keywords: Video Stylization, Style Transfer, Diffusion Models, Contrastive Learning, Texture Features
TL;DR¶
StyleMaster achieves high-quality video stylization generation and transfer with style consistency and content preservation by combining local texture selection based on prompt-patch similarity, contrastive global style extraction derived from model illusion generation, a motion adapter, and a grayscale Tile ControlNet.
Background & Motivation¶
Video stylization tasks require generating or transferring videos into the style of a given reference image. However, existing methods face three core challenges: (1) Loss of local textures: current approaches focus on global styles but ignore local textual details such as brushstrokes; (2) Content leakage: directly using CLIP global embeddings or all patch features introduces reference image content into the generated outputs; (3) Insufficient style-content decoupling: global representations struggle to balance content leakage prevention and style accuracy.
Additionally, existing style datasets (like Style30K) cannot guarantee style consistency within style groups, which harms contrastive learning. For example, one image in a group might belong to the real-world domain while another belongs to the anime domain. The collection and grouping process is also highly labor-intensive.
The core insight of StyleMaster is that the style extraction stage is crucial. It requires simultaneously considering global and local information, and leveraging model illusion to automatically generate paired data with absolute style consistency.
Method¶
Overall Architecture¶
Based on a DiT video generation model, StyleMaster comprises four core components: (1) contrastive dataset construction and global style projection based on model illusion; (2) local texture selection based on prompt-patch similarity; (3) a motion adapter to enhance temporal quality and style intensity; and (4) a grayscale Tile ControlNet for precise content control in video style transfer.
Key Designs 1: Model Illusion Contrastive Dataset and Global Style Extraction¶
- Function: Generate paired data with absolute style consistency to train a module that projects CLIP image embeddings into pure style representations.
- Mechanism: Leverage the model illusion properties of VisualAnagrams—during the T2I model sampling process, noisy images are copied and transformed (rotated/flipped) to form parallel processes. These are guided by different prompts to generate paired images that permute pixels to change content while preserving style. Infinite paired data is generated by randomly combining two lists (objects + style descriptions). An MLP projection layer \(F_{\text{global}} = \text{MLP}(F_i)\) is trained to transform CLIP image embeddings into global style descriptions using a triplet loss.
- Design Motivation: Manually collected datasets like Style30K cannot guarantee style consistency. In contrast, pairs generated by model illusion only differ by pixel rearrangements, ensuring absolute style consistency within groups. The whole CLIP model is not fine-tuned to maintain generalization capability.
Key Designs 2: Local Texture Selection Based on Prompt-Patch Similarity¶
- Function: Select patches carrying texture information but no content from CLIP patch features to prevent content leakage.
- Mechanism: Compute the similarity between CLIP patch features \(F_p\) and text features \(F_{\text{text}}\), selecting the \(k=15\) patches with the lowest similarity as texture features (since patches least related to text content are more likely to carry style textures). A Q-Former structure aggregates the selected patch features, which are then concatenated with global styles and injected into the model via dual cross-attention: \(F_{\text{out}} = \text{TCA}(F_{\text{in}}, F_{\text{text}}) + \text{SCA}(F_{\text{in}}, F_{\text{style}})\).
- Design Motivation: Directly using all 256 patch features causes severe content leakage (the UMT Score drops from 2.329 to 0.771). Random dropping can mitigate this but is less effective than similarity-based selection.
Key Designs 3: Motion Adapter and Grayscale Tile ControlNet¶
- Function: The motion adapter enhances video temporal quality and style intensity; the grayscale Tile ControlNet provides precise content control.
- Mechanism: Train a LoRA \(\widetilde{W} = W + \alpha \cdot A_t^{W,\text{down}} \cdot A_t^{W,\text{up}}\) on the temporal attention weights \(W_Q, W_K, W_V\). During training, set \(\alpha=1\) to generate static videos. During inference, set \(\alpha=-0.3\). This negative value not only increases the dynamic range but also pushes the generation results away from the real domain, enhancing the stylization degree. For video style transfer, a grayscale tile image is used as the ControlNet conditioning input, removing RGB color information to avoid interfering with style injection.
- Design Motivation: Image training leads to video temporal flickering and lack of dynamics. RGB tiles introduce color interference.
Loss & Training¶
- Global projection training: triplet loss \(\mathcal{L} = \sum_{n=1}^{N}[\|f(F_{i,n}^{\text{anc}}) - f(F_{i,n}^{\text{pos}})\| - \|f(F_{i,n}^{\text{anc}}) - f(F_{i,n}^{\text{neg}})\| + \alpha]\)
- Style module training: standard diffusion denoising loss
- Grayscale Tile ControlNet training: standard diffusion denoising loss
Key Experimental Results¶
Main Results 1: Image Style Transfer¶
| Method | CSD-Score↑ | ArtFID↓ | FID↓ | LPIPS↓ |
|---|---|---|---|---|
| StyleID (CVPR'24) | 0.40 | 38.57 | 23.91 | 0.55 |
| InstantStyle | 0.32 | 42.48 | 24.59 | 0.67 |
| CSGO | 0.35 | 41.42 | 25.71 | 0.56 |
| StyleMaster | 0.45 | 36.89 | 22.11 | 0.61 |
Main Results 2: Video Stylization Generation¶
| Method | CLIP-Text↑ | UMT-Score↑ | CSD-Score↑ | MotionSmooth↑ |
|---|---|---|---|---|
| VideoComposer | 0.057 | -2.268 | 0.680 | 0.975 |
| StyleCrafter | 0.294 | 1.994 | 0.448 | 0.973 |
| StyleMaster | 0.305 | 2.329 | 0.463 | 0.994 |
Ablation Study: Style Extraction Module Design¶
| Configuration | Global (GP) | Texture (Selection) | UMT Score | CSD Score |
|---|---|---|---|---|
| B1: CLIP Embedding | ✗ | - | 0.892 | 0.561 |
| B2: Global Projection | ✓ | - | 2.337 | 0.443 |
| B3: All Patches | - | No selection | 0.771 | 0.534 |
| B5: Similarity-based Selection | - | ✓ | 2.331 | 0.452 |
| B6: Global + Local | ✓ | ✓ | 2.329 | 0.463 |
Key Findings¶
- Significantly outperforms all competitors in ArtFID (a metric that comprehensively considers both style and content).
- Global projection increases the UMT Score from 0.892 to 2.337, effectively preventing content leakage.
- Setting the motion adapter to \(\alpha=-0.3\) achieves the optimal balance among visual quality, dynamic range, and style similarity.
- VideoComposer achieves the highest CSD (0.680) but because it directly copies content from the reference image, its UMT Score is negative.
Highlights & Insights¶
- Paired Data Generation via Model Illusion: Ingeniously utilizes the illusion properties of T2I models to generate an infinite amount of absolutely style-consistent paired data at zero cost. This data generation strategy is widely applicable to other style-related studies.
- Retreating to Advance with the Motion Adapter: Train the adapter to perform "static" generation, then take a negative ratio during inference for "motion+", which also implicitly enhances the degree of stylization.
- Selective Texture Preservation: A elegant selection strategy based on prompt-patch similarity successfully balances texture preservation and content leakage.
Limitations & Future Work¶
- The current method only processes the static style of images and does not involve dynamic style (e.g., particle effects, motion characteristics).
- Style extraction still relies on a reference image; future work could explore extracting and transferring dynamic styles from reference videos.
- The grayscale Tile ControlNet loses some color information, which might affect content preservation in certain scenarios.
Related Work & Insights¶
- IP-Adapter: An image adapter scheme but fails to decouple style and content.
- StyleTokenizer: Fine-tunes CLIP on Style30K using contrastive learning, but suffers from poor style consistency in the dataset.
- CSGO: Uses B-LoRA to generate triplet datasets, but can only extract global representations.
- StillMoving: Proposes the concept of a motion adapter; this work builds on it to find the style enrichment effect of negative ratios.
- Insight: The "flaws" (illusion properties) of generative models can be turned into valuable features.
Rating¶
⭐⭐⭐⭐ — Multiple innovations (model illusion data generation, prompt-patch selection, negative motion adapter) are seamlessly integrated, with each design thoroughly validated via ablation studies. It comprehensively outperforms existing methods in both image and video stylization tasks.