Style-Editor: Text-driven Object-Centric Style Editing¶
Conference: CVPR 2025
arXiv: 2408.08461
Code: None
Area: Image Segmentation
Keywords: Text-driven style editing, object-level editing, CLIP guidance, background preservation, patch selection
TL;DR¶
This paper proposes Style-Editor, which utilizes patch-level directional loss and adaptive background preservation loss in the CLIP space to achieve precise style editing of target objects using only text descriptions, without requiring segmentation masks or reference images.
Background & Motivation¶
Text-driven image style editing holds significant value for creative industries such as advertising, film, and gaming. Existing methods can be categorized into GAN-based methods (e.g., StyleGAN-NADA, CLIPstyler) and diffusion-based methods (e.g., Instruct Pix2Pix, Plug-and-Play). However, these methods face the following core limitations of prior work:
- Whole-image editing issue: Traditional directional loss applies style changes to the entire image, failing to distinguish the foreground object from the background.
- Semantic distortion issue: Although diffusion models are highly capable, they often alter the content structure of the target object, resulting in low fidelity.
- Mask dependency issue: Achieving object-level editing typically requires additional segmentation masks, which increases user operation complexity.
- Background contamination issue: Even when style editing is only aimed at the foreground, background regions are prone to unwanted style transfer.
The core idea of this paper is to utilize the zero-shot classification ability of the CLIP model to automatically locate the object region corresponding to the text, and combine this with carefully designed patch-level loss functions to achieve precise object-level style editing without requiring segmentation masks.
Method¶
Overall Architecture¶
The pipeline of Style-Editor consists of the following core modules: a style editing network (StyleNet, based on the U-Net architecture) that receives the source image and generates the stylized image; a Pre-fixed Region Selection (PRS) module that roughly localizes foreground regions in initial iterations; a Text-Matching Patch Selection (TMPS) module that uses the CLIP encoder to precisely select patches matching the source text; and finally, end-to-end optimization via a weighted combination of four loss functions (PCD loss, ABP loss, content loss, and TV loss).
Key Designs¶
-
Text-Matching Patch/Region Selection (TMPS) + Pre-fixed Region Selection (PRS):
- Function: Automatically localizes the object region in the image corresponding to the text description without requiring segmentation masks.
- Mechanism: PRS first divides the source image into a uniform grid, generates patches at three scales for each grid, selects patches matching the source text through TMPS, and generates a rough foreground mask \(M^{fg}\) via a voting mechanism. The core of TMPS is a two-stage selection—it first calculates the cosine similarity between each patch feature and the text feature to select the Top-M, and then computes the average feature vector \(f_{avg}\) for secondary screening (similarity > 0.8 and ranked in the top K/2).
- Design Motivation: To utilize the cross-modal alignment capability of CLIP instead of traditional segmentation networks, while the coarse localization strategy of PRS improves the efficiency and accuracy of the subsequent TMPS.
-
Patch-level Co-directional loss (PCD Loss):
- Function: Guides the direction of style transformation of the foreground object in the CLIP feature space while maintaining semantic consistency.
- Mechanism: The PCD loss comprises two sub-losses. The patch directional loss \(\mathcal{L}_{dir}\) ensures that the direction of change for each patch in the CLIP embedding space matches the text direction (measured by cosine similarity). The patch distribution consistency loss \(\mathcal{L}_{con}\) uses Jensen-Shannon divergence to align the CLIP feature distributions of patches in the source image and the stylized image. The target text is generated from a combination of the source text and style text using a central word selection technique.
- Design Motivation: Traditional directional loss focuses only on vector directions and neglects semantic information, which can lead to semantic collapse and information distortion among patches. The distribution consistency constraint prevents this degradation, ensuring that the edited region maintains a feature distribution consistent with the source image.
-
Adaptive Background Preservation loss (ABP Loss):
- Function: Keeps the original style and structure of the background region unaffected by editing.
- Mechanism: In each iteration, the foreground mask \(M^{fg*}\) is dynamically updated using the patches selected by TMPS (via cumulative OR operations), and the background mask is defined as \(M^{bg*} = 1 - M^{fg*}\). MS-SSIM and L1 losses are applied to the background region to constrain the background of the stylized image to match the original image.
- Design Motivation: Foreground localization is a dynamic process of gradual refinement, meaning the background mask also needs to be adaptively updated rather than using a fixed, static mask.
Loss & Training¶
Total loss function: \(\mathcal{L}_{total} = \mathcal{L}_{pcd} + \lambda_{abp}\mathcal{L}_{abp} + \lambda_c\mathcal{L}_c + \lambda_{tv}\mathcal{L}_{tv}\)
Where \(\mathcal{L}_{pcd} = \lambda_{dir}\mathcal{L}_{dir} + \lambda_{con}\mathcal{L}_{con}\), \(\lambda_{dir} = 1.5 \times 10^4\), \(\lambda_{con} = 3 \times 10^4\), \(\lambda_{abp} = 3 \times 10^4\), \(\lambda_c = 4 \times 10^2\), and \(\lambda_{tv} = 2 \times 10^{-3}\).
Training details: Optimized using the Adam optimizer with an initial learning rate of \(5 \times 10^{-4}\) for a total of 200 iterations (the first 20 iterations are the PRS phase), with the learning rate halved after 100 iterations. Each source image is trained independently, taking approximately 45 seconds per image (on an A6000 GPU). The ViT-B/32 CLIP model is used with an input resolution of 512×512. Content loss utilizes conv4_2 and conv5_2 features of VGG-19.
Key Experimental Results¶
Main Results¶
| Method | SimF↑ | ConF↓ | L1B↓ | SSIMB↑ | PSNRB↑ |
|---|---|---|---|---|---|
| Text2LIVE | 0.32 | 4.13 | 0.14 | 0.87 | 24.69 |
| CLIPstyler | 0.28 | 5.16 | 0.66 | 0.51 | 13.20 |
| Instruct Pix2Pix | 0.22 | 7.42 | 0.44 | 0.62 | 17.25 |
| Null-text Inv. | 0.20 | 4.22 | 0.16 | 0.74 | 23.48 |
| Plug and Play | 0.23 | 6.51 | 0.33 | 0.63 | 18.26 |
| LEDITS++ | 0.22 | 6.81 | 0.18 | 0.74 | 21.66 |
| Style-Editor | 0.33 | 3.75 | 0.10 | 0.90 | 27.65 |
The evaluation is based on the MSCOCO 2017 dataset (16 images × 10 style texts = 160 stylized images), using GT segmentation masks to separate foreground/background for assessment.
Ablation Study¶
| Configuration | SimF↑ | ConF↓ | L1B↓ | PSNRB↑ | Description |
|---|---|---|---|---|---|
| (a) baseline | 0.29 | 4.31 | 0.60 | 14.16 | Random patch + no module |
| (b) +Ldir | 0.32 | 4.72 | 0.49 | 16.02 | Directional loss is effective |
| (c) +Ldir+Lcon | 0.33 | 4.62 | 0.48 | 16.16 | Distribution consistency preserves details |
| (d) +Ldir+Labp | 0.32 | 4.16 | 0.10 | 27.28 | Background preservation is significantly improved |
| (e) All | 0.33 | 3.75 | 0.10 | 27.65 | Optimal |
Key Findings¶
- ABP loss contributes the most to background preservation, with L1B decreasing from 0.48 to 0.10, and PSNRB improving by over 11 dB.
- The distribution consistency loss \(\mathcal{L}_{con}\) within the PCD loss effectively prevents the loss of object details (e.g., chair shadows, hat shapes).
- In comparison with 11 alternative methods, Style-Editor achieves the best results in both foreground style matching and background preservation dimensionalities.
- Compared to mask-based generative models (such as Blended Diffusion), Style-Editor achieves better object structure preservation without requiring mask inputs.
Highlights & Insights¶
- Zero-mask design: Leverages CLIP's zero-shot capability to replace segmentation networks for object localization, reducing user interaction complexity.
- Two-stage localization (PRS+TMPS): The coarse-to-fine strategy balances efficiency and accuracy.
- Per-image optimization paradigm: Takes ~45 seconds per image, making it suitable for on-demand editing scenarios.
- Distribution consistency constraint is a crucial improvement over traditional directional losses, upgrading from "direction-only" to "direction + distribution".
Limitations & Future Work¶
- Object localization relies entirely on CLIP, which may fail for small objects or complex scenes that are difficult for CLIP to identify.
- The per-image optimization paradigm limits its potential for real-time applications.
- It only supports style editing at the texture/color levels and cannot alter geometric structures.
- Future work can explore integration with segmentation foundation models such as SAM.
Related Work & Insights¶
- vs CLIPstyler: CLIPstyler performs global style editing on the entire image, whereas Style-Editor achieves precise region selection via TMPS.
- vs Text2LIVE: Text2LIVE uses a layered representation but may edit non-target regions, offering inferior style fidelity compared to this work.
- vs Diffusion-based methods: Diffusion models tend to alter object content structures, yielding significantly worse ConF metrics compared to this work.
Rating¶
- Novelty: ⭐⭐⭐⭐ TMPS/PRS zero-mask localization + PCD loss distribution consistency constraint constitute a significant innovation.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comparison with 11 methods is thorough, with complete ablation and comprehensive evaluation metrics covering both foreground and background.
- Writing Quality: ⭐⭐⭐⭐ Clear structure, complete pseudocode provided, and rich illustrations.
- Value: ⭐⭐⭐⭐ Mask-free object-level style editing solution with high industrial application potential.