EmoEdit: Evoking Emotions through Image Manipulation¶
Conference: CVPR 2025
arXiv: 2405.12661
Code: None
Area: Diffusion Models
Keywords: Affective Image Manipulation, Emotion Adapter, Diffusion Model Editing, Content-Aware, Visual Emotion Analysis
TL;DR¶
This paper proposes EmoEdit, the first image manipulation framework that evokes specified emotions through content modification (rather than just color/style adjustments). It constructs the EmoEditSet dataset with 40,120 pairs, designs a plug-and-play Emotion Adapter, and achieves a significant balance between structural preservation and emotional evocation.
Background & Motivation¶
Background: Affective Image Manipulation (AIM) aims to modify user-provided images to evoke specific emotions. Existing methods mainly alter colors and styles, such as CLVA and AIF, which convert real images into artistic styles. While diffusion models excel in image editing, they lack knowledge concerning emotional manipulation.
Limitations of Prior Work: (1) Color and style adjustments fail to evoke precise and deep emotional changes—psychological research indicates that visual content (rather than just color) is the key emotional stimulus; (2) Existing AIM methods are mostly limited to binary emotional classification (positive/negative), which lacks granularity; (3) DALL-E 3 can convey emotions but fails to preserve the source image structure, whereas InstructPix2Pix (IP2P) preserves structure but lacks emotional expression—emotional evocation and structural preservation are inherently contradictory.
Key Challenge: Emotional evocation requires meaningful content modification (e.g., adding butterflies to convey contentment), but excessive modification compromises the source image structure. It is necessary to find a method to automatically select "appropriate and context-matched" emotional semantic modifications.
Goal: (1) Address the lack of large-scale AIM datasets—how to automatically construct high-quality emotional counterpart data? (2) Endow diffusion models with emotion awareness. (3) Enable automatic selection of appropriate content modifications based solely on emotion words, without requiring specific editing instructions.
Key Insight: Drawing from the psychological insight that "visual content acts as the emotional stimulus," clustering is performed on EmoSet to construct "emotion factor trees" for eight emotions. Each emotion corresponds to multiple semantic representations (e.g., "contentment" \(\rightarrow\) books and flowers, rainbows, butterflies). An Emotion Adapter is then trained in a data-driven manner to learn context-based semantic selection.
Core Idea: Constructing an emotion factor tree and a large-scale dataset to teach diffusion models "what kind of content modifications evoke which emotions," achieving content-aware editing driven solely by emotion words.
Method¶
Overall Architecture¶
EmoEdit consists of two main steps: (1) EmoEditSet dataset construction—extracting emotion factor trees by clustering EmoSet, generating source-target image pairs using IP2P, and filtering them through four-fold metrics and manual review; (2) Emotion Adapter training—designing an emotion adapter based on the Q-Former architecture, trained with a combination of diffusion loss and instruction loss to make it plug-and-play for various diffusion models. During inference, only the input image and the target emotion word are required.
Key Designs¶
-
Emotion Factor Tree and EmoEditSet Dataset Construction:
- Function: Providing large-scale, semantically diverse affective manipulation paired data.
- Mechanism: Perform clustering using CLIP semantic embeddings on EmoSet for eight emotions (amusement, awe, contentment, excitement, anger, disgust, fear, sadness) to extract representative visual factors. Utilize GPT-4V to generate content summaries for each cluster and categorize them into objects, scenes, actions, and expressions, thereby constructing a hierarchical "emotion factor tree." Then, collect 15,531 source images (from MagicBrush, MA5K, Unsplash) and use IP2P with emotion factors as instructions to generate target candidates. Filter these via a four-fold pipeline: CLIP image similarity (0.75-0.9), CLIP text similarity (>0.25), emotion score (>0.3), and aesthetic score, followed by manual review. This yields 40,120 pairs, averaging 2.6 emotional directions per image.
- Design Motivation: The lack of large-scale AIM data is a fundamental bottleneck. Acquiring emotion factors through clustering instead of manual labeling, combined with automatic generation and multi-dimensional filtering, allows for the scalable construction of high-quality datasets.
-
Emotion Adapter:
- Function: Endowing diffusion models with emotion awareness to automatically select the emotional semantic representation that best fits the input image.
- Mechanism: Built on the Q-Former architecture. Learnable queries \(q\) serve as an "emotion dictionary," with the target emotion \(e_t\) and the input image \(e_i\) acting as indices. The self-attention layer first selects relevant semantics based on the target emotion: \(A_s = \text{softmax}(\frac{[q;e_t]W_q^s([q;e_t]W_k^s)^T}{\sqrt{d_k}})[q;e_t]W_v^s\). Then, the cross-attention layer integrates image information to select the most matching representation: \(A_c = \text{softmax}(\frac{A_s W_q^c(e_i W_k^c)^T}{\sqrt{d_k}})e_i W_v^c\). After multiple iterations, the emotional embedding \(c_e\) is output and injected as a condition into the denoising process of IP2P.
- Design Motivation: Each emotion corresponds to multiple semantic representations (e.g., "fear" can be ghosts, darkness, or storms), which need to be dynamically selected based on the input image content. The query mechanism of Q-Former is naturally suited for this "retrieval from a dictionary based on conditions" operation.
-
Instruction Loss:
- Function: Capturing semantic changes in emotional data pairs, preventing the model from relying solely on pixel color adjustments.
- Mechanism: In addition to the standard diffusion loss \(\mathcal{L}_{LDM}\), an instruction loss is incorporated: \(\mathcal{L}_{ins} = \frac{1}{M}\|c_e - \mathcal{E}_{txt}(t_{ins})\|_2^2\), aligning the emotional embedding \(c_e\) output by the Emotion Adapter with the text encoding \(\mathcal{E}_{txt}(t_{ins})\) of the corresponding content instruction (e.g., "add colorful butterflies") in the dataset. The total loss is formulated as \(\mathcal{L} = \mathcal{L}_{LDM} + \mathcal{L}_{ins}\).
- Design Motivation: When trained solely with diffusion loss, the model tends to make color and texture adjustments (pixel-level optimal solutions), leading to color artifacts while lacking meaningful content modifications. Instruction loss enforces content changes through semantic-level supervision.
Loss & Training¶
During training, the IP2P parameters are frozen, and only the Emotion Adapter is trained. Diffusion loss is used to ensure pixel-level fidelity, while instruction loss ensures semantic-level emotional expression. During inference, the image guidance scale can be adjusted to control the balance between emotional intensity and structural preservation.
Key Experimental Results¶
Main Results¶
Evaluated on 405 test images (8 emotional directions, 3,240 pairs in total):
| Method | PSNR↑ | SSIM↑ | LPIPS↓ | CLIP-I↑ | Emo-A↑ | Emo-S↑ |
|---|---|---|---|---|---|---|
| SDEdit | 15.43 | 0.415 | 0.459 | 0.638 | 38.21% | 0.221 |
| PnP | 14.41 | 0.436 | 0.381 | 0.851 | 23.83% | 0.095 |
| ControlNet | 11.98 | 0.292 | 0.603 | 0.686 | 36.33% | 0.213 |
| CLVA | 12.61 | 0.397 | 0.479 | 0.757 | 14.04% | 0.017 |
| AIF | 14.05 | 0.537 | 0.493 | 0.828 | 12.74% | 0.004 |
| EmoEdit | 16.62 | 0.571 | 0.289 | 0.828 | 50.09% | 0.335 |
Human Evaluation¶
| Method | Structure Preservation↑ | Emotion Fidelity↑ | Overall Balance↑ |
|---|---|---|---|
| SDEdit | 11.71% | 10.85% | 5.07% |
| BlipDiff | 15.12% | 8.35% | 4.88% |
| EmoEdit | 70.12% | 75.73% | 89.12% |
Key Findings¶
- EmoEdit performs best across all pixel-level metrics (PSNR 16.62, SSIM 0.571, LPIPS 0.289), while its emotion accuracy (50.09%) far outperforms all other methods.
- The emotion gain score Emo-S (0.335) is 52% higher than the next best SDEdit (0.221), indicating more effective emotional modifications.
- Ablation studies confirm that the Emotion Adapter is indispensable (the image barely changes without it), instruction loss ensures semantic clarity, and diffusion loss guarantees structural preservation.
- The Emotion Adapter can be directly plugged into other models like ControlNet and BlipDiff to enhance their emotional capabilities.
- It can be extended to stylized image generation (combined with Composable Diffusion), preserving artistic styles while evoking emotions.
Highlights & Insights¶
- Pioneering Introduction of Content Modification for Affective Manipulation: Going beyond color/style adjustments, a systematic "emotion factor tree" is constructed based on psychological principles. This data construction paradigm can be extended to editing other abstract attributes.
- Plug-and-play Design of Emotion Adapter: Once trained, it can be directly integrated into any IP2P or Stable Diffusion-based editing/generation model. This modular emotional enhancement paradigm holds broad application value.
- EmoEditSet Dataset Contribution: The dataset contains 40,120 pairs with emotional directions and content instructions, serving as a foundational benchmark for the AIM field.
Limitations & Future Work¶
- It only supports Mikels' 8 emotional categories, whereas real-world emotions are far more complex and fine-grained.
- The emotion factor tree highly depends on the coverage of EmoSet, which may introduce bias.
- CLIP image similarity is not sufficiently precise as a proxy metric for structural preservation.
- Dataset construction relies on IP2P generation, constrained by IP2P's own editing capabilities and quality limits.
- Users cannot specify concrete modification details; it entirely relies on the model's automatic selection, which sometimes selects inappropriate semantics.
Related Work & Insights¶
- vs CLVA / AIF: These are style-transfer-based AIM methods that only modify colors and styles, resulting in limited emotional effectiveness (Emo-S < 0.02). EmoEdit achieves stronger emotional evocation through content modification.
- vs InstructPix2Pix: IP2P performs edits based on specific instructions but does not comprehend emotions; EmoEdit's Emotion Adapter injects emotional knowledge into IP2P, allowing it to function using only emotion words.
- vs SDEdit: SDEdit possesses some emotional understanding (Emo-A 38.21%) but severely disrupts image structure (PSNR 15.43, SSIM 0.415). EmoEdit achieves stronger emotional effects while preserving structure much better.
Rating¶
- Novelty: ⭐⭐⭐⭐ Content-aware AIM from a psychological perspective is a new direction, and the emotion factor tree + Adapter design is creative.
- Experimental Thoroughness: ⭐⭐⭐⭐ Quantitative + qualitative + user studies + ablations + cross-model transfer; comprehensive, though lacking large-scale comparisons.
- Writing Quality: ⭐⭐⭐⭐ Clear storyline, intuitive charts, and detailed description of the dataset construction process.
- Value: ⭐⭐⭐⭐ Opens up a new direction for content-aware AIM; the dataset and Adapter hold value for the community.