Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception¶

Conference: CVPR 2025
arXiv: 2603.11556
Code: To be confirmed
Area: Image Editing / Aesthetic Enhancement
Keywords: Image Aesthetic Enhancement, Diffusion Models, Multimodal Perception, Weakly Supervised Learning, ControlNet

TL;DR¶

This paper proposes DIAE, which translates vague aesthetic instructions into multimodal control signals (HSV/contour maps + text) via a Multimodal Aesthetic Perception (MAP) module. It constructs an "imperfectly-paired" dataset, IIAEData, and uses a dual-branch supervision strategy to achieve weakly-supervised aesthetic enhancement, achieving SOTA performance on LAION and MLLM aesthetic scores.

Background & Motivation¶

Background: Image Aesthetic Enhancement (IAE) requires models to possess aesthetic perception and creative editing capabilities—identifying defects in color, composition, and lighting, and improving them.

Limitations of Prior Work: (1) Text encoders in diffusion models struggle to comprehend abstract aesthetic instructions; (2) There is a lack of "perfectly-paired" training data (characterized by identical content but different aesthetics).

Key Challenge: Aesthetics represent high-level human perception, which is difficult to convey to generative models purely through text.

Goal: To enable diffusion models to understand and execute aesthetic enhancement, addressing the lack of training data.

Key Insight: Utilizing visual modalities to assist text in conveying aesthetics, combined with weakly-supervised learning using "imperfectly-paired" images.

Core Idea: Aesthetic perception = text + visual representation; weak supervision = dual-branch training.

Method¶

Overall Architecture¶

Three parts: IIAEData dataset (LLaVA matching + UNIAA evaluation) -> MAP Multimodal Aesthetic Perception (HSV + contour + text control signals) -> Dual-branch supervision (semantic preservation for content + aesthetic learning for quality).

Key Designs¶

IIAEData Dataset
- Function: To construct a weakly-paired training set with identical semantic content but different aesthetics.
- Mechanism: Filtering high/low MOS from AVA/TAD66K/KonIQ/FLICKR -> generating captions with LLaVA -> semantic matching -> UNIAA aesthetic evaluation.
- Scale: 47.5K (45K for training + 1.5K for testing).
Multimodal Aesthetic Perception (MAP)
- Function: To translate vague aesthetic text instructions into comprehensible multimodal control signals.
- Mechanism: Utilizing HSV maps + text for color, and contour maps + text for structure. CNNs extract visual features, CLIP encodes text, and ControlNet injects these into the UNet.
- Design Motivation: HSV maps align closely with human color perception, while contour maps emphasize spatial arrangement.
Dual-Branch Supervision
- Function: To resolve the issue where fully-supervised learning cannot be directly applied to imperfectly-paired data.
- Mechanism: The parameter t_s divides the denoising process into two phases. In the early phase, input image supervision is used to preserve semantics; throughout the entire process, reference image supervision is applied to learn aesthetics.
- Total Loss: L = L_ref + lambda * L_inp, with t_s defaulting to 900.

Loss & Training¶

Based on SD v1.5, AdamW lr=1e-5, 4xA800, 100K iterations.

Key Experimental Results¶

Main Results¶

Method	LAION(256)	LAION(512)	MLLM(256)	MLLM(512)	CLIP-I(256)	CLIP-I(512)
Original	4.962	5.123	3.243	3.300	1.000	1.000
InstructPix2Pix	4.991	5.396	3.264	3.325	0.764	0.690
DOODL	5.102	5.140	3.255	3.297	0.775	0.703
DIAE	5.324	6.012	3.339	3.662	0.772	0.784

Ablation Study¶

Configuration	LAION	MLLM	CLIP-I
DIAE (w/o visual)	5.250	3.343	0.623
DIAE (w/o text)	5.428	3.410	0.792
DIAE (Full)	5.668	3.501	0.778

Key Findings¶

At 512 resolution, LAION increases by +17.4%, and MLLM increases by +11.0%.
Content consistency drops significantly when the visual modality is removed.
Performance improvement is more pronounced for images with low MOS.
DIAE does not arbitrarily add or delete content.

Highlights & Insights¶

Aesthetics are decomposed into two dimensions, color and structure, represented through a multimodal approach combining vision and text.
The "imperfectly-paired" + weakly-supervised paradigm cleverly bypasses high dataset construction costs.
The dual-branch architecture, controlled by t_s, precisely balances semantic preservation versus aesthetic enhancement.
It can be integrated with MLLMs to achieve end-to-end processing.

Limitations & Future Work¶

Portrait scenarios are not addressed.
The approach is based on SD v1.5.
The parameter t_s requires tuning.

The "perfectly-paired" data construction concept of InstructPix2Pix and the "imperfectly-paired" approach of this paper complement each other, representing two extremes of supervision signals.
Using aesthetic scores for classifier guidance, as in DOODL/RAHF, is another alternative path, but it does not modify model behaviors.
The architectural concept of adding structure in ControlNet is innovatively extended by this work into multimodal aesthetic control (HSV + contours + text).
The content-style decoupling framework of StyleDiffusion inspired the design of the dual-branch supervision in this work.
Advancements in aesthetic MLLMs like Q-ALIGN enable the automatic generation of aesthetic evaluation text, serving as inputs for the MAP module in DIAE.

Rating¶

Novelty: ⭐⭐⭐⭐ Creative problem formulation and multimodal aesthetic perception.
Experimental Thoroughness: ⭐⭐⭐ Relatively few baseline methods compared.
Writing Quality: ⭐⭐⭐ Overall clear and structured.
Value: ⭐⭐⭐⭐ High practical demand for image aesthetic enhancement.