HAM: A Training-Free Style Transfer Approach via Heterogeneous Attention Modulation for Diffusion Models¶

Conference: CVPR 2026 arXiv: 2603.24043 Code: None Area: Diffusion Models / Image Generation Keywords: Style Transfer, Attention Modulation, Training-Free, Diffusion Models, Identity Preservation

TL;DR¶

This paper proposes HAM, a training-free style transfer method that achieves high-quality stylization without sacrificing content identity. HAM applies heterogeneous modulation (GAR + LAT) to self-attention and cross-attention layers of diffusion models, complemented by style-injected noise initialization (SINI), attaining state-of-the-art performance across multiple metrics.

Background & Motivation¶

Background: Diffusion model-based style transfer methods fall into two main categories: fine-tuning methods (training style control modules via LoRA/ControlNet) and training-free methods (manipulating attention features at inference time). Fine-tuning methods are computationally expensive and lack robustness; training-free methods such as StyleID and DiffArtist achieve stylization by injecting the style image's keys and values into self-attention layers.

Limitations of Prior Work: Existing training-free methods rely solely on self-attention manipulation to simultaneously inject style and preserve content. However, the Q/K/V projections in self-attention jointly encode spatial positional relationships and semantic representations, making it difficult to balance style expression and content preservation through a single channel, often leading to insufficient stylization or content distortion.

Key Challenge: Injecting style through self-attention inevitably corrupts content identity, as Q/K/V inherently couples spatial structure with semantic content. Existing methods are therefore trapped in a style–content trade-off.

Goal: To simultaneously capture complex style references and preserve content identity (structure, texture, text, etc.) under a training-free setting.

Key Insight: Decouple style injection and content protection across different attention mechanisms—using self-attention for global style–content fusion control, and cross-attention for precise local style transplantation and content preservation, thereby realizing heterogeneous modulation.

Core Idea: By applying heterogeneous attention modulation strategies to self-attention (GAR for global fusion) and cross-attention (LAT for local transplantation), HAM decouples style injection and content protection into distinct attention channels.

Method¶

Overall Architecture¶

HAM consists of three core modules: Global Attention Regulation (GAR), Local Attention Transplantation (LAT), and Style-Injected Noise Initialization (SINI). The system employs three parallel diffusion model branches: a content teacher (processing the content image), a style teacher (processing the style reference), and a student generator (producing the stylized output). SINI first constructs an initial noise that fuses style and content information. During denoising, GAR operates on self-attention layers for macro-level style–content fusion, while LAT operates on cross-attention layers for precise style/content control. The method is compatible with both SD2.1 (DDIM-based) and SD3.5 (DiT-based) architectures.

Key Designs¶

Global Attention Regulation (GAR):
- Function: Achieves macro-level style introduction and content preservation in self-attention layers.
- Mechanism: AdaIN is first applied to align and fuse the self-attention projections of the content teacher \((Q^c, K^c, V^c)\) and the style teacher \((Q^s, K^s, V^s)\), producing composite projections \((Q^{cs}, K^{cs}, V^{cs})\) by normalizing the content features and rescaling them with the mean and variance of the style features. The composite projections are then blended with the student generator's own self-attention projections via hyperparameter \(\alpha\): \(\hat{Q} = \alpha \cdot Q^m + (1-\alpha) \cdot Q^{cs}\), ensuring that the statistical distribution of self-attention projections remains aligned with the main branch throughout generation.
- Design Motivation: Directly replacing K/V in self-attention severely disrupts content structure. AdaIN-based fusion followed by weighted blending introduces style statistics while preserving the spatial-semantic structure of the main branch. The optimal performance is achieved at \(\alpha=0.75\), which favors retaining main-branch information.
Local Attention Transplantation (LAT):
- Function: Precisely controls style injection and content protection in cross-attention layers.
- Mechanism: The cross-attention K/V from the style teacher are directly transplanted into the student generator's cross-attention, replacing the original K/V for localized style injection. To prevent content identity degradation, the query is protected by blending the content teacher's \(Q^c_{cross}\) with the main branch's \(Q^m_{cross}\) via hyperparameter \(\beta\): \(\hat{Q}^m_{cross} = \beta \cdot Q^m_{cross} + (1-\beta) \cdot Q^c_{cross}\).
- Design Motivation: Unlike prior methods that operate exclusively in self-attention, LAT leverages cross-attention for style transplantation. Since cross-attention was originally designed for text-conditioned injection and does not carry spatial structural information, replacing K/V in this channel does not disrupt content structure as self-attention manipulation does. The optimal balance is achieved at \(\beta=0.25\).
Style-Injected Noise Initialization (SINI):
- Function: Incorporates style and content information at the diffusion starting point, providing a better initialization for generation.
- Mechanism: AdaIN is applied to fuse the content initial noise \(z^c_T\) and the style initial noise \(z^s_T\) into a stylized noise. A content residual term (the difference between the original content noise and the fused noise) is then introduced, with its weight controlled by hyperparameter \(\gamma\): \(z^m_T = \gamma \cdot \text{ContentResidual} + \text{StylizedNoise}\).
- Design Motivation: Using content noise alone cannot incorporate style; pure AdaIN fusion loses content identity. The dual-component design allows the initial noise to carry both style statistics and content structural information via residual connection. The optimal value is \(\gamma=0.5\).

Loss & Training¶

The method requires no training; all operations are performed at inference time. SD2.1 uses 50-step DDIM denoising at a resolution of 512×512. The three hyperparameters \(\alpha=0.75\), \(\beta=0.25\), and \(\gamma=0.5\) are determined via ablation studies.

Key Experimental Results¶

Main Results¶

Method	ArtFID↓	LPIPS↓	DINO↑	CLIP-I↑	CLIP-T↑	DC↑	CC↑
StyleID (CVPR'24)	15.161	0.635	0.544	0.619	0.213	1.873	1.964
DiffArtist (MM'25)	16.174	0.520	0.629	0.626	0.220	1.987	1.984
AttDistillation (CVPR'25)	16.170	0.629	0.541	0.615	0.219	1.878	1.969
HAM (Ours)	15.151	0.479	0.728	0.682	0.223	2.113	2.057

Ablation Study¶

Configuration	DINO↑	CLIP-I↑	CLIP-T↑	DC↑	CC↑
Baseline (no modules)	0.609	0.626	0.220	1.963	1.984
+GAR	0.618	0.626	0.231	1.993	2.002
+LAT	0.712	0.696	0.193	2.042	2.023
+GAR+LAT	0.746	0.696	0.202	2.099	2.040
+GAR+LAT+SINI (Full)	0.728	0.682	0.223	2.113	2.057

Key Findings¶

LAT contributes most to content preservation (DINO: 0.609 → 0.712) and is the core module for identity retention.
GAR primarily enhances style strength (CLIP-T: 0.220 → 0.231) while marginally improving content metrics.
SINI improves color diversity and style richness; in combination with the other two modules, it achieves the best overall DC/CC scores.
The hyperparameter \(\alpha=0.75\) favors retaining main-branch information; values that are too low cause severe content degradation.

Highlights & Insights¶

Assigning style injection and content protection to different types of attention mechanisms (heterogeneous modulation) is an elegant idea—leveraging cross-attention's inherent cross-modal information processing capability for style injection avoids the structural damage caused by self-attention manipulation.
The AdaIN + residual noise initialization design elegantly addresses the style–content balance in initial noise, proving more effective than simple noise replacement or blending.
Compatibility with both SD2.1 and SD3.5 architectures (DDIM vs. DiT) demonstrates the generality of the approach.

Limitations & Future Work¶

The method still has limitations in transferring highly abstract or surrealist styles.
Running three diffusion model branches simultaneously (content teacher, style teacher, student) incurs substantial computational overhead (approximately 16 seconds per image on SD2.1).
The three hyperparameters require manual tuning; different styles may require different settings.
The ArtFID advantage over StyleID is marginal (15.151 vs. 15.161), leaving room for improvement in quantitative evaluation.

vs. StyleID: StyleID injects style K/V only in self-attention, causing content distortion. HAM shifts style injection to cross-attention, significantly reducing content corruption.
vs. DiffArtist: DiffArtist performs well on content preservation (LPIPS) but achieves weaker stylization than HAM. HAM's LAT module achieves a better balance through its query protection mechanism.
The heterogeneous attention modulation paradigm is transferable to other domains such as image editing and video stylization.

Rating¶

Novelty: ⭐⭐⭐⭐ The heterogeneous modulation concept is novel, though individual components such as AdaIN fusion are relatively standard.
Experimental Thoroughness: ⭐⭐⭐⭐ Ablation studies are detailed and cover all modules and hyperparameters, though user studies are absent.
Writing Quality: ⭐⭐⭐⭐ Well-structured with complete formula derivations, though some descriptions are redundant.
Value: ⭐⭐⭐⭐ A practical training-free style transfer solution with strong cross-architecture compatibility.