TAUE: Training-free Noise Transplant and Cultivation Diffusion Model¶

Basic Information¶

Conference: CVPR2026 arXiv: 2511.02580 Code: Not released Area: Image Generation Keywords: Layered image generation, diffusion models, training-free, latent space transplantation, cross-layer attention

TL;DR¶

TAUE proposes a training-free layered image generation framework that "transplants" intermediate denoising latents into the initial noise of a new generation process, combined with cross-layer attention sharing, to achieve consistent three-layer generation of foreground, background, and composite images — matching or surpassing fine-tuning-based methods.

Background & Motivation¶

Text-to-image diffusion models (e.g., SDXL) can generate high-quality images, but their outputs are always single-layer flat images where foreground and background are inseparable. In professional design, animation, and advertising, the lack of layer-wise control is a critical bottleneck, forcing practitioners to manually segment and retouch results.

Existing layered generation methods fall into two categories:

Fine-tuning methods (LayerDiffuse, ART, etc.): Jointly denoise multiple layers using mask or alpha-channel autoencoders, but rely on large-scale proprietary datasets; high training cost and data inaccessibility limit reproducibility.

Training-free methods (Alfie, etc.): Can only generate isolated foregrounds without corresponding backgrounds — a partial solution at best.

Core Problem: How can foreground, background, and composite images be generated simultaneously — without fine-tuning or additional data — while maintaining spatial and semantic consistency across all three layers?

Method¶

Overall Architecture¶

TAUE is built on a Latent Diffusion Model (LDM) and operates in three stages:

Foreground Generation: Generates a foreground object \(I_{\text{fg}}\) against a uniform background, while extracting intermediate latents \(L_{\text{fg}}\).
Composite Generation: Transplants \(L_{\text{fg}}\) into new initial noise to generate the composite scene \(I_{\text{all}}\), while extracting background latents \(L_{\text{bg}}\).
Background Generation: Transplants \(L_{\text{bg}}\) into the background region to generate the standalone background \(I_{\text{bg}}\).

Each stage is conditioned on a separate text prompt: \(T_{\text{fg}}\), \(T_{\text{bg}}\), and \(T_{\text{all}}\).

Foreground Generation and Green Background Injection¶

Drawing inspiration from TKG-DM, TAUE injects a green background latent vector \(C_{\text{gb}}=[0,1,1,0]\) into the initial noise in latent space, so that the foreground object is generated against a uniform background:

\[z_{\text{fg},T} = (1-M) \odot z_T + M \odot \left((1-\alpha) z_T + \alpha C_{\text{gb}}\right)\]

where \(\alpha\) controls the blending strength of the background color and \(M\) is a spatial mask. The resulting foreground image \(I_{\text{fg}}\) has a clean green background, facilitating subsequent layer separation.

Layout Specification via Probabilistic Masks¶

Conventional methods use Gaussian or rectangular masks to localize foreground regions, but these produce artifacts at mask boundaries. TAUE redefines \(M\) as a probabilistic layout mask, decoupling object generation from mask edges via spatially weighted sampling.

Given bounding box center \((o_x, o_y)\), width \(w\), and height \(h\), a radially symmetric Gaussian distribution is defined as:

\[P(x,y) = \exp\left(-\frac{1}{2\sigma^2}\left[\left(\frac{x-o_x}{w/2}\right)^2 + \left(\frac{y-o_y}{h/2}\right)^2\right]\right)\]

After scaling \(P(x,y)\) to \([p_{\min}, p_{\max}]\), a binary mask is produced by comparison with a random matrix \(R(x,y)\):

\[M(x,y) = \begin{cases} 1 & \text{if } R(x,y) > P(x,y) \\ 0 & \text{otherwise} \end{cases}\]

This probabilistic mask allows smooth transitions at boundaries, eliminates mask-contour artifacts, and supports flexible position and scale control.

Intermediate Latent Extraction¶

Intermediate latents are cached at a specific timestep \(t_{\text{crop}}\) during denoising:

\[L_{\text{fg}} = z_{\text{fg}, t_{\text{crop}}} \in \mathbb{R}^{4 \times H/8 \times W/8}\]

where \(t_{\text{crop}} = \lfloor T \cdot (1 - r_{\text{crop}}) \rfloor\), with default \(r_{\text{crop}}=0.5\) (the midpoint of the denoising process). This latent encodes the geometric and semantic structure of the foreground object, serving as a "seed" to be transplanted in subsequent stages.

Object Region Mask¶

The composite generation stage first requires localizing the object region. TAUE combines two complementary signals:

Latent channel activations: Due to the green latent injection, channels \(c=1\) and \(c=2\) exhibit high activations in background regions and low activations in object regions.
Cross-attention maps: The attention map of the foreground prompt \(T_{\text{fg}}\) highlights semantically relevant spatial regions.

A smoothed activation map is constructed and a binary object mask is defined as:

\[v_{\text{gb}}(x,y) = \mathcal{G}_\sigma(L_{\text{fg}}^{(1)} + L_{\text{fg}}^{(2)})\]

\[m_{\text{obj}}(x,y) = \mathbf{1}\left[v_{\text{gb}}(x,y) < \tau_{\text{bg}} \land A_{\text{fg}}(x,y) > \tau_A\right]\]

This joint criterion retains only spatial locations that simultaneously satisfy "not dominated by the green background" and "strongly attended to by foreground text tokens," ensuring precise spatial localization.

Cross-Attention Shearing¶

To enforce semantic consistency between foreground and background, TAUE modulates cross-attention layers using the object mask — the foreground prompt \(T_{\text{fg}}\) is applied only within the object region, while the background prompt \(T_{\text{bg}}\) governs the remaining regions:

\[A_{\text{mix}} = m_{\text{obj}} \odot A_{\text{fg}} + (1 - m_{\text{obj}}) \odot A_{\text{bg}}\]

The mask is broadcast across \(d\) attention channels, ensuring foreground tokens dominate inside the object region and background tokens dominate elsewhere. This mechanism achieves inter-layer semantic propagation without introducing any additional parameters.

Noise Transplant and Cultivation¶

The core operation of composite generation — transplanting foreground latents into new initial noise while applying Laplacian high-pass filtering to enhance spatial detail:

\[z_{\text{all},T} = m_{\text{obj}} \cdot (f(L_{\text{fg}}) + \lambda \cdot n_{t_{\text{crop}}}) + (1 - m_{\text{obj}}) \cdot z_T\]

where \(f(\cdot)\) is the high-pass filter and \(\lambda\) controls noise intensity. During denoising, noise is blended across timesteps:

\[n_t = \begin{cases} m_{\text{obj}} \odot n_{t_{\text{crop}}} + (1 - m_{\text{obj}}) \odot n_t & \text{if } t_{\text{crop}} \leq t \\ n_t & \text{otherwise} \end{cases}\]

This two-phase scheme fixes the foreground while allowing the background to evolve freely, ensuring semantic alignment and visual coherence. The composite image \(I_{\text{all}}\) is obtained at the final step, and the intermediate latent \(L_{\text{bg}}\) is extracted at \(t_{\text{crop}}\) and passed to the background generation stage.

Background Generation¶

This stage mirrors composite generation but inverts the object mask. \(L_{\text{bg}}\) is transplanted into the \((1-m_{\text{obj}})\) region, and the attention masking constraint is released — the background cross-attention \(A_{\text{bg}}\) is applied to all spatial positions, allowing the background prompt to globally refine lighting, color, and contextual harmony.

Key Experimental Results¶

Experimental Setup¶

Base model: SDXL, resolution \(1024 \times 1024\)
Scheduler: EulerDiscrete, 50 denoising steps
Guidance scale: 7.5 for foreground generation, 5.0 otherwise
Crop ratio: \(r_{\text{crop}} = 0.5\)
Evaluation dataset: 1,770 images filtered from MS-COCO (excluding iscrowd=true and very small objects)
Prompt generation: Phi-3 is used to generate foreground and background prompts for each image

Main Results¶

Method	FID↓	CLIP-I↑	CLIP-S↑	PSNR_fg↑	PSNR_bg↑	SSIM_fg↑	SSIM_bg↑	LPIPS_fg↓	LPIPS_bg↓
LayerDiffuse (fine-tuned)	61.46	0.653	0.312	14.78	32.76	0.828	0.957	0.323	0.039
Alfie+inpainting (training-free)	85.93	0.644	0.302	15.32	27.45	0.778	0.947	0.254	0.019
TAUE (training-free)	60.53	0.646	0.323	20.46	25.86	0.901	0.895	0.137	0.106
TAUE + Layout (training-free)	55.59	0.655	0.329	23.82	23.55	0.969	0.863	0.045	0.138

Key Findings:

TAUE surpasses the fine-tuning method LayerDiffuse on FID and CLIP-S, indicating superior visual fidelity and text alignment.
Foreground reconstruction quality (PSNR/SSIM/LPIPS) leads across the board, validating that latent transplantation effectively preserves object detail.
Background reconstruction is slightly lower than LayerDiffuse and Alfie, as those methods reuse unmasked background pixels (artificially inflating scores), whereas TAUE denoises the background entirely from scratch.
Incorporating layout control further improves FID (55.59 vs. 60.53) and foreground fidelity (PSNR 23.82 vs. 20.46).

Ablation Study¶

Method	FID↓	CLIP-I↑	CLIP-S↑	PSNR_fg↑	PSNR_bg↑	SSIM_fg↑	SSIM_bg↑	LPIPS_fg↓	LPIPS_bg↓
50% + high-pass filter (default)	55.59	0.655	0.329	23.82	23.55	0.969	0.863	0.045	0.138
50% w/o high-pass filter	55.79	0.654	0.328	23.92	23.59	0.970	0.862	0.045	0.139
75% (late extraction)	56.48	0.653	0.328	24.33	25.02	0.974	0.904	0.041	0.091
25% (early extraction)	55.70	0.640	0.321	21.12	19.70	0.953	0.750	0.059	0.284