ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback¶

Conference: ECCV 2024
arXiv: 2404.07987
Code: GitHub
Area: Controllable Generation / Segmentation
Keywords: ControlNet, Controllable Generation, Cycle Consistency, Diffusion Models, Reward Fine-tuning

TL;DR¶

This paper proposes ControlNet++, which explicitly optimizes the quality of conditional controllable generation through pixel-level cycle consistency loss. A pre-trained discriminative model is used to extract conditions from generated images and align them with the input conditions. To avoid the massive GPU memory overhead of multi-step sampling, an efficient single-step denoising reward strategy is designed. This significantly improves controllability (e.g., +11.1% segmentation mIoU) under various conditional controls such as segmentation masks, edges, and depth.

Background & Motivation¶

Although text-to-image diffusion models (such as Stable Diffusion) generate images with stunning quality, text alone struggles to describe fine-grained spatial layouts and details. While models like ControlNet and T2I-Adapter introduce image-based conditional controls (segmentation masks, edge maps, depth maps), the controllability remains suboptimal.

Key Findings: There is a significant discrepancy between the images generated by existing ControlNet models and the input conditions. For example: - Under segmentation mask conditions, the model achieves only 32.55 mIoU (much lower than the 50.7 mIoU of the same evaluation model on real data). - T2I-Adapter-SDXL consistently produces incorrect forehead wrinkles. - ControlNet v1.1 introduces a large amount of erroneous details.

Key Challenge: Existing methods only implicitly learn controllability during the latent space denoising process, lacking explicit pixel-level consistency constraints.

Method¶

Overall Architecture¶

ControlNet++ models controllable generation as an image-to-image translation task and incorporates the cycle consistency concept: 1. An input condition $c_v$ is fed into the diffusion model to generate an image $x_0'$. 2. A discriminative model $D$ extracts the condition $\hat{c}_v$ from $x_0'$. 3. The consistency loss between $c_v$ and $\hat{c}_v$ is optimized.

Key Designs¶

Key Design 1: Cycle-Consistency Reward Loss¶

\[L_{reward} = L(c_v, \hat{c}_v) = L(c_v, D(x_0'))\]

Different loss functions and discriminative models are used for different condition types: - Segmentation mask: UperNet-R50 with pixel-wise cross-entropy loss. - Depth map: DPT model with RMSE loss. - Edge map: Corresponding edge detectors with SSIM/F1 Loss.

The total loss is a weighted sum of the diffusion training loss and the reward loss: $$L_{total} = L_{train} + \lambda \times L_{reward}$$

Key Design 2: Efficient Single-Step Reward Strategy¶

Limitations of Prior Work: Generating images directly via multi-step sampling starting from random noise $x_T$ and then calculating the reward loss requires storing gradients across all timesteps. For a 50-step inference, this requires ~340GB of GPU memory, making it entirely infeasible.

Mechanism: Instead of sampling from random noise, noise is added to the training image followed by a single-step denoising process:

Disturb Consistency: Add small noise to the training image $x_0$ to obtain $x_t'$ (identical to the standard diffusion forward process).
Single-Step Denoising Recovery: When the noise is small ($t \le t_{thre}$), directly predict the original image via single-step sampling.
Calculate Reward Loss using Denoised $x_0'$.

Final training strategy: Use $L_{train} + \lambda \times L_{reward}$ when $t \le t_{thre}$, and only $L_{train}$ otherwise.

Core Idea: Adding noise breaks the consistency between the image and the condition. The reward loss forces the model to reconstruct this consistency during denoising, thereby enhancing its ability to follow control conditions during generation.

Loss & Training¶

Freeze the pre-trained discriminative reward model and the main text-to-image model.
Only update the ControlNet parameters (consistent with original ControlNet training).
The timestep threshold $t_{thre}$ controls when to enable the reward loss (small-noise timesteps).
Using the reward loss alone causes image distortion, so it must be jointly trained with the diffusion loss.

Key Experimental Results¶

Main Results¶

Condition Type	Metric	ControlNet	T2I-Adapter	Uni-ControlNet	Ours
Segmentation Mask (ADE20K)	mIoU	32.55	12.61	19.39	43.64
Canny Edge	F1	34.65	23.65	27.32	37.04
Hed Edge	SSIM	0.7621	-	0.6910	0.8097
LineArt Edge	SSIM	0.7054	-	-	0.8399
Depth Map	RMSE (Lower is better)	35.90	48.40	40.65	28.32

ControlNet++ significantly outperforms existing methods across all conditional controls.

Image Quality FID Comparison¶

Condition	ControlNet FID	ControlNet++ FID
Segmentation Mask (ADE20K)	33.28	29.49
Segmentation Mask (COCO)	21.33	19.29
Depth Map	17.76	16.66

The model improves controllability while maintaining or even improving image quality.

Ablation Study¶

Generalization of Reward Strategy Across Timesteps:

Unoptimized Steps $[T, t_{thre}]$	Optimized Steps $[t_{thre}, 1]$	mIoU
ControlNet	ControlNet	32.55
ControlNet	Ours	38.03
Ours	ControlNet	41.46
Ours	Ours	43.64

Even when reward fine-tuning is only trained on small timesteps, its performance generalizes to large timesteps.

Impact of Reward Model Strength:

Reward Model	RM's Own mIoU	Evaluation mIoU
None	-	32.55
DeepLabv3-MBv2	34.02	31.96
FCN-R101	39.91	40.44
UperNet-R50	42.05	43.64

Stronger reward models lead to better controllability. Weak reward models can actually yield negative effects.

Key Findings¶

Explicit optimization vastly outperforms implicit learning: Pixel-level cycle consistency is significantly superior to implicit controllability that only relies on the denoising process.
Single-step reward strategy is both effective and efficient: It reduces GPU memory from 340GB to approximately 7GB while maintaining strong performance.
Generated data can boost discriminative models: Training DeepLabv3 with data generated by ControlNet++ achieves +1.19 mIoU compared to training with ControlNet-generated data.
Impact of Text: ControlNet++ still generates images correctly when the text is empty or conflicts with the image condition, whereas ControlNet fails.

Highlights & Insights¶

Elegant Transfer of CycleGAN Ideas: Brings cycle consistency from image translation to controllable diffusion models.
Engineering Insight for Efficient Reward Strategy: Replaces multi-step sampling with a noise-injection and single-step denoising mechanism, turning an intractable approach into a practical one.
Unified Framework: The same methodology is universally applicable to various conditions such as segmentation masks, edges, and depth.
Generation-Discrimination Closed Loop: The discriminative model acts as a feedback signal to guide the generative model, forming a virtuous cycle.

Limitations & Future Work¶

Relies heavily on high-quality differentiable discriminative models; some conditions (e.g., skeletons, sketches) still lack robust differentiable extractors.
The intrinsic capability ceiling of the reward model limits the upper bound of controllability.
The timestep threshold $t_{thre}$ needs to be tuned manually.
Validation has only been performed on SD1.5, without extension to stronger base architectures like SDXL or SD3.
There is a slight decrease in CLIP-Score for edge-based conditions.

ControlNet / T2I-Adapter: Direct subjects of comparison and improvement in this work, sharing the ControlNet architecture.
CycleGAN: Inspired the core concept of cycle consistency.
RLHF / ReFL: The reward fine-tuning concept is inspired by Reinforcement Learning from Human Feedback in NLP.
Insights: Discriminative models as AI feedback are much more efficient than human annotations; cycle consistency can be extended to other conditional generation tasks.

Rating¶

Novelty: 4/5 - The combination of cycle consistency and the efficient single-step reward strategy is novel and practical.
Experimental Thoroughness: 5/5 - Evaluated across 6 conditions and 5 baselines with comprehensive ablations and generated-data validation.
Writing Quality: 4/5 - Clear motivation and rich visualizations.
Value: 5/5 - Directly addresses the core pain points of controllable generation, achieves significant performance gains, and is fully open-sourced.