All-in-One Slider for Attribute Manipulation in Diffusion Models¶

Conference: CVPR 2026
arXiv: 2508.19195
Code: Available (provided on project page)
Area: Diffusion Models/Image Generation
Keywords: Attribute Manipulation, Sparse Autoencoder, Text Embedding Space, Disentangled Representation, Zero-shot Generalization

TL;DR¶

The proposed All-in-One Slider framework trains a lightweight Attribute Sparse Autoencoder on the intermediate layer embeddings of a text encoder. It decomposes attributes into disentangled directions within a high-dimensional sparse activation space, enabling continuous, fine-grained, and composable control of multiple facial attributes with a single module. It also demonstrates zero-shot continuous manipulation capabilities for unseen attributes (e.g., ethnicity, celebrities).

Background & Motivation¶

T2I diffusion models have made significant progress in image generation, but progressive and fine-grained control over specific attributes remains difficult. Traditional prompt engineering (e.g., appending "with a big smile" to a prompt) only allows for coarse-grained, non-continuous manipulation and inevitably affects unrelated attributes (e.g., hairstyle, identity).

Existing attribute manipulation methods follow a One-for-One paradigm—training an independent slider module for each attribute (e.g., ConceptSlider, AttributeControl). This leads to three problems: (1) every new attribute requires additional training, accumulating time and compute costs; (2) parameter redundancy as multiple slider parameters pile up; (3) limited flexibility and scalability, making it difficult to handle a large number of different attributes in practice.

The core insight of this paper is "break it down to build it up": drawing on the success of Sparse Autoencoders (SAE) in discovering interpretable semantic units in LLMs, the authors construct a unified, disentangled attribute latent space within the diffusion model's text embedding space, allowing all attributes to share one lightweight module.

Method¶

Overall Architecture¶

The All-in-One Slider consists of two stages: - Stage 1 (Attribute Disentanglement Training): An Attribute Sparse Autoencoder is trained (unsupervised) on the intermediate layer embeddings of the SDXL text encoder. - Stage 2 (Attribute Manipulation): Using the trained SAE, target attribute text is encoded into sparse directions, scaled by a scalar $\lambda$, and added to the original prompt embedding.

Key Designs¶

Embedding Extraction:
- Function: Extracts hidden states from the intermediate layers of the SDXL dual text encoder.
- Mechanism: Features are extracted from the 11th layer of the first encoder and the 29th layer of the second encoder, concatenated to form a joint representation $x \in \mathcal{D}$.
- Design Motivation: Intermediate layers retain both semantic information and identity features; shallow layers are too low-level, while deep layers are too semantic.
Attribute Sparse Autoencoder (SAE):
- Function: Maps text embeddings to a high-dimensional sparse latent space (Attribute Latent Space) to achieve attribute disentanglement.
- Mechanism: The encoder uses top-k sparse activation: $$z_{\text{ALS}} = \text{Top-}k(\text{ReLU}(W_{\text{enc}}(x - b_{\text{pre}}) + b_{\text{enc}}))$$ The decoder reconstructs the original embedding: $\hat{x} = W_{\text{dec}} z_{\text{ALS}} + b_{\text{pre}}$. The training objective is reconstruction loss + auxiliary loss (to activate dead neurons): $$\mathcal{L} = \|x - \hat{x}\|_2^2 + \alpha \mathcal{L}_{\text{aux}}$$ The auxiliary loss selects the $k_{\text{aux}}$ least active neurons to reconstruct the residual $r = x - \hat{x}$, ensuring broad semantic coverage of the latent space.
- Design Motivation: Top-k sparsity is key to attribute disentanglement—each attribute activates only a small, unique subset of neurons, naturally disentangling different attributes.
Attribute Manipulation:
- Function: Achieves continuous, fine-grained attribute control by modifying text embeddings during inference.
- Mechanism: After encoding target attribute $A$ to obtain a direction in the latent space, it is scaled by $\lambda$ and added to the original embedding: $$x_{\text{manipulated}} = x + W_{\text{dec}}(\lambda \times \text{ENC}(x_A))$$ $\lambda$ controls manipulation intensity (positive for enhancement, negative for reduction).
- Design Motivation: Because the latent space is compositional, multiple attribute directions can be directly superimposed for combined manipulation; unseen attribute text embeddings still trigger adaptive composite activations in the trained space.

Loss & Training¶

Training Data: 52 controllable facial attributes $\times$ 1000 samples = 52,000 text samples.
Image generation via SDXL, 50-step sampling, classifier-free guidance = 7.5.
Multi-subject scenario extension: Introduces an Attention Pooling Aggregator module + consistency loss $\mathcal{L}_{\text{multi}} = \mathcal{L}_{\text{sae}} + \eta \mathcal{L}_{\text{cons}}$.

Key Experimental Results¶

Main Results¶

Setting	Method	QS (Old)	IS (Old)	QS (Smile)	IS (Smile)	QS (Makeup)	IS (Makeup)
Single Attribute	ConceptSlider	3.794	0.434	4.144	0.496	4.542	0.653
Single Attribute	AttControl	4.039	0.601	4.395	0.695	4.268	0.604
Single Attribute	Ours	4.049	0.716	4.265	0.637	4.291	0.742
Multi-Attribute	ConceptSlider	4.150	0.499	3.801	0.522	4.059	0.479
Multi-Attribute	AttControl	3.667	0.376	4.056	0.635	4.248	0.515
Multi-Attribute	Ours	4.212	0.688	4.428	0.628	4.297	0.635

Advantages are more pronounced in multi-attribute scenarios: Old+Makeup QS 4.428 vs. 4.056 (second best), a Gain of 0.37.

Ablation Study¶

Configuration	QS (Avg)	IS (Avg)	Description
Layer 8/28	4.124	0.635	Shallow, insufficient semantics
Layer 9/30	4.144	0.669	Sub-optimal
Layer 10/24	4.185	0.718	Good balance
Layer 10/28	4.202	0.698	Best overall balance

$\lambda$ ablation: $\lambda=0.15$ results in under-editing, $\lambda=0.30$ provides stronger attribute expression but identity preservation decreases; the "old" attribute is most deeply entangled with identity features.

Key Findings¶

For multi-attribute combined manipulation, this method avoids conflicts and maintains high semantic consistency due to the disentangled latent space.
Zero-shot generalization to unseen ethnicities (e.g., African, Chinese, Indian) and celebrity identities (Obama, Einstein).
Transferable to different diffusion backbones such as SD v1.4 and SDXL-Turbo.
Equally effective when extended to photographic style manipulation (40 styles including B&W, Golden Hour).

Highlights & Insights¶

Paradigm shift from One-for-One to All-in-One: train once, apply to all attributes forever, greatly lowering the barrier to entry for attribute manipulation.
Top-k sparsity of SAE is naturally suited for attribute disentanglement, representing a clever innovation migrated from the field of LLM interpretability to visual generation.
Zero-shot compositional generalization suggests that the attribute correspondences learned by text encoders are well-structured.
Extremely lightweight module (only SAE encoder/decoder weights), meaning no modifications to the base model are required.

Limitations & Future Work¶

The "old" attribute is deeply entangled with identity features; identity preservation drops noticeably under strong manipulation.
Currently primarily validated on facial attributes; control effectiveness for full-body or scene-level attributes has not been fully explored.
Relies on the semantic quality of the pre-trained text encoder; attributes poorly covered by the text encoder may exhibit limited effects.
Multi-subject scenarios require additional Attention Pooling fine-tuning, not representing a zero-cost solution.

The core difference from ConceptSlider (weight space editing) and AttributeControl (embedding space training) is that the former requires 1-LoRA-per-attribute and the latter requires 1-supervised-pair-per-attribute; ours covers all with one unsupervised training session.
While SAE applications in diffusion (SAeUron, Diffusion Lens, Unpacking SDXL Turbo) focus on interpretability, this paper is the first to use it for controllable attribute manipulation.
Insight: Similar sparse disentanglement strategies could potentially be used for motion attribute control in video generation.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to use K-SAE for unified attribute manipulation; a paradigm innovation from One-for-One to All-in-One.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive experiments covering single/multi-attribute, zero-shot, cross-model, multi-subject, and style extensions with detailed ablations.
Writing Quality: ⭐⭐⭐⭐ Clear motivation, systematic method description, and intuitive chart design.
Value: ⭐⭐⭐⭐⭐ High practical value as a lightweight, scalable, and zero-shot generalizing attribute control solution.