Skip to content

SliderSpace: Decomposing the Visual Capabilities of Diffusion Models

Conference: ICCV 2025 arXiv: 2502.01639 Code: https://github.com/rohitgandikota/sliderspace Area: Diffusion Models Keywords: Controllable generation with diffusion models, semantic decomposition, LoRA adapters, latent space exploration, unsupervised discovery

TL;DR

SliderSpace applies PCA to CLIP features of images generated by a diffusion model under a given prompt, automatically discovering multiple semantically orthogonal controllable directions. Each direction is trained as a LoRA adapter (slider), enabling concept decomposition, artistic style exploration, and diversity enhancement without any manually specified attributes.

Background & Motivation

Background: Text-to-image diffusion models can generate richly diverse image variants from the same prompt using different random seeds. However, this vast creative potential remains opaque to users—we know the model "can generate different images," yet the underlying structure driving these variations is not understood.

Limitations of Prior Work: (1) Existing control methods (ControlNet, IP-Adapter, Concept Sliders) all require users to pre-specify the attributes to be controlled, which restricts exploratory use. (2) Unlike the low-dimensional latent spaces of GANs, the high-dimensional latent spaces of diffusion models lack interpretable structure, precluding continuous and compositional editing analogous to StyleGAN. (3) Distilled models (e.g., DMD2) are fast but suffer from severe mode collapse.

Key Challenge: Diffusion models encode extraordinarily rich visual knowledge, yet users can only interact with them through the narrow bandwidth of text. Visual variations that cannot be precisely described in text—such as a particular lighting style or a subtle compositional shift—remain inaccessible.

Goal: Given an arbitrary text prompt, automatically discover the principal directions of variation in a diffusion model's knowledge of the corresponding concept, expose them as user-controllable sliders, and allow users to explore the diffusion model's latent space in a manner analogous to GAN latent space traversal.

Key Insight: Since the collection of images generated by a diffusion model with different seeds already encodes different dimensions of the model's understanding of a concept, applying PCA to this distribution naturally reveals the most salient directions of variation.

Core Idea: Apply spectral decomposition (PCA) to CLIP features of a large set of images generated by the diffusion model, use the resulting principal components as training targets, and train a LoRA adapter for each direction. Each adapter serves as a slider controlling an independent semantic dimension of variation.

Method

Overall Architecture

SliderSpace proceeds in three steps: (1) Distribution Sampling—given prompt \(c\), generate approximately 5,000 images with different seeds, extracting the estimated final image \(\tilde{x}_{0,t}\) at each timestep \(t\); (2) Semantic Decomposition—map all \(\tilde{x}_{0,t}\) into CLIP feature space and apply PCA to obtain \(n\) principal components \(\{v_i\}\); (3) Slider Training—train a LoRA adapter \(\mathcal{T}_i\) for each principal component direction \(v_i\) such that its effect in CLIP space aligns with \(v_i\).

Key Designs

  1. Final Image Extrapolation + CLIP PCA:

    • Function: Extract semantic information from intermediate diffusion states for decomposition.
    • Mechanism: At timestep \(t\), the estimated final image is extracted via \(\tilde{x}_{0,t} = (x_t - \sqrt{1-\bar{\alpha}_t} \epsilon_t) / \sqrt{\bar{\alpha}_t}\), mapped to CLIP space \(\phi(\tilde{x}_{0,t})\), and PCA is applied to the set of CLIP features across all seeds and timesteps: \(V = \text{PCA}(\{\phi(\tilde{x}_{0,t})\}_{j,t})\). Using multiple timesteps rather than only the final image is motivated by the observation that different timesteps capture different levels of semantic variation—early steps govern composition and layout, while later steps govern fine details and texture.
    • Design Motivation: Applying PCA directly to pixels captures only color and positional variation; performing PCA in the high-level semantic space of CLIP yields genuinely semantically meaningful directions of variation.
  2. Semantically Orthogonal LoRA Slider Training:

    • Function: Translate abstract PCA directions into controllable model editors.
    • Mechanism: For each principal component \(v_i\), a LoRA adapter \(\mathcal{T}_i\) is trained. The training objective maximizes the cosine similarity between the CLIP feature change induced by the adapter, \(\Delta\phi_i = \phi(\tilde{x}_{0,t}^i) - \phi(\tilde{x}_{0,t})\), and \(v_i\): \(\mathcal{L}_{\text{sliderspace}} = \sum_{i=1}^n (1 - \cos(\Delta\phi_i, v_i))\). Since PCA directions are mutually orthogonal by construction, the trained sliders govern distinct semantic dimensions.
    • Design Motivation: The low-rank structure of LoRA ensures that each slider has a small parameter footprint (enabling composition and stacking), and editing via LoRA is more stable than directly manipulating CLIP embeddings, which does not guarantee image quality.
  3. Slider Composition and Transfer:

    • Function: Support simultaneous application of multiple sliders and transfer across concepts.
    • Mechanism: Multiple LoRA adapters can be additively combined into model weights with independently adjustable scale factors. Due to semantic orthogonality, the combined effect approximates linear superposition. Furthermore, sliders trained on the "person" concept transfer to related concepts such as "police," "athlete," and even "dog."
    • Design Motivation: The additive composition property of LoRA naturally supports multi-slider coordination; transferability arises from shared internal representations within the diffusion model.

Loss & Training

The training loss is \(\mathcal{L} = \sum_{i=1}^n (1 - \cos(\Delta\phi_i, v_i))\), using rank-1 LoRA adapters (found to be most efficient under a fixed training budget). The default configuration uses 40 PCA directions. Discovery of 64 sliders completes in approximately 2 hours on a single A100 GPU.

Key Experimental Results

Main Results: Concept Decomposition Diversity Evaluation

Image diversity and text alignment are compared between SliderSpace-augmented generation and the base model across 6 concepts:

Concept DreamSim Diversity ↑ CLIP-Score Text Alignment ↓
Mountain (SDXL-DMD) Low ~25
Mountain (SliderSpace) High ~0.5 ~25
Monster (SDXL-DMD) Low ~32
Monster (SliderSpace) High ~0.55 ~33
Person (SDXL-DMD) Low ~20
Person (SliderSpace) High ~0.45 ~20

User study: SliderSpace achieves win rates of 72.4% / 66.0% / 68.1% over SDXL-DMD on diversity / usefulness / creativity, respectively.

Artistic Style Exploration

Method FID ↓ (vs. ParrotZone reference distribution)
Generic Art Prompts High
LLM-generated Art Prompts Medium
Concept Sliders (64 manual) Moderate
SliderSpace (64 automatic) Lowest

SliderSpace with only 10 directions matches the FID of 64 manually crafted Concept Sliders.

Ablation Study

Configuration FID Performance Notes
SliderSpace (full) Best CLIP-space PCA + orthogonal training
No orthogonality constraint (naive multi-LoRA) Poor Many redundant/meaningless directions
Pixel-space PCA Moderate Captures color/shape variation but not semantics
No diversity expansion Slightly worse Insufficient diversity in distilled model training data

Key Findings

  • Rank-1 LoRA is most efficient: Under a fixed training budget, rank-1 adapters outperform higher-rank variants.
  • Saturation at ~40 PCA directions: Marginal gains diminish beyond 40 directions.
  • CLIP vs. DINOv2: Overall performance is comparable, but FaceNet discovers finer-grained directions for face-related concepts.
  • Timestep selection: Applying sliders across all timesteps alters global structure; skipping the first few steps enables more precise local editing.
  • Substantial diversity gains: On COCO-30k, FID for DMD-SliderSpace decreases from 15.52 to 12.12, approaching the 11.72 of undistilled SDXL.

Highlights & Insights

  • A return to the GAN latent space paradigm: Diffusion models have long lacked the "drag-to-explore" interaction characteristic of GAN latent spaces. SliderSpace elegantly bridges this gap via PCA + LoRA, enabling users to "turn knobs" to explore a diffusion model's visual capabilities in the style of StyleGAN.
  • Self-supervised concept discovery without annotation: No semantic labels are required; meaningful control directions are discovered solely from samples generated by the model itself. This demonstrates that diffusion models internally develop structured concept representations.
  • Addressing mode collapse in distilled models: The diversity sliders discovered by SliderSpace can restore the diversity of distilled models to levels approaching those of the original model, which is highly valuable for practical deployment.

Limitations & Future Work

  • The approach depends on semantic encoders such as CLIP, whose training data biases propagate into the discovered directions.
  • The discovery process requires approximately 2 hours on an A100, which may limit rapid iteration.
  • Discovered directions do not necessarily correspond one-to-one with human-intuitive "natural attributes"; some directions may conflate multiple semantic concepts.
  • Artistic style directions do not necessarily correspond one-to-one with real artists.
  • Future directions include more efficient discovery algorithms, user-guided direction refinement, and cross-model direction transfer.
  • vs. Concept Sliders (Gandikota et al. 2024): Concept Sliders require users to specify attributes (e.g., "age," "smile"), whereas SliderSpace discovers attributes automatically. The two approaches are complementary—SliderSpace helps users "discover the unknown," while Concept Sliders help users "precisely control the known."
  • vs. NoiseCLR (Dalva & Yanardag 2024): NoiseCLR uses contrastive learning to discover directions in the text embedding space, but the results may lack semantic interpretability. SliderSpace performs PCA directly in the CLIP visual space, which is more straightforward and guarantees orthogonality.
  • vs. weights2weights (Dravid et al. 2024): w2w requires training separate LoRAs for each individual before performing PCA, incurring substantial cost. SliderSpace performs PCA directly on the generative distribution, which is considerably more efficient.
  • The method implies that diffusion models spontaneously develop structured internal representations during training, offering insights into the internal mechanisms of generative models.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Elegantly combines PCA decomposition with LoRA training to realize GAN-style latent space exploration for diffusion models.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three application scenarios, user studies, comprehensive ablations, and multi-model validation.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear exposition, concise mathematical derivations, and high-quality figures.
  • Value: ⭐⭐⭐⭐⭐ Strong practical utility (addressing mode collapse and enhancing controllability) with deep theoretical insight (revealing internal model structure).