Skip to content

TCFG: Tangential Damping Classifier-Free Guidance

Conference: CVPR 2025
arXiv: 2503.18137
Code: None
Area: Diffusion Models
Keywords: Classifier-Free Guidance, Manifold Hypothesis, Singular Value Decomposition, Diffusion Model Sampling, Tangential Component

TL;DR

From the perspective of data manifold geometry, this work utilizes SVD to remove the unaligned tangential component in the unconditional score relative to the conditional score, improving CFG sampling quality with minimal computational overhead. It consistently reduces FID across SD1.5, SDXL, SD3, and DiT.

Background & Motivation

  1. Background: Classifier-Free Guidance (CFG) is the most critical sampling strategy in current text-to-image diffusion models, achieving high-quality conditional generation by combining conditional and unconditional scores.
  2. Limitations of Prior Work: In the CFG guidance formula \(\tilde{s}_\theta = s_\theta(z_t) + \omega(s_\theta(z_t, y) - s_\theta(z_t))\), the unconditional score estimates a generic intermediate manifold for all samples. Its tangential component may not align with the manifold direction of the conditional score, leading to sampling trajectories drifting away from the target manifold, resulting in artifacts like over-exposure and abnormal shapes.
  3. Key Challenge: Since the unconditional score serves all conditions simultaneously, its tangential component is "averaged" and exhibits systematic deviation from the tangent space of the manifold corresponding to a specific condition \(y\).
  4. Goal: To eliminate the manifold misalignment issue of the unconditional score in CFG without changing model weights or increasing inference costs.
  5. Key Insight: Utilizing the theoretical connection between the score function of diffusion models and the normal/tangent space of the data manifold—high singular values correspond to normal components (well-aligned), and low singular values correspond to tangential components (misaligned).
  6. Core Idea: Apply SVD to the conditional and unconditional scores, retain the normal component corresponding to the maximum singular value, and discard the remaining tangential components.

Method

Overall Architecture

TCFG is a plug-and-play sampling strategy that replaces the score combination method in standard CFG. At each sampling step: obtain the conditional score \(s_\theta(z_t, y)\) and unconditional score \(s_\theta(z_t)\) \(\to\) concatenate them into a matrix \(A\) \(\to\) perform SVD \(\to\) project the unconditional score onto the direction corresponding to the largest singular value (normal component) \(\to\) perform CFG using the modified unconditional score.

Key Designs

  1. Intermediate Manifold Hypothesis and SVD Analysis

    • Function: Provides a theoretical foundation for the method, proving the existence of a geometric structure of normal/tangential components within the score function.
    • Mechanism: Extends existing theorems (which state that the score approaches the data manifold's normal vector as \(t \to 0\)) by assuming that an intermediate manifold \(\mathcal{M}_t\) exists for all timesteps \(t \in (0,1)\), making the score function an element of its normal space. By collecting 17,000 score samples on SD1.5 and performing SVD, a significant gap in singular values is observed at all timesteps, verifying the existence of the intermediate manifold. It is further found that singular vectors corresponding to high singular values show high cosine similarity between conditional and unconditional scores (aligned normal components), while those corresponding to low singular values show low similarity (unaligned tangential components).
    • Design Motivation: Demonstrates that the misalignment of low-singular-value components is the root cause of artifacts generated by CFG.
  2. Tangential Damping

    • Function: Removes the tangential component in the unconditional score that is unaligned with the conditional score.
    • Mechanism: Concatenates the unconditional and conditional scores into a matrix \(A = [s_\theta(z_t), s_\theta(z_t, y)]\), and performs SVD to obtain singular values \(\sigma_i\) and right singular vectors \(v_i\). It only retains the direction \(v_1\) corresponding to the largest singular value (normal component) and projects the unconditional score onto this direction: \(\hat{s}_\theta(z_t) = s_\theta(z_t) \cdot V^T \cdot [v_1, 0]\). Then, the modified CFG is executed: \(\hat{s}_\theta(z_t, y) = \hat{s}_\theta(z_t) + \omega(s_\theta(z_t, y) - \hat{s}_\theta(z_t))\). Since SVD only needs to be performed on a \(2 \times D\) matrix, the computational overhead is negligible.
    • Design Motivation: The normal component is responsible for "pulling towards the manifold" (where conditional and unconditional scores agree), while the tangential component is responsible for "moving along the manifold" (where they disagree). Discarding the tangential component reduces misalignment interference.
  3. Sufficiency of Single-Sample SVD

    • Function: Verifies the practicality of the method—showing that it works without needing to collect a large number of samples.
    • Mechanism: Compares the results of "using all samples for SVD" and "using only a single sample pair for SVD" in a toy experiment (two moons dataset), finding them to be almost identical. This allows the method to perform SVD using only the two scores of the current sample at each sampling step, introducing no extra calculation.
    • Design Motivation: Ensures that the method does not introduce batch dependencies or extra computational overhead.

Loss & Training

TCFG does not require training or fine-tuning; it is purely a test-time modification to the sampling strategy. It uses the original weights of the pre-trained model directly.

Key Experimental Results

Main Results

Model Metric Original CFG +TCFG Gain
SD v1.5 FID↓ 13.26 13.12 -0.14
SDXL FID↓ 13.36 12.65 -0.71
SD v3 FID↓ 16.66 13.74 -2.92
DiT (ImageNet) FID↓ 32.67 29.5 -3.17
DiT sFID↓ 17.92 13.27 -4.65
DiT Recall↑ 0.13 0.19 +46%

Ablation Study

Method FID↓ Description
SAG 13.53 -
SAG + TCFG 11.48 Significant effect when combined with SAG
PAG 14.45 -
PAG + TCFG 11.87 Compatible with PAG
CFG++ 13.97 -
CFG++ + TCFG 13.44 Compatible with CFG++

Key Findings

  • The stronger the model (SD3 > SDXL > SD1.5), the larger the FID improvement yielded by TCFG. It is hypothesized that stronger models possess clearer manifold structures, leading to better tangential component discarding.
  • SD3 is based on Rectified Flow instead of standard diffusion, yet TCFG is still applicable, demonstrating the framework-agnostic nature of the method.
  • TCFG significantly mitigates the over-exposure bias issue, which is a well-known pain point of CFG.
  • The CLIP Score remains almost unchanged, showing that removing the tangential component does not impair text alignment.

Highlights & Insights

  • Simple yet Effective: The entire method only requires a single SVD of a \(2 \times D\) matrix per step (where \(D\) is the latent dimension), with zero extra training and zero parameter tuning, while consistently improving FID. This combination of "geometric perspective + minimalist operation" is highly elegant.
  • Bridge between Theory and Practice: The logical chain is complete, progressing from the manifold hypothesis \(\to\) SVD gap observation \(\to\) tangential component misalignment \(\to\) discarding operation.
  • High Generality: Applicable to diffusion models, Rectified Flow, text-guided, and class-guided generation, and can be combined with works like SAG, PAG, and CFG++.

Limitations & Future Work

  • The theory is still based on hypotheses (the existence of the intermediate manifold \(\mathcal{M}_t\)), and rigorous mathematical proof is not yet complete.
  • It only discards the smallest singular value direction (retaining \(v_1\)); higher-dimensional manifolds might require retaining more directions.
  • Lacks subjective quality evaluation through user studies.
  • Future work could explore adaptively selecting how many singular directions to retain, or dynamically adjusting them based on timesteps.
  • vs SAG/PAG: SAG uses self-attention maps and PAG uses identity attention maps to enhance CFG, optimizing from the perspective of attention mechanisms. This work optimizes from the perspective of score manifold geometry; the two are orthogonal and can be combined.
  • vs CFG++: CFG++ modifies the mathematical formula of CFG to achieve better sampling, while this work modifies the input scores themselves before feeding them into CFG. The two can also be combined.
  • vs ICG: ICG substitutes empty text with random text embeddings, operating at the conditional representation level. This work operates directly in the score space.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Understands the CFG misalignment issue from a manifold geometry perspective and proposes a minimalist solution; the concept is novel.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Covers multiple models and frameworks, with detailed FID-CLIP curves and compatibility experiments, though it lacks user studies.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear theoretical derivation and intuitive diagrams, progressing step-by-step from toy experiments to actual models.
  • Value: ⭐⭐⭐⭐⭐ Plug-and-play, zero cost, and widely compatible, possessing high practical value.