Skip to content

Diffusion Classifiers Understand Compositionality, but Conditions Apply

Conference: NeurIPS 2025 (Datasets & Benchmarks)
arXiv: 2505.17955
Code: https://github.com/eugene6923/Diffusion-Classifiers-Compositionality
Area: Image Generation / Diffusion Models / Compositional Understanding
Keywords: diffusion classifier, compositionality, zero-shot classification, benchmark, timestep weighting

TL;DR

A comprehensive study of zero-shot diffusion classifiers on compositional understanding tasks: covering 3 diffusion models (SD 1.5/2.0/3-m) × 10 datasets × 30+ tasks. The paper introduces Self-Bench, a diagnostic benchmark that eliminates domain gap by using images generated by the diffusion models themselves, and finds that diffusion classifiers do understand compositionality—but performance is conditioned on domain alignment and timestep weighting, hence "conditions apply."

Background & Motivation

Discriminative models such as CLIP frequently fail on compositional understanding tasks—for instance, they cannot distinguish "a red apple on a green table" from "a green apple on a red table," which requires attribute binding and spatial relation reasoning. Contrastive learning models tend to learn shortcut features, are insensitive to word order, and perform poorly on spatial relation and counting tasks.

In contrast, text-to-image (T2I) diffusion models (e.g., the Stable Diffusion family) employ pixel-level supervision during training and can synthesize highly complex compositional scenes, suggesting strong intrinsic compositional understanding. This raises a natural question: Can the compositional capability of generative models be transferred to discriminative tasks?

Zero-shot diffusion classifiers repurpose diffusion models for discrimination: given an image and multiple candidate texts, they identify the best-matching text by comparing the denoising reconstruction errors (i.e., ELBO approximations of the conditional likelihood) under different text conditions. Prior work demonstrated promising results over CLIP on a small number of benchmarks such as Winoground and CLEVR, but with limited evaluation scope and insufficient analysis.

The motivation of this paper is to fill that gap by systematically answering, across large-scale and diverse compositional understanding tasks, the question: under what conditions can diffusion classifiers genuinely understand visual compositionality?

Core Problem & Three Hypotheses

The authors propose three progressive research hypotheses:

  1. Hypothesis 1: The discriminative compositional understanding of diffusion models surpasses that of CLIP → requires validation on larger-scale benchmarks.
  2. Hypothesis 2: Diffusion models can understand (discriminate) images they themselves generated → requires eliminating domain gap to verify.
  3. Hypothesis 3: Domain gap can be mitigated through timestep weighting → requires exploring the relationship between timesteps and domain gap.

Method

Diffusion Classifier Foundation

Given an image \(\mathbf{x}\) and its latent representation \(\mathbf{z}\), the diffusion classifier estimates the conditional likelihood by minimizing the denoising loss:

\[\tilde{y} = \arg\max_{y_k} \log p(\mathbf{z} \mid y=y_k)\]

The likelihood is approximated via the ELBO as:

\[\mathcal{L}(\mathbf{z}, \mathbf{c}) = \mathbb{E}_{t,\epsilon}[w_t \|\epsilon - \epsilon_\Theta(\mathbf{z}_t, t, \mathbf{c})\|^2]\]

In practice, a fixed set of \(T_s=30\) uniformly spaced timesteps and fixed noise are used to reduce Monte Carlo estimation variance.

SD3-m as a Classifier (First Application)

SD3-m is based on the Rectified Flow model and is trained with a conditional flow matching (CFM) loss, which differs from the standard diffusion objective of SD1.5/2.0. The authors reparameterize the CFM objective as a noise prediction loss to enable the same classifier framework:

\[\mathcal{L}_{\mathrm{RF}}(\mathbf{x}_0) = \mathbb{E}_{t,\epsilon}[w_t \|\epsilon_\Theta(\mathbf{z}_t, t, \mathbf{c}) - \epsilon\|^2]\]

The only difference is that SD3's training timestep sampling uses a logit-normal distribution, though experiments show that uniform weighting performs better for classification tasks.

Self-Bench: Diagnostic Benchmark Design

Core Idea: Use images generated by the diffusion model itself as the test set, eliminating the distribution gap between real images and the model's training domain, thereby isolating the effect of domain gap on discriminative performance.

Construction pipeline: 1. Collect prompts: Use text prompts from the GenEval benchmark, covering 6 task categories (single object, two objects, color, color attribution, position, counting), with 80 object classes in total. 2. Generate images: For each prompt, generate 4 images with each of SD1.5/2.0/3-m (guidance scale=9.0); failed samples are manually filtered. 3. Construct discrimination tasks: Retain the original prompt as the positive sample and construct negative prompts (e.g., replacing "left of" with "right of"/"above"/"below"). 4. Evaluate: Test whether the diffusion classifier can correctly pair the generated image with the correct prompt.

The dataset is divided into two subsets: Full (all generated samples) and Correct (samples unanimously approved by three annotators). SD3-m's Correct rate is substantially higher than SD1.5/2.0 (e.g., for the Position task: SD3-m 113/400 vs. SD1.5 6/400), reflecting the generational improvement in generation quality.

Timestep Weighting

The timestep weight \(w_t\) supports two parameterizations: - Piecewise constant: Each timestep independently learns a weight \(v_0, \ldots, v_{T_s-1}\), used to obtain performance upper bounds. - Polynomial smoothing: \(w_t = \sum_{i=0}^p a_i t^i\), used to prevent overfitting under low-data settings (only 5% of training data).

Key Experimental Results

Hypothesis 1 Validation: Large-Scale Evaluation on 10 Benchmarks × 33 Subtasks

Evaluation is conducted on 10 compositional understanding benchmarks: Vismin, EQBench, MMVP, CLEVR, Whatsup, Spec, ARO, Sugarcrepe, COLA, and Winoground. Tasks are grouped into four categories: Object, Attribute, Position, and Counting.

Key Findings: - Diffusion classifiers perform strongest on Position (spatial relation) tasks; SD3-m outperforms CLIP on this category. - However, they are clearly inferior to CLIP on Counting tasks. - Surprisingly, SD3-m (the strongest generator) achieves lower overall discriminative accuracy (39%) than SD1.5/2.0 (43%). - CLIP models still outperform diffusion classifiers on most tasks, contradicting conclusions from prior small-scale experiments.

Hypothesis 2 Validation: Self-Bench In-Domain vs. Cross-Domain

In-domain evaluation (each model evaluates its own generated images): - Diffusion classifiers perform well on the Correct subset, demonstrating that they genuinely discriminate based on content rather than simply matching prompts. - Generation accuracy and discrimination accuracy are positively correlated (correlation coefficient 0.77): models with stronger generation also discriminate better. - SD3-m performs best in in-domain evaluation, consistent with its superior generation capability.

Cross-domain evaluation (each model evaluates images generated by other models): - Performance drops significantly for all models; SD3-m degrades the most—accuracy drops by 38% on two-object tasks and by 33–40% on color and spatial tasks. - This explains why SD3-m underperforms SD1.5/2.0 on real-image benchmarks: it is not that SD3-m fails to understand compositionality, but that the domain gap between real images and its generation domain is too large.

Hypothesis 3 Validation: Timestep Weighting Mitigates Domain Gap

Single-timestep analysis: - SD2.0 achieves nonzero classification accuracy across all timesteps. - SD3-m yields zero accuracy at more than 50% of timesteps (especially when evaluating SD2.0-generated images), indicating extreme sensitivity to timestep selection.

Low-data timestep reweighting (learning weights from only 5% of data): - Reweighted SD3-m consistently outperforms all baseline models and their reweighted variants across all real-world benchmarks. - CLEVR binding task: 63% → 98% - WhatsupA spatial task: 30% → 42% - SD1.5/2.0 benefit little from reweighting—their uniform weights are already near optimal.

Relationship between domain gap and timestep weighting: - Domain gap is quantified as the L2 embedding distance between real images and Self-Bench generated images using a CLIP image encoder. - For SD3-m, larger domain gap correlates with greater performance gain from timestep weighting (positive correlation). - No such trend is observed for SD1.5/2.0, as most timesteps are already effective for these models, leaving little room for improvement.

Intuitive Interpretation of Timesteps

Through visualization of the denoising generation process at different timesteps, the authors offer the following intuition: - Very early timesteps (\(t=0.1\)): noise is too small; denoising output is barely influenced by the prompt—no discriminative power. - Very late timesteps (\(t=0.96\)): noise is too large; the model regenerates the image entirely from the prompt, overriding the original—also no discriminative power. - Intermediate timesteps (\(t \in [0.73, 0.93]\)): the model performs meaningful prompt-guided editing while preserving the original image structure—these are the discriminatively informative timesteps.

Highlights & Insights

  • The Self-Bench design is particularly elegant: using a model's own generated images as the test set to eliminate domain gap is a general diagnostic methodology transferable to other generative-discriminative capability studies.
  • The title "conditions apply" precisely captures the core finding: diffusion models do possess compositional understanding, but performance is highly dependent on domain alignment and timestep selection.
  • The counterintuitive dissociation between generation and discrimination is given a principled explanation: SD3-m's weak discriminative performance does not reflect poor compositional understanding but rather a large domain gap with respect to real images.
  • Low-data timestep reweighting is a practical deployment strategy: using only 5% of data substantially mitigates the domain gap problem.
  • The concept of a "discriminative window" over timesteps has theoretical value for understanding the internal representations of diffusion models.

Limitations & Future Work

  • The inference cost of diffusion classifiers far exceeds that of CLIP: computing the likelihood for each candidate text requires multiple forward denoising passes, limiting practical deployment.
  • Self-Bench images are generated by the models themselves and may not fully reflect the visual complexity and diversity of the real world.
  • Larger-scale or newer architectures (e.g., SDXL, FLUX, SD3.5) are not explored; their domain gap and timestep sensitivity may differ substantially.
  • Timestep reweighting requires a small amount of labeled data; a general timestep strategy for purely zero-shot settings remains lacking.
  • Quantifying domain gap via CLIP embeddings introduces a circular dependency.
  • vs. CLIP zero-shot: CLIP still outperforms diffusion classifiers on most compositional tasks, but diffusion classifiers have advantages in in-domain and spatial relation settings.
  • vs. Diffusion-ITM (Krojer et al.): Prior work has limited evaluation scope; this paper extends to 10 benchmarks and 33 tasks, yielding more comprehensive and divergent conclusions.
  • vs. Clark & Jaini's fixed timestep weights: Prior work uses a fixed global weight \(w_t=\exp(-7t)\); this paper demonstrates that adaptive adjustment per model and task is necessary.
  • vs. Generative AI Paradox (West et al.): That work uses separate models for generation and discrimination; this paper directly probes the generation–discrimination relationship within the same model via diffusion classifiers.

Rating

  • Novelty: ⭐⭐⭐⭐ Self-Bench and the domain gap–timestep weighting correlation analysis are valuable methodological contributions.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 3 models × 10 datasets × 33 tasks at an unprecedented scale, with detailed ablations.
  • Writing Quality: ⭐⭐⭐⭐ The three-hypothesis-then-validation structure is clear and systematic; figures and tables are of high quality.
  • Value: ⭐⭐⭐ Useful reference for understanding the boundaries of diffusion model discriminative capability and the mechanism by which domain gap affects generation quality evaluation.