Diffusion Probe: Generated Image Result Prediction Using CNN Probes¶

Conference: CVPR 2026 arXiv: 2602.23783 Code: None Area: Diffusion Models / Image Quality Prediction Keywords: Diffusion Models, Probe, Cross-Attention, Early Quality Prediction, Generation Acceleration

TL;DR¶

This work discovers that the cross-attention distribution in early denoising steps of diffusion models is highly correlated with final image quality. It proposes Diffusion Probe — a lightweight CNN that predicts generation quality from early attention maps — enabling pre-filtering of low-quality generation trajectories after only 10% of denoising steps, thereby accelerating prompt optimization, seed selection, and GRPO training.

Background & Motivation¶

State of the Field¶

Background: Text-to-image (T2I) diffusion models face a fundamental efficiency bottleneck: quality is unpredictable.

Limitations of Prior Work¶

Limitations of Prior Work: Users often require multiple attempts (varying prompts or seeds) to obtain satisfactory results, with each attempt requiring a complete denoising pass.

Root Cause¶

Key Challenge: Academic methods such as IC-Edit (repeated generation) and Flow-GRPO (multi-candidate ranking) similarly depend on full generation.

Starting Point¶

Key Insight: Existing early prediction methods either incur high computational overhead (ICEdit requires a 72B VLM for decoding) or cannot be automated (PromptCharm relies on manual interpretation of attention maps).

Core finding: Early cross-attention maps in diffusion models contain predictive signals for final image quality — tokens whose attention is dispersed or fragmented correspond to objects that are missing or distorted in the final image.

Method¶

Overall Architecture¶

At an early denoising step (default $t=5$ out of 25 total steps), intermediate cross-attention maps $\mathcal{A}$ are extracted and, together with timestep embeddings, fed into a lightweight CNN probe network $E_\theta$, which outputs a scalar quality score prediction $\hat{q} = f_\theta(E_\theta(\mathcal{A}, t))$. The probe is trained offline on a dataset using an MSE regression loss. The predicted score is used for early-exit decisions in downstream tasks.

Key Designs¶

Empirical Discovery of Attention–Quality Mapping: A systematic audit of the FLUX model reveals two key phenomena: (a) even during high-noise denoising stages, tokens corresponding to semantically salient objects induce sharp, localized high-attention regions — early object localization emerges rapidly; (b) when final image quality is poor (missing objects, distortions, semantic inconsistencies), the early attention maps of the corresponding tokens are visibly diffuse and fragmented. These two observations establish the theoretical basis for using cross-attention as a probe for generation trajectories.
Lightweight CNN Probe Architecture: The probe consists of multiple DownBlocks (with residual layers) and an OutputLayer (normalization + pooling + convolution). For UNet-based models (e.g., SDXL), attention maps are extracted from the last 10 encoder blocks; for DiT-based models (e.g., FLUX), they are extracted from 10 consecutive intermediate blocks. The architecture is extremely lightweight and does not modify any parameters of the base model — it is fully plug-and-play. The training objective is an MSE loss: $$\mathcal{L} = \|\hat{q} - q\|_2^2$$ where $q$ is the quality score assigned by a pretrained reward model (e.g., ImageReward) on fully generated images.
Three Downstream Applications: (a) Prompt Optimization: An LLM is invoked to rewrite the prompt only when the predicted score falls below a threshold $\tau$, avoiding unnecessary LLM overhead; (b) Seed Selection: For a pool of candidate seeds, only $T_0 \ll T$ denoising steps are run, after which the probe predicts quality and the best seed is selected for a single full generation; (c) Efficient Flow-GRPO Training: Probe predictions replace full generation to rapidly mine preference pairs $(x^+, x^-)$, significantly accelerating policy convergence. The core logic across all three applications is consistent: replace costly full-generation evaluation with cheap early-stage prediction.

Loss & Training¶

MSE regression loss, with labels from the ImageReward pretrained reward model
Training data: 15K prompts (MS-COCO); evaluation: 5K disjoint prompts
Low-score samples are oversampled to address class imbalance
Default extraction step: $t=5$ (the 5th of 25 steps); prediction accuracy improves most sharply from $t=1$ to $t=5$, with diminishing returns thereafter

Key Experimental Results¶

Main Results (Prediction Accuracy, 1024×1024 Resolution)¶

Base Model	Steps	SRCC↑	AUC-ROC↑	KTC↑	PCC↑
SDXL	5	0.73	0.86	0.57	0.72
SDXL	10	0.76	0.89	0.61	0.75
FLUX	5	0.76	0.88	0.60	0.75
FLUX	10	0.79	0.91	0.64	0.78
Qwen-Image	10	0.72	0.87	0.56	0.71

Strong consistency is observed across architectures (UNet/DiT); AUC > 0.9 indicates excellent discriminative performance for binary quality classification.

Ablation Study (Downstream Task Performance)¶

Model	Task	Method	CLIP Score↑	ImageReward↑	Aesthetic↑
SDXL	Prompt Optimization	Baseline	28.31	0.71	5.13
SDXL	Prompt Optimization	+Probe	30.24	0.72	5.29
SDXL	Prompt Optimization	+LLM	30.80	0.73	5.34
FLUX	Seed Selection	Random	31.37	1.02	5.67
FLUX	Seed Selection	+Probe	31.41	1.06	5.79

The probe approaches LLM-level performance on prompt optimization at substantially lower cost; seed selection yields notable improvements in aesthetic scores.

Key Findings¶

Probe prediction accuracy reaches near-peak performance at step 5 (PCC 0.75 vs. peak 0.78), using only 20% of total denoising steps
Consistent performance across three distinct architectures (UNet/DiT) validates the model-agnostic nature of the approach
In Flow-GRPO training, the probe increases the proportion of high-quality samples by 2.5×, yielding smoother convergence curves
The degree of attention map diffuseness directly corresponds to object rendering failure modes (missing, distorted, or attribute-misaligned objects)

Highlights & Insights¶

This work is the first to introduce the LLM probing paradigm into diffusion models — "diagnosing generation trajectories with probes" represents an entirely novel perspective
The core insight is elegant: concentrated early attention implies correct object rendering; diffuse early attention implies generation failure
The probe is fully decoupled from the base model, zero-invasive, and applicable to any T2I model with attention mechanisms
Practical application scenarios are diverse and practically relevant (prompt iteration, seed filtering, RL training acceleration)

Limitations & Future Work¶

Prediction has only been validated for the ImageReward metric; predictive capacity for other quality dimensions (e.g., text rendering quality, spatial relationships) remains to be evaluated
The probe must be trained separately for each base model; generalization to new models requires re-collecting training data
The extraction step $t$ and block layer are currently selected manually, which may be suboptimal for different models
The MSE loss is insensitive to ranking; ranking losses or contrastive learning could be explored
For prompts that are extremely simple or highly complex, attention patterns may not carry equivalent predictive power

Distinction from DAAM (attention attribution visualization): DAAM performs post-hoc analysis, whereas Diffusion Probe performs prediction
Complementary to attention manipulation methods such as Attend-and-Excite — the latter improves attention, while the former predicts its outcome
Compared to ICEdit's early prediction: ICEdit requires decoding with a VLM (72B parameters), whereas the probe requires only a lightweight CNN
The probing paradigm is well-established in NLP (probes predicting linguistic properties); extending it to visual generation is a natural progression
The approach can be combined with acceleration methods such as DPCache — early filtering followed by accelerated generation

Rating¶

Novelty: ⭐⭐⭐⭐ First application of probing to diffusion models; the attention–quality correlation is a valuable finding
Experimental Thoroughness: ⭐⭐⭐⭐ Validated across three models × multiple step counts × three downstream tasks
Writing Quality: ⭐⭐⭐⭐ Motivation is clear, visualizations are intuitive, and experimental organization is well-structured
Value: ⭐⭐⭐⭐ High practical value, especially for scenarios requiring extensive sampling (RL training, agent-based generation)
Value: Pending