CVPR 2026 Segmentation Diffusion Models Training-Free Segmentation Cross-Attention Auto Aggregation Per-Pixel Rescaling Generative Scaling

Making Training-Free Diffusion Segmentors Scale with the Generative Power¶

Conference: CVPR 2026 arXiv: 2603.06178 Code: Available Area: Semantic Segmentation Keywords: Diffusion Models, Training-Free Segmentation, Cross-Attention, Auto Aggregation, Per-Pixel Rescaling, Generative Scaling

TL;DR¶

This paper identifies the fundamental reasons why existing training-free diffusion segmentation methods fail to scale with the generative power of stronger models — namely, two gaps between cross-attention maps and semantic relevance (an aggregation gap and a score imbalance gap). It proposes two techniques, auto aggregation and per-pixel rescaling, forming the GoCA framework, which for the first time enables stronger diffusion models (SDXL, PixArt-Sigma, Flux) to significantly outperform weaker ones in training-free semantic segmentation.

Background & Motivation¶

Background: Text-to-image diffusion models (Stable Diffusion, Flux, etc.) have been explored for discriminative tasks given their powerful image generation capabilities. One line of research focuses on "training-free diffusion segmentation" — directly leveraging cross-attention maps from pretrained diffusion models for semantic segmentation without additional training.

Core Premise and Expectation: Training-free diffusion segmentation methods are grounded in the generative power of diffusion models. Intuitively, stronger generative models should yield better segmentation results — i.e., segmentation performance should scale with generative capability.

Counter-Intuitive Observation: The authors find that existing methods (DiffSegmentor, FTTM, etc.) are almost exclusively validated on Stable Diffusion v1.5/v2.1. When switched to stronger models such as SDXL, PixArt-Sigma, or Flux, segmentation performance does not improve and may even degrade — a finding that fundamentally contradicts the intuition that stronger generation implies better segmentation.

Gap I — Aggregation Gap: Diffusion models contain multi-head, multi-layer cross-attention, with each head/layer producing independent attention maps. Prior methods aggregate these maps using manually specified weights, a process that becomes infeasible for more complex architectures (UNet → DiT → MMDiT).

Gap II — Score Imbalance Gap: Even with a globally aggregated attention map, raw scores do not directly reflect semantic relevance. Two forms of imbalance exist: (a) foreground token scores (e.g., "cat") are substantially higher than background token scores (e.g., "grass"), making direct comparison unreliable; (b) semantic special tokens (e.g., <sos>) exhibit inconsistent score magnitudes across pixels, corrupting per-token normalization.

Core Idea: Replace manual weight tuning with an automated aggregation scheme based on inter-activation correlations within the model, and eliminate interference from semantic special tokens via per-pixel rescaling — bridging both gaps so that training-free segmentation genuinely scales with generative capability.

Method¶

Overall Architecture: GoCA (Generative scaling of Cross-Attention)¶

The framework consists of two modules: (1) Auto Aggregation, which addresses Gap I by automatically assigning aggregation weights across heads and layers; and (2) Per-Pixel Rescaling, which addresses Gap II by eliminating the interference of semantic special tokens on attention scores. These are followed by standard self-attention refinement and argmax-based segmentation.

Auto Aggregation¶

Head-wise Aggregation¶

Mechanism: Multi-head attention can be rewritten as a sum of per-head output vectors: \(Output = \sum_n A_n V_n W_n^O\). The contribution of each head can be measured by the dot-product similarity between that head's output vector and the total output.
Formulation: For head \(n\) in layer \(m\), the weight is \(w_{mn} = (A_n V_n W_n^O)_m^\top \cdot Output_m\), normalized to obtain per-pixel head weights \(w_{mn}'\).
Design Motivation: Per-pixel weights, rather than a single global weight, more finely capture the contribution of different heads at different spatial locations.

Layer-wise Aggregation¶

Mechanism: A pseudo self-attention map \(A_p\) is computed from dense diffusion features as a proxy for global attention; the weight of each layer is then determined by the similarity between that layer's actual self-attention map and this proxy.
Key Assumption: The contribution pattern of cross-attention layers mirrors that of self-attention layers — self-attention similarity is used as a proxy for cross-attention contribution.
Formulation: \(w_m = (A_p')^\top (A_{self}^m)'\), normalized and used to weight-sum across layers. The final global attention map is \(A = \sum_m w_m' A_m\).

Per-Pixel Rescaling¶

Problem: Two imbalances exist in the global attention map: (a) large magnitude differences between foreground and background token scores; (b) semantic special tokens (<sos>) dominate scores with inconsistent scales across pixels, causing per-token normalization to fail.
Solution:
Exclude non-content tokens: Retain only content word tokens (e.g., "cat", "grass"), discarding semantic special tokens and stopword tokens.
Per-pixel normalization to unity: For each pixel \(i\), normalize content token scores: \(A'(i,q) = \frac{A(i,q)}{\sum_j A(i,q(j))}\), eliminating scale interference from semantic special tokens.
Per-token re-normalization: Apply min-max normalization to \([0,1]\) across all pixels for each token, enabling reliable cross-token comparison.
Intuition: Semantic special token scores are higher in background regions (since foreground regions are already dominated by content token information); removing this interference substantially improves background attention map quality.

Key Experimental Results¶

Main Results¶

mIoU comparison on five standard semantic segmentation benchmarks:

Method	Model	VOC	Context	COCO-Obj	Cityscapes	ADE20K
DiffSegmentor	SD v1.5	60.1	27.5	37.9	-	-
FTTM	SD v1.5	48.9	30.0	34.6	12.3	20.3
Vanilla	SD v1.5	44.3	32.3	32.3	11.8	18.0
Vanilla	Flux	55.7	48.4	43.3	25.6	24.5
GoCA	SD v1.5	60.7	40.4	39.2	16.1	22.0
GoCA	SD XL	65.6	42.3	44.3	21.2	23.2
GoCA	PixArt-Σ	63.6	43.2	39.8	22.6	23.8
GoCA	Flux	70.7	51.1	48.1	27.1	29.3

GoCA + Flux achieves 70.7% mIoU on Pascal VOC, outperforming Vanilla SD v1.5 by 26.4 points and the strongest prior SOTA (DiffSegmentor) by 10.6 points.

Ablation Study¶

Component contributions on Pascal VOC 2012 (mIoU):

Head Agg.	Layer Agg.	Rescaling	SD v1.5	SD XL
Vanilla	Vanilla	Vanilla	44.3	51.1
Vanilla	Manual	Vanilla	51.1	-
Ours	Vanilla	Vanilla	44.8	56.1
Vanilla	Ours	Vanilla	52.1	51.3
Vanilla	Vanilla	Ours	52.6	51.4
Ours	Ours	Ours	60.7	65.6

Each component contributes approximately 5–8 points individually; their combination yields substantially larger gains (SD v1.5: +16.4, SD XL: +14.5).

Generative Integration Experiment¶

S-CFG integration (COCO-30k, CFG=5.0):

Method	FID↓	CLIP↑
CFG	19.27	31.34
S-CFG	19.15	31.35
GoCA + S-CFG	18.82	31.42

Replacing the internal segmentor of S-CFG with GoCA consistently improves generation quality, validating the practical utility of training-free segmentation in generative pipelines.

Key Findings¶

First successful positive scaling from generative to segmentation capability: Manually tuned baselines on SD v1.5 sometimes outperform Vanilla methods on SD XL/PixArt-Sigma; GoCA eliminates this counter-intuitive phenomenon.
Especially pronounced improvement in background regions: Per-pixel rescaling substantially improves attention map quality for background categories such as "grass" and "wall" by removing the adversarial influence of semantic special tokens.
Strong architectural generalizability: GoCA is effective across UNet-based (SD v1.5/XL), DiT-based (PixArt-Sigma), and MMDiT-based (Flux) architectures, whereas manual tuning methods cannot generalize.
Auto layer aggregation matches manual tuning: The proposed layer aggregation (52.1) matches or exceeds manual aggregation (51.1) without any human intervention.
Clear value in generative integration: As an internal component of S-CFG, GoCA yields consistent improvements in both FID and CLIP scores.

Highlights & Insights¶

Novel and important problem formulation: This is the first work to systematically identify and validate the failure of training-free diffusion segmentation to scale with generative capability, providing a clear direction for the field.
Precise gap analysis: The problem is decomposed into an aggregation gap and a score imbalance gap, each with a rigorous formalization and a targeted solution.
Fully training-free: The method introduces no learnable parameters and relies solely on the intrinsic structure of model activations, preserving the purity of the training-free paradigm.
Insightful observation on semantic special tokens: The finding that <sos> scores are higher in background regions (since foreground regions are dominated by content token information) offers theoretical value for understanding the internal mechanisms of diffusion models.
Strong practical utility: GoCA can be directly integrated into generative techniques such as S-CFG to improve text-to-image generation quality.

Limitations & Future Work¶

Limited to semantic segmentation: The method relies on the assumption of semantic relevance in cross-attention maps and has not yet been extended to instance segmentation, panoptic segmentation, depth estimation, or other discriminative tasks.
Dependence on external object detectors: GPT-4o is required to construct prompts covering all categories, introducing a dependency on external modules.
Sensitivity to prompt design: Different prompt strategies significantly affect results, and cross-method comparisons are confounded by differences in prompt design.
Directions for future work: Extending GoCA to instance and panoptic segmentation; exploring automatic prompt generation without external detectors; investigating temporal extension to video diffusion models.

vs. DiffSegmentor / FTTM: These methods manually assign aggregation weights for SD v1.5 and cannot generalize to SD XL/Flux or other new architectures. GoCA resolves architectural dependency through auto aggregation and also surpasses DiffSegmentor on SD v1.5 (60.7 vs. 60.1).
vs. DiffCut: DiffCut applies Normalized Cut to dense diffusion features for segmentation; GoCA likewise leverages dense features but uses them to compute a proxy self-attention map for layer-wise aggregation weights — the two approaches are complementary in their use of dense features.
vs. training-based diffusion discriminators (ODISE, VPD, etc.): Training-based methods achieve higher performance by fine-tuning for segmentation tasks but require additional training data and computation. GoCA demonstrates that training-free methods, when correctly scaled, can substantially close this gap.

Rating¶

Novelty: ⭐⭐⭐⭐ First to systematically identify and address the scaling failure of training-free diffusion segmentation
Experimental Thoroughness: ⭐⭐⭐⭐ Four diffusion models, five benchmarks, ablation studies, generative integration, and qualitative analysis
Writing Quality: ⭐⭐⭐⭐ Clear problem motivation, well-structured progressive gap analysis, and intuitive figures
Value: ⭐⭐⭐⭐ Opens the door to scaling in training-free diffusion segmentation with both practical utility and theoretical significance