TRACE: Your Diffusion Model is Secretly an Instance Edge Detector¶

Conference: ICLR 2026 arXiv: 2503.07982 Code: Project Page Area: Instance Segmentation / Panoptic Segmentation Keywords: Diffusion Models, Instance Edges, Self-Attention, IEP, Unsupervised Segmentation

TL;DR¶

This work identifies an "Instance Emergence Point" (IEP) in the denoising trajectory of text-to-image diffusion models, at which self-attention exhibits sharp divergence changes at object boundaries. TRACE leverages IEP localization, ABDiv edge extraction, and single-step distillation to generate high-quality instance edges with an 81× inference speedup—requiring no instance-level annotations—improving unsupervised instance segmentation by +5.1 AP and surpassing point-supervised panoptic segmentation with tag-level supervision by +1.7 PQ.

Background & Motivation¶

Background: Instance and panoptic segmentation have long relied on dense annotations (masks/boxes/points), which are costly and suffer from inter-annotator inconsistency. Unsupervised approaches (e.g., MaskCut) cluster ViT semantic features, but ViTs are optimized for cross-image semantic similarity rather than intra-image instance separation, frequently merging adjacent objects of the same class or fragmenting a single instance. Weakly supervised methods require at least point annotations to distinguish instances.

Limitations of Prior Work: (1) Unsupervised methods depend on self-supervised ViT features that are insufficient at the instance level—merging neighboring same-class objects is a fundamental failure mode; (2) depth-assisted approaches (CutS3D) fail on adjacent objects at similar depths; (3) tag-level weak supervision already approaches fully supervised accuracy in semantic segmentation (99% on VOC), yet bridging from semantic to panoptic segmentation still requires point or box annotations.

Key Challenge: Semantic features excel at "knowing what" but struggle at "telling who from whom"—instance separation requires a fundamentally different signal source.

Goal: To identify an annotation-free, instance-level signal that complements semantic features for instance separation.

Key Insight: Diffusion models progressively evolve from noise → instance structure → semantic content during denoising—at specific timesteps, self-attention transiently but clearly encodes instance boundaries.

Core Idea: The self-attention of diffusion models is a hidden instance edge annotator—sharp divergence changes in attention distributions across boundaries constitute the instance boundary signal.

Method¶

Overall Architecture¶

TRACE consists of three stages: (1) IEP (Instance Emergence Point): the KL divergence between consecutive self-attention maps is computed along the denoising trajectory, and the timestep \(t^*\) corresponding to the divergence peak is identified to obtain instance-aware self-attention \(SA_{\text{inst}}\); (2) ABDiv (Attention Boundary Divergence): relative attention divergence between spatial neighbors is computed on \(SA_{\text{inst}}\) to produce a pseudo-edge map \(E\); (3) Single-step self-distillation: the diffusion backbone is fine-tuned with LoRA and a lightweight edge decoder \(\mathcal{G}_\phi\) is trained, enabling edge prediction in a single forward pass at \(t=0\) with an 81× inference speedup. The generated edges are integrated into downstream segmentation via Background-Guided Propagation.

Key Designs¶

Instance Emergence Point (IEP):
- Function: Automatically locates the timestep at which instance structure is most salient during denoising.
- Mechanism: KL divergence between self-attention maps at adjacent timesteps is computed along the denoising trajectory: \(t^* = \arg\max_t D_{\text{KL}}(SA(X_{t_{\text{prev}}}) \| SA(X_t))\). Early in denoising, attention is nearly random; in the middle stage, instance boundaries emerge (divergence peak); later, attention stabilizes into semantic content. A fixed step size of 100 yields stable results.
- Design Motivation: The denoising process undergoes a reverse transition from semantic → instance → noise—IEP precisely captures the inflection point of this transition. KL divergence is more sensitive than L2/L1 to subtle differences between probability distributions, outperforming L2 by 5.6 AP\(^{mk}\) (9.4 vs. 3.8).
Attention Boundary Divergence (ABDiv):
- Function: Converts instance-aware self-attention maps into edge maps.
- Mechanism: For each pixel \((i,j)\), the sum of KL divergences between opposing neighbors in the four cardinal directions is computed: \(\text{ABDiv}(SA)_{i,j} = D_{\text{KL}}(SA_{i+1,j} \| SA_{i-1,j}) + D_{\text{KL}}(SA_{i,j+1} \| SA_{i,j-1})\). Neighbors within the same instance have similar attention distributions (low divergence); neighbors across instance boundaries exhibit abrupt divergence spikes.
- Design Motivation: Non-parametric—no training or clustering is required; boundary signals are extracted directly from the geometric properties of attention distributions.
Single-step Self-distillation Edge Decoder:
- Function: Compresses the multi-step IEP+ABDiv computation into a single inference step.
- Mechanism: The ABDiv pseudo-edge map \(E\) is used as supervision (pixels \(> \mu+\sigma\) as positive, \(< \mu-\sigma\) as negative, intermediate pixels masked out). The diffusion backbone is fine-tuned with LoRA at \(t=0\), and a lightweight decoder \(\mathcal{G}_\phi\) is trained with the loss \(\mathcal{L} = \|I-\hat{I}\|^2 + \text{DiceLoss}(E, \hat{E})\). The reconstruction loss stabilizes training and completes broken edges.
- Design Motivation: IEP+ABDiv requires approximately 3.7 seconds per image; after distillation, inference takes only 45 ms/image (81× speedup), and the resulting edges are more continuous and complete.

Loss & Training¶

Distillation training: DiceLoss (edge prediction) + L2 reconstruction loss, with uncertain pixels excluded. Training is performed exclusively on the COCO training set using LoRA fine-tuning of the diffusion backbone. Inference requires a single forward pass; the default backbone is SD3.5-L.

Key Experimental Results¶

Main Results¶

Unsupervised instance segmentation (COCO 2014, AP\(^{mk}\)):

Method	VOC AP	COCO 2014 AP	COCO 2017 AP
MaskCut*	5.8	3.0	2.3
+ TRACE	9.7	7.9	7.5
ProMerge*	5.0	3.1	2.5
+ TRACE	9.4	8.2	7.8
CutLER*	11.2	8.9	8.7
+ CutS3D (depth)	-	10.9	10.7
+ TRACE	14.8	13.1	12.8

Weakly supervised panoptic segmentation (VOC 2012 PQ):

Method	Supervision	VOC PQ	COCO PQ
Mask2Former*	Full mask	73.6	51.9
EPLD	Point	56.6	34.2
EPLD (Swin-L)	Point	68.5	41.0
DHR+TRACE	Tag	56.9	32.8
DHR+TRACE (Swin-L)	Tag	69.8	43.1

Ablation Study¶

Component ablation (COCO 2014, ProMerge baseline, AP\(^{mk}\)):

Configuration	AP\(^{mk}\)	Notes
Baseline	3.1	No TRACE
+ ABDiv (semantic step)	3.2	ABDiv at semantic timestep is nearly ineffective
+ IEP + ABDiv	4.8	IEP localizes the correct timestep → effective
+ IEP + ABDiv + Distillation	8.2	Distillation completes broken edges ↑↑

Diffusion vs. non-diffusion backbone comparison:

Backbone	Type	Params	AP\(^{mk}\)
DINOv2-G	Non-diffusion	1.1B	2.6
Qwen2.5-VL	Non-diffusion	72B	4.1
PixArt-α	Diffusion	0.6B	7.1
SD3.5-L	Diffusion	8.1B	8.2
FLUX.1	Diffusion	12B	8.3

Key Findings¶

Unique advantage of diffusion models: PixArt-α at 0.6B (AP\(^{mk}\) 7.1) substantially outperforms Qwen2.5-VL at 72B (4.1), confirming that instance edges are a prior unique to generative models.
Distillation improves quality, not just speed: Inference time drops from 3.7 s to 45 ms, and edges become more continuous and complete.
Tag supervision surpasses point supervision: DHR+TRACE (tag only) achieves PQ 69.8 on VOC, exceeding EPLD (point supervision) at 68.5.
Conventional edge detectors are entirely inadequate: Canny achieves only 1.2 AP\(^{mk}\) vs. TRACE's 9.4—because conventional detectors respond to intensity gradients, not instance boundaries.

Highlights & Insights¶

Discovery of the Instance Emergence Point during denoising: The staged transition of self-attention from noise → instance → semantics is a novel and previously unreported observation.
Non-parametric edge extraction: ABDiv requires no training or labels, relying purely on the geometric properties of attention distributions.
Model-agnostic: The optimal timestep identified by IEP is highly consistent across five different diffusion backbones.
Plug-and-play composability: TRACE integrates seamlessly with MaskCut, CutLER, ProMerge, DHR, and other pipelines.

Limitations & Future Work¶

The method depends on the self-attention of diffusion models and is inapplicable to non-diffusion architectures, as confirmed experimentally.
IEP search still requires multi-step forward passes (~3 s/image), though this is no longer needed after distillation.
Distillation is performed only on SD3.5-L; different backbones may require separate distillation.
Edge quality on small objects and heavily occluded scenes remains to be evaluated.
The current method applies only to static images; temporal consistency in video settings has not been explored.

vs. MaskCut/CutLER: These methods rely on DINO feature clustering and cannot separate adjacent same-class objects; TRACE's instance edges directly address this core failure mode.
vs. CutS3D: Depth estimation is used to assist instance separation, failing at similar depths; TRACE does not depend on depth and outperforms by 2.2 AP on COCO.
vs. DiffCut/DiffSeg: These methods apply diffusion attention to semantic segmentation using fixed timesteps; TRACE demonstrates that IEP-identified timesteps are substantially more effective than fixed ones.
Insight: The structural priors encoded in generative models far exceed prior expectations—self-attention not only "knows where to look" but also "knows where the boundaries are."

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The discovery of IEP+ABDiv is highly original; framing diffusion models as instance edge annotators represents a genuinely new perspective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated on both unsupervised and weakly supervised tracks, with 10 backbone comparisons, comprehensive ablations, and multi-benchmark validation.
Writing Quality: ⭐⭐⭐⭐⭐ The narrative is fluent with excellent figures; every design choice is supported by quantitative evidence.
Value: ⭐⭐⭐⭐⭐ A paradigm-shifting contribution to unsupervised and weakly supervised segmentation; the result of tag supervision surpassing point supervision carries broad implications.