TRACE: Your Diffusion Model is Secretly an Instance Edge Detector¶
Conference: ICLR 2026 arXiv: 2503.07982 Code: Project Page Area: Instance Segmentation / Panoptic Segmentation Keywords: Diffusion Models, Instance Edges, Self-Attention, IEP, Unsupervised Segmentation
TL;DR¶
This work identifies an "Instance Emergence Point" (IEP) in the denoising trajectory of text-to-image diffusion models, at which self-attention exhibits sharp divergence changes at object boundaries. TRACE leverages IEP localization, ABDiv edge extraction, and single-step distillation to generate high-quality instance edges with an 81× inference speedup—requiring no instance-level annotations—improving unsupervised instance segmentation by +5.1 AP and surpassing point-supervised panoptic segmentation with tag-level supervision by +1.7 PQ.
Background & Motivation¶
Background: Instance and panoptic segmentation have long relied on dense annotations (masks/boxes/points), which are costly and suffer from inter-annotator inconsistency. Unsupervised approaches (e.g., MaskCut) cluster ViT semantic features, but ViTs are optimized for cross-image semantic similarity rather than intra-image instance separation, frequently merging adjacent objects of the same class or fragmenting a single instance. Weakly supervised methods require at least point annotations to distinguish instances.
Limitations of Prior Work: (1) Unsupervised methods depend on self-supervised ViT features that are insufficient at the instance level—merging neighboring same-class objects is a fundamental failure mode; (2) depth-assisted approaches (CutS3D) fail on adjacent objects at similar depths; (3) tag-level weak supervision already approaches fully supervised accuracy in semantic segmentation (99% on VOC), yet bridging from semantic to panoptic segmentation still requires point or box annotations.
Key Challenge: Semantic features excel at "knowing what" but struggle at "telling who from whom"—instance separation requires a fundamentally different signal source.
Goal: To identify an annotation-free, instance-level signal that complements semantic features for instance separation.
Key Insight: Diffusion models progressively evolve from noise → instance structure → semantic content during denoising—at specific timesteps, self-attention transiently but clearly encodes instance boundaries.
Core Idea: The self-attention of diffusion models is a hidden instance edge annotator—sharp divergence changes in attention distributions across boundaries constitute the instance boundary signal.
Method¶
Overall Architecture¶
TRACE consists of three stages: (1) IEP (Instance Emergence Point): the KL divergence between consecutive self-attention maps is computed along the denoising trajectory, and the timestep \(t^*\) corresponding to the divergence peak is identified to obtain instance-aware self-attention \(SA_{\text{inst}}\); (2) ABDiv (Attention Boundary Divergence): relative attention divergence between spatial neighbors is computed on \(SA_{\text{inst}}\) to produce a pseudo-edge map \(E\); (3) Single-step self-distillation: the diffusion backbone is fine-tuned with LoRA and a lightweight edge decoder \(\mathcal{G}_\phi\) is trained, enabling edge prediction in a single forward pass at \(t=0\) with an 81× inference speedup. The generated edges are integrated into downstream segmentation via Background-Guided Propagation.
Key Designs¶
-
Instance Emergence Point (IEP):
- Function: Automatically locates the timestep at which instance structure is most salient during denoising.
- Mechanism: KL divergence between self-attention maps at adjacent timesteps is computed along the denoising trajectory: \(t^* = \arg\max_t D_{\text{KL}}(SA(X_{t_{\text{prev}}}) \| SA(X_t))\). Early in denoising, attention is nearly random; in the middle stage, instance boundaries emerge (divergence peak); later, attention stabilizes into semantic content. A fixed step size of 100 yields stable results.
- Design Motivation: The denoising process undergoes a reverse transition from semantic → instance → noise—IEP precisely captures the inflection point of this transition. KL divergence is more sensitive than L2/L1 to subtle differences between probability distributions, outperforming L2 by 5.6 AP\(^{mk}\) (9.4 vs. 3.8).
-
Attention Boundary Divergence (ABDiv):
- Function: Converts instance-aware self-attention maps into edge maps.
- Mechanism: For each pixel \((i,j)\), the sum of KL divergences between opposing neighbors in the four cardinal directions is computed: \(\text{ABDiv}(SA)_{i,j} = D_{\text{KL}}(SA_{i+1,j} \| SA_{i-1,j}) + D_{\text{KL}}(SA_{i,j+1} \| SA_{i,j-1})\). Neighbors within the same instance have similar attention distributions (low divergence); neighbors across instance boundaries exhibit abrupt divergence spikes.
- Design Motivation: Non-parametric—no training or clustering is required; boundary signals are extracted directly from the geometric properties of attention distributions.
-
Single-step Self-distillation Edge Decoder:
- Function: Compresses the multi-step IEP+ABDiv computation into a single inference step.
- Mechanism: The ABDiv pseudo-edge map \(E\) is used as supervision (pixels \(> \mu+\sigma\) as positive, \(< \mu-\sigma\) as negative, intermediate pixels masked out). The diffusion backbone is fine-tuned with LoRA at \(t=0\), and a lightweight decoder \(\mathcal{G}_\phi\) is trained with the loss \(\mathcal{L} = \|I-\hat{I}\|^2 + \text{DiceLoss}(E, \hat{E})\). The reconstruction loss stabilizes training and completes broken edges.
- Design Motivation: IEP+ABDiv requires approximately 3.7 seconds per image; after distillation, inference takes only 45 ms/image (81× speedup), and the resulting edges are more continuous and complete.
Loss & Training¶
Distillation training: DiceLoss (edge prediction) + L2 reconstruction loss, with uncertain pixels excluded. Training is performed exclusively on the COCO training set using LoRA fine-tuning of the diffusion backbone. Inference requires a single forward pass; the default backbone is SD3.5-L.
Key Experimental Results¶
Main Results¶
Unsupervised instance segmentation (COCO 2014, AP\(^{mk}\)):
| Method | VOC AP | COCO 2014 AP | COCO 2017 AP |
|---|---|---|---|
| MaskCut* | 5.8 | 3.0 | 2.3 |
| + TRACE | 9.7 | 7.9 | 7.5 |
| ProMerge* | 5.0 | 3.1 | 2.5 |
| + TRACE | 9.4 | 8.2 | 7.8 |
| CutLER* | 11.2 | 8.9 | 8.7 |
| + CutS3D (depth) | - | 10.9 | 10.7 |
| + TRACE | 14.8 | 13.1 | 12.8 |
Weakly supervised panoptic segmentation (VOC 2012 PQ):
| Method | Supervision | VOC PQ | COCO PQ |
|---|---|---|---|
| Mask2Former* | Full mask | 73.6 | 51.9 |
| EPLD | Point | 56.6 | 34.2 |
| EPLD (Swin-L) | Point | 68.5 | 41.0 |
| DHR+TRACE | Tag | 56.9 | 32.8 |
| DHR+TRACE (Swin-L) | Tag | 69.8 | 43.1 |
Ablation Study¶
Component ablation (COCO 2014, ProMerge baseline, AP\(^{mk}\)):
| Configuration | AP\(^{mk}\) | Notes |
|---|---|---|
| Baseline | 3.1 | No TRACE |
| + ABDiv (semantic step) | 3.2 | ABDiv at semantic timestep is nearly ineffective |
| + IEP + ABDiv | 4.8 | IEP localizes the correct timestep → effective |
| + IEP + ABDiv + Distillation | 8.2 | Distillation completes broken edges ↑↑ |
Diffusion vs. non-diffusion backbone comparison:
| Backbone | Type | Params | AP\(^{mk}\) |
|---|---|---|---|
| DINOv2-G | Non-diffusion | 1.1B | 2.6 |
| Qwen2.5-VL | Non-diffusion | 72B | 4.1 |
| PixArt-α | Diffusion | 0.6B | 7.1 |
| SD3.5-L | Diffusion | 8.1B | 8.2 |
| FLUX.1 | Diffusion | 12B | 8.3 |
Key Findings¶
- Unique advantage of diffusion models: PixArt-α at 0.6B (AP\(^{mk}\) 7.1) substantially outperforms Qwen2.5-VL at 72B (4.1), confirming that instance edges are a prior unique to generative models.
- Distillation improves quality, not just speed: Inference time drops from 3.7 s to 45 ms, and edges become more continuous and complete.
- Tag supervision surpasses point supervision: DHR+TRACE (tag only) achieves PQ 69.8 on VOC, exceeding EPLD (point supervision) at 68.5.
- Conventional edge detectors are entirely inadequate: Canny achieves only 1.2 AP\(^{mk}\) vs. TRACE's 9.4—because conventional detectors respond to intensity gradients, not instance boundaries.
Highlights & Insights¶
- Discovery of the Instance Emergence Point during denoising: The staged transition of self-attention from noise → instance → semantics is a novel and previously unreported observation.
- Non-parametric edge extraction: ABDiv requires no training or labels, relying purely on the geometric properties of attention distributions.
- Model-agnostic: The optimal timestep identified by IEP is highly consistent across five different diffusion backbones.
- Plug-and-play composability: TRACE integrates seamlessly with MaskCut, CutLER, ProMerge, DHR, and other pipelines.
Limitations & Future Work¶
- The method depends on the self-attention of diffusion models and is inapplicable to non-diffusion architectures, as confirmed experimentally.
- IEP search still requires multi-step forward passes (~3 s/image), though this is no longer needed after distillation.
- Distillation is performed only on SD3.5-L; different backbones may require separate distillation.
- Edge quality on small objects and heavily occluded scenes remains to be evaluated.
- The current method applies only to static images; temporal consistency in video settings has not been explored.
Related Work & Insights¶
- vs. MaskCut/CutLER: These methods rely on DINO feature clustering and cannot separate adjacent same-class objects; TRACE's instance edges directly address this core failure mode.
- vs. CutS3D: Depth estimation is used to assist instance separation, failing at similar depths; TRACE does not depend on depth and outperforms by 2.2 AP on COCO.
- vs. DiffCut/DiffSeg: These methods apply diffusion attention to semantic segmentation using fixed timesteps; TRACE demonstrates that IEP-identified timesteps are substantially more effective than fixed ones.
- Insight: The structural priors encoded in generative models far exceed prior expectations—self-attention not only "knows where to look" but also "knows where the boundaries are."
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The discovery of IEP+ABDiv is highly original; framing diffusion models as instance edge annotators represents a genuinely new perspective.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated on both unsupervised and weakly supervised tracks, with 10 backbone comparisons, comprehensive ablations, and multi-benchmark validation.
- Writing Quality: ⭐⭐⭐⭐⭐ The narrative is fluent with excellent figures; every design choice is supported by quantitative evidence.
- Value: ⭐⭐⭐⭐⭐ A paradigm-shifting contribution to unsupervised and weakly supervised segmentation; the result of tag supervision surpassing point supervision carries broad implications.