SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation¶

Conference: ICCV 2025 arXiv: 2507.05798 Code: None (mentioned on project page) Area: Image Segmentation Keywords: Panoptic Scene Graph Generation, Open-Vocabulary, Diffusion Models, Spatial Relation Reasoning, Graph Transformer

TL;DR¶

This paper proposes SPADE — a spatial-aware denoising network for open-vocabulary panoptic scene graph generation (PSG). It adapts a pretrained diffusion model into a PSG-specific spatial prior extractor via DDIM inversion-guided calibration, and designs a relational graph Transformer to capture both long-range and local context. SPADE substantially outperforms prior state-of-the-art methods in both closed-set and open-set settings, with particularly strong performance on spatial relation prediction.

Background & Motivation¶

Panoptic scene graph generation (PSG) unifies instance segmentation and relation understanding into subject-predicate-object triplets. While VLM-based open-vocabulary methods have made notable progress, a critical and largely overlooked issue remains:

Spatial Reasoning Deficiency in VLMs: Multiple studies have shown that VLMs such as CLIP, BLIP, and GLIP suffer from an inherent weakness in spatial relation understanding — due to the scarcity of spatial descriptions in their training data — making it difficult for these models to determine relations such as "left of / right of / above / below."

Distance Sensitivity: Through systematic experiments, the authors find that when two objects are far apart (center distance > 1/3 image width), VLM-based models exhibit a sharp drop in spatial relation prediction performance (e.g., OpenPSG R@50 drops from 43.7 to 37.1).

Lack of Contextual Reasoning: Existing methods focus primarily on designing visual prompts to extract VLM knowledge, while neglecting spatial and semantic contextual information among relation pairs.

Limitations of Directly Using Diffusion Models: Although diffusion models possess strong spatial compositional capability, their pretrained knowledge is not optimized for the PSG task, making direct application ineffective.

Core Motivation: Can the spatial knowledge of diffusion models be injected into VLMs without compromising their inherent open-world recognition capability?

Method¶

Overall Architecture¶

SPADE is a two-stage approach: - Stage 1: Inversion-guided Calibration — adapts a pretrained diffusion model into a PSG-specific denoising network. - Stage 2: Spatial-aware Context Reasoning — generates high-quality relation queries via a relational graph Transformer.

Key Designs¶

Inversion-guided Calibration:
- Inverse Spatial Prior Extraction: Real images are converted into deterministic noise \(z\) via the DDIM inversion process; a teacher diffusion model (BELM) then performs deterministic sampling conditioned on relation prompts "[subject] is [predicate] [object]..." to obtain cross-attention maps \(A_i\) as spatial priors.
- Implicit Text Encoder: Since no textual description is available at inference time, the text encoder is replaced by a CLIP image encoder with an MLP adapter: \(f_i = \epsilon_\phi(x_i, \mathrm{MLP} \circ \mathrm{CLIP_{image}}(x_i))\).
- LoRA Calibration: Only the low-rank matrices of the UNet cross-attention layers are updated, \(\Delta\mathbf{W}_k = \mathbf{B} \times \mathbf{D}\) (\(\mathbf{B} \in \mathbb{R}^{m_{in} \times r}\), \(\mathbf{D} \in \mathbb{R}^{r \times m_{out}}\)), preserving pretrained knowledge.
- Calibration Loss: \(\mathcal{L}_{cal} = \frac{1}{N}\sum_{i=1}^{N}(\lambda\|A_i - A_i'\|_1)\), aligning the cross-attention maps computed on real images with those derived from the inversion process.
- Design Motivation: The DDIM inversion process inherently preserves the spatial structure of the input image; LoRA fine-tuning injects PSG-specific spatial knowledge while maximally retaining the diffusion model's world knowledge.
Spatial-aware Relational Graph Transformer (RGT):
- Spatial-Semantic Graph Construction: A graph \(G \in \mathbb{R}^{N \times N}\) is constructed based on spatial distance between instance masks (adjacent = 1) and feature cosine similarity (above threshold = 1).
- Long-range Context Learning (\(\mathrm{RGT_g}\)): Self-attention is computed separately over neighbors \(\mathcal{P}(r)^+\) and non-neighbors \(\mathcal{P}(r)^-\), then fused: \(\mathbf{q}_r \leftarrow \mathbf{q}_r + \mathrm{RGT}(\mathbf{q}_r)_{\mathcal{P}^+} + \mathrm{RGT}(\mathbf{q}_r)_{\mathcal{P}^-}\), followed by MLP-based feature fusion.
- Local Context Learning (\(\mathrm{RGT_l}\)): A GCN aggregates local neighborhood information over graph \(G\): \(\hat{\mathbf{q}}_r = \mathrm{GCN}(G, \mathbf{q}_r'; \mathbf{W}_l)\).
- Relation Query Construction: Relation queries \(\Psi_r\) are constructed by selecting similar object pairs based on cosine distance; an auxiliary loss \(\mathcal{L}_{rqc}\) optimizes selection quality.
- Design Motivation: Modeling only connected object pairs is insufficient — non-adjacent objects may also participate in relations (e.g., "a person far away looking at an airplane"). The dual long-range and local reasoning covers relational contexts at different spatial scales.
Open-vocabulary Relation Prediction:
- CLIP text encoders encode category/predicate templates, and classification is performed via feature similarity.
- Dual-path prediction fusion: diffusion feature scores \(\mathbf{P}_o\) and CLIP-pooled feature scores \(\mathbf{P}_o'\) are combined via geometric mean: \(\mathbf{P}^o_{\text{final}} = \mathbf{P}_o^\alpha \cdot \mathbf{P}_o'^{(1-\alpha)}\).
- Relation prediction follows the same scheme, using joint masks of subject and object for pooling.
- Design Motivation: Diffusion models excel at spatial reasoning but have limited open-vocabulary capability, while CLIP excels at open-world recognition but is weak in spatial reasoning — complementary fusion exploits the advantages of each.

Loss & Training¶

Stage 1 (calibration): Only \(\mathcal{L}_{cal}\) (L1 alignment of cross-attention maps); MLP adapter and LoRA parameters are updated.
Stage 2: \(\mathcal{L} = \mathcal{L}_{\text{rel}} + \lambda_{\text{rqc}}\mathcal{L}_{rqc} + \lambda_{\text{mask}}L_{\text{mask}}\)
Hyperparameters: \(\alpha=0.34\), \(\eta=0.65\), \(\lambda_{rqc}=0.6\), \(\lambda_{mask}=1\)
Diffusion model and CLIP parameters are frozen during Stage 2.
Total training: 80 epochs, with learning rate decay at epoch 60.
Training hardware: 4× A100 GPUs.

Key Experimental Results¶

Main Results¶

Dataset / Setting	Metric	SPADE	OpenPSG	OvSGTR	Gain
PSG Closed-set	R/mR@100	54.3/51.7	49.3/47.5	41.4/28.3	+5.0/+4.2
PSG Open-set (OvR)	R/mR@50	26.7/23.3	21.2/19.8	19.3/12.4	+5.5/+3.5
PSG Open-set (OvR)	R/mR@100	31.8/25.8	25.1/21.4	22.8/14.0	+6.7/+4.4
VG Open-set (OvR)	R/mR@100	29.9/13.9	25.7/12.1	26.7/5.7	+4.2/+1.8
PSG Open-set (OvD+R)	R@50	22.7	11.4	19.1	+3.6

Ablation Study¶

Configuration	R/mR@50 (PSG Closed-set)	Note
w/o RGT components	30.5/26.3	Baseline
+ Long-range Neighbor Learning (LCNL)	35.6/32.2	Neighbor context yields significant improvement
+ Long-range Non-neighbor (LCNNL)	37.1/34.4	Non-neighbor relations also contribute
+ Local Context Learning (LCL)	40.3/35.6	GCN local reasoning complements long-range
+ All + \(\mathcal{L}_{rqc}\)	45.1/41.2	Auxiliary loss further improves performance

Calibration Strategy	OvR R@50	OvD+R R@50	Note
No calibration (pretrained UNet directly)	15.3	10.1	Pretrained knowledge mismatched with PSG
No LoRA (full fine-tuning)	18.8	12.7	Destroys pretrained knowledge
No inversion (random noise)	21.0	15.9	Lacks spatial structure prior
Full method	26.7	22.7	Inversion + LoRA is optimal

Key Findings¶

Systematic Analysis of Spatial Relations: SPADE achieves R@50 of 42.3 on distant relations (DR), compared to 37.1 for OpenPSG, with the performance gap narrowing from 6.6 to 4.2 — demonstrating that SPADE effectively improves long-range spatial reasoning.
Open-vocabulary Module Ablation: Fusing diffusion features with pooled CLIP features outperforms either branch alone (26.7 vs. 12.5 vs. 21.4).
Inversion Process is Critical: Random noise sampling (21.0) is substantially inferior to deterministic inversion (26.7), as the inversion process preserves the spatial layout of the original image.

Highlights & Insights¶

Depth of Problem Identification: The paper systematically uncovers the spatial reasoning deficiency of VLMs and quantifies the effect of object distance on performance.
Creative Use of the Diffusion Inversion Process: DDIM inversion inherently preserves spatial structure — this property is elegantly repurposed as a spatial prior for PSG.
Dual Long-range and Local Context Reasoning: Separate modeling of neighbor and non-neighbor relations combined with GCN local aggregation comprehensively covers relational contexts at different spatial scales.
Complementary Diffusion + Discriminative Dual-path Classification: Diffusion models handle spatial reasoning while CLIP handles open-world recognition, with complementary fusion leveraging the strengths of each.
Necessity of LoRA Calibration: Full fine-tuning underperforms LoRA, validating the importance of preserving pretrained knowledge.

Limitations & Future Work¶

The two-stage training pipeline is relatively complex, requiring independent UNet calibration before RGT training.
The computational cost of diffusion model inference is non-trivial; inference speed is not reported.
Spatial-semantic graph construction relies on fixed thresholds, lacking flexibility.
Evaluation is limited to two datasets (PSG and VG); broader scene graph generation benchmarks are not explored.
The relation prompt design ("[subject] is [predicate] [object]") may constrain the types of relations that can be captured.

Compared to VLM-based methods such as OpenPSG and OvSGTR, SPADE is the first to introduce diffusion models to enhance spatial reasoning in PSG.
The LoRA calibration strategy (fine-tuning only low-rank matrices of cross-attention layers) is generalizable to other downstream tasks that require adapting diffusion models.
The idea of separately reasoning over neighbors and non-neighbors in the relational graph Transformer can be applied to other structured prediction tasks such as HOI detection.
Unlike DiffusionSG and related work, SPADE leverages cross-attention maps from the inversion process rather than the generative capability of diffusion models.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Identifies VLM spatial reasoning deficiencies and innovatively exploits the diffusion inversion process to supply spatial priors.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers closed-set, open-set, spatial relation analysis, and multi-dimensional ablations; efficiency analysis is absent.
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated, methodology is systematically described, and figures/tables are informative.
Value: ⭐⭐⭐⭐ Reveals a core limitation of VLMs in spatial reasoning and establishes a new paradigm for diffusion model + VLM integration.