Skip to content

APT: Towards Universal Scene Graph Generation via Plug-in Adaptive Prompt Tuning

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=IZWJhdK2o7
Code: https://github.com/CGCL-codes/APT
Area: Scene Graph Generation / Visual Relationship Detection / Prompt Tuning
Keywords: Scene Graph Generation, Prompt Tuning, Semantic Representation, Open-Vocabulary, Plug-in Module

TL;DR

APT replaces the long-standing "frozen word vector semantic prior" in Scene Graph Generation (SGG) with a set of lightweight learnable prompts. It dynamically modulates static semantic features into representations dependent on visual context. As a plug-and-play module, it can be integrated into any one-stage, two-stage, or open-vocabulary SGG framework, achieving comprehensive performance gains with <0.5M parameters and shorter training times.

Background & Motivation

  • Background: Scene Graph Generation (SGG) involves representing images in an "object-predicate-object" structural graph. For years, the field has been dominated by two paths: two-stage methods that detect objects before predicting relationships (relying on strong detector features but suffering from fragmented context), and one-stage methods that utilize end-to-end joint modeling (suffering from high computational costs and coarse relationship granularity). Both categories share a common habit: using static, fixed semantic embeddings exported from pre-trained language models like GloVe or BERT as semantic priors.
  • Limitations of Prior Work: While these frozen word vectors are effective in NLP, they are naturally mismatched for SGG tasks that prioritize context sensitivity, fine-grained relationships, and asymmetric subject-object roles. A "person" embedding remains invariant whether the person is riding a horse or holding a phone, making it difficult to distinguish between synonymous predicates like "standing on" and "walking on". The authors use t-SNE visualization to prove that static semantic spaces collapse all "person" instances into a single point, whereas the visual feature space naturally clusters by relationship context (riding, walking, holding), indicating a severe mismatch.
  • Key Challenge: The community has been preoccupied with the "one-stage vs. two-stage" architectural debate, overlooking the fact that the real bottleneck lies in the representation paradigm. Simply replacing one frozen model with a stronger one (e.g., GloVe → BERT → CLIP-text) merely richers the internal sub-structures without addressing the fundamental misalignment between semantics and visual relationship context (diagnostic tests show that while silhouette/mutual information improves with stronger models, they still remain mismatched with fine-grained SGG requirements).
  • Goal: To move beyond architectural competition by providing a lightweight, universal, and pluggable mechanism that injects adaptive semantics into any SGG framework while maintaining minimal parameter and training overhead.
  • Core Idea: [Paradigm Shift] Use a set of lightweight learnable prompts as "conditional adaptors." Without backpropagating to the pre-trained backbone, these prompts modulate frozen semantic features into dynamic representations that vary with visual context and relationship roles. The authors liken this process to a "modem" in communications, where a prompt \(P\) carries visual context information to modulate the raw semantic signal.

Method

Overall Architecture

The core of APT is a set of lightweight learnable prompts that adapt frozen pre-trained semantic embeddings into context-aware task features. It is designed as a universal plugin: in two-stage methods, it acts separately on the detection and relationship stages; in one-stage methods, it merges into a single relationship prompt; and in open-vocabulary settings, an additional Compositional Generalization Prompter (CGP) is attached. The pre-trained semantic backbone remains frozen, with only the prompt parameters, visual projectors, and lightweight MLP fusion networks being learnable.

flowchart TD
    A[Frozen Semantic Embeddings e_static<br/>GloVe/BERT/CLIP-text] --> F[Fusion Network f_θ]
    P[Learnable Prompts P_d/P_r/P_ur] --> F
    V[Visual Features v] --> Phi[Visual Projector φ] --> F
    F --> E[Adaptive Representation ẽ]
    E --> DET[Detection Head / Relationship Predictor]
    subgraph OV[Open-Vocabulary Branch CGP]
        RCG[Relationship Context Gating] --> BPS[Base Prompt Synthesis] --> FRF[Feature Refinement Fusion]
    end
    A -.-> OV
    V -.-> OV
    OV --> E

Key Designs

1. Unified Plug-in Prompt: "Modulating" Frozen Embeddings into Dynamic Features. The operational principle of APT can be summarized by a general formula—for any semantic concept \(c\), a lightweight learnable prompt \(P(c)\) re-modulates its frozen embedding \(e_{static}(c)\) under the current visual context: \(\tilde{e}(c) = f_\theta\big(A(P(c), e_{static}(c), \phi(v))\big)\), where \(A(\cdot)\) is an aggregation function that pools the prompt sequence into a single vector, \(\phi(v)\) is a projector encoding the visual context, and \(f_\theta\) is a small fusion network generating the final adaptive representation. Crucially, only \(P\), \(\phi\), and \(f_\theta\) are learnable, while the pre-trained semantic backbone remains frozen, ensuring extreme parameter efficiency and avoiding catastrophic forgetting. The authors motivate this from an Information Bottleneck perspective: the goal is for the learned \(\tilde{e}\) to retain maximal object identity information relative to visual context \(v\) and target \(y\), while compressing redundant semantics irrelevant to the current relationship. The objective is formulated as \(\max\ I(\tilde{e}; y) - \beta I(\tilde{e}; e_{static} \mid v, y)\). In specific stages, this differentiates into three types of prompts: two-stage methods use Detection Prompts \(P_d(c)\in\mathbb{R}^{L_d\times D}\) (generating adaptive representations for each object class fed into the detection head) and Relationship Prompts \(P_r(r)\in\mathbb{R}^{L_r\times D}\) (capturing nuances in subject-object interactions for predicate classes). One-stage methods, lacking an independent detection stage, use a single Unified Relationship Prompt \(P_{ur}\) to modulate semantic queries or label embeddings before they enter the transformer decoder for cross-attention with visual features.

2. Compositional Generalization Prompter (CGP): Synthesizing Prompts for Unseen Concepts. Open-vocabulary settings require models to generalize to object/predicate combinations not seen during training, where fixed prompts are insufficient. CGP uses a three-module pipeline—"Conditioning-Synthesis-Refinement"—to generate adaptive semantics dynamically. First, the Relationship Context Gating (RCG) concatenates visual evidence and initial semantic clues through an MLP to generate role-aware gating weights \(w_s = \sigma(\text{MLP}_{gate}(\text{Concat}(v_s, e_{static}(s))))\), determining which prompt bases to activate for each entity. Next, Base Prompt Synthesis (BPS) maintains a set of learnable base prompts \(B\in\mathbb{R}^{N\times L_{ov}\times D}\) as a repository of relationship concepts. It creates a weighted combination of these bases using the gating weights \(P_{cgp}(s)=\sum_{i=1}^{N} w_s^i \cdot B_i\), followed by weighted token pooling with normalization to obtain a compact prompt \(\bar{p}=\text{LayerNorm}(\frac{1}{L_b}\sum_t P_{cgp}(s)_t)\). This allows for the generation of near-infinite varieties of customized prompts from a finite set of bases, enabling compositional generalization. Finally, Feature Refinement and Fusion (FRF) concatenates the synthesized prompt, frozen semantics, and projected visual features through a fusion MLP \(\tilde{e}_{ov}(s)=f_{\theta_{frf}}(\text{Concat}(P_{cgp}(s), e_{static}(s), \phi_v(v_s)))\) to produce context-sensitive representations for relationship reasoning even on unseen concepts. CGP is also a plugin that can seamlessly enhance standard relationship prompts in both two-stage and one-stage frameworks.

3. Multi-regularization Training Objectives: Constraining Prompt Sparsity, Orthogonality, and Drift. To keep prompts flexible yet stable, the total objective adds several regularization terms to the classification loss \(L_{cls}\): Frobenius norm constraints \(\lambda_p\|B\|_F^2 + \lambda_{pd}\|P_{det}\|_F^2 + \lambda_{pr}\|P_{rel}\|_F^2\) on bases and prompts to prevent overfitting; a distillation term \(\lambda_d\,\mathbb{E}[\|\tilde{e}-e_{static}\|_2^2]\) to prevent adaptive representations from drifting too far from original semantics; an orthogonality term \(\lambda_{orth}\sum_{i<j}\|B_i^\top B_j\|_F^2\) to force different bases to capture complementary concepts; and gating entropy \(-\beta\sum_i w^i\log w^i\) with a KL term \(\gamma\,\text{KL}(w\,\|\,u_{prior})\) to encourage sparsity, diversity, and alignment with prior distributions. This suite of regularizations ensures that prompts learn discriminative yet non-degenerate semantic modulations under a minimal budget of "<0.5M new parameters."

Key Experimental Results

The datasets used are Visual Genome (VG, 150 object classes / 50 predicate classes), Open Image V6, and GQA; only VG is reported here due to space. Three sub-tasks are evaluated: PredCls, SGCls, and SGDet, using indicators R@K, mR@K (long-tail robustness), and F@K (harmonic mean of R and mR).

Main Results (VG, PredCls excerpt, +APT denotes integration of Ours)

Method R@50/100 mR@50/100 F@50/100
Motif (CVPR'18) 64.6/66.0 15.2/16.2 24.6/26.0
Motif+APT 66.5/68.2 17.4/18.1 26.4/28.1
PE-Net (CVPR'23) 65.8/67.6 17.7/19.2 27.9/29.9
PE-Net+APT 67.5/69.2 19.3/20.5 29.7/31.6
EGTR (CVPR'24, one-stage) 54.1/56.6 35.7/38.2 43.0/45.6
EGTR+APT 56.4/58.3 37.5/40.1 45.2/47.7
LLM4SGG (CVPR'24) 62.2/64.1 36.2/39.1 45.7/48.6
LLM4SGG+APT 65.1/66.9 38.1/42.2 47.9/50.3
ST-SGG (ICLR'24) 53.9/57.7 28.1/31.5 36.9/40.8
ST-SGG+APT 58.7/62.3 31.3/34.6 39.9/43.7

Gains are concentrated in mR@K (long-tail predicates), proving that adaptive prompts alleviate the bias of static features toward high-frequency predicates. The simultaneous improvement in F@K shows that mR gains do not come at the expense of R.

Open-Vocabulary Results (VG, Novel split excerpt)

Method Novel R@50/100 Novel mR@50/100 Novel F@50/100
SDSGG (NeurIPS'24) 25.4/29.6 25.2/31.2 25.3/30.4
SDSGG+APT 26.6/31.1 26.7/32.3 27.1/32.3
OvSGTR (ECCV'24) 20.5/23.9 13.5/16.2 16.3/19.3
OvSGTR+APT 21.2/25.0 14.3/17.2 17.1/20.4

On unseen classes (Novel), mR@50 increased by up to +6.0, verifying that CGP can unlock compositional knowledge in pre-trained models.

Ablation Study

Model (based on PE-Net / SDSGG) Key Observation
+D-Prompt only Slight R@K Gain (better object representation), but limited help for relationship reasoning.
+R-Prompt only Significant mR@K Gain, directly alleviating predicate bias—Relationship prompt is the core.
+Full APT Best overall metrics; two prompts collaborate from detection through relationship prediction.
CGP: +RCG Novel metrics Gain; visual context conditioning is the first step to generalization.
CGP: +RCG+BPS Novel mR@50 increased by +3.8 over vanilla; synthesized custom prompts via bases are key.
+Full CGP (inc. FRF) Highest harmonic mean; non-linear fusion in FRF brings balanced improvements.

Key Findings

  • Efficiency Analysis: APT adds <0.5M parameters (even for LLM4SGG, the overhead is <1.5%), and actually reduces training time per epoch for almost all models (especially for one-stage, with LLM4SGG reduced by 25% and ST-SGG by 11.3%). The authors attribute this to context-aware features being easier to optimize, accelerating convergence.
  • New Pareto Frontier: LLM4SGG+APT uses +1.49 performance, −25% training time, and −4.6% parameters (relative to comparable enhancements), proving APT is overwhelmingly superior in "performance per unit of compute."
  • IB Validation: Compared to frozen GloVe, APT reduces PCA@90% from 26 to 23 and increases discrete mutual information proxy from 1.49 to 1.96, validating the Information Bottleneck explanation of "retaining task-sufficient information while compressing redundant complexity."

Highlights & Insights

  • Convincing Problem Diagnosis: Using a suite of quantitative diagnostics—t-SNE visualization, silhouette scores, participation rates, PCA@90, and mutual information proxies—the paper transforms the intuition that "frozen semantic priors are a fundamental SGG bottleneck" into quantifiable evidence. This is more insightful than merely proposing a new architecture.
  • Apt "Modem" Analogy: Prompts are not used as simple prefixes but as carriers of visual information to modulate frozen semantic signals. This perspective clarifies the role of prompt tuning in structured visual tasks.
  • True Universal Plugin: The same paradigm covers two-stage, one-stage, and open-vocabulary frameworks, naturally adapting to different architectural stages with D/R/Pur prompts rather than using a forced application.
  • Saving Parameters and Time: While plugin-style works often trade overhead for performance, APT actually shortens training time. This counter-intuitive result—that adaptive features accelerate convergence—is a practical selling point.

Limitations & Future Work

  • Limited to VG: Although Open Image V6 and GQA are mentioned, results are primarily reported for VG due to page limits, leaving room for more evidence of cross-dataset universality.
  • Excessive Hyperparameters: The training objective contains a long list of coefficients (\(\lambda_p, \lambda_{pd}, \lambda_{pr}, \lambda_d, \lambda_{orth}, \beta, \gamma, \lambda_w\)). The actual cost and sensitivity of tuning these were not fully discussed.
  • Absolute mR remains low: Even with the gains, the global values for mR@K on long-tail predicates remain low (mostly between 10–20 for SGDet), indicating that the long-tail problem in SGG is far from resolved; APT is a mitigation, not a cure.
  • Dependence on Base Semantic Model Quality: The ceiling of prompt modulation is limited by the internal structure of the frozen semantic backbone. Diagnostics indicate that stronger models are still mismatched, though richer; whether prompts can truly break through this ceiling remains to be explored.
  • vs. Architecture Debate (One-stage / Two-stage): APT takes a stand by not reinventing the wheel at the architectural level, but rather attacking the representation paradigm. This approach of "operating at a different level of abstraction" is instructive for sub-fields stuck in architectural stagnation.
  • vs. Open-Vocabulary SGG (OvSGTR / SDSGG / RAHP): These methods often rely on frozen CLIP for zero-shot alignment, but CLIP features are general rather than tailored for relationship structures. APT’s CGP fills this gap by synthesizing prompts for unseen combinations via base synthesis and context gating.
  • vs. Continuous Prompt Tuning (Lester et al.): Migrating continuous prompts from "language model prefixes" to "feature modulators for multimodal structure prediction" is a specific paradigm for applying prompt tuning across modalities, providing direct inspiration for other structured visual tasks like HOI detection and visual relationship understanding.

Rating

  • Novelty: ⭐⭐⭐⭐ — Relocates the SGG bottleneck from architecture to representation paradigm and provides a unified explanation via prompt modulation and the Information Bottleneck principle.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Covers over ten baselines across two/one-stage and open-vocabulary settings, including efficiency and IB proxy analyses; slightly docked for primarily reporting VG.
  • Writing Quality: ⭐⭐⭐⭐ — Logical progression of motivation, clear diagnostics, and an easy-to-understand "modem" analogy.
  • Value: ⭐⭐⭐⭐ — Plug-and-play, <1.5% parameters, and saves training time; high practical value for the SGG community as a low-cost, high-reward universal enhancement.