APT: Towards Universal Scene Graph Generation via Plug-in Adaptive Prompt Tuning¶
Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=IZWJhdK2o7
Code: https://github.com/CGCL-codes/APT
Area: Scene Graph Generation / Visual Relationship Detection / Prompt Tuning
Keywords: Scene Graph Generation, Prompt Tuning, Semantic Representation, Open-Vocabulary, Plug-in Module
TL;DR¶
APT replaces the long-standing "frozen word vector semantic prior" in Scene Graph Generation (SGG) with a set of lightweight learnable prompts. It dynamically modulates static semantic features into representations dependent on visual context. As a plug-and-play module, it can be integrated into any one-stage, two-stage, or open-vocabulary SGG framework, achieving comprehensive performance gains with <0.5M parameters and shorter training times.
Background & Motivation¶
- Background: Scene Graph Generation (SGG) involves representing images in an "object-predicate-object" structural graph. For years, the field has been dominated by two paths: two-stage methods that detect objects before predicting relationships (relying on strong detector features but suffering from fragmented context), and one-stage methods that utilize end-to-end joint modeling (suffering from high computational costs and coarse relationship granularity). Both categories share a common habit: using static, fixed semantic embeddings exported from pre-trained language models like GloVe or BERT as semantic priors.
- Limitations of Prior Work: While these frozen word vectors are effective in NLP, they are naturally mismatched for SGG tasks that prioritize context sensitivity, fine-grained relationships, and asymmetric subject-object roles. A "person" embedding remains invariant whether the person is riding a horse or holding a phone, making it difficult to distinguish between synonymous predicates like "standing on" and "walking on". The authors use t-SNE visualization to prove that static semantic spaces collapse all "person" instances into a single point, whereas the visual feature space naturally clusters by relationship context (riding, walking, holding), indicating a severe mismatch.
- Key Challenge: The community has been preoccupied with the "one-stage vs. two-stage" architectural debate, overlooking the fact that the real bottleneck lies in the representation paradigm. Simply replacing one frozen model with a stronger one (e.g., GloVe → BERT → CLIP-text) merely richers the internal sub-structures without addressing the fundamental misalignment between semantics and visual relationship context (diagnostic tests show that while silhouette/mutual information improves with stronger models, they still remain mismatched with fine-grained SGG requirements).
- Goal: To move beyond architectural competition by providing a lightweight, universal, and pluggable mechanism that injects adaptive semantics into any SGG framework while maintaining minimal parameter and training overhead.
- Core Idea: [Paradigm Shift] Use a set of lightweight learnable prompts as "conditional adaptors." Without backpropagating to the pre-trained backbone, these prompts modulate frozen semantic features into dynamic representations that vary with visual context and relationship roles. The authors liken this process to a "modem" in communications, where a prompt \(P\) carries visual context information to modulate the raw semantic signal.
Method¶
Overall Architecture¶
The core of APT is a set of lightweight learnable prompts that adapt frozen pre-trained semantic embeddings into context-aware task features. It is designed as a universal plugin: in two-stage methods, it acts separately on the detection and relationship stages; in one-stage methods, it merges into a single relationship prompt; and in open-vocabulary settings, an additional Compositional Generalization Prompter (CGP) is attached. The pre-trained semantic backbone remains frozen, with only the prompt parameters, visual projectors, and lightweight MLP fusion networks being learnable.
flowchart TD
A[Frozen Semantic Embeddings e_static<br/>GloVe/BERT/CLIP-text] --> F[Fusion Network f_θ]
P[Learnable Prompts P_d/P_r/P_ur] --> F
V[Visual Features v] --> Phi[Visual Projector φ] --> F
F --> E[Adaptive Representation ẽ]
E --> DET[Detection Head / Relationship Predictor]
subgraph OV[Open-Vocabulary Branch CGP]
RCG[Relationship Context Gating] --> BPS[Base Prompt Synthesis] --> FRF[Feature Refinement Fusion]
end
A -.-> OV
V -.-> OV
OV --> E
Key Designs¶
1. Unified Plug-in Prompt: "Modulating" Frozen Embeddings into Dynamic Features. The operational principle of APT can be summarized by a general formula—for any semantic concept \(c\), a lightweight learnable prompt \(P(c)\) re-modulates its frozen embedding \(e_{static}(c)\) under the current visual context: \(\tilde{e}(c) = f_\theta\big(A(P(c), e_{static}(c), \phi(v))\big)\), where \(A(\cdot)\) is an aggregation function that pools the prompt sequence into a single vector, \(\phi(v)\) is a projector encoding the visual context, and \(f_\theta\) is a small fusion network generating the final adaptive representation. Crucially, only \(P\), \(\phi\), and \(f_\theta\) are learnable, while the pre-trained semantic backbone remains frozen, ensuring extreme parameter efficiency and avoiding catastrophic forgetting. The authors motivate this from an Information Bottleneck perspective: the goal is for the learned \(\tilde{e}\) to retain maximal object identity information relative to visual context \(v\) and target \(y\), while compressing redundant semantics irrelevant to the current relationship. The objective is formulated as \(\max\ I(\tilde{e}; y) - \beta I(\tilde{e}; e_{static} \mid v, y)\). In specific stages, this differentiates into three types of prompts: two-stage methods use Detection Prompts \(P_d(c)\in\mathbb{R}^{L_d\times D}\) (generating adaptive representations for each object class fed into the detection head) and Relationship Prompts \(P_r(r)\in\mathbb{R}^{L_r\times D}\) (capturing nuances in subject-object interactions for predicate classes). One-stage methods, lacking an independent detection stage, use a single Unified Relationship Prompt \(P_{ur}\) to modulate semantic queries or label embeddings before they enter the transformer decoder for cross-attention with visual features.
2. Compositional Generalization Prompter (CGP): Synthesizing Prompts for Unseen Concepts. Open-vocabulary settings require models to generalize to object/predicate combinations not seen during training, where fixed prompts are insufficient. CGP uses a three-module pipeline—"Conditioning-Synthesis-Refinement"—to generate adaptive semantics dynamically. First, the Relationship Context Gating (RCG) concatenates visual evidence and initial semantic clues through an MLP to generate role-aware gating weights \(w_s = \sigma(\text{MLP}_{gate}(\text{Concat}(v_s, e_{static}(s))))\), determining which prompt bases to activate for each entity. Next, Base Prompt Synthesis (BPS) maintains a set of learnable base prompts \(B\in\mathbb{R}^{N\times L_{ov}\times D}\) as a repository of relationship concepts. It creates a weighted combination of these bases using the gating weights \(P_{cgp}(s)=\sum_{i=1}^{N} w_s^i \cdot B_i\), followed by weighted token pooling with normalization to obtain a compact prompt \(\bar{p}=\text{LayerNorm}(\frac{1}{L_b}\sum_t P_{cgp}(s)_t)\). This allows for the generation of near-infinite varieties of customized prompts from a finite set of bases, enabling compositional generalization. Finally, Feature Refinement and Fusion (FRF) concatenates the synthesized prompt, frozen semantics, and projected visual features through a fusion MLP \(\tilde{e}_{ov}(s)=f_{\theta_{frf}}(\text{Concat}(P_{cgp}(s), e_{static}(s), \phi_v(v_s)))\) to produce context-sensitive representations for relationship reasoning even on unseen concepts. CGP is also a plugin that can seamlessly enhance standard relationship prompts in both two-stage and one-stage frameworks.
3. Multi-regularization Training Objectives: Constraining Prompt Sparsity, Orthogonality, and Drift. To keep prompts flexible yet stable, the total objective adds several regularization terms to the classification loss \(L_{cls}\): Frobenius norm constraints \(\lambda_p\|B\|_F^2 + \lambda_{pd}\|P_{det}\|_F^2 + \lambda_{pr}\|P_{rel}\|_F^2\) on bases and prompts to prevent overfitting; a distillation term \(\lambda_d\,\mathbb{E}[\|\tilde{e}-e_{static}\|_2^2]\) to prevent adaptive representations from drifting too far from original semantics; an orthogonality term \(\lambda_{orth}\sum_{i<j}\|B_i^\top B_j\|_F^2\) to force different bases to capture complementary concepts; and gating entropy \(-\beta\sum_i w^i\log w^i\) with a KL term \(\gamma\,\text{KL}(w\,\|\,u_{prior})\) to encourage sparsity, diversity, and alignment with prior distributions. This suite of regularizations ensures that prompts learn discriminative yet non-degenerate semantic modulations under a minimal budget of "<0.5M new parameters."
Key Experimental Results¶
The datasets used are Visual Genome (VG, 150 object classes / 50 predicate classes), Open Image V6, and GQA; only VG is reported here due to space. Three sub-tasks are evaluated: PredCls, SGCls, and SGDet, using indicators R@K, mR@K (long-tail robustness), and F@K (harmonic mean of R and mR).
Main Results (VG, PredCls excerpt, +APT denotes integration of Ours)¶
| Method | R@50/100 | mR@50/100 | F@50/100 |
|---|---|---|---|
| Motif (CVPR'18) | 64.6/66.0 | 15.2/16.2 | 24.6/26.0 |
| Motif+APT | 66.5/68.2 | 17.4/18.1 | 26.4/28.1 |
| PE-Net (CVPR'23) | 65.8/67.6 | 17.7/19.2 | 27.9/29.9 |
| PE-Net+APT | 67.5/69.2 | 19.3/20.5 | 29.7/31.6 |
| EGTR (CVPR'24, one-stage) | 54.1/56.6 | 35.7/38.2 | 43.0/45.6 |
| EGTR+APT | 56.4/58.3 | 37.5/40.1 | 45.2/47.7 |
| LLM4SGG (CVPR'24) | 62.2/64.1 | 36.2/39.1 | 45.7/48.6 |
| LLM4SGG+APT | 65.1/66.9 | 38.1/42.2 | 47.9/50.3 |
| ST-SGG (ICLR'24) | 53.9/57.7 | 28.1/31.5 | 36.9/40.8 |
| ST-SGG+APT | 58.7/62.3 | 31.3/34.6 | 39.9/43.7 |
Gains are concentrated in mR@K (long-tail predicates), proving that adaptive prompts alleviate the bias of static features toward high-frequency predicates. The simultaneous improvement in F@K shows that mR gains do not come at the expense of R.
Open-Vocabulary Results (VG, Novel split excerpt)¶
| Method | Novel R@50/100 | Novel mR@50/100 | Novel F@50/100 |
|---|---|---|---|
| SDSGG (NeurIPS'24) | 25.4/29.6 | 25.2/31.2 | 25.3/30.4 |
| SDSGG+APT | 26.6/31.1 | 26.7/32.3 | 27.1/32.3 |
| OvSGTR (ECCV'24) | 20.5/23.9 | 13.5/16.2 | 16.3/19.3 |
| OvSGTR+APT | 21.2/25.0 | 14.3/17.2 | 17.1/20.4 |
On unseen classes (Novel), mR@50 increased by up to +6.0, verifying that CGP can unlock compositional knowledge in pre-trained models.
Ablation Study¶
| Model (based on PE-Net / SDSGG) | Key Observation |
|---|---|
| +D-Prompt only | Slight R@K Gain (better object representation), but limited help for relationship reasoning. |
| +R-Prompt only | Significant mR@K Gain, directly alleviating predicate bias—Relationship prompt is the core. |
| +Full APT | Best overall metrics; two prompts collaborate from detection through relationship prediction. |
| CGP: +RCG | Novel metrics Gain; visual context conditioning is the first step to generalization. |
| CGP: +RCG+BPS | Novel mR@50 increased by +3.8 over vanilla; synthesized custom prompts via bases are key. |
| +Full CGP (inc. FRF) | Highest harmonic mean; non-linear fusion in FRF brings balanced improvements. |
Key Findings¶
- Efficiency Analysis: APT adds <0.5M parameters (even for LLM4SGG, the overhead is <1.5%), and actually reduces training time per epoch for almost all models (especially for one-stage, with LLM4SGG reduced by 25% and ST-SGG by 11.3%). The authors attribute this to context-aware features being easier to optimize, accelerating convergence.
- New Pareto Frontier: LLM4SGG+APT uses +1.49 performance, −25% training time, and −4.6% parameters (relative to comparable enhancements), proving APT is overwhelmingly superior in "performance per unit of compute."
- IB Validation: Compared to frozen GloVe, APT reduces PCA@90% from 26 to 23 and increases discrete mutual information proxy from 1.49 to 1.96, validating the Information Bottleneck explanation of "retaining task-sufficient information while compressing redundant complexity."
Highlights & Insights¶
- Convincing Problem Diagnosis: Using a suite of quantitative diagnostics—t-SNE visualization, silhouette scores, participation rates, PCA@90, and mutual information proxies—the paper transforms the intuition that "frozen semantic priors are a fundamental SGG bottleneck" into quantifiable evidence. This is more insightful than merely proposing a new architecture.
- Apt "Modem" Analogy: Prompts are not used as simple prefixes but as carriers of visual information to modulate frozen semantic signals. This perspective clarifies the role of prompt tuning in structured visual tasks.
- True Universal Plugin: The same paradigm covers two-stage, one-stage, and open-vocabulary frameworks, naturally adapting to different architectural stages with D/R/Pur prompts rather than using a forced application.
- Saving Parameters and Time: While plugin-style works often trade overhead for performance, APT actually shortens training time. This counter-intuitive result—that adaptive features accelerate convergence—is a practical selling point.
Limitations & Future Work¶
- Limited to VG: Although Open Image V6 and GQA are mentioned, results are primarily reported for VG due to page limits, leaving room for more evidence of cross-dataset universality.
- Excessive Hyperparameters: The training objective contains a long list of coefficients (\(\lambda_p, \lambda_{pd}, \lambda_{pr}, \lambda_d, \lambda_{orth}, \beta, \gamma, \lambda_w\)). The actual cost and sensitivity of tuning these were not fully discussed.
- Absolute mR remains low: Even with the gains, the global values for mR@K on long-tail predicates remain low (mostly between 10–20 for SGDet), indicating that the long-tail problem in SGG is far from resolved; APT is a mitigation, not a cure.
- Dependence on Base Semantic Model Quality: The ceiling of prompt modulation is limited by the internal structure of the frozen semantic backbone. Diagnostics indicate that stronger models are still mismatched, though richer; whether prompts can truly break through this ceiling remains to be explored.
Related Work & Insights¶
- vs. Architecture Debate (One-stage / Two-stage): APT takes a stand by not reinventing the wheel at the architectural level, but rather attacking the representation paradigm. This approach of "operating at a different level of abstraction" is instructive for sub-fields stuck in architectural stagnation.
- vs. Open-Vocabulary SGG (OvSGTR / SDSGG / RAHP): These methods often rely on frozen CLIP for zero-shot alignment, but CLIP features are general rather than tailored for relationship structures. APT’s CGP fills this gap by synthesizing prompts for unseen combinations via base synthesis and context gating.
- vs. Continuous Prompt Tuning (Lester et al.): Migrating continuous prompts from "language model prefixes" to "feature modulators for multimodal structure prediction" is a specific paradigm for applying prompt tuning across modalities, providing direct inspiration for other structured visual tasks like HOI detection and visual relationship understanding.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Relocates the SGG bottleneck from architecture to representation paradigm and provides a unified explanation via prompt modulation and the Information Bottleneck principle.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Covers over ten baselines across two/one-stage and open-vocabulary settings, including efficiency and IB proxy analyses; slightly docked for primarily reporting VG.
- Writing Quality: ⭐⭐⭐⭐ — Logical progression of motivation, clear diagnostics, and an easy-to-understand "modem" analogy.
- Value: ⭐⭐⭐⭐ — Plug-and-play, <1.5% parameters, and saves training time; high practical value for the SGG community as a low-cost, high-reward universal enhancement.