PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination¶

Conference: ICCV 2025 arXiv: 2509.04833 Code: GitHub Area: Social Computing Keywords: visual grounding, proposal-based, contrastive learning, referring expression, target existence discrimination

TL;DR¶

This paper proposes PropVG, the first end-to-end proposal-based visual grounding framework that eliminates the need for pretrained detectors. It decomposes visual grounding into two stages — foreground proposal generation and contrastive learning-based referring scoring — and introduces a Multi-granularity Target Discrimination (MTD) module that integrates object-level and semantic-level information to determine target existence. PropVG achieves state-of-the-art performance on 10 datasets while running 4× faster than traditional proposal-based methods.

Background & Motivation¶

Evolution of Visual Grounding Tasks: - Classical VG: REC (bounding box prediction) and RES (mask prediction), one expression per target - Generalized VG (GVG): Extended to zero or multiple targets, requiring target existence judgment

Tension Between Two Dominant Paradigms: - Traditional two-stage proposal methods (e.g., MAttNet): Rely on pretrained detectors to generate candidate boxes before expression matching. Strengths: global foreground perception. Weaknesses: lower performance and slow inference (320 ms/frame) - Direct referring methods (e.g., TransVG, SimVG): End-to-end prediction of target location. Strengths: simple and efficient. Weaknesses: focus solely on the referred target while ignoring other foreground objects, weakening referential understanding and interpretability

Two Key Open Problems: - (1) How to preserve the global perception advantage of proposals while eliminating dependence on pretrained detectors for end-to-end training? - (2) In generalized settings, how to accurately determine target existence? Existing methods rely on global or single-granularity predictions, overlooking the complementarity of multi-granularity information

Core Insight: Visual grounding can be naturally decomposed into "which objects are foreground" (detection) and "which one is referred to" (matching). Joint end-to-end training of both steps enables simultaneous global perception and referential understanding.

Method¶

Overall Architecture¶

Input image \(\mathcal{I}\) and text expression \(\mathcal{T}\) → BEiT-3 multimodal encoder → SimFPN multi-scale features → Dual branches: (1) UNet decoder + SegHead for global segmentation \(M_{seg}\); (2) multi-scale deformable decoder + DetHead for foreground proposal generation → CRS module for referring score computation → MTD module for target existence determination.

Key Design 1: End-to-End Proposal-Based Framework¶

Foreground Proposal Stage: - Initialize \(N\) learnable queries \(Q_{init}\) - Interact with SimFPN multi-scale features via a multi-scale deformable decoder to produce proposal queries \(Q_{prop} \in \mathbb{R}^{N \times C}\) - DetHead outputs foreground bounding boxes \(P_{bbox} \in \mathbb{R}^{N \times 4}\) and confidence scores \(P_{score} \in \mathbb{R}^{N \times 2}\) - Hungarian matching assigns queries to all foreground objects (not only the referred target), providing global supervision

Referring Scoring Stage: - Query Proj. maps \(Q_{prop}\) to \(Q_{prop}'\) for referring classification - CRS module computes the referring score for each query

Key distinction from traditional two-stage methods: no external pretrained detector is required; proposals and referring are trained end-to-end within a unified framework.

Key Design 2: Contrastive-based Refer Scoring (CRS)¶

The CRS module combines sentence-level and word-level contrastive learning to assess the relevance of each proposal to the text expression.

Sentence-Level Contrastive Learning: Computes the similarity matrix \(S_{sent} \in \mathbb{R}^{N \times 1}\) between query features \(Q_p\) and the global text feature \(f_s\). The global text feature is obtained via valid mask pooling:

\[f_s^i = \max[f_t^i \times (\sim m)]\]

Word-Level Contrastive Learning: Computes the similarity matrix \(S_{word} \in \mathbb{R}^{N \times N_t}\) between query features and word-level text features \(f_t\).

Adaptive Weighted Fusion: A learnable weight \(w_s\) (derived from \(f_s\) via MLP + sigmoid) dynamically balances the contributions of the two levels:

\[S_{ref} = w_s \cdot S_{sent} + (1 - w_s) \cdot \text{MaxPool}(S_{word})\]

Similarity is computed using cosine similarity with a learnable temperature parameter \(T\) (initialized to 0.07):

\[\text{Sim}(f_1, f_2) = \frac{f_1 \cdot f_2}{\|f_1\| \|f_2\|} / T\]

Key Design 3: Multi-Granularity Target Discrimination (MTD)¶

MTD integrates object-level information (referring scores from the detection branch) and semantic-level information (mask predictions from the segmentation branch) to determine target existence.

Score Prior Cross Attention (SPCA): Injects prior score information into standard attention:

\[O = \text{Softmax}(QK^T + \text{MLP}(S))V\]

where \(S\) includes \(S_{ref}\) (object-level prior) and \(M_{seg}\) (semantic-level prior).

Final Target Existence Score: Fuses information from three granularities:

\[S_{exist} = \text{Max}(S_{ref}) \times \text{TAS}(M_{seg}) \times \varepsilon_{exist}\]

where TAS (TopK Average Score) computes the average score of the top-K pixels in the segmentation prediction, mitigating the influence of high-confidence outliers.

Loss & Training¶

\[\mathcal{L}_{total} = \mathcal{L}_{seg} + \lambda_{det} \cdot \mathcal{L}_{det} + \lambda_{exist} \cdot \mathcal{L}_{exist} + \lambda_{ref} \cdot \mathcal{L}_{ref}\]

Default settings: \(\lambda_{det}=0.1\), \(\lambda_{exist}=0.2\), \(\lambda_{ref}=1.0\).

Key Experimental Results¶

Evaluation Scope¶

10 datasets: RefCOCO/+/g (REC/RES), gRefCOCO (GREC/GRES), R-RefCOCO/+/g, Ref-ZOM

Main Results (REC — RefCOCO)¶

Method	Backbone	val	testA	testB
MAttNet* (traditional proposal)	ResNet-101	76.65	81.14	69.99
SimVG-DB	BEiT3-ViT-B	91.47	93.65	87.94
OneRef	BEiT3-ViT-B	91.89	94.31	88.58
PropVG	BEiT3-ViT-B	92.70	95.07	89.58

PropVG comprehensively outperforms direct referring methods with the same backbone on RefCOCO, achieving a +14% performance gain over MAttNet at 4× faster inference.

Generalized Visual Grounding (GRES — gRefCOCO)¶

Method	val gIoU	testA gIoU	testB gIoU
HDC (Swin-B)	68.28	72.52	63.85
PropVG	73.29	74.43	65.87

GREC Detection Task¶

Method	val F1	val N-acc
SimVG	62.1	54.7
PropVG	72.2	72.8

PropVG surpasses SimVG by +10.1% F1 on GREC and improves "no-target" discrimination accuracy by +18.1%.

Ablation Study¶

Component	F1score	N-acc.	gIoU
Basic Setting	63.41	64.11	65.98
+ SimFPN	65.14	68.02	66.86
+ UNet Decoder	65.87	69.19	68.16
+ Multi-scale Deformable Decoder	67.44	69.47	69.10
+ Channel Splitting	67.98	70.44	69.59
+ Query Proj. (Baseline)	68.81	70.39	69.85
+ CRS	70.61	74.78	71.30
+ MTD	72.20	72.83	73.29
− Foreground Supervision	66.83	61.06	66.37

Key Findings¶

CRS yields +1.8 F1score and +4.4 N-acc.; the adaptive weighted fusion of word-level and sentence-level contrastive learning is critical
Removing foreground supervision causes an ~2% performance drop, confirming that the proposal stage's global object perception provides valuable priors for referring scoring
PropVG achieves an inference speed of only 76 ms/frame, significantly outperforming traditional proposal methods (MAttNet 320 ms, PolyFormer 150 ms, GroundingDINO 120 ms)
On R-RefCOCO/+/g, rIoU improves by 9.5–11.2%, with the largest gains in multi-target and target-absent scenarios

Highlights & Insights¶

Reviving the proposal-based paradigm: By eliminating the speed and performance bottlenecks of traditional two-stage methods through end-to-end design, this work demonstrates that the proposal-based approach itself is not outdated — the issue lies in the implementation
Reformulating VG as "detection + binary classification": The proposal stage generates candidates, and the referring stage performs a binary decision, reducing task complexity
Elegant multi-granularity discrimination design: MTD fuses detection scores, segmentation predictions, and learnable queries via a multiplicative structure that ensures low confidence in any single pathway suppresses the final score

Limitations & Future Work¶

Proposal stage training requires foreground object annotations, increasing annotation cost
Target existence judgment in absent-target scenarios depends on a fixed MTD threshold (0.7), which may require tuning across datasets
A performance gap remains on certain metrics compared to MLLM-based methods (e.g., GSVA-13B), though PropVG uses far fewer parameters (0.2B vs. 13B+)
The framework has not been evaluated under open-vocabulary or zero-shot settings

REC/RES: TransVG, LAVT, SeqTR, SimVG, OneRef
GVG: ReLA, GREC, RefSegformer, HDC
Proposal-based: MAttNet, NMTree, Ref-NMS
MLLM-based methods: LISA, GSVA, GLaMM

Rating¶

Novelty: 4/5 (end-to-end proposal-based framework + multi-granularity target discrimination; architecturally innovative)
Technical Depth: 4/5 (dual-level contrastive learning in CRS and SPCA mechanism in MTD are well-motivated)
Experimental Thoroughness: 5/5 (10 datasets, extensive ablations, speed comparisons, parameter count comparison with MLLMs)
Writing Quality: 4/5 (clear structure, comprehensive figures)