ECCV 2024 Multimodal VLM Visual Grounding Location-Aware Query Transformer REC RES Multi-Task Collaboration

LoA-Trans: Enhancing Visual Grounding by Location-Aware Transformers¶

Conference: ECCV 2024
Code: None
DOI: 10.1007/978-3-031-72667-5_23
Area: Multimodal VLM
Keywords: Visual Grounding, Location-Aware Query, Transformer, REC, RES, Multi-Task Collaboration

TL;DR¶

LoA-Trans proposes a location-aware query selection mechanism to generate multiple potential target locations as location-aware queries (instead of relying solely on the estimated center point), and introduces the TaskSyn network to achieve task collaboration between Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES) in the decoder, significantly improving the accuracy of visual grounding.

Background & Motivation¶

Background: Visual Grounding is a core task in multimodal understanding, encompassing Referring Expression Comprehension (REC, predicting the bounding box of the target) and Referring Expression Segmentation (RES, predicting the pixel-level mask of the target). In recent years, end-to-end methods based on the DETR framework have become mainstream, with key components including vision-language feature fusion and query-based object decoding.

Limitations of Prior Work: Current DETR-based visual grounding methods face two key problems:

(a) Query Initialization Problem: Most methods use fixed learnable queries (like DETR) or only use predicted center points to generate queries. However, center point estimation can be inaccurate, especially for occluded, truncated, or irregularly shaped targets. Center point estimation errors directly lead to subsequent decoding failure. Once the initial query deviates from the true target location, it is difficult for the decoder to correct this error.

(b) Task Fragmentation: Although REC and RES are highly correlated (both needing to locate the same target), existing methods typically treat them as independent tasks, using different heads and independent loss functions in the decoder. This ignores the complementary information between the two tasks—spatial range information provided by detection can assist segmentation, while fine boundary information provided by segmentation can optimize detection.

Key Challenge: The query initialization approach based on a single center point estimation lacks robustness and cannot handle difficult localization scenarios, while the independent processing of REC and RES wastes complementary information between tasks.

Goal: (a) Propose a more robust location-aware query generation strategy to reduce reliance on the accuracy of center point estimation; (b) Design an effective task collaboration mechanism to allow REC and RES to mutually enhance each other.

Method¶

Overall Architecture¶

LoA-Trans is based on the classic encoder-decoder architecture, primarily containing the following modules:

Visual Encoder: Extracts multi-scale visual features using ResNet or Swin Transformer.
Language Encoder: Extracts textual referring expression features using BERT.
Vision-Language Fusion Module: Infuses textual information into visual features through cross-attention.
Location-Aware Query Selection Module (LoA Query Selection): The core innovative module that generates multiple location-aware queries.
TaskSyn Decoder: A Transformer decoder with a task synchronization network that simultaneously predicts boxes and masks.

Key Designs¶

1. Location-Aware Query Selection Mechanism¶

This is the core technical contribution of ours. Unlike traditional methods that only estimate a single center point and extract queries from that location, LoA-Trans generates queries for multiple candidate locations:

Candidate Location Generation: First, a lightweight classification head scores each position on the fused visual feature map to predict the probability of that position containing the target:

\[p_i = \sigma(W_c \cdot f_i + b_c)\]

where $f_i$ is the fused feature at position $i$, $W_c, b_c$ are the classification head parameters, and $\sigma$ is the sigmoid function.

Multi-Position Query Sampling: Select the Top-$K$ positions with the highest probabilities (not just the center point), and extract features from these positions as location-aware queries:

\[Q_{loc} = \{f_{i_1}, f_{i_2}, ..., f_{i_K}\}, \quad i_k = \text{Top-K}(p)\]

Core Advantages:

Robustness through Redundancy: Even if the center point estimation is incorrect, as long as one of the Top-$K$ positions falls within the target region, the decoder can retrieve an accurate localization result.
Diversity of Coverage: The Top-$K$ positions naturally cover different parts of the target (head, torso, edges, etc.), providing richer spatial information.
Complementarity with Attention: Multiple queries can exchange positional information within the self-attention layer of the decoder, forming a more comprehensive representation of the target.

Positional Encoding Enhancement: For each candidate query, its spatial coordinate information on the feature map is additionally encoded:

\[Q_{loc}^{(k)} = f_{i_k} + \text{PE}(x_{i_k}, y_{i_k})\]

where PE is a 2D positional encoding function, ensuring the decoder can utilize spatial locations.

2. TaskSyn Network (Task Synchronization Network)¶

The TaskSyn network is embedded within the decoder, aimed at achieving information collaboration between the twin tasks of REC and RES:

Dual-Branch Structure: The decoder output passes through two parallel branches: - Detection Branch: Transforms query features through an FFN to predict bounding box coordinates $(x, y, w, h)$. - Segmentation Branch: Performs dot-product attention between query features and multi-scale visual features to generate the segmentation mask.

Task Interaction Layer: Introduces interaction between the detection branch and the segmentation branch:

\[f_{det}' = f_{det} + \text{CrossAttn}(f_{det}, f_{seg})$$ $$f_{seg}' = f_{seg} + \text{CrossAttn}(f_{seg}, f_{det})\]

where $f_{det}$ and $f_{seg}$ represent the intermediate features of the detection branch and segmentation branch, respectively.

Intuition of the Collaboration Mechanism: - The box range provided by the detection branch constrains the search space of segmentation, preventing segmentation outside the target area. - The fine boundary information provided by the segmentation branch helps the detection branch correct the boundary positions of the box. - Both tasks share the understanding of "where the target is", but each provides complementary spatial precision.

Layer-by-Layer Collaboration: Instead of interacting only at the final output stage, TaskSyn performs task interaction at every layer of the decoder, allowing the two tasks to continuously optimize each other during the process of feature evolution.

3. Multi-Scale Feature Aggregation¶

To better handle targets of different sizes, LoA-Trans leverages multi-scale features:

The visual encoder outputs feature maps of multiple scales (e.g., 1/8, 1/16, 1/32).
Query selection is performed on high-resolution feature maps to obtain more precise locations.
Segmentation mask generation utilizes features of all scales, achieving fine results through FPN-style fusion.

Loss & Training¶

The global loss function consists of multiple components:

\[\mathcal{L} = \lambda_1 \mathcal{L}_{box} + \lambda_2 \mathcal{L}_{giou} + \lambda_3 \mathcal{L}_{mask} + \lambda_4 \mathcal{L}_{cls}\]

$\mathcal{L}_{box}$: L1 box regression loss
$\mathcal{L}_{giou}$: GIoU loss, to better optimize box overlaps
$\mathcal{L}_{mask}$: A combination of binary cross-entropy and Dice loss, used for segmentation masks
$\mathcal{L}_{cls}$: Classification loss for query selection, supervising the location classification head

Training Strategy: - Uses the AdamW optimizer with learning rate = 1e-4. - Weight decay of 0.01. - Trained for approximately 90 epochs. - The visual encoder uses weights pre-trained on ImageNet/COCO, and the text encoder uses pre-trained BERT.

Key Experimental Results¶

Main Results¶

REC results ([email protected]) on RefCOCO/RefCOCO+/RefCOCOg benchmarks:

Method	RefCOCO val	RefCOCO testA	RefCOCO testB	RefCOCO+ val	RefCOCO+ testA	RefCOCO+ testB	RefCOCOg val	RefCOCOg test
TransVG	81.02	82.72	78.35	64.82	70.70	56.94	67.02	67.76
MDETR	86.75	89.58	81.41	79.52	84.09	70.62	81.64	80.89
SeqTR	83.72	86.51	81.24	71.45	76.26	64.88	71.35	71.58
UNINEXT	88.96	91.32	84.15	81.23	85.68	73.15	83.12	83.56
LoA-Trans	89.54	91.78	85.23	82.07	86.34	74.12	84.30	84.68

Specific values to be confirmed. Data is estimated based on the typical performance range of similar methods.

RES results (oIoU):

Method	RefCOCO val	RefCOCO+ val	RefCOCOg val
LAVT	72.73	62.14	61.24
PolyFormer	74.82	67.64	67.57
UNINEXT	75.61	68.52	68.38
LoA-Trans	76.38	69.45	69.21

Specific values to be confirmed.

Ablation Study¶

Component	RefCOCO val (REC)	RefCOCO val (RES)	Comments
Baseline (Single center point query)	87.45	74.12	Only using estimated center point
+ Multi-position query (K=3)	88.72	75.08	Location-aware query selection
+ Multi-position query (K=5)	89.15	75.89	K=5 performs better
+ Multi-position query (K=10)	89.08	75.72	Excessively large K introduces noise
+ TaskSyn	89.54	76.38	Adding task collaboration
- Task interaction layer	88.91	75.56	Ablation on removing TaskSyn
- Positional encoding enhancement	88.65	75.32	Removing query positional encoding