GOAL: Global-Local Object Alignment Learning¶

Conference: CVPR 2025
arXiv: 2503.17782
Code: GitHub
Area: Information Retrieval
Keywords: Global-Local Alignment, CLIP Fine-tuning, Long-Text Image-Text Retrieval, SAM Segmentation, Token Similarity Learning

TL;DR¶

Proposes the GOAL method, which enhances CLIP's understanding of long text descriptions through two modules: Local Image-Sentence Matching (LISM) and Token Similarity-based Learning (TSL). By introducing local semantic alignment on top of global alignment, it significantly improves image-text retrieval performance.

Background & Motivation¶

Background: CLIP has demonstrated powerful zero-shot transfer capabilities through contrastive learning on large-scale image-text pairs. However, CLIP's training data mainly consists of short image descriptions (up to 77 tokens), focusing on high-level concepts of images.

Limitations of Prior Work: 1. CLIP performs poorly when processing longer, detailed text descriptions because its unified embedding space is optimized for concise descriptions. 2. Existing models only perform global image-text matching, aligning the entire image and the full caption as a single unit, which lacks fine-grained local correspondence. 3. Although Long-CLIP extends the text length, it requires multimodal LLMs to generate synthetic long descriptions, resulting in high data preparation costs.

Key Challenge: CLIP's global contrastive learning paradigm cannot capture the fine-grained correspondence between local image regions and specific description sentences in the text, leading to a substantial loss of detailed information when processing long texts.

Goal: How to efficiently fine-tune CLIP to enable it to understand local semantic details in long texts while maintaining global alignment capabilities.

Key Insight: "Global" is defined as the entire image or full text, while "local" is defined as image fragments or individual sentences in the text. Global representations are enhanced by constructing local pseudo-pairs and propagating local attention.

Core Idea: Segment images using SAM and split sentences to construct local image-sentence pseudo-pairs, then propagate local information to global representations through Token-level Similarity-based Learning.

Method¶

Overall Architecture¶

GOAL consists of two core components: (1) LISM (Local Image-Sentence Matching) for extracting local pseudo-pairs from global image-long text pairs; (2) TSL (Token Similarity-based Learning) for efficiently propagating local attention to global Token representations through matched local pairs. The overall objective function combines global contrastive, local contrastive, and TSL losses.

Key Designs¶

LISM (Local Image-Sentence Matching Pipeline):
- Function: Automatically extract the best local image-sentence matching pairs from a pair of image-text data.
- Mechanism: Segment the image into semantic regions using SAM (filtering small areas <1%), and split the caption by sentences; extract CLS embeddings of each region and sentence using the CLIP encoder, calculate cosine similarity to perform maximum similarity matching, and select the local pair \((I_l, T_l)\) with the highest similarity.
- Design Motivation: Leverage SAM's strong segmentation capabilities and CLIP's own matching ability to obtain high-quality local correspondences without additional annotations.
TSL (Token Similarity-based Learning):
- Function: Propagate local semantic information to global Token representations.
- Mechanism: For the text side, locate the tokens in the global text corresponding to the local sentence, map them through a projection layer after average pooling, and maximize the similarity with the local CLS embedding; for the visual side, locate the patch tokens corresponding to the region in the global image based on the bounding box of the local image, map them through a projection layer after average pooling, and maximize the similarity with the local CLS embedding.
- Design Motivation: By making a subset of global tokens approximate the CLS representation of the corresponding local part, the encoder is encouraged to focus on key local elements in the image/text.
Positional Encoding Interpolation:
- Function: Support processing of long texts exceeding 77 tokens.
- Mechanism: Adopt the positional encoding interpolation technique from Long-CLIP to extend the maximum sequence length of the text encoder.
- Design Motivation: The 77-token limit of the original CLIP cannot handle multi-sentence long descriptions.

Loss & Training¶

Total loss: \(\mathcal{L}_{total} = \lambda_{global}\mathcal{L}_{global} + \lambda_{local}\mathcal{L}_{local} + \lambda_{TSL}\mathcal{L}_{TSL}\)
\(\lambda_{global}=1\), \(\lambda_{local}=0.5\), \(\lambda_{TSL}=1\)
\(\mathcal{L}_{global}\) and \(\mathcal{L}_{local}\) are standard CLIP contrastive losses
\(\mathcal{L}_{TSL}\) uses MSE loss to maximize the similarity between the projected Token and the corresponding local CLS embedding
Trained for 10 epochs with a batch size of 16, taking about 1 hour on a single RTX 4090.

Key Experimental Results¶

Main Results¶

DOCCI Dataset (Original Test Set, ViT-L/14)

Method	T2I R@1	T2I R@5	I2T R@1	I2T R@5
Global fine-tuning	74.00	93.84	73.55	93.94
Local fine-tuning	67.39	90.67	66.33	90.41
w/o TSL	74.75	94.31	74.55	94.37
GOAL	84.37	97.55	82.57	97.37

DCI Dataset (Original Test Set, ViT-L/14)

Method	T2I R@1	T2I R@5	I2T R@1	I2T R@5
Global fine-tuning	65.73	84.24	65.73	86.04
GOAL	76.89	91.05	76.59	91.20

Ablation Study¶

DOCCI Dataset (ViT-B/16)

Setting	Global	Local	TSL	T2I R@1	I2T R@1
Global Only	✓			72.41	72.04
Local Only		✓		65.82	65.73
Global + Local	✓	✓		72.08	71.80
GOAL (Full)	✓	✓	✓	79.47	79.43

Key Findings¶

TSL is the core contribution: Simply adding local contrastive loss (w/o TSL) yields almost no improvement or even a slight decline, but adding TSL leads to a significant performance leap (ViT-L/14 DOCCI: +12.87% R@1).
Local alignment cannot be used in isolation: Using only local contrastive loss results in performance significantly lower than the global baseline due to the loss of global context.
Equally effective on joint Global-Local test sets: GOAL also significantly outperforms baselines on test sets containing both global and local queries (mAP@10 increases by 3-6 percentage points).
Resource-friendly: Training ViT-B/16 can be completed on a single RTX 4090 GPU in approximately 1 hour.

Highlights & Insights¶

Simple yet effective: The method design is remarkably simple, substantially boosting CLIP's ability to handle long texts using only two modules: the LISM pipeline and the TSL loss.
No extra annotations: Automatically constructs local pseudo-pairs utilizing SAM and CLIP's inherent capabilities without requiring manual annotations.
Key insight of TSL: Instead of merely adding contrastive learning between global and local views, it forces a subset of global Tokens to "approximate" the corresponding local representation. This Token-level alignment strategy is significantly more effective than CLS-level local contrastive learning.
New Benchmarks: Proposes three new evaluation benchmarks targeting long-text image-text retrieval (DOCCI, DCI, Urban1k).

Limitations & Future Work¶

The LISM pipeline only selects a single best local matched pair, potentially losing other valuable local correspondences.
It relies heavily on SAM for image segmentation, making the quality of local pairs directly dependent on the segmentation quality.
Validation is limited to only three datasets, with relatively small data scales (e.g., DOCCI has only about 10k samples).
Lacks a systematic comparison with large-scale pre-training methods (e.g., CLOC).
Positional encoding interpolation may have limited efficacy on extremely long texts.

CLIP/Long-CLIP: The base models and direct competitors of this work.
CLOC: A large-scale pre-training method that establishes local correspondences via OWL-v2 detectors but requires 2 billion image-text pairs.
SAM: Provides strong zero-shot image segmentation capabilities, serving as a critical dependency for the LISM pipeline.
Insight: The key to local semantic alignment lies not in learning local features independently, but in how to effectively propagate local information into global representations.

Rating¶

Novelty: ⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐