FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs¶

TL;DR¶

FineLIP enables CLIP models to handle long text descriptions and perform fine-grained visual-textual matching through positional embedding stretching (77 to 248 tokens), an Adaptive Token Refinement Module (ATRM), and a Cross-modal Late Interaction Module (CLIM). It significantly outperforms existing methods such as Long-CLIP and TULIP on long-description retrieval tasks.

Background & Motivation¶

Two Key Limitations of CLIP:
Text length limitation: The CLIP text encoder can process at most 77 tokens, preventing it from encoding detailed, rich, long descriptions.
Global feature alignment: Traditional CLIP only aligns global visual and text features, neglecting local fine-grained information.
Demand for long descriptions: With the advancement of LVLMs (e.g., GPT-4V, LLaVA), detailed image descriptions far exceeding 77 tokens can be generated, containing rich information like color, position, and spatial relations. Existing methods fail to fully utilize these details.
Limitations of Prior Work:
- Long-CLIP: Extends to 248 tokens but only uses global feature alignment, overlooking local details.
- TULIP: Introduces relative positional encodings but is also focused on the global level.
- DreamLIP: Decomposes long descriptions into multiple short descriptions without directly processing long text.
- FILIP/SPARC: Incorporate fine-grained alignment but are only designed for short descriptions.
Key Insight: Simultaneously addressing both long text encoding and fine-grained alignment is essential, as neither can be omitted.

Method¶

Overall Architecture¶

FineLIP enhances the pretrained CLIP model through three main steps: 1. Stretching positional embeddings to support long text inputs. 2. Adaptive Token Refinement Module (ATRM) to aggregate visual and textual tokens. 3. Cross-modal Late Interaction Module (CLIM) to achieve token-level fine-grained alignment.

Key Designs¶

1. Positional Embedding Stretching¶

Preserves the first 20 positional embeddings (experiments show they are already fully trained during pretraining).
Conducts $4\times$ adaptive interpolation stretching on the 21st to 77th positional embeddings.
Final length: $20 + (77 - 20) \times 4 = 248$ tokens.
Advantage: Preserves the cross-modal alignment capability of the pretrained weights, avoiding training from scratch.

Design Motivation: Local tokens in the last layer of the Transformer can be ambiguous; direct token-level alignment yields suboptimal results.
Strategy: Aggregation is superior to selection—token selection loses information, while aggregation retains all information.
Mechanism:
- Input $N$ tokens $\rightarrow$ output $N'$ refined tokens ($N'/N = 0.2$, default aggregation ratio of 20%).
- Transformation matrix $W_{ref} \in \mathbb{R}^{N' \times N}$, learned via a self-attention-like mechanism: $$W_{ref} = \text{SoftMax}\left(\frac{W_q \sigma(X W_k)^T}{\tau}\right)$$
- $\tau$ is a learnable temperature parameter that encourages sparse attention.
Features: Applies aggregation to both visual and textual branches (with independent parameters for each) rather than only refining the visual side.
Preserves global [CLS] and [EOS] tokens, excluding them from aggregation.

Computes the cosine similarity between the refined visual token $v'_i$ and text token $t'_j$.
Bidirectional MaxSim pooling: $$R(I,T) = \frac{1}{P'}\sum_{i=1}^{P'}\max_j S(v'_i, t'_j) + \frac{1}{M'}\sum_{j=1}^{M'}\max_i S(t'_i, v'_j)$$
Retains global tokens ([CLS]/[EOS]) for alignment, enabling cross-granularity hybrid alignment.

Loss & Training¶

Adopts Triplet Marginal Loss (rather than traditional contrastive loss) with a margin $\alpha = 0.2$: $$\mathcal{L}_{triplet} = \mathcal{L}_{i2t} + \mathcal{L}_{t2i}$$ $$\mathcal{L}_{i2t} = \max(0, R(I_q, T^-) - R(I_q, T^+) + \alpha)$$

This ensures that the similarity of positive sample pairs exceeds that of negative sample pairs by at least a margin of $\alpha$.

Key Experimental Results¶

Main Results (Urban1k + DOCCI Retrieval)¶

B/16 Model on Urban1k:

Method	I2T R@1	I2T R@5	T2I R@1	T2I R@5
Baseline	0.859	0.969	0.866	0.963
Long-CLIP	0.789	-	0.795	-
TULIP	0.881	-	0.866	-
SPARC	0.854	0.963	0.853	0.957
LAPS	0.890	0.987	0.884	0.971
FineLIP	0.907	0.983	0.893	0.975
FineLIP*	0.912	0.985	0.900	0.977

The improvement is even more significant on the L/14 model, where FineLIP* achieves an I2T R@1 of 0.940.

Ablation Study (Tab. 3)¶

Validates the necessity of the following components: - A) Positional embedding stretching vs. training from scratch $\rightarrow$ stretching is significantly superior. - B) ATRM aggregation ratio of 0.2 is optimal. - C) Dual-branch aggregation (visual + textual) is superior to visual-only aggregation. - D) Triplet Loss outperforms Contrastive Loss.

Key Findings¶

FineLIP comprehensively outperforms Long-CLIP (+12% I2T R@1) and TULIP on all long-description retrieval tasks.
The contribution of token refinement in the textual branch is significant—only refining the visual side yields limited performance, while dual-branch refinement maximizes the gain.
An aggregation ratio of 0.2 ($5\times$ compression) achieves the optimal balance between performance and efficiency.
Triplet Loss is more suitable for fine-grained retrieval scenarios than Contrastive Loss.
In text-to-image generation tasks, the text encoder of FineLIP also demonstrates superior FID and CLIP-Score.

Highlights & Insights¶

Dual-branch token refinement: Unlike methods that only focus on visual tokens (FILIP/SPARC), FineLIP also aggregates textual tokens to eliminate the ambiguity of the original raw text tokens.
Cross-granularity hybrid alignment: Retains the global [CLS]/[EOS] tokens in the refined set, allowing global-local information to interact within the same framework.
Low overhead, high reward: The parameter count of ATRM is very small (only the $W_q$ and $W_k$ projection matrices), and it reduces the number of subsequent tokens, thus improving efficiency.
Universal enhancement: FineLIP can be plugged into any CLIP variant (B/16, L/14), yielding consistent improvements.

Limitations & Future Work¶

248-token ceiling: Although extended from 77 to 248, it may still be insufficient for extremely long (e.g., paragraph-level) descriptions.
Training data requirements: Requires long-description datasets (e.g., ShareGPT4V, DOCCI), which are expensive to construct.
Zero-shot classification: The paper mainly evaluates retrieval and generation tasks, and does not report zero-shot classification benchmarks.
Aggregation strategy: The current linear aggregation might lose spatial information in extremely fine-grained scenarios (e.g., small object detection).

Long-CLIP (2024): Pioneer of positional embedding stretching $\rightarrow$ integrated and extended in this work.
FILIP (NeurIPS'21): Token-level similarity alignment $\rightarrow$ performs better after incorporating token aggregation in this work.
ColBERT (Information Retrieval): Late interaction mechanism $\rightarrow$ inspires the MaxSim pooling design of CLIM.
Insights: Improvements in vision-language models rely not only on scaling up, but more importantly, on refining the granularity of alignment and the way information is utilized.

Rating¶

⭐⭐⭐⭐ — The method is simple and elegant; the combination of dual-branch aggregation and fine-grained alignment is highly practical. The performance improvement in long-description scenarios is significant, consistent, and broadly applicable.