VladVA: Discriminative Fine-tuning of LVLMs¶
Conference: CVPR 2025
arXiv: 2412.04378
Code: None
Area: Multimodal VLM
Keywords: Discriminative fine-tuning, LVLM, Contrastive learning, Image-text retrieval, Compositional understanding
TL;DR¶
The VladVA framework is proposed to transform generative LVLMs (LLaVA) into strong discriminative models via a hybrid short/long caption data strategy, joint training with contrastive and autoregressive losses, and parameter-efficient adaptation using soft prompting and LoRA. It substantially outperforms CLIP-based models and the 18B EVA-CLIP on image-text retrieval and compositional understanding benchmarks.
Background & Motivation¶
Current vision-language models have two main paradigms, each with its own shortcomings:
Contrastive VLMs (e.g., CLIP): Strong discriminative capability but limited language understanding, exhibiting "bag of words" behavior—scrambling word order does not affect matching scores, resulting in poor performance on compositional understanding (spatial relations, attribute binding). Larger models and datasets cannot fundamentally resolve this.
Generative LVLMs (e.g., LLaVA): Combine a vision encoder and an LLM, possessing strong reasoning and fine-grained understanding capabilities. However, their autoregressive training paradigm makes them unsuitable for direct application to discriminative tasks (such as image-text retrieval).
Core Problem: Can the advantages of both be combined? The authors find that LVLMs naturally possess zero-shot discriminative capability (by extracting a summary token via specific prompts), but the performance is far inferior to CLIP. While prior work like E5-V suggested that image-text contrastive fine-tuning is harmful, VladVA proves this conclusion wrong through a meticulously designed framework.
Method¶
Overall Architecture¶
VladVA adopts a two-tower architecture: On the image side, the image is processed through the complete LVLM (vision encoder + projection layer + LLM) to obtain the image embedding \(\mathbf{f}_v\) (using the hidden state of the last token as the summary token). On the text side, the text is processed through the LLM to obtain the text embedding \(\mathbf{f}_t\). The similarity is computed using cosine similarity between the two. Contrastive training is performed on short captions, while autoregressive training is conducted on long captions, achieving parameter efficiency via soft prompts and LoRA.
Key Designs¶
-
Data Strategy: Collaboration of Short/Long Captions:
- Function: Enables the model to learn both coarse-grained and fine-grained image-text matching simultaneously.
- Mechanism: Training data is categorized by caption length into short captions (<30 tokens, headline-level) and long captions (30-500 tokens, detailed description-level). Short captions are used for contrastive learning to teach high-level image-text matching, while long captions are utilized for autoregressive learning to teach fine-grained details and compositional relationships. Images missing specific caption types are supplemented with short captions generated by BLIP2 and long captions generated by ShareGPT-4V.
- Design Motivation: Directly training long captions with contrastive loss causes collapse because long captions are highly specific, offering almost no hard negatives, which drives the loss to zero within a few hundred iterations. Dividing tasks by length in conjunction with different losses is key to resolving this contradiction.
-
Hybrid Training Loss: Contrastive + Autoregressive:
- Function: Simultaneously strengthens both discriminative capability and language understanding in a unified framework.
- Mechanism: The contrastive loss \(\mathcal{L}_c = \frac{1}{b}\sum_{k=1}^{b}(-\log\frac{\exp(s_v^{k,k})}{\sum_j\exp(s_v^{k,j})} - \log\frac{\exp(s_t^{k,k})}{\sum_j\exp(s_t^{j,k})})\) is applied to align the summary tokens of short captions. The autoregressive loss \(\mathcal{L}_{CE} = \sum_{i=1}^{L}\log p_\theta(u_i | \mathbf{x}_v, \mathbf{x}_p^v, \mathbf{x}_{q,<i}^{long})\) is applied to token-by-token prediction of long captions.
- Design Motivation: The autoregressive loss provides three key advantages: (a) token-by-token prediction is a challenging task that prevents training collapse; (b) the prediction process encourages the summary token to compress more information; (c) it preserves the model's generative capability.
-
Parameter-Efficient Adaptation: Soft Prompting + LoRA:
- Function: Low-cost fine-tuning of the LVLM.
- Mechanism: Learnable vectors replace the token embeddings of handcrafted prompts (initialized with handcrafted prompt embeddings). Different soft prompts are used for the image and text modalities. LoRA adapters (rank=16, \(\alpha\)=16) are added to the linear layers of the LLM. Analysis shows that the post-training soft prompt decodes into sentences with virtually unchanged semantics, with only the start and end boundary characters modified.
- Design Motivation: The critical role of soft prompts is not to alter semantics but to "mark which token should aggregate discriminative information". LoRA supplements the limited representation capacity of soft prompts.
Behavioral Change Analysis¶
After training, the model exhibits three key behavioral changes: (1) the attention maps between summary tokens and vision tokens become denser—while the generative mode allows the model to "peek step-by-step", the discriminative mode requires it to "read everything at once"; (2) the entropy of the output distribution increases, showing that summary tokens encode richer information; (3) the cumulative variance of the embedding matrix becomes more distributed, reflecting fuller utilization of the embedding space, which corresponds to a higher matrix rank.
Loss & Training¶
Total loss = short caption contrastive loss + long caption autoregressive loss, jointly optimized within each batch. Training is conducted for 7 epochs with a batch size of 1024, a learning rate of \(10^{-4}\) using the AdamW optimizer with a cosine schedule, and a maximum of 32 A100 GPUs. The training dataset consists of approximately 8.1M samples (OpenImages 4M + CC3M 2.8M + ShareGPT-4V 1.3M).
Key Experimental Results¶
Main Results (Zero-shot Image-Text Retrieval R@1)¶
| Method | Parameters | Flickr IR | COCO IR | nocaps IR | Flickr TR | COCO TR |
|---|---|---|---|---|---|---|
| CLIP (ViT-L) | 0.43B | 67.3 | 37.0 | 48.6 | 87.2 | 58.1 |
| EVA-CLIP (18B) | 18B | 83.3 | 55.6 | 69.3 | 95.3 | 72.8 |
| E5-V (8B) | 8.36B | 79.5 | 52.0 | 65.9 | 88.2 | 62.0 |
| VladVA (7B) | 7.06B | 85.0 | 59.0 | 72.3 | 94.3 | 72.9 |
Compositional Understanding (SugarCrepe)¶
| Category | VladVA | EVA-CLIP(18B) | E5-V(8B) | CLIP(ViT-L) | Gain vs. EVA |
|---|---|---|---|---|---|
| Object Swap | 79.0 | 65.3 | 75.0 | 60.2 | +13.7 |
| Attribute Swap | 82.9 | 76.0 | 70.1 | 62.3 | +6.9 |
| Relation Replace | 86.8 | 76.1 | 85.3 | 65.2 | +10.7 |
| Attribute Add | 95.8 | 85.0 | 83.5 | 71.5 | +10.8 |
Ablation Study (1M Sample Training)¶
| Configuration | SugarCrepe(Rep/Swp/Add) | Flickr T2I/I2T | Description |
|---|---|---|---|
| Original LLaVA | 81.9/59.8/64.7 | 59.6/65.6 | Baseline without adaptation |
| +soft prompt | 86.4/66.9/89.3 | 76.7/91.7 | Prompt alone is highly effective |
| +LoRA | 87.0/69.8/88.8 | 79.1/91.4 | LoRA has larger capacity |
| +Combining both | 87.1/72.0/88.6 | 79.6/92.9 | Mutually beneficial improvement |
| +AR Loss | 89.5/75.5/89.5 | 80.6/91.8 | AR loss is critical |
| Data 1M→8.1M | Continuous improvements | No signs of saturation | Large scaling potential |
Key Findings¶
- 7B VladVA outperforms 18B EVA-CLIP: Flickr IR 85.0% vs 83.3%, COCO IR 59.0% vs 55.6%.
- The Object Swap category enjoys the largest improvement (+13.7%), which directly measures "bag of words" behavior, showing that VladVA significantly mitigates the fundamental limitations of the CLIP family.
- Contrastive loss and autoregressive loss play complementary roles: removing the AR loss leads to a severe drop in compositional understanding, while removing the contrastive loss degrades retrieval performance.
- Performance continuously scales from 1M to 8.1M without saturation, indicating strong scaling potential.
- The Qwen2-VL-2B version is also effective (Flickr IR improves from 54.1 to 80.4), showing the robust generalization of the framework.
Highlights & Insights¶
- Overturns a key conclusion from E5-V: Demonstrates that image-text contrastive fine-tuning is not harmful, but rather critical to unlocking the discriminative capability of LVLMs—provided that a reasonable data strategy and autoregressive loss are deployed.
- Insightful analysis of "what makes a good prompt": a high-entropy output distribution leads to a high-rank embedding matrix, resulting in better retrieval performance.
- Explanatory power of attention densification: Elegantly explains why LVLMs require specialized training—the generative mode allows the model to "peek step-by-step", whereas the discriminative mode requires it to "read everything at once".
- Efficiency story of 7B outperforming 18B: Better training strategies, not larger models, are what is needed.
Limitations & Future Work¶
- Main experiments are only based on LLaVA-1.5-7B; the effects on larger LVLMs (13B/70B) have not been verified.
- Text R@1 on Flickr is slightly lower than that of EVA-CLIP(18B) (94.3 vs 95.3), indicating room for optimization on the text side.
- Non-negligible training costs (32 A100 GPUs), presenting a barrier to full reproduction.
- Integration of E5-V's text-text contrastive loss with the VladVA framework was not explored (mentioned as future work in the paper).
- Inference requires one forward pass each for image and text embeddings, making LVLM inference costs higher than CLIP.
Related Work & Insights¶
- Direct comparison with E5-V: while E5-V solely utilizes text-text loss, VladVA proves that image-text contrastive combined with AR loss is superior.
- Comparison with VLM2Vec (which lacks generative loss and soft prompts): VladVA substantially outperforms it under the same settings.
- The paradigm of "transforming generative models into discriminative ones" could potentially be applied to other generative models (such as diffusion for retrieval).
- The divide-and-conquer strategy utilizing short/long captions is transferable to other vision-language tasks requiring multi-grained supervision.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Extremely novel framework for transforming LVLMs into discriminative models; the hybrid loss and data strategy are meticulously designed.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Highly comprehensive evaluation across retrieval, compositionality, ablations, data scaling, prompt analysis, and attention visualization.
- Writing Quality: ⭐⭐⭐⭐⭐ Thorough motivational analysis, with strong experimental and theoretical support for each design component.
- Value: ⭐⭐⭐⭐⭐ Pioneering contribution—proves the immense potential of LVLMs in discriminative tasks, with far-reaching impacts.