VITRIX-CLIPIN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions¶
Conference: NeurIPS 2025 arXiv: 2508.02329 Code: None Area: Robotics Keywords: CLIP, fine-grained visual understanding, instruction editing data, hard negatives, long captions
TL;DR¶
This paper proposes the CLIP-IN framework, which leverages instruction editing datasets as hard negatives and incorporates long captions to enhance CLIP's fine-grained visual understanding. The approach achieves significant improvements on benchmarks such as MMVP without compromising zero-shot performance, and when integrated into MLLMs, it substantially reduces visual hallucinations.
Background & Motivation¶
Vision-language models such as CLIP excel at coarse-grained image-text alignment but exhibit notable deficiencies in fine-grained visual understanding:
Coarse-grained alignment: CLIP's contrastive learning tends to capture high-level semantic alignment while neglecting subtle differences.
Short-text limitation: Standard CLIP relies on short textual descriptions, discarding rich semantic details.
Lack of hard negatives: Training data lacks image-text pairs that are highly similar yet semantically distinct.
The paper introduces two core innovations: - Exploiting image editing instruction datasets as a natural source of hard negatives. - Incorporating long captions and rotary position encodings to capture richer semantics.
Method¶
Overall Architecture¶
CLIP-IN comprises two core innovations: 1. Hard-negative contrastive learning based on instruction editing data. 2. Integration of long captions with rotary position encoding.
Key Designs¶
-
Instruction Editing Data as Hard Negatives:
- Existing image editing datasets (e.g., InstructPix2Pix) are utilized.
- Pre- and post-edit image pairs naturally constitute hard negative samples.
- Example: original image (cat on a red sofa) vs. edited image (cat on a blue sofa).
- The paired editing instructions provide precise descriptions of semantic differences.
-
Symmetric Hard-Negative Contrastive Loss:
- The model is trained not only to match correct image-text pairs but also to distinguish subtle editing differences.
- A symmetric design applies hard-negative contrastive learning in both image→text and text→image directions: \(\mathcal{L}_{\text{HN}} = -\log \frac{e^{s(I, T^+)}}{e^{s(I, T^+)} + \sum_k e^{s(I, T_k^-)}}\)
-
Long Captions + Rotary Position Encoding (RoPE):
- Detailed long-form textual descriptions (typically 100–300 tokens) are introduced.
- The standard CLIP text encoder is limited to 77 tokens.
- RoPE is applied to extend context length while preserving positional awareness.
Loss & Training¶
- \(\mathcal{L}_{\text{CLIP}}\): Standard contrastive loss (preserves zero-shot capability).
- \(\mathcal{L}_{\text{HN}}\): Hard-negative contrastive loss (improves fine-grained understanding).
- \(\mathcal{L}_{\text{long}}\): Long-caption alignment loss.
Key Experimental Results¶
Main Results (Fine-Grained Visual Understanding)¶
| Method | MMVP ↑ | Winoground ↑ | ARO-Relation ↑ | SugarCrepe ↑ | IN-1K Zero-Shot ↑ |
|---|---|---|---|---|---|
| CLIP (ViT-L/14) | 28.5 | 35.2 | 62.8 | 75.3 | 75.5 |
| CLIP + NegCLIP | 32.1 | 38.5 | 68.5 | 78.2 | 74.8 |
| CLIP + SigLIP | 30.8 | 37.1 | 66.2 | 77.5 | 76.2 |
| CLIP + DAC | 35.2 | 40.8 | 70.5 | 80.1 | 74.5 |
| CLIP-IN | 42.8 | 46.5 | 75.8 | 84.5 | 75.8 |
MLLM Integration Results¶
| Visual Encoder | LLaVA-1.5 MMVP ↑ | LLaVA Hallucination Rate ↓ | POPE Acc ↑ | MMBench ↑ |
|---|---|---|---|---|
| CLIP-ViT-L | 32.5 | 45.2 | 83.5 | 64.8 |
| SigLIP | 35.8 | 42.1 | 85.2 | 66.5 |
| CLIP-IN | 45.2 | 32.5 | 88.8 | 68.2 |
Ablation Study¶
| Component | MMVP ↑ | Winoground ↑ | IN-1K ↑ |
|---|---|---|---|
| CLIP-IN (full) | 42.8 | 46.5 | 75.8 |
| w/o hard negatives | 33.5 | 39.2 | 76.1 |
| w/o long captions | 38.2 | 43.1 | 75.5 |
| w/o RoPE | 36.5 | 41.8 | 75.2 |
| Random negatives (non-instruction editing) | 35.8 | 40.5 | 75.5 |
| Short captions only | 37.2 | 42.5 | 76.0 |
Key Findings¶
- CLIP-IN achieves a 14.3-point improvement on MMVP (28.5 → 42.8), demonstrating the substantial benefit of the hard-negative strategy.
- Instruction editing data as hard negatives substantially outperforms random negatives (+7.0 MMVP).
- Notably, zero-shot ImageNet performance improves rather than degrades (75.5 → 75.8), indicating that fine-grained gains do not compromise general-purpose capability.
- When integrated into MLLMs, the visual hallucination rate drops from 45.2% to 32.5%, demonstrating significant practical value.
Highlights & Insights¶
- Data source innovation: Repurposing image editing data as a low-cost source of high-quality hard negatives is a resourceful contribution.
- Preserved generality: Simultaneously improving fine-grained understanding while maintaining zero-shot capability is a particularly noteworthy achievement.
- Downstream value: The marked reduction in MLLM hallucinations directly enhances practical application quality.
- RoPE extension: An elegant solution to CLIP's text length limitation.
Limitations & Future Work¶
- Instruction editing datasets primarily cover visual attribute editing, with insufficient coverage of abstract conceptual differences.
- Long caption generation relies on external models, which may introduce noise.
- Although RoPE extends context length, its effectiveness for very long texts (>1000 tokens) remains unverified.
- Training cost is higher than standard CLIP fine-tuning.
Related Work & Insights¶
- NegCLIP: A pioneer in enhancing CLIP understanding via negative sample augmentation.
- InstructPix2Pix: The image editing dataset used as a data source in this work.
- SigLIP: Google's improved contrastive learning approach.
- DAC: Caption-augmented contrastive learning.
Rating¶
| Dimension | Score (1–5) |
|---|---|
| Novelty | 4 |
| Theoretical Depth | 3 |
| Experimental Thoroughness | 5 |
| Writing Quality | 4 |
| Practical Value | 5 |
| Overall Recommendation | 4.5 |