ACL 2025 Multimodal VLM Multimodal Retrieval Data Synthesis Composed Image Retrieval Contrastive Learning instruction tuning Hard Negatives

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval¶

Conference: ACL 2025
arXiv: 2412.14475
Code: To be released (complete dataset, model, and pipeline will be fully open-sourced)
Area: Multimodal VLM / Multimodal Retrieval
Keywords: Multimodal Retrieval, Data Synthesis, Composed Image Retrieval, Contrastive Learning, instruction tuning, Hard Negatives

TL;DR¶

This paper proposes the MegaPairs data synthesis method, which leverages heterogeneous KNN triplets to mine matching image pairs from an open-domain image corpus, combined with VLMs/LLMs to generate retrieval instructions, synthesizing 26 million multimodal training instances. The resulting MMRet model, trained on only 0.5M data, outperforms MagicLens which uses 36.7M data (achieving $70\times$ data efficiency), and achieves SOTA on 4 CIR benchmarks and 36 MMEB datasets.

Background & Motivation¶

Background: The demand for multimodal retrieval is rapidly increasing (e.g., image search, VQA, RAG). Methods based on pre-trained vision-language models (CLIP/ALIGN/SigLIP) have established fundamental capabilities but are limited to text-to-image matching, falling short of handling more complex multimodal tasks.

Limitations of Prior Work: Instruction tuning can enhance multitask capability, but instruction data for multimodal retrieval is extremely scarce. Existing methods (e.g., MagicLens) synthesize data from image pairs co-occurring on the same webpage, which faces four key limitations: - Poor scalability: Only a small portion of webpages contain multiple images. - Low quality: Co-occurring images are often irrelevant or highly redundant. - Insufficient diversity: Relationships between co-occurring images are monotonic (e.g., different angles of the same object). - Poor accessibility: Large-scale datasets are usually proprietary and private.

Core Idea: Leveraging an open-domain image corpus (e.g., DataComp) + multiple similarity models to mine heterogeneous image pairs + open-source VLMs/LLMs to generate retrieval instructions $\rightarrow$ to construct large-scale, high-quality, diverse, and publicly available multimodal retrieval training data.

Method¶

Overall Architecture¶

MegaPairs consists of two stages: (1) mining relevant image pairs using three similarity models to discover heterogeneous correlations; (2) generating open-ended retrieval instructions, where a VLM describes the relationship and an LLM refines it into instructions.

Stage 1: Mining Relevant Image Pairs¶

Three similarity models introducing heterogeneous correlations:
Vision-Semantic Correlation (EVA-CLIP Image Encoder): Captures semantically relevant but visually distinct image pairs (e.g., different angles of the same car).
Vision-Pattern Correlation (DINOv2): Captures visually similar but potentially semantically distinct image pairs (e.g., different cars in similar backgrounds).
Caption Correlation (EVA-CLIP Text Encoder): Based on the textual similarity of image captions.
Filtering Strategy: Image pairs with similarity in the range of $(0.8, 0.96)$ are retained to exclude weak correlations and near-duplicates.
Hard Negatives: For each image pair (I_q, I_ti), other images in the retrieval set {I_tj | tj ≠ ti} naturally act as hard negatives, introducing 5 hard negatives for each pair.

Stage 2: Generating Open-Ended Instructions¶

Step 1: A VLM (InternVL2-26B) takes the image pair as input and generates a detailed description D_i of the common concepts and differences between the two images.
Step 2: An LLM (LLaMA3-8B) refines the description into a retrieval instruction T_{q→ti}, generating at least 3 different instructions per pair to increase diversity.
Final Triplet: (I_q, T_{q→ti}, I_ti), where (I_q, T_{q→ti}) is used to retrieve I_ti.

Implementation Scale¶

Image Corpus: A subset of 20 million captioned images from Recap-DataComp-1B.
Synthesized Results: 26,235,105 image pairs, fully open-sourced.

MMRet Model¶

Two architectures are designed:

CLIP-based MMRet (Base/Large):
- A dual-encoder architecture that independently encodes images and texts.
- Multimodal embeddings are fused using score-fusion: e_it = Φ_I(I) + Φ_T(T).
MLLM-based MMRet (based on LLaVA-1.6 Mistral 7B):
- Image tokens are fed directly into the LLM for processing.
- Uses task-specific instructions: <instruct> {task_inst} <query> {q_t} {q_i} [EOS].
- Uses the normalized final hidden state of the [EOS] token as the embedding.

Training Objectives¶

Standard InfoNCE contrastive loss: $$\mathcal{L} = -\frac{1}{|\mathcal{Q}|}\sum_{q_i \in \mathcal{Q}} \log \frac{\exp(\mathbf{e}_{q_i} \cdot \mathbf{e}_{c_i^+}/\tau)}{\sum_{c_j \in \mathcal{C}} \exp(\mathbf{e}_{q_i} \cdot \mathbf{e}_{c_j}/\tau)}$$ The temperature parameter $\tau = 0.02$. The query and candidate can be images, texts, or image-text combinations.

Experiments¶

Zero-Shot CIR Performance (Table 1)¶

Method	Backbone	Params	CIRCO mAP@5	CIRR R@1	FashionIQ R@10	GeneCIS Rs@1
MagicLens-L‡	CoCa-L	613M	34.1	33.3	38.0	16.7
IP-CIR	CLIP-G	43.8B†	32.8	39.3	45.7	-
MMRet-Base	CLIP-B	149M	34.3	36.1	31.9	18.0
MMRet-Large	CLIP-L	428M	39.2	38.0	34.6	18.1
MMRet-MLLM	LLaVA-1.6	7.57B	42.2	46.7	35.6	21.1

Key Findings: - MMRet-MLLM outperforms the previous SOTA on CIRCO by 8.1 percentage points (42.2 vs 34.1). - MMRet-Base (149M) even outperforms most large models (including MagicLens-L 613M). - All MMRet models lead across all scales.

MMEB Zero-Shot Performance (Table 2)¶

Model	Classification	VQA	Retrieval	Grounding	Overall
UniIR	42.1	15.0	60.1†	62.2	42.8
MMRet-MLLM	47.2	18.4	56.5	62.2	44.0

In the comprehensive evaluation across 36 datasets, MMRet-MLLM achieves the highest overall score of 44.0, despite UniIR's retrieval meta-task containing 10 out of 12 MMEB retrieval datasets (which is not strictly zero-shot).

MMEB Fine-Tuning Performance (Table 3)¶

Model	IND	OOD	Overall
VLM2Vec (LLaVA-1.6)	61.0	47.5	55.0
VLM2Vec (Phi-3.5-V)	66.5	52.0	60.1
MMRet-MLLM	68.0	59.1	64.1

After fine-tuning, the OOD performance is improved by 11.6% (vs. LLaVA-1.6 baseline), demonstrating the strong generalization ability endowed by MegaPairs pre-training.

Data Quality and Scalability (Figure 2)¶

$70\times$ Data Efficiency: Only 0.5M MegaPairs samples are needed to outperform the model trained on 36.7M MagicLens data.
Continuous Scaling: Performance continuously scales up as the data volume grows, with no saturation observed.

Ablation Study¶

Effect of Hard Negatives (Table 4): Using mined hard negatives improves performance on CIRCO by 2.6 percentage points (29.7 → 32.3).
Search Strategy (Table 5): Jointly using the three similarity models yields the best performance (32.3 mAP@5). Individually, using text similarity outperforms using visual similarity.

Highlights & Insights¶

Innovative Data Synthesis Paradigm: Transitioning from "relying on co-occurring webpage images" to "actively mining from open-domain corpora" thoroughly addresses the scalability bottleneck.
Heterogeneous Correlations are Key: The diverse correlations introduced by the three different similarity models are the core guarantee of data quality—relationships from a single model are too monotonous.
Extreme Data Efficiency: 0.5M data beating 36.7M demonstrates that data quality is far more important than quantity, which is a classic insight in the retrieval domain.
Fully Open-Source: The complete suite of the dataset, model, and pipeline is open-sourced, providing a significant contribution to the community.
Strong Performance of Small Models: MMRet-Base (149M) outperforms most large models, proving that correct training data can compensate for gaps in model scale.

Limitations & Future Work¶

Only three retrievers are used to construct image pairs; more retrieval strategies (e.g., BGE, image-text cross-retrieval) might further enhance diversity.
Although the image sources have been filtered by the Datacomp team, they may not be entirely clean.
CLIP-based MMRet underperforms specialized CIR models in domain-specific verticals such as FashionIQ.

Multimodal Retrieval: Pre-trained models like CLIP, ALIGN, and SigLIP; general multimodal embeddings like UniIR and E5-V.
Instruction Tuning: From LLMs (FLAN, InstructGPT) to embeddings (Su et al., GTE), and to multimodal domains (MagicLens, UniIR).
CIR Methods: Zero-shot or training-based methods such as SEARLE, CIReVL, LDRE, and CompoDiff.

Rating ⭐⭐⭐⭐⭐¶

Novelty: ⭐⭐⭐⭐⭐ The data synthesis paradigm combining heterogeneous KNN triplets and VLM/LLM annotations is novel and effective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across 4 CIR benchmarks and 36 MMEB datasets, featuring zero-shot/fine-tuning setups, data scaling, and ablation studies.
Value: ⭐⭐⭐⭐⭐ Fully open-sourced, 26 million data points are readily accessible, addressing the pain points of the field.
Writing Quality: ⭐⭐⭐⭐ Clearly described method with rich illustrations.