Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models¶
Conference: CVPR 2025
arXiv: 2408.04594
Code: https://github.com/modelscope/data-juicer/tree/ImgDiff
Area: Multimodal VLM
Keywords: Contrastive Data Synthesis, Image Difference Captioning, Multimodal Large Language Models, Fine-grained Visual Understanding, Data Augmentation
TL;DR¶
Proposes a data synthesis method inspired by contrastive learning. It automatically generates similar image pairs containing subtle object differences along with their difference descriptions. After fine-tuning MLLMs on this data, it outperforms GPT-4V and Gemini on MMVP by 12 points, achieving an average improvement of 3.06% across 8 general MLLM benchmarks.
Background & Motivation¶
Background: The performance of Multimodal Large Language Models (MLLMs) highly depends on the quality of training data. Current training data for MLLMs mainly consists of image-text pairs in the pre-training phase and VQA instruction-following data in the fine-tuning phase, which still leaves a gap in fine-grained image recognition.
Limitations of Prior Work: (1) Existing VQA datasets focus on single-image understanding, lacking discrimination training on subtle differences between similar images; (2) Vision encoders like CLIP suffer from the "CLIP-blind" issue, failing to distinguish between images that have similar CLIP features but different content; (3) Existing image difference datasets (e.g., Spot-the-Diff, CLEVR-Change) are small in scale, narrow in domain, and their full-image level difference descriptions are not sufficiently precise.
Key Challenge: Improving the fine-grained visual discrimination of MLLMs requires large-scale "similar but different" image pairs. However, manual annotation is extremely costly, and existing synthesis methods (such as InstructPix2Pix) produce difference descriptions that are imprecise and lack spatial focus.
Goal: Designs a fully automatic data synthesis pipeline to generate high-quality "object substitution" image pairs and precise region-level difference descriptions.
Key Insight: Reference the idea of contrastive learning ā allowing the model to learn to distinguish fine-grained differences by comparing similar images.
Core Idea: Utilize SDXL + Prompt-to-Prompt to generate similar image pairs, extract different regions through multi-stage filtering, and leverage MLLMs to generate precise region-level difference descriptions, forming high-quality contrastive training data.
Method¶
Overall Architecture¶
A three-step pipeline: (1) Image Pair Generation: using an LLM for object replacement to produce caption pairs, and generating similar image pairs using SDXL + Prompt-to-Prompt; (2) Difference Area Generator: locating different regions via image similarity filtering, FastSAM instance segmentation, image-text matching filtering, and difference detection; (3) Difference Captions Generator: a two-stage process using MLLMs to annotate region content and generate precise difference descriptions.
Key Designs¶
-
Image Pair Generation (Prompt-to-Prompt + SDXL):
- Function: Generating highly similar image pairs where only a small number of objects are replaced.
- Mechanism: (i) Retrieve image captions from caption databases (e.g., MSCOCO or LLaVA pre-training data); (ii) Use an LLM to replace an object name in the caption (e.g., "dog" \(\rightarrow\) "cat"); (iii) Generate image pairs based on the original and replaced captions using Stable-Diffusion-XL configured with Prompt-to-Prompt ā which ensures only the replaced object changes while keeping the rest consistent by sharing cross-attention layers.
- Design Motivation: Directly comparing to InstructPix2Pix-style methods: (1) Utilizes the more advanced SDXL (instead of SD-1.5) to generate more realistic images; (2) Prompt-to-Prompt offers better control over the scope of modification than editing methods. Out of 118K image pairs, a multi-stage filtering process yields 38,533 highly similar but non-identical image pairs.
-
Difference Area Generator:
- Function: Precisely locating the regions (bounding boxes) representing object differences between the image pairs.
- Mechanism: A four-step cascading filter: (i) Image Similarity Filtering: Calculate the cosine similarity of the image pair using CLIP, retaining pairs within a specific threshold range (too similar = no difference, too different = completely different); (ii) FastSAM Segmentation: Query FastSAM for instance segmentation on each image to obtain all bounding boxes; (iii) Image-Text Matching Filtering: Use BLIP to perform image-text matching between cropped sub-images and object names to confirm that the sub-image indeed contains the target object; (iv) Difference Detection: Crop both images at the same bounding box position, compute the similarity between these sub-images using CLIP, retain only regions with significant differences, and apply IoU filtering to remove overlapping boxes.
- Design Motivation: Avoids using object detection (such as YOLO) due to limited object categories. This segmentation-and-similarity-comparison-based approach is unrestricted by predefined categories and can detect differences of arbitrary objects. Multi-stage filtering ensures the quality of the 117,779 valid bounding boxes.
-
Difference Captions Generator:
- Function: Generating precise textual descriptions for the different regions.
- Mechanism: Stage-1 (Object Labeling & Filtering): Select the top-5 bounding boxes with the lowest similarity, utilize LLaVA-NEXT to describe the content of each region, and filter them via image-text matching and description similarity (using CLIP to compute the similarity between both descriptions, where excessive similarity implies no real difference). Stage-2 (Difference Captioning): Draw red bounding boxes on the images to mark different regions, and feed both images along with region descriptions and visual cues (red boxes) into LLaVA-NEXT to generate descriptions of the specific differences within the red-boxed regions.
- Design Motivation: The key innovation is region-level difference descriptions instead of full-image difference; full-image descriptions often omit details or remain too general, while red-box prompting and localized descriptions ensure each difference is precisely captured. Ultimately, 12,688 high-quality "object substitution" samples are selected from 117,779 bounding boxes.
Loss & Training¶
- Mix Img-Diff data into the original MLLM visual instruction tuning data for fine-tuning.
- For LLaVA-1.5 and MGM: Re-tune after mixing.
- For InternVL2: Conduct secondary fine-tuning on top of the original tuning.
- Perform an additional 2 epochs of domain-adaptation fine-tuning on Spot-the-Diff and Image-Edit-Request.
- No extra loss function designed; standard VQA training objectives are used.
Key Experimental Results¶
Main Results¶
MMVP Benchmark (Image Difference Recognition)¶
| Model | Original Score | +Img-Diff | Gain |
|---|---|---|---|
| LLaVA-1.5-7B | ~18 | ~32 | +14 |
| MGM-7B | ~27 | ~42 | +15 |
| InternVL2-8B | ~40 | ~45 | +5 |
| GPT-4V | ~30 | - | - |
| Gemini | ~30 | - | - |
8 General MLLM Benchmarks (Average Gain \(\Delta\))¶
| Model | VQAv2 | GQA | POPE | MMBench | MM-Vet | SciQA | SEED | Average \(\Delta\) |
|---|---|---|---|---|---|---|---|---|
| LLaVA-1.5-7B | 78.5ā79.3 | 62ā62.8 | 85.9ā86.4 | 64.3ā66.1 | 30.5ā33.2 | 66.8ā68.2 | 58.6ā61.7 | +3.06% |
| MGM-7B | 80.4ā80.7 | 62.6ā62.7 | 86ā86.2 | 69.3ā68.7 | 40.8ā44.1 | 70.6ā71.7 | 63.5ā63.2 | +1.28% |
| InternVL2-8B | 81.8ā81.8 | 62.6ā62.6 | 87.7ā88.0 | 82.5ā82.7 | 49.2ā52.6 | 96.5ā96.6 | 69.5ā69.9 | +1.01% |
Spot-the-Diff Benchmark¶
| Model | BLEU | METEOR | CIDEr-D | ROUGE-L |
|---|---|---|---|---|
| MGM-7B | 9.9 | 12.0 | 46.3 | 31.5 |
| MGM-7B + RP | 10.8 | 13.1 | 53.5 | 33.0 |
| VACC (Prev. SOTA) | 9.7 | 12.6 | 41.5 | 32.1 |
Key Findings¶
- Img-Diff brings higher improvement to weaker models: LLaVA-1.5-7B gains 3.06%, while InternVL2-8B (which already has plenty of high-quality data) only gains 1.01%, indicating diminishing marginal returns of data.
- Quality evaluation results: Human annotation of 1,000 samples reveals that 79.6% of bounding boxes contain valid object differences, 80.1% of region descriptions are accurate, and over 70% of the difference descriptions are completely correct.
- Considerable diversity: The dataset covers 1,203 object categories and 3,680 unique "object replacement pairs".
- Data quantity vs. quality trade-off: Appendix experiments indicate that reasonably scaling up the data quantity can further boost performance, but quality takes precedence over quantity.
- "Object removal" data is also effective: Aside from "object substitution," the "object removal" variant can also bring additional improvements (Appendix Section 16).
Highlights & Insights¶
- Sophisticated design of the entire data synthesis pipeline ā from image pair generation and multi-stage filtering to region-level descriptions, every step incorporates quality-assurance mechanisms.
- Region-level difference descriptions (bounding boxes + local descriptions) are much more precise than global descriptions ā resolving the limitation where "a single caption cannot encompass all differences."
- As a general fine-tuning dataset, Img-Diff is "harmless" ā it enhances difference-recognition ability without hurting (and even slightly improving) model performance on general VQA.
- Outperforming GPT-4V and Gemini is highly convincing ā the ~12-point improvement comes purely from data without altering model architectures.
Limitations & Future Work¶
- The data scale is relatively small (12,688 samples); scaling up while maintaining quality remains a challenge.
- Dependency on the generation quality of SDXL ā synthetic images may contain artifacts that affect difference detection.
- Difference descriptions rely heavily on LLaVA-NEXT, which itself has limitations in producing fine-grained descriptions.
- Only focuses on "object substitution/removal" differences, lacking a systematic coverage of attribute changes (like color, texture, or pose).
- The filtering pipeline is relatively complex, and end-to-end efficiency needs to be improved.
Related Work & Insights¶
- vs InstructPix2Pix: Both utilize Prompt-to-Prompt to generate image pairs, but Img-Diff employs the superior SDXL, multi-stage filtering, and region-level descriptions, yielding significantly higher data quality.
- vs Spot-the-Diff Dataset: Real-world captures from stationary cameras, where differences represent natural temporal changes. Img-Diff consists of synthetic object changes, offering broader coverage but potentially deviating from clean real-world distributions.
- vs VIXEN: The first method using MLLMs for image difference captioning, focusing on model design. Img-Diff focuses on data synthesis, and the two are highly complementary.
Rating¶
- Novelty: āāāā The concept of using contrastive learning principles for data synthesis is novel, though the fundamental components (SDXL, Prompt-to-Prompt, FastSAM) are existing tools.
- Experimental Thoroughness: āāāāā Extremely comprehensive evaluation across 3 difference benchmarks, 8 general benchmarks, and 3 MLLMs, alongside data quality/diversity assessment, ablation, and multiple variants.
- Writing Quality: āāāā The pipeline diagram is clear, and every component is explicitly explained, though the division between the main text and the appendix is somewhat heavy.
- Value: āāāā Provides a generalizable data augmentation strategy of direct value to improving the fine-grained visual capabilities of MLLMs, with open-sourced code and data.