Skip to content

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

Conference: CVPR 2025
arXiv: 2408.04594
Code: https://github.com/modelscope/data-juicer/tree/ImgDiff
Area: Multimodal VLM
Keywords: Contrastive Data Synthesis, Image Difference Captioning, Multimodal Large Language Models, Fine-grained Visual Understanding, Data Augmentation

TL;DR

Proposes a data synthesis method inspired by contrastive learning. It automatically generates similar image pairs containing subtle object differences along with their difference descriptions. After fine-tuning MLLMs on this data, it outperforms GPT-4V and Gemini on MMVP by 12 points, achieving an average improvement of 3.06% across 8 general MLLM benchmarks.

Background & Motivation

Background: The performance of Multimodal Large Language Models (MLLMs) highly depends on the quality of training data. Current training data for MLLMs mainly consists of image-text pairs in the pre-training phase and VQA instruction-following data in the fine-tuning phase, which still leaves a gap in fine-grained image recognition.

Limitations of Prior Work: (1) Existing VQA datasets focus on single-image understanding, lacking discrimination training on subtle differences between similar images; (2) Vision encoders like CLIP suffer from the "CLIP-blind" issue, failing to distinguish between images that have similar CLIP features but different content; (3) Existing image difference datasets (e.g., Spot-the-Diff, CLEVR-Change) are small in scale, narrow in domain, and their full-image level difference descriptions are not sufficiently precise.

Key Challenge: Improving the fine-grained visual discrimination of MLLMs requires large-scale "similar but different" image pairs. However, manual annotation is extremely costly, and existing synthesis methods (such as InstructPix2Pix) produce difference descriptions that are imprecise and lack spatial focus.

Goal: Designs a fully automatic data synthesis pipeline to generate high-quality "object substitution" image pairs and precise region-level difference descriptions.

Key Insight: Reference the idea of contrastive learning — allowing the model to learn to distinguish fine-grained differences by comparing similar images.

Core Idea: Utilize SDXL + Prompt-to-Prompt to generate similar image pairs, extract different regions through multi-stage filtering, and leverage MLLMs to generate precise region-level difference descriptions, forming high-quality contrastive training data.

Method

Overall Architecture

A three-step pipeline: (1) Image Pair Generation: using an LLM for object replacement to produce caption pairs, and generating similar image pairs using SDXL + Prompt-to-Prompt; (2) Difference Area Generator: locating different regions via image similarity filtering, FastSAM instance segmentation, image-text matching filtering, and difference detection; (3) Difference Captions Generator: a two-stage process using MLLMs to annotate region content and generate precise difference descriptions.

Key Designs

  1. Image Pair Generation (Prompt-to-Prompt + SDXL):

    • Function: Generating highly similar image pairs where only a small number of objects are replaced.
    • Mechanism: (i) Retrieve image captions from caption databases (e.g., MSCOCO or LLaVA pre-training data); (ii) Use an LLM to replace an object name in the caption (e.g., "dog" \(\rightarrow\) "cat"); (iii) Generate image pairs based on the original and replaced captions using Stable-Diffusion-XL configured with Prompt-to-Prompt — which ensures only the replaced object changes while keeping the rest consistent by sharing cross-attention layers.
    • Design Motivation: Directly comparing to InstructPix2Pix-style methods: (1) Utilizes the more advanced SDXL (instead of SD-1.5) to generate more realistic images; (2) Prompt-to-Prompt offers better control over the scope of modification than editing methods. Out of 118K image pairs, a multi-stage filtering process yields 38,533 highly similar but non-identical image pairs.
  2. Difference Area Generator:

    • Function: Precisely locating the regions (bounding boxes) representing object differences between the image pairs.
    • Mechanism: A four-step cascading filter: (i) Image Similarity Filtering: Calculate the cosine similarity of the image pair using CLIP, retaining pairs within a specific threshold range (too similar = no difference, too different = completely different); (ii) FastSAM Segmentation: Query FastSAM for instance segmentation on each image to obtain all bounding boxes; (iii) Image-Text Matching Filtering: Use BLIP to perform image-text matching between cropped sub-images and object names to confirm that the sub-image indeed contains the target object; (iv) Difference Detection: Crop both images at the same bounding box position, compute the similarity between these sub-images using CLIP, retain only regions with significant differences, and apply IoU filtering to remove overlapping boxes.
    • Design Motivation: Avoids using object detection (such as YOLO) due to limited object categories. This segmentation-and-similarity-comparison-based approach is unrestricted by predefined categories and can detect differences of arbitrary objects. Multi-stage filtering ensures the quality of the 117,779 valid bounding boxes.
  3. Difference Captions Generator:

    • Function: Generating precise textual descriptions for the different regions.
    • Mechanism: Stage-1 (Object Labeling & Filtering): Select the top-5 bounding boxes with the lowest similarity, utilize LLaVA-NEXT to describe the content of each region, and filter them via image-text matching and description similarity (using CLIP to compute the similarity between both descriptions, where excessive similarity implies no real difference). Stage-2 (Difference Captioning): Draw red bounding boxes on the images to mark different regions, and feed both images along with region descriptions and visual cues (red boxes) into LLaVA-NEXT to generate descriptions of the specific differences within the red-boxed regions.
    • Design Motivation: The key innovation is region-level difference descriptions instead of full-image difference; full-image descriptions often omit details or remain too general, while red-box prompting and localized descriptions ensure each difference is precisely captured. Ultimately, 12,688 high-quality "object substitution" samples are selected from 117,779 bounding boxes.

Loss & Training

  • Mix Img-Diff data into the original MLLM visual instruction tuning data for fine-tuning.
  • For LLaVA-1.5 and MGM: Re-tune after mixing.
  • For InternVL2: Conduct secondary fine-tuning on top of the original tuning.
  • Perform an additional 2 epochs of domain-adaptation fine-tuning on Spot-the-Diff and Image-Edit-Request.
  • No extra loss function designed; standard VQA training objectives are used.

Key Experimental Results

Main Results

MMVP Benchmark (Image Difference Recognition)

Model Original Score +Img-Diff Gain
LLaVA-1.5-7B ~18 ~32 +14
MGM-7B ~27 ~42 +15
InternVL2-8B ~40 ~45 +5
GPT-4V ~30 - -
Gemini ~30 - -

8 General MLLM Benchmarks (Average Gain \(\Delta\))

Model VQAv2 GQA POPE MMBench MM-Vet SciQA SEED Average \(\Delta\)
LLaVA-1.5-7B 78.5→79.3 62→62.8 85.9→86.4 64.3→66.1 30.5→33.2 66.8→68.2 58.6→61.7 +3.06%
MGM-7B 80.4→80.7 62.6→62.7 86→86.2 69.3→68.7 40.8→44.1 70.6→71.7 63.5→63.2 +1.28%
InternVL2-8B 81.8→81.8 62.6→62.6 87.7→88.0 82.5→82.7 49.2→52.6 96.5→96.6 69.5→69.9 +1.01%

Spot-the-Diff Benchmark

Model BLEU METEOR CIDEr-D ROUGE-L
MGM-7B 9.9 12.0 46.3 31.5
MGM-7B + RP 10.8 13.1 53.5 33.0
VACC (Prev. SOTA) 9.7 12.6 41.5 32.1

Key Findings

  • Img-Diff brings higher improvement to weaker models: LLaVA-1.5-7B gains 3.06%, while InternVL2-8B (which already has plenty of high-quality data) only gains 1.01%, indicating diminishing marginal returns of data.
  • Quality evaluation results: Human annotation of 1,000 samples reveals that 79.6% of bounding boxes contain valid object differences, 80.1% of region descriptions are accurate, and over 70% of the difference descriptions are completely correct.
  • Considerable diversity: The dataset covers 1,203 object categories and 3,680 unique "object replacement pairs".
  • Data quantity vs. quality trade-off: Appendix experiments indicate that reasonably scaling up the data quantity can further boost performance, but quality takes precedence over quantity.
  • "Object removal" data is also effective: Aside from "object substitution," the "object removal" variant can also bring additional improvements (Appendix Section 16).

Highlights & Insights

  • Sophisticated design of the entire data synthesis pipeline — from image pair generation and multi-stage filtering to region-level descriptions, every step incorporates quality-assurance mechanisms.
  • Region-level difference descriptions (bounding boxes + local descriptions) are much more precise than global descriptions — resolving the limitation where "a single caption cannot encompass all differences."
  • As a general fine-tuning dataset, Img-Diff is "harmless" — it enhances difference-recognition ability without hurting (and even slightly improving) model performance on general VQA.
  • Outperforming GPT-4V and Gemini is highly convincing — the ~12-point improvement comes purely from data without altering model architectures.

Limitations & Future Work

  • The data scale is relatively small (12,688 samples); scaling up while maintaining quality remains a challenge.
  • Dependency on the generation quality of SDXL — synthetic images may contain artifacts that affect difference detection.
  • Difference descriptions rely heavily on LLaVA-NEXT, which itself has limitations in producing fine-grained descriptions.
  • Only focuses on "object substitution/removal" differences, lacking a systematic coverage of attribute changes (like color, texture, or pose).
  • The filtering pipeline is relatively complex, and end-to-end efficiency needs to be improved.
  • vs InstructPix2Pix: Both utilize Prompt-to-Prompt to generate image pairs, but Img-Diff employs the superior SDXL, multi-stage filtering, and region-level descriptions, yielding significantly higher data quality.
  • vs Spot-the-Diff Dataset: Real-world captures from stationary cameras, where differences represent natural temporal changes. Img-Diff consists of synthetic object changes, offering broader coverage but potentially deviating from clean real-world distributions.
  • vs VIXEN: The first method using MLLMs for image difference captioning, focusing on model design. Img-Diff focuses on data synthesis, and the two are highly complementary.

Rating

  • Novelty: ⭐⭐⭐⭐ The concept of using contrastive learning principles for data synthesis is novel, though the fundamental components (SDXL, Prompt-to-Prompt, FastSAM) are existing tools.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive evaluation across 3 difference benchmarks, 8 general benchmarks, and 3 MLLMs, alongside data quality/diversity assessment, ablation, and multiple variants.
  • Writing Quality: ⭐⭐⭐⭐ The pipeline diagram is clear, and every component is explicitly explained, though the division between the main text and the appendix is somewhat heavy.
  • Value: ⭐⭐⭐⭐ Provides a generalizable data augmentation strategy of direct value to improving the fine-grained visual capabilities of MLLMs, with open-sourced code and data.