ShareGPT4V: Improving Large Multi-Modal Models with Better Captions¶

Conference: ECCV 2024
arXiv: 2311.12793
Code: https://ShareGPT4V.github.io
Area: Multi-modal VLM
Keywords: High-quality caption, data augmentation, multi-modal alignment, GPT4-Vision, pre-training data

TL;DR¶

ShareGPT4V constructs a high-quality descriptive caption dataset comprising 1.2M entries (seed of 100K generated by GPT4-Vision + expanded to 1.2M via Share-Captioner). By using this dataset to train ShareGPT4V-7B (a model based on the LLaVA architecture) in both pre-training and SFT stages, it achieves state-of-the-art performance on 9 out of 11 multi-modal benchmarks. This demonstrates that high-quality captions are the key bottleneck in multi-modal alignment for LMMs.

Background & Motivation¶

Background: Current large multi-modal models (LMMs) follow a two-stage paradigm of "pre-training alignment + SFT fine-tuning" and have achieved significant progress in multi-modal understanding.

Limitations of Prior Work: Captions in mainstream image-text datasets are generally short and focus only on salient objects (e.g., COCO-Caption averages only 52 characters), which leads to the compression and loss of a significant amount of rich, fine-grained visual semantic information.

Key Challenge: The visual modality naturally contains rich information (world knowledge, object attributes, spatial relations, aesthetic evaluations, etc.), but the information content of existing captions is far from sufficient to support effective modality alignment. Although LLaVA-Instruct utilizes GPT-4, the model does not "see" the image, relying instead on human annotations and imagination, which inevitably leads to hallucinations.

Goal: How to obtain high-quality image descriptions at scale and low cost to improve modality alignment across all training stages of LMMs.

Key Insight: First use GPT4-Vision to directly generate 100K high-quality captions (averaging 942 characters) from images, and then train a caption model (Share-Captioner) to scale this to 1.2M.

Core Idea: High-quality captions are the silver bullet for multi-modal alignment—even without changing the architecture, simply replacing 3.5% of SFT data with high-quality captions can significantly boost performance.

Method¶

Overall Architecture¶

Pipeline: Multi-source images \(\rightarrow\) Data-specific prompt design \(\rightarrow\) GPT4-Vision generation of 100K seed captions \(\rightarrow\) Share-Captioner training \(\rightarrow\) Scaling to 1.2M captions \(\rightarrow\) Used in the pre-training and SFT stages of LMMs

The model architecture follows LLaVA-1.5: CLIP-Large (336×336) \(\rightarrow\) 2-layer MLP Projector \(\rightarrow\) Vicuna-v1.5 (7B)

Key Designs¶

Data-Specific Prompt Engineering:
- Function: Designs dedicated prompts for different data sources to guide GPT4-Vision to generate highly context-relevant descriptions.
- Mechanism: Basic prompt (object attributes, appearance, spatial relationship) + source-specific prompt (e.g., landmark images \(\rightarrow\) geographical location and name, celebrity images \(\rightarrow\) identity information) + aesthetic evaluation prompt.
- Design Motivation: Images from different sources focus on different aspects; general prompts cannot capture domain-specific knowledge (e.g., the Eiffel Tower should not be described merely as a "tall iron tower").
Share-Captioner (Caption Model Training and Data Expansion):
- Function: Fine-tunes a caption model using 100K GPT4-Vision captions to replace the expensive GPT4-Vision for large-scale caption generation.
- Mechanism: After fine-tuning on 100K diverse high-quality captions, Share-Captioner can generate high-quality captions with a unified instruction without needing source-specific prompts.
- Design Motivation: GPT4-Vision is highly expensive; a localized caption model is needed for scalable expansion. Human evaluation indicates that Share-Captioner matches the quality of GPT4-Vision (38.2% vs. 35.3% preference, 26.5% tie).
- Expansion Scale: 1.2M images, costing approximately 44 A100 GPU days.
Unfreezing Strategy for Vision Encoder in Pre-training:
- Function: Simultaneously fine-tunes the vision encoder, projector, and LLM during the pre-training stage.
- Mechanism: Since the caption quality is high enough, unfreezing the latter half of the vision encoder (starting from block 12) allows the encoder to learn to generate visual embeddings that correspond to the details in the captions, achieving bidirectional alignment.
- Design Motivation: Prior LMM works do not fine-tune the vision encoder during pre-training, as low-quality captions may degrade visual knowledge. However, unfreezing under high-quality caption scenarios is highly beneficial.
- Key Findings: Unfreezing the latter half (starting from layer 12) yields the best performance, improving the MME score by 52.2 points and MMBench by 2.2%.

Loss & Training¶

Pre-training Stage: Uses ShareGPT4V-PT (1.2M captions), learning rate \(2e{-5}\), batch size 256, approximately 4700 steps. Simultaneously fine-tunes the latter half of the vision encoder, projector, and LLM.
SFT Stage: Follows LLaVA-1.5's 665K SFT data, but replaces 23K detailed descriptions with ShareGPT4V high-quality captions. Freezes the vision encoder, fine-tuning the projector and LLM. Learning rate \(2e{-5}\), batch size 128, approximately 5200 steps.

Key Experimental Results¶

Main Results¶

Benchmark	Metric	ShareGPT4V-7B	LLaVA-1.5-13B	Qwen-VL-Chat-7B	Gain
MME-P	Perception Score	1567.4	1531.3	1487.5	+36.1 vs 13B
MME-C	Cognitive Score	376.4	295.4	360.7	+15.7 vs Qwen
MMBench	Accuracy	68.8%	67.7%	60.6%	+1.1%
SEED-I	Accuracy	69.7%	68.2%	58.2%	+1.5%
MM-Vet	GPT Score	37.6	35.4	-	+2.2
LLaVA-Wild	GPT Score	72.6	70.7	-	+1.9
VQA-v2	Accuracy	80.6%	80.0%	78.2%	+0.6%
VizWiz	Accuracy	57.2%	53.6%	38.9%	+3.6%

The 7B model outperforms all competitors (including 13B models and Qwen-VL-Chat which uses 1.4 billion training samples) on 9 out of 11 benchmarks.

Ablation Study¶

Configuration (Pre-training/SFT using ShareGPT4V)	MME-P	MMBench	SEED-I	Description
✗/✗ (baseline LLaVA-1.5)	1510.7	64.3%	66.2%	Baseline
✗/✓	1542.1	66.8%	66.7%	SFT used alone, +31.4
✓/✗	1557.2	67.4%	68.5%	Pre-training used alone, +46.5
✓/✓	1567.4	68.8%	69.7%	Both stages used, optimal

Vision Encoder Unfrozen Starting Layer	VRAM Usage	MME-P	MMBench	SEED-I
24 (Fully Frozen)	49.6 GB	1515.2	66.6%	68.1%
12 (Half Unfrozen)	56.7 GB	1567.4	68.8%	69.7%
0 (Fully Unfrozen)	63.6 GB	1545.7	68.5%	69.2%

Key Findings¶

Only replacing 3.5% of SFT data with high-quality captions brings consistent improvements across multiple LMMs (LLaVA: MME +222.8, LLaVA-1.5-13B: MME +22.0, Qwen-VL-Chat: MME +22.3).
The quality of captions during the pre-training stage has a larger impact: using the same images, the ShareGPT4V-PT captions improve MME by 18.2 points more than BLIP captions.
High-quality captions allow modality alignment to be achieved with a relatively lightweight data scale—significant improvements are observed with 100K data points, and performance gradually saturates after 1000K.
Unfreezing the latter half of the vision encoder (rather than all of it or completely freezing it) is the optimal strategy.

Highlights & Insights¶

Data Quality > Model Architecture: The core message of this paper is that "with a simple architecture and a small number of parameters, high-quality data alone can outperform larger models," which is a universally instructive insight.
Distillation Concept of Share-Captioner: Training an alternative caption model using 100K data points from GPT4-Vision is a classic "strong model distillation \(\rightarrow\) weak model expansion" paradigm.
Diversified Prompt Design for Data Sources: Designing prompts tailored to different image sources is an effective means to improve caption coverage.
Vision Encoder Unfreezing Strategy: In scenarios with high-quality data, unfreezing the latter half of the encoder is a transferable training trick.

Limitations & Future Work¶

Caption generation relies on GPT4-Vision, and the cost of the initial 100K remains high.
Verified only at the 7B scale; the scaling law with larger models or larger data regimes is not explored.
The quality upper bound of Share-Captioner is constrained by GPT4-Vision, making it unable to surpass the teacher model.
Though the dataset is diverse, it is still dominated by natural images, with limited coverage of scenarios such as documents and charts.
The interaction effects between captions and other types of training data (e.g., grounding, OCR) were not explored.

vs LLaVA-Instruct: LLaVA lets GPT-4 "imagine" image content to generate captions, which inevitably leads to hallucinations; ShareGPT4V has GPT4-Vision directly inspect and describe images, achieving higher accuracy.
vs LaCLIP/VeCLIP: These works attempt to rewrite/merge captions using LLMs to enhance CLIP training, but are limited by the original caption quality and LLM hallucinations; ShareGPT4V produces high-quality descriptions from the source.
vs BLIP-LCS: BLIP-generated captions average 54 characters, while ShareGPT4V averages 942 characters, showing a huge difference in information density.

Rating¶

Novelty: ⭐⭐⭐☆☆ The method itself is not complicated (using a strong model to generate good data), but the insight that "data quality determines the upper bound of modality alignment" is valuable.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across 11 benchmarks, with systematic ablation study (data stages, data volume, encoder unfreezing layers, caption quality comparisons), which is very detailed.
Writing Quality: ⭐⭐⭐⭐☆ Clear logic, rich charts, and a complete storyline.
Value: ⭐⭐⭐⭐⭐ The ShareGPT4V dataset has become the standard pre-training data for several subsequent LMM works, yielding enormous practical impact.