MM-IFEngine: Towards Multimodal Instruction Following¶

Conference: ICCV 2025
arXiv: 2504.07957
Area: Multimodal VLM
Keywords: Instruction Following, MLLM, SFT, DPO, Benchmark, Constrained Generation, Multimodal Evaluation

TL;DR¶

This paper proposes the MM-IFEngine pipeline, which systematically generates high-quality image–instruction pair data (in both SFT and DPO variants) and constructs the MM-IFEval benchmark, achieving significant improvements in multimodal instruction following for MLLMs.

Background & Motivation¶

Multimodal large language models (MLLMs) must accurately follow user-specified instructions in real-world applications—such as producing JSON-formatted outputs, adhering to word limits, or incorporating specific keywords. Three key bottlenecks currently impede progress:

Scarcity of training data: Open-source MLLMs lack high-quality multimodal instruction-following training data, resulting in poor performance under complex constraints.

Oversimplified existing benchmarks: Benchmarks such as MIA-Bench contain only simple atomic instructions (averaging 2.6 constraints per question) with weak correlation between constraints and visual content, causing most models to exceed 80% accuracy and rendering the benchmarks unable to differentiate model capabilities.

Imprecise evaluation strategies: Existing methods rely on LLM-as-a-judge, which yields unreliable judgments for constraints requiring precise verification, such as word counting or format checking.

These three limitations have stalled progress in MLLM instruction following, necessitating simultaneous breakthroughs in data generation, benchmark construction, and evaluation strategy.

Method¶

Overall Architecture¶

MM-IFEngine is an end-to-end image–instruction pair generation pipeline consisting of three stages:

Image Filter: High-quality images are selected from datasets such as CC3M and ALLaVA, with low-resolution and semantically impoverished images removed. For unannotated image-only datasets, IC9600 and RAM metrics are applied to select semantically rich natural scene images.
Task Generation: For images without QA pairs, task instructions are generated by sampling from a predefined pool of 16 task descriptions, with GPT-4o producing a tailored task instruction list for each image. For datasets with existing QA pairs (e.g., ALLaVA), questions containing few-shot examples or answer options are filtered out using regular expressions and length constraints.
Constraints Integration: Constraints are sampled from a pool of 6 major categories and 32 subcategories (text length, mathematical requirements, format, rhetorical logic, action requirements, and keywords). An LLM generates specific constraint content and verifies its compatibility with the corresponding task instruction.

Key Designs¶

MM-IFInstruct-23k (SFT Dataset): - Responses are generated by InternVL2.5-78B-MPO, with post-processing retaining only samples achieving a constraint satisfaction rate ≥ 80%. - The final dataset contains 23k items drawn from CC3M (16k), ALLaVA (6k), and MultiUI/Geo170k/ChartQA (4k). - Each sample contains 3–12 constraints, with an average constraint count substantially exceeding that of existing datasets.

MM-IFDPO-23k (DPO Preference Dataset): - Positive samples are taken directly from the high-quality data above. - Negative samples are generated by Qwen2-VL-7B-Instruct under four settings: - Image provided, with a random 1/3 of constraints removed. - Image provided, with a random 2/3 of constraints removed. - Image provided, with all constraints removed. - Full prompt provided, but without the image. - Ablation experiments show that removing 100% of constraints yields the best negative samples, as this maximizes the semantic gap between positive and negative samples.

MM-IFEval Benchmark: - 400 questions (300 Compose-Level + 100 Perception-Level). - 32 constraint categories, averaging 5.1 constraints per question. - Compose-Level: compositional constraints on output format, keywords, etc. - Perception-Level: requires visual perception ability, covering natural scenes, UI interfaces, charts, and mathematical expressions.

Hybrid Evaluation Strategy: 1. Rule-based verification: Predefined functions check constraints that admit precise verification, such as format and word count. 2. Direct LLM judgment: Evaluates constraints that do not require precise counting, such as the inclusion of specific vocabulary. 3. Comparative LLM judgment: For subjective constraints such as tone and style, two responses are generated—one with and one without the constraint—and compared.

Loss & Training¶

SFT stage: standard cross-entropy loss.
DPO stage: standard DPO loss; the KL divergence term preserves the model's original generalization ability.

Key Experimental Results¶

Main Results: Instruction-Following Benchmark Improvements¶

Model	MM-IFEval	MIA-Bench	IFEval	Average
Qwen2-VL-7B (base)	42.0	80.5	47.4	56.6
+ MM-IFInstruct-23k (SFT)	52.3 (+10.3)	87.7 (+7.2)	52.6 (+5.2)	64.2 (+7.6)
+ MM-IFDPO-23k (DPO)	52.2 (+10.2)	88.1 (+7.6)	59.7 (+12.3)	66.7 (+10.1)
LLaVA-NeXT-Llama3-8B (base)	39.7	83.3	50.7	57.9
+ MM-IFDPO-23k (DPO)	49.3 (+9.6)	90.0 (+6.7)	69.1 (+18.4)	69.5 (+11.6)

VQA Benchmark Retention¶

Model	MMMU	MMBench	MMStar	AI2D	OCRBench	Average
Qwen2-VL-7B (base)	53.9	81.0	60.8	82.9	86.7	72.3
+ MM-IFDPO-23k	54.0	81.3	58.5	83.3	86.8	72.4

VQA performance is nearly unaffected after DPO training, owing to KL divergence regularization.

MM-IFEval Leaderboard Highlights¶

Model	C-Level	P-Level	Average
GPT-4o	71.5	44.0	64.6
Qwen2-VL-72B	53.4	43.0	50.8
Qwen2-VL-7B + DPO	55.2	43.0	52.2

The 7B model after DPO fine-tuning surpasses the original 72B model, with a relative improvement of 24.3%.

Ablation Study on DPO Negative Sample Strategy¶

Progressively increasing the proportion of removed constraints from 33% → 66% → 100% yields monotonically improving results, indicating that widening the semantic gap between positive and negative samples is more effective for DPO training. The strategy of removing the image yields the weakest performance.

Highlights & Insights¶

Systematic solution: The work simultaneously addresses three major bottlenecks—data, benchmarks, and evaluation—forming a complete closed loop.
DPO substantially outperforms SFT: Negative samples are constructed by removing constraints, and KL divergence preserves generalization ability, resulting in roughly double the performance gain compared to SFT.
Smaller models surpass larger ones: The 7B model after DPO fine-tuning outperforms the original 72B model on MM-IFEval, demonstrating the value of high-quality instruction-following data.
Perception-Level remains challenging: Even GPT-4o achieves only 44.0 on P-Level, indicating that visual constraint understanding is far from solved.
Elegant constraint granularity design: The hierarchical constraint system with 6 major categories and 32 subcategories balances coverage and controllability.

Limitations & Future Work¶

Perception-Level improvements are limited: DPO fine-tuning primarily improves Compose-Level performance, with little gain on P-Level.
Data generation relies on GPT-4o, which incurs high cost and may introduce associated biases.
The benchmark is relatively small (only 400 questions), limiting statistical significance.
Validation is conducted only on 7–8B scale models; gains on larger models remain untested.

LLM instruction following: Text-based instruction-following benchmarks such as IFEval, CFBench, and InFoBench.
Multimodal instruction following: Multimodal benchmarks such as MIA-Bench and VisIT-Bench, which feature simple constraints and coarse evaluation.
Instruction fine-tuning data: Synthetic datasets such as ShareGPT4V and ALLaVA, which lack data specifically targeting instruction following.

Rating¶

Novelty: ⭐⭐⭐⭐ — Full-stack innovation spanning data generation, benchmarking, and evaluation strategy; the hybrid evaluation strategy is a particular highlight.
Practicality: ⭐⭐⭐⭐⭐ — The dataset and evaluation tools are fully open-sourced and can be directly applied to improve any MLLM.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive multi-benchmark validation with sufficient ablation studies, though the benchmark scale is somewhat limited.
Writing Quality: ⭐⭐⭐⭐ — Clear structure with highly informative figures and tables.