Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs¶
Conference: ICCV 2025 arXiv: 2503.20309 Code: Dataset Area: Multimodal VLM Keywords: Multimodal large language models, preference alignment, instruction following, DPO, hallucination mitigation Authors: Zitian Wang, Yue Liao, Kang Rong, Fengyun Rao, Yibo Yang, Si Liu (Beihang University, NUS, KAUST)
TL;DR¶
This paper proposes the Instruction-oriented Preference Alignment (IPA) framework, which anchors alignment signals to instruction completion efficacy rather than hallucination factors alone, via an automated preference construction mechanism and a progressive preference data collection pipeline. IPA achieves consistent improvements on Qwen2VL-7B across 9 benchmarks spanning hallucination evaluation, general VQA, and text comprehension.
Background & Motivation¶
Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language understanding; however, noisy or ambiguous annotations in SFT data can lead models to learn incorrect information. Preference alignment has thus emerged as a post-training strategy.
Existing preference alignment methods, however, suffer from fundamental limitations:
Over-focus on hallucination factors: Methods such as RLHF-V, RLAIF-V, and Topic-Overwrite primarily construct preference pairs by injecting or detecting hallucinations, restricting improvements to hallucination mitigation.
Neglect of intrinsic quality dimensions: When multiple responses show no significant difference in hallucination, these methods lack discriminative capability.
Dependence on human or commercial models: Manual annotation is costly, and reliance on GPT-4V incurs substantial financial overhead.
The core question raised by the authors is: What factors in preference pairs truly determine the direction of alignment?
The answer is instruction completion efficacy—whether a response adequately fulfills the core requirements of the instruction. For example, given the question "Is there a cat in the image?", both "Yes." and "The image's lower-left corner displays a gray cat." are equally free of hallucination, yet the latter better satisfies the instruction by demonstrating observational reasoning and completeness of detail.
Method¶
Overall Architecture¶
IPA consists of two components: an automated preference construction mechanism and a progressive preference data collection pipeline.
Stage 1: Response Sampling¶
For each sample \(s = (V, I, r^{ref})\), two sampling strategies are employed:
Normal sampling: $\(r^{norm} \sim \pi_G(\cdot | V, I; \theta_{\pi_G})\)$
Contrastive sampling (inspired by VCD): $\(r^{cont} \sim \pi_G(\cdot | t(V), I; \theta_{\pi_G}), \quad t \sim \mathcal{T}\)$
where \(t\) is a perturbation operator randomly selected from the perturbation set \(\mathcal{T}\) (noise, blur, scaling, etc.). Contrastive sampling aims to enhance the diversity of deficient response patterns related to the robustness of \(\pi_G\).
Stage 2: Reflective Revision¶
Responses are refined through two sub-stages:
-
Diagnostic feedback generation: The revision model \(\pi_R\) identifies critical deficiencies in the initial response: $\(fb \sim \pi_R(\cdot | V, I, r, I^{fb}; \theta_{\pi_R})\)$
-
Feedback-driven refinement: The response is refined based on the generated feedback: $\(r^{rev} \sim \pi_R(\cdot | V, I, r, fb, I^{rf}; \theta_{\pi_R})\)$
Beyond correcting visual hallucinations, the revision stage more importantly expands the depth of task-relevant information.
Stage 3: Instruction-Oriented Verification¶
A key design component—the verification process decouples alignment signal extraction from format constraints:
The verification model evaluates four dimensions: - Logical entailment: Whether the reference answer can be logically derived from the response - Detail completeness: Whether critical details are missing - Contradiction detection: Whether the response contradicts the reference - Instruction compliance: Whether the response follows the instruction
\(v = 1\) indicates all conditions are satisfied (preferred); \(v = 0\) indicates certain failure patterns are present (dispreferred). Preference pairs \(\mathcal{P}^{resp} = \{(r^w, r^l)\}\) are constructed accordingly.
Progressive Preference Collection Pipeline¶
Round 1: Preference construction is executed on the seed dataset to collect initial preference data. Hard samples for which \(\mathcal{R}^{win}\) cannot be generated are retained for \(\mathcal{D}_2\).
Round 2 (Self-Evolution): The generator and reviser are optimized via DPO using Round 1 preference data: $\(\pi_G^+ = \text{DPO}(\pi_G, \mathcal{P}_1^{resp}), \quad \pi_R^+ = \text{DPO}(\pi_R, \mathcal{P}_1^{fb})\)$ The enhanced models then reprocess the hard samples.
Round 3 (Reference-Guided): For remaining hard samples, the reference response \(r^{ref}\) is introduced during the revision stage: $\(fb \sim \pi_R(\cdot | V, I, r, r^{ref}, I^{fb}; \theta_{\pi_R})\)$
A total of 89K preference pairs are collected.
Training Details¶
- Backbone model: Qwen2VL-7B
- Alignment method: DPO + LoRA
- Training duration: 1 epoch
- Dataset: Multi-source multimodal samples covering VQA, text comprehension, and open-ended instructions
Key Experimental Results¶
Main Results: Consistent Improvements Across Benchmarks¶
| Model | HallBench | POPE | MMHal | MMMU | MMStar | MMVet | MME | LLaVA | OCR |
|---|---|---|---|---|---|---|---|---|---|
| Qwen2VL-7B | 50.0 | 86.2 | 3.6 | 53.8 | 60.7 | 63.1 | 1676 | 76.6 | 86.2 |
| + IPA | 54.3 | 87.2 | 3.7 | 54.6 | 61.7 | 64.2 | 1687 | 84.0 | 87.3 |
| Gain | +4.3 | +1.0 | +0.1 | +0.8 | +1.0 | +1.1 | +11 | +7.4 | +1.1 |
| Qwen2.5VL-7B | 54.7 | 86.2 | 3.7 | 57.6 | 64.7 | 65.6 | 1694 | 75.2 | 87.5 |
| + IPA | 55.7 | 86.6 | 3.9 | 59.8 | 66.5 | 68.3 | 1707 | 87.3 | 88.2 |
Consistent gains on Qwen2.5VL-7B further demonstrate the cross-model transferability of the preference data.
Comparison with Existing Methods¶
| Method | HallBench | POPE | MMMU | MMStar | MMVet | MME | LLaVA | OCR |
|---|---|---|---|---|---|---|---|---|
| RLHF-V | 50.2 | 86.4 | 52.9 | 61.0 | 60.4 | 1682 | 76.6 | 86.3 |
| RLAIF-V | 50.2 | 86.5 | 53.8 | 61.1 | 58.2 | 1674 | 78.8 | 86.4 |
| VLFeedback | 52.2 | 84.7 | 53.7 | 60.7 | 60.6 | 1682 | 81.4 | 86.6 |
| MMPR | 53.4 | 86.4 | 54.3 | 58.5 | 61.7 | 1681 | 83.5 | 86.3 |
| IPA (Ours) | 54.3 | 87.2 | 54.6 | 61.7 | 64.2 | 1687 | 84.0 | 87.3 |
- Hallucination-oriented methods (RLHF-V, RLAIF-V, Topic-Overwrite) tend to exhibit performance trade-offs across multi-dimensional evaluations.
- IPA achieves consistent improvements across all dimensions, with particularly notable margins on MMVet (+1.1) and LLaVABench (+7.4).
Ablation Study¶
| Configuration | Avg. | HallBench | MMMU | MMStar | MMVet | OCR |
|---|---|---|---|---|---|---|
| Baseline | 68.1 | 50.0 | 53.8 | 60.7 | 63.1 | 86.2 |
| Round 1 w/o CS | 69.1 | 52.3 | 53.4 | 61.2 | 61.7 | 87.0 |
| Round 1 | 70.1 | 52.9 | 54.2 | 61.9 | 62.7 | 87.0 |
| + Round 2 | 70.4 | 53.7 | 54.6 | 61.7 | 63.8 | 86.9 |
| + Round 3 w/ RGS | 69.8 | 53.4 | 54.7 | 61.1 | 62.7 | 87.2 |
| + Round 3 | 70.5 | 54.3 | 54.6 | 61.7 | 64.2 | 87.3 |
Key findings: - Contrastive sampling contributes an average gain of 1.0 point, with +2.9 on HallBench. - Self-evolution (Round 2) yields an additional 0.3-point improvement. - Reference-guided sampling (RGS) induces simple imitation behavior, causing a 0.7-point performance drop. - Reference-guided feedback (the full Round 3) effectively recovers hard samples.
Highlights & Insights¶
- Paradigm shift: Moving from hallucination-oriented to instruction-oriented preference alignment, anchored to instruction completion capability, achieves comprehensive performance improvement rather than single-dimensional optimization.
- Elegant verification design: Preference judgment is reformulated as a binary verification task, establishing clear decision boundaries by decoupling format variation.
- Progressive collection pipeline: The three-round progressive strategy maximizes the utilization of hard samples, with 89K preference pairs covering diverse multi-source multimodal scenarios.
- Cross-model generalization: Consistent improvements on Qwen2.5VL-7B demonstrate the generality of the preference signals.
- Effectiveness of contrastive sampling: Visual perturbations not only increase negative sample diversity but also expose weaknesses in the model's robustness to multimodal understanding.
Limitations & Future Work¶
- Validation on larger-scale models (e.g., tens of billions of parameters) is absent due to computational resource constraints.
- The verification model and the aligned model both use Qwen2VL-7B, which may introduce bias.
- Preference data is primarily constructed based on Qwen2VL, requiring additional filtering when applied to weaker models (e.g., LLaVA-1.5-7B).
- Online preference optimization approaches (e.g., online DPO or RLHF with PPO) are not explored.
Related Work & Insights¶
- Difference from MMPR: MMPR focuses on CoT reasoning and formatted answer matching, whereas IPA targets instruction completion capability.
- Comparison with VLFeedback: VLFeedback relies on GPT-4V annotation and optimizes for helpfulness, faithfulness, and ethics; IPA is fully automated and centers on instruction compliance.
- Implications for future work: The instruction-oriented preference construction paradigm is generalizable to arbitrary multimodal tasks and MLLM architectures.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The instruction-oriented perspective on preference alignment is novel, and the verification mechanism is elegantly designed.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive coverage of 9 benchmarks, thorough ablations, and cross-model validation strengthen the claims.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure with well-articulated motivation; Figure 1 provides an intuitive and compelling contrast.
- Value: ⭐⭐⭐⭐ — Provides a scalable paradigm for preference data construction with practical applicability to the MLLM community.