Skip to content

S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models

Conference: ACL 2026 arXiv: 2604.18512 Code: None Area: Multimodal VLM / Preference Alignment Keywords: Multi-image reasoning, DPO preference optimization, visual search, difficulty grading, VLM alignment

TL;DR

This paper proposes a Simple-to-Hard (S2H) DPO framework that constructs multi-image preference data across three progressively harder levels (anchored reasoning → cross-image comparison → global visual search), systematically improving VLM multi-image reasoning while preserving single-image performance.

Background & Motivation

State of the Field: VLMs have achieved remarkable progress in single-image understanding, yet effective reasoning across multiple images remains challenging. Multi-image reasoning requires localizing relevant images, comparing, and integrating information from multiple visual sources.

Limitations of Prior Work: Existing multi-image alignment methods (e.g., MIA-DPO) primarily focus on "anchored reasoning" — where the question pre-specifies which image to attend to (e.g., "In image 3, ...") — thereby bypassing global visual search and autonomous cross-image comparison, two critical capabilities. This leaves models underperforming in more complex multi-image scenarios.

Root Cause: MIA-DPO trains exclusively on Level 1 data (single-image anchored questions), neglecting the higher-order reasoning capabilities required at Level 2 (multi-image anchored comparison) and Level 3 (global visual search). Different levels induce qualitatively distinct reasoning patterns, and low-level training fails to generalize to higher levels.

Paper Goals: To explicitly define the capability hierarchy required for multi-image reasoning and construct preference data covering all levels to comprehensively improve VLM multi-image reasoning.

Starting Point: The paper defines a three-level capability hierarchy — Level 1 (reasoning over a pre-specified single image), Level 2 (comparing pre-specified multiple images), and Level 3 (autonomously searching all images to locate those satisfying a given condition) — and constructs corresponding chosen/rejected pairs for DPO training.

Core Idea: Chosen/rejected pairs are created via prompt-driven complexity rather than model-specific hallucinations, making the dataset model-agnostic and covering the full reasoning spectrum from simple to hard.

Method

Overall Architecture

S2H-DPO converts existing single-image data into multi-image preference data at three levels, with 20K samples per level. Level 1 constructs preference pairs using distractor images and model hallucinations; Level 2 designs kinship recognition and visual arithmetic tasks to test cross-image comparison; Level 3 designs global visual search tasks requiring the model to search all images before localizing the target. All levels are trained jointly.

Key Designs

  1. Three-Level Reasoning Capability Hierarchy:

    • Function: Systematically defines the complete capability spectrum for multi-image reasoning.
    • Mechanism: Level 1 (single-image anchored) — "What color is the car in image 2?" requires attending only to the specified image; Level 2 (multi-image anchored comparison) — "Are the cars in images 1 and 3 the same color?" requires cross-image relational comparison; Level 3 (global search) — "Which image contains a white car?" requires inspecting all images to locate the target. Each level strictly demands more capabilities than the previous.
    • Design Motivation: Training solely on Level 1, as in MIA-DPO, is insufficient — different levels induce qualitatively distinct reasoning patterns, and low-level training does not generalize to higher levels.
  2. Model-Agnostic Chosen/Rejected Construction:

    • Function: Eliminates the need to regenerate data for each new model.
    • Mechanism: Level 1 uses distractor images to trigger hallucinations (identical to MIA-DPO); Level 2 leverages pre-labeled datasets (kinship datasets, synthetic visual arithmetic) to deterministically generate correct/incorrect pairs; Level 3 samples target concept images from ImageNet paired with random distractor images, where chosen responses accurately describe the target image and rejected responses provide generic, non-targeted descriptions. Semantic similarity filtering via CLIP/MPNet removes low-quality pairs.
    • Design Motivation: MIA-DPO relies on model-specific hallucinations to generate rejected samples, necessitating regeneration for each new model. The prompt-driven approach produces contrastive pairs through task design itself, making it universally applicable across models.
  3. Joint Multi-Level Training:

    • Function: Simultaneously learns reasoning capabilities across all levels.
    • Mechanism: Data from all three levels are mixed and trained with the standard DPO loss \(L_{\text{DPO}} = -\mathbb{E}[\log \sigma(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)})]\). Evaluation is conducted on LLaVA-v1.5-7B, Qwen2.5-VL-7B, and Qwen3-VL-2B.
    • Design Motivation: Ablation experiments demonstrate that joint training outperforms training on any single level, as different reasoning levels mutually reinforce each other.

Loss & Training

Standard DPO loss with temperature \(\beta=0.1\), learning rate \(5 \times 10^{-5}\), trained for 3 epochs. 20K samples per level.

Key Experimental Results

Main Results

Method BLINK MANTIS NLVR2 Multi-image Avg.
LLaVA-v1.5 Baseline 37.1 41.9 52.1 43.7
MIA-DPO 42.9 44.2 54.2 47.1
S2H-DPO 43.4 47.9 55.6 49.0
Gain vs. Baseline +6.3 +6.0 +3.5 +5.3

Ablation Study

Configuration Multi-image Avg. Single-image Avg. Notes
Level 1 only 47.1 Maintained Equivalent to MIA-DPO
Level 2 only Improved Maintained Cross-image comparison is beneficial
Level 3 only Improved Maintained Global search is most challenging
Level 1+2+3 49.0 Maintained Joint training is optimal

Key Findings

  • S2H-DPO outperforms MIA-DPO on all multi-image benchmarks, with a more pronounced advantage on harder Level 3 tasks.
  • Joint training across all three levels surpasses training on any single level; different reasoning levels mutually reinforce each other.
  • A key advantage is that multi-image reasoning gains are achieved without any degradation in single-image performance (no decline on MMStar or POPE).
  • Unlike MIA-DPO, S2H-DPO's data construction does not depend on model-specific hallucinations and is universally applicable across models.

Highlights & Insights

  • Clear and compelling definition of capability levels: The progressive hierarchy from anchored reasoning → comparison → search, where each level strictly requires more capabilities than the previous, provides a systematic task analysis framework transferable to other multimodal reasoning scenarios.
  • Prompt-driven vs. hallucination-driven contrastive design: The former generates natural contrast through task difficulty, while the latter relies on model-specific deficiencies. The former is more general and does not become obsolete as models improve.
  • Practical importance of preserving single-image performance: Multi-image gains should not come at the cost of single-image degradation; S2H-DPO successfully achieves both simultaneously.

Limitations & Future Work

  • The specific task designs for each level (kinship recognition, visual arithmetic) may lack sufficient diversity.
  • Level 3 rejected samples are generated by "omitting the target specification," which may result in inconsistent quality.
  • Validation is limited to 7B and 2B models; effectiveness on larger models remains unknown.
  • Scenarios involving more than four images are not considered.
  • vs. MIA-DPO: MIA-DPO relies solely on Level 1 data and model hallucinations; S2H-DPO covers all three levels with model-agnostic data construction.
  • vs. LLaVA-RLHF/HA-DPO: These methods focus on single-image preference alignment, whereas S2H-DPO targets hierarchical improvement of multi-image reasoning.

Rating

  • Novelty: ⭐⭐⭐⭐ The three-level capability hierarchy is insightful, though the underlying methodology (DPO + synthetic data) is not novel in itself.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Three multi-image and two single-image benchmarks, three models, and sufficient ablations.
  • Writing Quality: ⭐⭐⭐⭐ Motivation is clear and the capability hierarchy visualization is effective, though some descriptions are verbose. Code: To be confirmed Area: llm_alignment Keywords: To be supplemented

TL;DR

To be supplemented after a thorough reading of the paper.

Background & Motivation

To be supplemented after a thorough reading of the paper.

Method

To be supplemented after a thorough reading of the paper.

Key Experimental Results

To be supplemented after a thorough reading of the paper.

Highlights & Insights

To be supplemented after a thorough reading of the paper.

Limitations & Future Work

To be supplemented after a thorough reading of the paper.

To be supplemented after a thorough reading of the paper.

Rating

  • Novelty: Pending
  • Experimental Thoroughness: Pending
  • Writing Quality: Pending
  • Value: Pending