Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment¶

Conference: NeurIPS 2025 arXiv: 2506.06970 Code: N/A Area: Multimodal VLM Keywords: Cross-modal retrieval, preference alignment, DPO, modality gap, MLLM

TL;DR¶

This paper proposes MAPLE, a framework that leverages the inherent modality alignment capabilities of off-the-shelf MLLMs to automatically construct preference data, and introduces a Relative Preference Alignment (RPA) loss to guide cross-modal representation learning, achieving significant improvements on fine-grained retrieval tasks.

Background & Motivation¶

Contrastive learning models such as CLIP have demonstrated strong performance in cross-modal retrieval; however, their feature spaces exhibit a pronounced modality gap—a systematic separation between image and text embeddings in the shared space—which limits retrieval effectiveness.

A key observation motivates this work: off-the-shelf MLLMs (e.g., Qwen2-VL) inherently possess strong modality alignment capabilities. By proposing a unified metric based on the 1-Wasserstein distance (WD) that enables simultaneous comparison of logit-based models (MLLMs) and embedding-based models (CLIP), the authors show that the alignment quality of MLLMs substantially surpasses that of CLIP.

However, fine-tuning MLLMs into retrieval models degrades this inherent alignment capability. The central question therefore becomes: how can the powerful cross-modal alignment ability of MLLMs be preserved while converting them into retrievers?

Two core challenges are identified: 1. Standard contrastive learning performs coarse-grained alignment by uniformly pushing away all negatives, ignoring fine-grained semantic differences. 2. Directly applying contrastive fine-tuning to MLLMs destroys their pre-existing modality alignment priors.

Method¶

Overall Architecture¶

MAPLE (Modality-Aligned Preference Learning for Embeddings) consists of two main components: 1. Preference data construction: offline hard negative mining followed by online MLLM-based scoring. 2. Preference alignment training: aligning the embedding space with MLLM preferences via the RPA loss.

Key Designs¶

1. MLLM-Based Retriever Architecture¶

The model is initialized from a pretrained MLLM.
The causal attention mask is replaced with bidirectional attention.
Mean pooling is applied over the final hidden states to obtain retrieval representations.
The model is fine-tuned using LoRA.

2. Preference Data Construction¶

Offline stage — Candidate generation: - DINOv2 is used to extract image embeddings; SemDedup is applied for deduplication. - For each image, top-$K$ nearest neighbors are retrieved as the hard negative candidate set $\mathcal{C}_i^{img}$. - The MLLM's multi-image reasoning capability is used to generate discriminative descriptions, forming the text candidate set $\mathcal{C}_i^{txt}$.

Online stage — Scoring and structuring: - Alignment scores are computed by prompting the MLLM to output "yes"/"no" for image–text pairs; softmax is applied to obtain alignment scores $\alpha_{ii}$. - Candidates are ranked by alignment score, and two preference structures are constructed: - Pairwise preference: all candidate pairs $(x_{r_a}, x_{r_b})$ satisfying $a < b$. - Listwise preference: exploiting the suffix structure of the complete ranking list.

3. Relative Preference Alignment (RPA) Loss¶

Starting from DPO, two key modifications are introduced:

Eliminating the reference model: A uniform prior $U$ is adopted as the reference model $\pi_w$, simplifying the objective to: $$\mathcal{L}_{\text{DPO-simplified}} = -\mathbb{E}[\log \sigma(\beta \log \pi_\theta(y_w|x) - \beta \log \pi_\theta(y_l|x))]$$

Adapting to embedding models: Log-probabilities are replaced with scaled similarity scores $\beta(z^{anchor} \cdot z^{candidate})$.

Pairwise RPA loss: $$\mathcal{L}_{\text{RPA-Pairwise}}^{txt2img} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{0 \le k < l \le K} (\alpha_{i,r_k} - \alpha_{i,r_l}) \log \sigma(s_{ik} - s_{il})$$

The preference weight $(\alpha_{r_k} - \alpha_{r_l})$ ensures that pairs with larger preference gaps receive greater attention.

Listwise RPA loss: $$\mathcal{L}_{\text{RPA-Listwise}}^{txt2img} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{k=0}^{K-1} w_{ik} \log \frac{\exp(s_{ik})}{\sum_{j=k}^{K} \exp(s_{ij})}$$

where the weight $w_{ik}$ is the average difference in MLLM alignment scores between the $k$-th candidate and all subsequent candidates.

Loss & Training¶

Regularized joint optimization: $$\mathcal{L} = \lambda \mathcal{L}_{RPA} + (1 - \lambda) \mathcal{L}_{contrast}$$

$\mathcal{L}_{contrast}$ is the standard contrastive loss, serving as regularization to prevent feature collapse from excessive alignment.
$\lambda$ balances the preference alignment strength against general retrieval capability.
Training data: a subset of OpenImages.
An extended negative pool strategy is employed to implicitly increase the effective batch size without additional computational overhead.

Key Experimental Results¶

Main Results¶

Model	COCO T/I R@1	Flickr30k T/I R@1	Winoground T/I	NaturalBench T/I
SigLIPv2 (2B)	72.8/56.1	95.4/86.0	39.8/17.0	65.5/68.7
VladVA (LLaVA-7B)	72.9/59.0	94.3/83.3	40.5/17.5	-/-
MAPLE (Qwen2-VL-7B)	75.5/60.3	94.3/86.1	56.0/32.7	76.1/76.8

Improvements on fine-grained retrieval are particularly pronounced: Winoground Text +13.5 and Image +15.2 relative to VladVA.

Ablation Study¶

Method	COCO T/I	Winoground T/I	NaturalBench T/I
Baseline ($\mathcal{L}_{contrast}$)	74.0/54.4	42.5/20.5	61.4/62.5
+ RPA-Pairwise only	51.9/52.4	48.8/34.7	70.1/77.3
$\mathcal{L}_{contrast}$ + RPA-Listwise	71.9/58.6	51.0/28.2	69.2/71.2
+ Extended negatives + RPA	75.5/60.3	56.0/32.7	76.1/76.8

Key Findings¶

Using RPA loss alone substantially improves fine-grained retrieval at the cost of general retrieval performance; adding $\mathcal{L}_{contrast}$ regularization recovers both.
Listwise RPA outperforms Pairwise: the listwise formulation accounts for the full ranking structure rather than treating pairs independently.
Extended negative pool benefits both general and fine-grained retrieval by effectively increasing the batch size.
Modality gap analysis: MAPLE with RPA significantly reduces the distributional gap ($W_{dist\text{-}gap}$) while improving the discriminative gap ($W_{disc\text{-}gap}$).

Highlights & Insights¶

MLLMs inherently possess strong modality alignment ability: this observation is independently valuable, suggesting that MLLMs can serve as "alignment teachers" for other models.
Elegant transfer from DPO to embedding space: replacing log-probabilities with similarity scores and eliminating the reference model adapts the framework naturally to the retrieval setting.
Refined preference weighting: using MLLM alignment score differences as preference strength weights focuses training on sample pairs with clear preference signals.
Unified modality gap metric: the Wasserstein distance-based metric enables comparisons across architectures with different output formats (logit-based vs. embedding-based).

Limitations & Future Work¶

Cross-modal representations may inherit biases inherent to the MLLM backbone.
The approach has not been validated on more complex tasks such as composed retrieval.
Computing MLLM alignment scores during the online stage introduces additional training cost.
General retrieval performance shows only marginal improvement over the baseline in some configurations (e.g., COCO Text R@1 increases from 74.0 to 75.5, with slight degradation in certain settings).
Future work could explore additional MLLM backbones and larger-scale training data.

Connection to DPO/RLHF: the work extends LLM alignment techniques to the domain of visual retrieval.
Alternative direction for improving CLIP: rather than modifying the encoder architecture, the method leverages external MLLM knowledge to improve alignment.
Inspiration: similar ideas could be applied to video retrieval and fine-grained matching in multimodal RAG systems.

Rating¶

Novelty: ⭐⭐⭐⭐ (DPO-to-embedding-space transfer and discovery of MLLM alignment priors)
Experimental Thoroughness: ⭐⭐⭐⭐ (multiple general and fine-grained benchmarks with comprehensive ablations)
Writing Quality: ⭐⭐⭐⭐ (technically rich but somewhat complex in structure)
Value: ⭐⭐⭐⭐ (significant gains on fine-grained retrieval; the MLLM prior finding is broadly inspiring)

Method	COCO T/I	Winoground T/I	NaturalBench T/I
Baseline (\(\mathcal{L}_{contrast}\))	74.0/54.4	42.5/20.5	61.4/62.5
+ RPA-Pairwise only	51.9/52.4	48.8/34.7	70.1/77.3
\(\mathcal{L}_{contrast}\) + RPA-Listwise	71.9/58.6	51.0/28.2	69.2/71.2
+ Extended negatives + RPA	75.5/60.3	56.0/32.7	76.1/76.8