Skip to content

DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning

Conference: ACL 2025
arXiv: 2506.21096
Code: GitHub
Area: Multimodal VLM
Keywords: Sentence Representation Learning, Cross-modal Alignment, Intra-modal Alignment, Contrastive Learning, Knowledge Distillation

TL;DR

The DALR framework is proposed to address Cross-Modal misalignment Bias (CMB) and Intra-modal Semantic Divergence (ISD) in multimodal sentence representation learning using a dual-level alignment strategy of cross-modal consistency learning and intra-modal rank distillation, achieving SOTA performance on STS and TR tasks.

Background & Motivation

Background: Sentence representation learning maps sentences into low-dimensional vectors while preserving semantic information, which is widely used in semantic similarity and information extraction tasks. Since SimCSE, the "PLM + contrastive learning" paradigm has become mainstream. Incorporating visual information (e.g., MCSE, KDMCSE) has been proven to provide rich supervisory signals.

Limitations of Prior Work: Existing multimodal sentence representation methods face two key challenges when aligning images and text at a coarse level:

  • Cross-Modal misalignment Bias (CMB): Textual information is dense and selective, focusing on key details, whereas images capture all content indiscriminately, resulting in massive redundancy. For example, a "boat" might only occupy a small area in an image, while most visual patches contain irrelevant information. Furthermore, annotators' cognitive biases lead to multiple semantic descriptions for the same image, further magnifying modal heterogeneity.

  • Intra-modal Semantic Divergence (ISD): Due to co-referencing the same image, texts with widely different semantics might be incorrectly identified as highly similar. For example, for the same football image, "a boy in a blue and white jersey kicks the ball" and "a group of people watching football players" have completely different descriptive focuses, yet both are highly similar to the image, leading to false negatives.

Key Challenge: Methods such as KDMCSE mitigate this by filtering false negatives using thresholds. However, this hard thresholding still leads to misclassification, as high image-text similarity does not equate to cross-text semantic consistency.

Goal: Simultaneously address CMB in cross-modal alignment and ISD in intra-modal consistency to achieve fine-grained dual-level alignment.

Key Insight: At the cross-modal level, an auxiliary consistency task is introduced to generate soft labels for guiding alignment. At the intra-modal level, multi-teacher rank distillation is utilized to capture continuous semantic ranking structures.

Core Idea: Relationships between sentences are continuous ranking structures rather than simple binary positive/negative labels. Dual-level alignment spanning both cross-modal and intra-modal levels is required to fully exploit visual signals.

Method

Overall Architecture

DALR consists of three modules: 1. Multimodal Contrastive Learning Module (Baseline): Guides sentence representation learning leveraging visual information. 2. Cross-Modal Alignment Module: Refines image-text semantic matching via an auxiliary consistency task. 3. Intra-Modal Alignment Module: Enhances internal textual consistency via rank distillation + KL divergence.

Key Designs

Cross-modal Consistency Learning

A dataset \(\mathcal{D}'\) containing matched and mismatched image-text pairs is constructed (mismatched pairs are generated by shuffling images). Cosine embedding loss is employed for binary classification:

\[\mathcal{L}_{cons} = \begin{cases} 1 - \cos(h_s^{v'}, s_s^{z'}) & \text{if } y' = 1 \\ \max(0, \cos(h_s^{v'}, s_s^{z'}) - m) & \text{if } y' = 0 \end{cases}\]

where \(m = 0.2\) is the negative margin. This task captures deeper semantic relations and is trained in parallel with contrastive learning to generate cross-modal soft labels.

Cross-Modal Alignment

The text-to-visual distribution \(P_i^{t2v}\) of the student model and the text-to-text \(Q_i^{t2t}\) and visual-to-visual \(Q_i^{v2v}\) distributions of the teacher model are calculated. Alignment is promoted by minimizing the KL divergence:

\[\mathcal{L}_{CMA} = \frac{1}{2} \sum_{i=1}^{N} \left( D_{KL}(Q_i^{t2t} \| P_i^{t2v}) + D_{KL}(Q_i^{v2v} \| (P_i^{t2v})^T) \right)\]

Total cross-modal learning loss: \(\mathcal{L}_{CML} = \mathcal{L}_{cons} + \mathcal{L}_{CMA}\)

Intra-Modal Rank Distillation

Assuming relations between samples are continuous rather than binary, multi-teacher models (SimCSE + DiffCSE) are used to generate coarse-grained semantic rankings as pseudo labels. ListMLE is used to optimize ranking learning:

\[\mathcal{L}_{rank} = -\sum_{i=1}^{N} \log \left( \prod_{j=1}^{M} \frac{\exp(S(x_i)_{\pi_i^T(j)} / \tau)}{\sum_{k=j}^{M} \exp(S(x_i)_{\pi_i^T(k)} / \tau)} \right)\]

Simultaneously, KL divergence is introduced to align the global distribution:

\[\mathcal{L}_{IMA} = \sum_{i=1}^{N} D_{KL}(Q_i^{t2t} \| P_i^{t2t})\]

Total intra-modal loss: \(\mathcal{L}_{IML} = \mathcal{L}_{rank} + \mathcal{L}_{IMA}\)

Loss & Training

The final training objective is formulated as:

\[\mathcal{L}_{total} = \mathcal{L}_{Info} + \lambda \mathcal{L}_{CML} + \mu \mathcal{L}_{IML}\]

where \(\mathcal{L}_{Info}\) is the InfoNCE contrastive loss, and \(\lambda\) and \(\mu\) are balancing hyperparameters.

Key Experimental Results

Main Results: STS Task (Spearman's Correlation Coefficient \(\times 100\))

Flickr Dataset:

Model STS12 STS13 STS14 STS15 STS16 STS-B SICK-R Avg
SimCSE-BERT 69.9 79.8 72.9 81.9 77.8 76.6 68.4 75.3
MCSE-BERT 71.4 81.8 74.8 83.6 77.5 79.5 72.6 77.3
KDMCSE-BERT 74.4 83.1 76.3 83.7 78.8 81.3 73.0 78.6
DALR-BERT 73.9 84.0 76.5 84.3 80.6 81.8 75.3 79.5
KDMCSE-RoBERTa 73.6 83.8 77.4 84.0 81.5 82.3 71.2 79.1
DALR-RoBERTa 73.6 84.4 77.2 84.9 82.0 82.6 74.6 79.9

DALR-BERT achieves an average score of 79.5 on Flickr, outperforming KDMCSE-BERT by +0.9. DALR-RoBERTa reaches 79.9, outperforming KDMCSE by +0.8.

Transfer Tasks

Model MR CR SUBJ MPQA SST TREC MRPC Avg
KDMCSE-BERT (flickr) 82.78 87.89 95.37 90.08 87.61 86.08 75.88 86.53
DALR-BERT 82.95 88.10 95.89 90.83 88.04 86.60 76.06 86.92
KDMCSE-RoBERTa (flickr) 83.21 88.16 95.73 90.46 88.05 86.30 76.18 86.87
DALR-RoBERTa 83.57 88.69 96.44 91.01 88.96 86.80 76.74 87.45

DALR-RoBERTa achieves an average of 87.45 on TR tasks, outperforming KDMCSE by +0.58.

Ablation Study

Settings STS Avg TR Avg
Baseline (SimCSE + Multimodal Contrastive) 77.3 85.64
+ Cross-Modal Alignment (CML) 78.8 86.38
+ Intra-Modal Alignment (IML) 78.6 86.25
+ CML + IML (DALR) 79.5 86.92

Both modules make independent contributions, and their combination yields the best performance. In rank distillation, multiple teachers (SimCSE + DiffCSE) outperform a single teacher.

Key Findings

  1. Cross-modal consistency learning effectively mitigates CMB, with the most significant improvement observed on SICK-R (+2.3 on BERT).
  2. Rank distillation capturing continuous semantic structures is more effective than hard-threshold filtering.
  3. Multi-teacher distillation outperforms single-teacher distillation, as different teachers provide complementary ranking perspectives.
  4. Consistent improvements are achieved across both Flickr and COCO datasets.

Highlights & Insights

  • Accurate Diagnosis of CMB and ISD: Visually demonstrates cross-modal misalignment and intra-modal divergence through clear illustrations.
  • Soft Labels Instead of Hard Thresholds: Consistency learning generates continuous soft labels to guide alignment, avoiding erroneous filtering caused by hard thresholds.
  • Complementary Design of ListMLE + KL: ListMLE preserves ranking structures while KL divergence learns global distributions. Combining both is more robust than using either in isolation.
  • Lightweight Incremental Design: Plugged incrementally into existing multimodal contrastive learning frameworks without requiring additional architectural modifications.

Limitations & Future Work

  1. Reliance on frozen visual/textual teacher encoders (CLIP); the quality of the teachers directly bounds the performance.
  2. Evaluations are conducted solely on English datasets; cross-lingual generalization remains unverified.
  3. Systematic ablation is lacking for teacher model selection (SimCSE + DiffCSE) in rank distillation.
  4. Training overhead increases due to the multiple teachers and auxiliary tasks.
  5. Future work could explore integration with more powerful visual encoders (e.g., SigLIP).
  • SimCSE \(\to\) MCSE \(\to\) KDMCSE \(\to\) DALR: The evolutionary path of multimodal sentence representation is clear, and DALR addresses two previously overlooked issues on top of prior study.
  • The application of ListMLE learning to rank in NLP is noteworthy.
  • Insight: Cross-modal alignment should not only consider binary positive/negative pairs but also model the continuous ranking relationships among samples.

Rating

  • Novelty: ⭐⭐⭐ — The formulation of CMB/ISD problems is insightful, but the overall framework is a combination of existing techniques.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Dual benchmarks (STS and TR), two PLM backbones, two multimodal datasets, and complete ablation studies.
  • Writing Quality: ⭐⭐⭐⭐ — The motivation is clearly illustrated with figures, and the methodological derivation is rigorous.
  • Value: ⭐⭐⭐ — Shows consistent improvements in the field of sentence representation learning, though the performance gain is moderate (~1%).