Skip to content

HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language Models

Conference: ACL2025
arXiv: 2506.00805
Code: GitHub
Area: Multimodal VLM / Medical Vision-Language Model Alignment
Keywords: Medical VLM, Preference Optimization, Self-Contrastive Rewarding, Modality Alignment, Hallucination

TL;DR

This paper proposes HSCR, a hierarchical self-contrastive rewarding method that exposes the model's intrinsic modality misalignment through visual token dropout to automatically generate high-quality preference data. Combined with explicit/implicit multi-level preference optimization, it significantly enhances the zero-shot performance and trustworthiness of medical VLMs using only 2,000 training samples.

Background & Motivation

Problem Definition

Medical Vision-Language Models (Med-VLMs) process medical images by integrating a visual encoder into an LLM. However, limited paired multimodal medical training data leads to severe modality misalignment—models may hallucinate image content, preferring text-based priors while ignoring actual visual information. This severely compromises model trustworthiness in high-risk medical scenarios.

Two Major Challenges of Existing Methods

Challenge 1: Low Sampling Probability of Preference Data - Preference data generated by human annotators or GPT-4o exhibits a distribution shift from the decoding behavior of the Med-VLM. - These external preference data have a very low sampling probability during Med-VLM optimization. - This leads to weak reward signals and poor alignment performance.

Challenge 2: Insufficient Effectiveness of Adjacent-Level Contrast - In traditional binary preference optimization (correct vs. incorrect), the preference gap is too large. - A weakly trained Med-VLM can easily saturate its capability in distinguishing pre-selected outputs from non-selected ones. - Coarse-grained contrast fails to capture subtle preference differences.

Method

Overall Architecture

HSCR consists of three steps: 1. Token-level Self-Contrastive Reward Data Generation (Section 3.1) 2. Similarity-Aware Preference Reranking (Section 3.2) 3. Multi-Level Preference Optimization (MLPO) (Section 3.3)

Key Designs

Key Design 1: Token-level Self-Contrastive Reward Data Generation

Core Idea: Leverage the Med-VLM's own intrinsic misalignment to generate dispreferred responses without external tools or annotations.

Step 1 - Visual Token Dropout: - Apply a 70% dropout rate to the original visual tokens \(i\) to obtain \(i'\) - Calculate the full and dropout token logits respectively: \(\text{logit}_\theta(y|i,x)\) and \(\text{logit}_\theta(y|i',x)\)

Step 2 - Identify Modality-Coupled Tokens: Locate the tokens most susceptible to visual information by contrasting the differences between the two sets of logits:

\[P_{\text{diff}} = \text{Softmax}[(1+\beta)\cdot\text{logit}_\theta(y|i,x) - \beta\cdot\text{logit}_\theta(y|i',x)]\]

where \(\beta=0.9\) controls the contrast intensity. The top-\(n\) tokens (\(n=10\)) with the largest logit differences are selected. These tokens are strongly coupled with the visual modality and are highly prone to inducing hallucinations under misalignment.

Step 3 - Generate Dispreferred Responses: For the identified sensitive tokens, decode and replace them with weakly visually related tokens (i.e., hallucinated outputs) based on \(P_{\text{diff}}\) in ascending order. By replacing varying numbers of sensitive tokens, a set of dispreferred responses with different degrees of error \(\{y_{l1}, y_{l2}, ..., y_{lk}\}\) is generated.

Key Design 2: Similarity-Aware Preference Reranking

To ensure preference rankings accurately reflect semantic differences: - Compute the semantic similarity \(\text{sim}(y_{lk}, y_w)\) between each dispreferred response \(y_{lk}\) and the preferred response \(y_w\). - Rerank them in descending order of similarity. - Select \(j\) responses (\(j=3\)) that have a similarity difference of at least 0.1 for optimization.

Key Design 3: Multi-Level Preference Optimization (MLPO)

Explicit Preference Learning—distinguish between correct and incorrect responses:

\[L_E = -\sum_{j=1}^{k} \mathbb{E}_{(x,y_w,y_{lj})\sim D}\left[\log\sigma\left(\gamma\log\frac{\pi_\theta(y_w|x)}{\pi_{\text{sft}}(y_w|x)} - \gamma\log\frac{\pi_\theta(y_{lj}|x)}{\pi_{\text{sft}}(y_{lj}|x)}\right)\right]\]

This is similar to standard DPO, but contrasts the preferred response with all dispreferred responses (rather than only a single pair).

Implicit Preference Learning—distinguish between varying degrees of incorrect responses:

\[L_I = -\sum_{j=1}^{k}\sum_{m=j+1}^{k} \mathbb{E}_{(x,y_{lj},y_{lm})\sim D}\left[\log\sigma\left(\gamma\log\frac{\pi_\theta(y_{lj}|x)}{\pi_{\text{sft}}(y_{lj}|x)} - \gamma\log\frac{\pi_\theta(y_{lm}|x)}{\pi_{\text{sft}}(y_{lm}|x)}\right)\right]\]

This encourages the model to learn the relative quality ranking among dispreferred responses (those closer to being correct should rank higher than those further away).

Total Loss: \(L_{\text{HSCR}} = L_E + L_I\)

Ingenuity of Loss Function Design

  • Explicit preference provides coarse-grained alignment directions.
  • Implicit preference captures fine-grained preference gradients.
  • The two complement each other, allowing the model to simultaneously learn "what is correct" and "how severe the error is".

Key Experimental Results

Experimental Setup

  • Visual Encoder: CLIP-ViT-L/14@336px
  • LLM: Mistral-7B
  • Training Data: Only 2,000 training samples
  • Hyperparameters: \(j=3, \beta=0.9, n=10, \gamma=0.1\), LoRA rank=16, 2 epochs
  • Evaluation: Rad-VQA, SLAKE, PathVQA (open/closed-ended); captioning & instruction-following

Med-VQA Main Results

Method RAD-VQA Open/Closed SLAKE Open/Closed PathVQA Open/Closed
GPT-4o 51.6/63.97 59.06/71.63 24.14/75.97
LLaVA-Med1.5 32.31/56.62 42.45/56.49 10.01/59.75
ST-LLaVA 33.81/59.16 40.13/55.53 10.38/52.05
LiPO 31.85/57.37 43.18/58.13 9.37/60.17
HSCR 35.92/60.13 45.32/63.46 12.36/64.17

HSCR achieves SOTA under the zero-shot setting, yielding a 6.97% gain on the SLAKE closed-ended task and approaching GPT-4 performance on the RAD-VQA closed-ended task.

Captioning & Instruction-Following Results

Method Conversation Description Overall
LLaVA-Med1.5 SFT(60K-IM) 58.6 42.5 54.4
HSCR (2K) 59.4(+0.8) 52.9(+10.4) 57.7(+3.3)

Key Findings: HSCR with 2,000 samples brings larger performance improvements than scaling SFT from 10K to 60K (+10.4% vs. +4.4% on the description task).

Ablation Study

1. Explicit vs. Implicit Preferences

Explicit Implicit SLAKE Closed
56.49
57.78(+1.29)
60.32(+3.83)
63.46(+6.97)

Implicit preferences outperform explicit preferences when used alone, but the combination of both yields the best results.

2. Preference Data Construction Methods

Method SLAKE Closed
LLaVA-Med1.5 Baseline 56.49
GPT-4o Generated Preference 57.96(+1.47)
HSCR Self-Contrastive Preference 63.46(+6.97)

Self-contrastive preferences far outperform GPT-4o external preferences, validating the effectiveness of leveraging the model's intrinsic misalignment.

3. Comparison of Masking Strategies

Strategy SLAKE Closed PathVQA Closed
Pixel-Level Mask 57.49 60.79
Patch-Level Mask 58.44 61.83
Latent Space Mask 60.32 62.77
Visual Token Dropout 63.46 64.17

Perturbations closer to the input of the LLM backbone are more effective. Visual token dropout directly eliminates the transmission of visual information to the LLM, making it the most effective at triggering intrinsic misalignment.

4. Masking Ratio: 70% is optimal; a masking ratio below 50% is insufficient to effectively perturb visual information.

General Multimodal Task Generalization

Applying HSCR to a general VLM (LLaVA-v1.5) outpaces the DPO baseline on the AMBER benchmark, demonstrating that the method is not limited to the medical domain.

Highlights & Insights

  1. Clever exploitation of "defects": Instead of trying to fix the model's misalignment directly, this work leverages it to generate high-quality preference data—allowing the model to "expose its own problems" to train itself.
  2. Exceptional data efficiency: Reaching SOTA with only 2,000 samples is much more efficient than scaling data sizes in standard SFT.
  3. Value of implicit preference learning: Reveals that relative quality among dispreferred responses contains rich signals, which are ignored by traditional binary DPO.
  4. Design motivation for visual token dropout: Inspired by MAE and ViT, this design exposes the degree of modality coupling by directly removing visual information at the LLM input level.
  5. Representation learning analysis: Qualitative visualization using t-SNE shows that HSCR aligns correct response embeddings more tightly with image embeddings, directly validating the effectiveness of modality alignment.

Limitations & Future Work

  1. Medical data quality and diversity remain limited, which impacts the model's generalization to rare clinical cases.
  2. Evaluations are primarily conducted in controlled experimental environments, lacking clinical workflow integration and real-world validation.
  3. The 70% dropout rate is high; whether key visual information might be lost in certain scenarios warrants further investigation.
  4. The computational complexity of implicit preference is \(O(k^2)\)—scaling quadratically with the number of dispreferred responses.
  5. Validated only on Mistral-7B; the effects on larger-scale models or different architectures remain unknown.
  • VLM Preference Optimization: RLHF-V (human feedback), POVID (diffusion-noise-generated rejection responses), RLAIF-V (multi-VLM aggregation)
  • Medical VLM Alignment: ST-LLaVA (self-training + GPT-4o scoring), MMedPO (multi-agent preference data construction)
  • Contrastive Decoding: VCD (Contrastive Decoding to reduce hallucination)
  • DPO and Variants: Rafailov et al., LiPO (listwise preference optimization)

Rating ⭐⭐⭐⭐

The method is elegantly designed with deep motivational analysis and exhibits exceptional data efficiency (2K samples). The hierarchical multi-level preference optimization is highly innovative. Ablation studies thoroughly validate the contributions of each component. The primary limitations are that validation is restricted to 7B models, and real-case clinical environment evaluations are missing.