Skip to content

Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization

Conference: CVPR 2025
arXiv: 2503.17928
Code: https://github.com/zhangzef/NaPO
Area: Alignment RLHF
Keywords: Modality Bias, Noise-Aware Optimization, Preference Learning, MLLM Debiasing, Hallucination Mitigation

TL;DR

Targeting the modality bias issue in MLLMs (over-reliance on language priors or visual details), NaPO constructs a biased dataset, RLAIF-V-Bias, by masking modality information. It proposes a noise-aware preference optimization algorithm based on a negative Box-Cox transformation to achieve robust training on automatically constructed noisy data, yielding superior results in both debiasing and hallucination mitigation.

Background & Motivation

Background: MLLMs perform remarkably well across various tasks, but suffer from a pervasive modality bias problem—models tend to over-rely on information from one modality while ignoring the other.

Limitations of Prior Work: Modality bias can be categorized into two types: (1) Language bias—models rely heavily on language priors while ignoring visual inputs (e.g., answering "bears are brown" even when shown a polar bear image); (2) Visual bias—models pay excessive attention to visual details and generate content unrelated to the query (e.g., describing numerous irrelevant visual details when asked "is the house on the left?"). Existing methods either require balancing the dataset distribution or rely on large-scale supervised fine-tuning, with the latter risking the loss of pre-existing knowledge.

Key Challenge: When formulating debiasing as a preference optimization problem, automatically constructing biased data is relatively easy. However, automated data inevitably contains noise (some "biased" responses are actually of decent quality), and standard DPO is highly susceptible to overfitting on such noisy data.

Goal: (1) How to automatically construct effective debiased preference data? (2) How to perform robust preference optimization on automated data containing noise?

Key Insight: Control the information flow via masking to generate biased responses—masking visual information yields language-biased responses, and masking text information yields visually-biased responses. Then, handle the inevitable noise in the data using a noise-aware preference optimization loss.

Core Idea: Use modality masking to construct biased data, employ a negative Box-Cox transformation to smoothly transition between BCE and MAE, and dynamically adjust the robustness of the optimization based on the noise level of the data.

Method

Overall Architecture

The method consists of two stages: (1) Data Construction—based on the RLAIF-V dataset, language-biased and visually-biased responses are generated by masking visual/text modality information, forming the RLAIF-V-Bias dataset; (2) Training Algorithm—optimization is conducted using the NaPO algorithm, where standard DPO loss is applied to the original preference data, noise-aware NaPO loss is applied to the biased contrastive data, and dynamic weights are used to balance the three losses.

Key Designs

  1. Modality Biased Response Generation (RLAIF-V-Bias Dataset):

    • Function: Automatically construct preference training data tailored for language and visual biases.
    • Mechanism: Language-biased responses are generated via \(y_{lb} = \text{MLLM}([\text{MASK}]; t)\) (masking all visual information so the model relies solely on language priors); visually-biased responses are generated via \(y_{vb} = \text{MLLM}(v; [\text{MASK}])\) (masking all text information so the model relies solely on visual content). The final dataset comprises: original unbiased responses (winning samples) + language-biased responses + visually-biased responses (losing samples).
    • Design Motivation: Capture biased samples at low cost by controlling the information flow rather than using manual annotation; eschew explicit filtering in favor of handling noise via soft selection in the downstream NaPO phase.
  2. Noise-Aware Preference Optimization (NaPO):

    • Function: Achieve robust preference optimization on automatically constructed data containing noise.
    • Mechanism: Unify the BCE loss in DPO with the noise-robust MAE loss using a negative Box-Cox transformation. The NaPO loss is formulated as \(\mathcal{L}_{\text{NaPO}} = \frac{1}{q}(1 - \sigma(\beta \log\frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log\frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)})^q)\), where \(q \in (0,1]\) controls the noise robustness: \(q \to 0\) approaches BCE (fast convergence but noise-sensitive), while \(q \to 1\) approaches MAE (noise-robust but slow convergence).
    • Design Motivation: MAE satisfies the symmetric loss condition (noise-robustness), which BCE lacks, though BCE achieves faster convergence. Balance them dynamically by adjusting \(q\).
  3. Adaptive Noise Coefficients and Dynamic Weights:

    • Function: Automatically adjust the \(q\) value and loss weight according to the noise level of each sample.
    • Mechanism: A key observation is that noisy samples (which are mislabeled as biased but are actually unbiased) tend to have a small reward margin, whereas truly biased samples have a larger margin. Thus: \(q = 1 - \sigma(\alpha \cdot \psi(x, y_w, y_l))\), where a larger margin leads to a smaller \(q\) (closer to BCE as the data is trustworthy), and a smaller margin leads to a larger \(q\) (closer to MAE due to potential noise). Meanwhile, the margin is also utilized to calculate the weight \(\gamma_i\) for each loss term.
    • Design Motivation: Noise characteristics differ between language-biased and visually-biased data—language bias uses the average log probability (\(\psi_\mu\), \(\alpha=0.5\)) to distinguish noise, while visual bias uses the sum of log probabilities (\(\psi_\Sigma\), \(\alpha=0.01\)).

Loss & Training

The final optimization objective is: \(\mathcal{L}_\gamma = \gamma_{y_l} \cdot \mathcal{L}_{\text{DPO}}(x, y_w, y_l) + \gamma_{y_{lb}} \cdot \mathcal{L}_{\text{NaPO}}(x, y_w, y_{lb}) + \gamma_{y_{vb}} \cdot \mathcal{L}_{\text{NaPO}}(x, y_w, y_{vb})\). The original data uses DPO (high-quality), the biased data uses NaPO (noise handling), and the weights \(\gamma\) are dynamically calculated based on margins. Training configuration: LLaVA-v1.5-7B, \(\beta=0.1\), lr=5e-7, 4 epochs, batch size = 4, 8×A100 80GB, training took 7 hours.

Key Experimental Results

Main Results

Model Configuration VLind CB↑ VLind LP↑ ObjHal CHAIR_s↓ ObjHal CHAIR_i↓ AMBER HalRate↓ MMHal Score↑
LLaVA-v1.5-7B (Baseline) 0.0 0.0 53.6 25.2 36.4 2.11
+ RLAIF-V (Standard DPO) 39.4 25.4 32.0 8.5 23.4 3.23
+ RLAIF-V-Bias (DPO) 0.3 0.4 35.3 10.5 22.4 3.28
+ RLAIF-V-Bias (NaPO) 58.9 44.0 25.7 6.2 20.7 3.31

Ablation Study

Configuration VLind CB↑ VLind LP↑ CHAIR_s↓ CHAIR_i↓
Full (NaPO + Dynamic Weights) 58.9 44.0 25.7 6.2
w/o Dynamic Weights 50.0 38.2 27.7 8.0
NaPO → DPO 43.4 32.2 29.0 8.3
Language-biased data only 40.4 36.4 28.0 6.4
Visually-biased data only 62.3 31.4 26.3 7.6

Key Findings

  • DPO fails severely on biased data: Training on the RLAIF-V-Bias data with standard DPO yields a VLind CB score of only 0.3 (virtually ineffective), performing even worse than the original RLAIF-V, which demonstrates that DPO cannot handle noise in automatically constructed data.
  • Language and visual bias data are complementary: Language-biased data is more effective at mitigating language priors (LP +36.4), while visually-biased data excels at mitigating common-sense bias (CB +62.3) and hallucination (CHAIR_s 26.3). A combination of both yields the best performance.
  • Choice of noise metric is crucial: Replacing \(\psi_\mu\) with \(\psi_\Sigma\) for language bias causes the CB score to plummet from 58.9 to 21.9, completely breaking the performance.
  • Generalization to 13B: The method remains effective on LLaVA-v1.5-13B, raising the CB score from 31.5 to 42.1.

Highlights & Insights

  • Formulating the debiasing problem as preference optimization is very natural, since biased responses inherently constitute undesired behaviors, a setting for which DPO is natively suited. The core innovation lies in addressing the practical challenge of noisy automatic data.
  • The theoretical analysis of the negative Box-Cox transformation is elegant: It unifies MAE and BCE into a continuous family of losses, where the parameter \(q\) provides an explicit dial for noise robustness with rigorous theoretical guidance.
  • The discovery that language and visual biases require different noise metrics holds great practical value, reflecting the different distributional characteristics of alternative noise types.

Limitations & Future Work

  • Validation is only conducted on LLaVA-v1.5; other advanced MLLMs (e.g., Qwen-VL, InternVL) have not been tested.
  • Replacing all DPO terms with NaPO leads to performance degradation (Table 6), indicating that NaPO still makes assumptions about data quality.
  • Whether bias is always harmful remains open to discussion—in certain scenarios, a moderate level of bias may actually be beneficial.
  • The choice of the noise coefficient \(\alpha\) relies on manual hyperparameter tuning, offering limited automation.
  • vs RLAIF-V: RLAIF-V performs general preference optimization, whereas this work constructs data specifically targeting modality bias. With equivalent data volume, RLAIF-V-Bias + NaPO significantly outperforms RLAIF-V (CB +19.5, CHAIR_s -6.3).
  • vs Standard DPO: Standard DPO completely fails on automatically constructed noisy data (CB at only 0.3), which NaPO addresses by dynamically incorporating noise robustness.
  • vs Data Filtering Methods: Instead of explicit data filtering, this method acts via soft selection at the loss level to handle noise gracefully without discarding valuable information.

Rating

  • Novelty: ⭐⭐⭐⭐ The theoretical framework unifying BCE/MAE via the negative Box-Cox transformation is innovative, though the modality-masking approach for constructing biased data is relatively straightforward.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Evaluations on 4 benchmarks, comprehensive ablation, and detailed analysis of noise metrics are solid, though model coverage is limited.
  • Writing Quality: ⭐⭐⭐⭐ Theoretical derivations are clear, though the notation system is relatively complex.
  • Value: ⭐⭐⭐⭐ The NaPO algorithm can generalize to other noisy preference optimization scenarios, showing high practical utility.