VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare¶
Conference: ACL 2025
arXiv: 2502.13775
Code: https://github.com/anudeex/VITAL.git
Area: Medical NLP
Keywords: Pluralistic alignment, healthcare, LLM benchmark, value diversity, dataset
TL;DR¶
This paper constructs VITAL, the first pluralistic alignment benchmark dataset for the healthcare domain, containing 13.1K value scenarios and 5.4K multiple-choice questions. Extensive evaluation of 8 LLMs demonstrates that existing pluralistic alignment techniques (especially ModPlural) perform poorly in medical scenarios, and simple prompting yields better results.
Background & Motivation¶
Background: LLM alignment techniques (e.g., RLHF, DPO) are increasingly mature, but typically model the "average" preference, neglecting value diversity across different cultures, demographics, and communities. Sorensen et al. (2024) proposed a pluralistic alignment framework, defining three modes: Overton (covering all diverse perspectives), Steerable (steering based on user-specified attributes), and Distributional (matching real-world distributions). Feng et al. (2024) proposed the ModPlural multi-LLM collaboration scheme.
Limitations of Prior Work: (1) Existing alignment datasets (such as OpinionQA, GlobalOpinionQA) are not focused on the healthcare domain; (2) Plurality is particularly critical in medical scenarios, where culture, religion, and personal values influence health decisions; (3) The effectiveness of existing pluralistic alignment techniques in the healthcare domain remains unverified.
Key Challenge: General-purpose pluralistic alignment techniques may not transfer to domain-specific settings. Misalignment in medical scenarios can lead to harmful health advice or belief homogenization.
Goal: (1) Construct the first pluralistic alignment benchmark for healthcare; (2) Systematically evaluate existing methods on this benchmark; (3) Explore future improvement directions.
Key Insight: Approaching the problem from the highly sensitive and controversial healthcare domain, the authors use surveys, opinion polls, and moral dilemmas to construct a pluralistic dataset.
Core Idea: The healthcare domain requires specialized pluralistic alignment benchmarks and methods, as general-purpose solutions show limited effectiveness here.
Method¶
Overall Architecture¶
Construct the VITAL dataset \(\rightarrow\) Evaluate 8 LLMs using four alignment techniques (Vanilla, Prompting, MoE, ModPlural) \(\rightarrow\) Analyze performance across three pluralistic modes (Overton, Steerable, Distributional).
Key Designs¶
-
Dataset Construction (VITAL):
- Function: To construct a healthcare pluralistic alignment benchmark containing 13.1K value scenarios + 5.4K multiple-choice questions
- Mechanism: Collecting multiple-choice questions from various surveys and moral datasets (such as OpinionQA, GlobalOpinionQA, and MoralChoice), and using few-shot classification via FLAN-T5 to filter out samples that are health-related, represent pluralistic viewpoints, and require actions
- Data Distribution: The Overton mode contains 1,649 text samples, the Steerable mode contains 15,340 samples (text+QA), and the Distributional mode contains 1,857 QA samples
- Quality Validation: Human annotation verified 10% of the samples, confirming 80% as health-related (Fleiss' Kappa: 0.49)
-
Evaluation Techniques:
- Vanilla: Direct LLM output with no instructions
- Prompting: Appending pluralistic instructions within the prompt
- MoE: The primary LLM functions as a router to select the most relevant community LLM (perspective/culture LLM), feeding its response back to the primary LLM for final generation
- ModPlural: The primary LLM collaborates with multiple community LLMs. In the Overton mode, it concatenates community messages to perform multi-document summarization; in the Steerable mode, it selects the most relevant community LLM; in the Distributional mode, it aggregates the community probability distributions
-
Evaluation Metrics:
- Overton: Sentence-level entailment calculated using an NLI model for value coverage, augmented by human evaluation and GPT-as-Judge
- Steerable: Accuracy (whether the final response adheres to the designated steering attribute)
- Distributional: Jensen-Shannon Divergence (lower is better, indicating greater similarity to the ground-truth distribution)
LLM Agent Experiments¶
The experiment investigated replacing fine-tuned community LLMs with lightweight LLM agents (role-playing agents based on Mistral-7B). A healthcare-specific agent pool was constructed, with GPT-4o selecting the 6 most relevant agents. The NLI coverage of the 6 agents was 44.16% (vs. 47.84% for original community LLMs), which increased to 49.37% when using 10 agents.
Key Experimental Results¶
Main Results: Overton Mode Coverage (%)¶
| Method | LLaMA2-7B | Gemma-7B | Qwen2.5-7B | LLaMA3-8B | ChatGPT | Average |
|---|---|---|---|---|---|---|
| Aligned Vanilla | 20.76 | 38.60 | 32.41 | 18.93 | 26.70 | 26.10 |
| + Prompting | 22.88 | 40.61 | 34.42 | 27.41 | 32.22 | 30.46 |
| + MoE | 19.58 | 26.00 | 28.14 | 24.70 | 18.84 | 22.79 |
| + ModPlural | 15.38 | 22.18 | 22.30 | 24.51 | 18.06 | 20.09 |
Ablation Study: Comparison of Community LLM Sources¶
| Configuration | LLaMA2-7B | LLaMA3-8B | Gemma-7B |
|---|---|---|---|
| Perspective Community LLM | 15.15 | 23.82 | 22.37 |
| Culture Community LLM | 17.61 | 25.11 | 22.45 |
| Healthcare LLM as Primary Model | 12.00 (ModPlural) | - | - |
| Replacing Community LLMs with 6 LLM Agents | 44.16 (NLI) | - | - |
| 10 LLM Agents | 49.37 (NLI) | - | - |
Key Findings¶
- Prompting > ModPlural: Across all 8 models and 3 alignment modes, simple prompting consistently outperformed the more complex ModPlural multi-LLM collaboration scheme, with the maximum performance gap reaching 55.5%.
- Invariance to Model Scale: Scaling up models did not yield consistent performance improvements.
- NLI Evaluation Bias: Overton coverage is positively correlated with the number of generated response sentences. ModPlural's summarization tends to compress multiple arguments into a single sentence, resulting in artificially lower NLI scores.
- Inadequacy of Direct Domain-Specific Model Replacement: Using a medical-specialized LLM (mental-llama2-7b) as the primary model yielded no substantial improvements, suggesting that simple domain "patching" is insufficient.
- Slightly Comparable in Distributional Mode: ModPlural performed reasonably well in the distributional mode, narrowing the gap with other methods.
Highlights & Insights¶
- Counter-Intuitive Finding: Highly complex multi-LLM collaboration schemes (ModPlural, MoE) underperform simple prompting in healthcare scenarios, indicating that general pluralistic schemes may fail in domain-specific tasks. This highlights the vital importance of domain-specific design.
- Feasibility of Replacing Community LLMs with Agents: 10 lightweight agents outperformed fine-tuned community LLMs in coverage without requiring expensive fine-tuning, offering a highly scalable and dynamically expandable alternative.
- Dataset Design: The methodology of combining textual scenarios with multiple-choice questions to cover three pluralistic dimensions is highly transferable to other sensitive domains such as law and education.
Limitations & Future Work¶
- Data construction heavily relies on FLAN-T5 filtering, and human verification only covered 10% of the sample with moderate agreement (Kappa 0.49), implying potential noise in data quality.
- The Overton evaluation leverages NLI models at the sentence level to judge entailment, which suffers from biases related to sentence count and semantic compression.
- State-of-the-art LLMs (e.g., GPT-4, LLaMA3-70B, and other larger models) were not evaluated.
- No novel alignment method was proposed; research was restricted to benchmark evaluation.
Related Work & Insights¶
- vs. ModPlural (Feng et al., 2024): ModPlural is the state-of-the-art pluralistic alignment method. This paper exposes its shortcomings in healthcare, despite its strong performance in general domains.
- vs. OpinionQA (Santurkar et al., 2023): OpinionQA is a widely used benchmark dataset for alignment but does not focus on healthcare. VITAL addresses this critical gap.
- vs. MoralChoice (Liu et al., 2024): MoralChoice offers moral scenarios but is not medical-specific; VITAL filters and extends a healthcare-related subset from it.
Rating¶
- Novelty: ⭐⭐⭐⭐ The first healthcare pluralistic alignment benchmark, filling an important gap.
- Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated across 8 models, 4 methods, and 3 modes for a comprehensive study, but lacks a proposed novel methodology.
- Writing Quality: ⭐⭐⭐⭐ Clear structure and detailed analysis.
- Value: ⭐⭐⭐⭐ Both the dataset and counter-intuitive findings provide valuable reference for the community.