Skip to content

PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference

Basic Information

Conference: ACL 2025
Dataset: PKU-Alignment/PKU-SafeRLHF
Institution: Peking University / HKUST / Infinigence-AI
Area: LLM Safety Alignment / RLHF
Keywords: safety alignment, RLHF, preference data, harm categories, severity levels, moderation

TL;DR

This work releases PKU-SafeRLHF, a large-scale safety preference dataset containing 44.6k refined prompts, 265k QA pairs with safety meta-labels, and 166.8k preference data items. It introduces 19 harm categories and 3 severity levels of annotation for the first time, and trains a severity-sensitive moderation model (93% accuracy) along with a SafeRLHF alignment pipeline based on this dataset.

Background & Motivation

  • LLM Safety Issues: LLM training data is sourced from the internet, containing substantial noise, errors, and societal biases, which might lead models to generate offensive content, leak privacy, spread misinformation, or even exhibit deceptive alignment behavior.
  • Data Bottlenecks: The effectiveness of safety alignment methods depends heavily on high-quality preference data and meta-label classifications, but large-scale annotation is highly expensive.
  • Limitations of Prior Work:
    • Prompts from the prior work BeaverTails were collected from the internet, suffering from limited quality and diversity, with a long-tail distribution of harm categories.
    • There is a lack of severity grading (most existing works only rely on binary safe/unsafe classification).
    • Helpfulness and harmlessness annotations are often coupled, lacking decoupled annotations.
  • Goal: Provide an open-source, high-quality, multi-level safety preference dataset to support finer-grained risk control and more effective safety alignment.

Method

Dataset Construction

Data Scale

Component Scale
Refined prompts 44.6k
QA pairs (with safety meta-labels) 265k
Preference data 166.8k
Harm categories 19
Severity levels 3

Prompt Generation

  • 63.6% are generated by Alpaca3-70B, and 14% are generated by WizardLM-30B-Uncensored.
  • Safety guidelines and few-shot examples are written for each harm category.
  • Input severity rules are provided to require the generation of three different prompts ranging from minor to severe.
  • Alpaca3-70B is used with context expansion to extend prompts, enhancing diversity.

Response Generation

Alpaca-(1,2,3) models are used to generate responses: 1. Generate high-quality responses using default parameters first. 2. Increase the temperature to generate 10 additional responses. 3. Rank based on textual similarity and filter out gibberish to select high-quality, low-similarity responses. - Compared to BeaverTails, semantic ambiguity and gibberish content are reduced by 32%.

Model Selection Strategy

  • Base models from the Llama family (7B/8B) are used with Alpaca 52K SFT.
  • Reasons for not using chat models or larger models:
    • RLHF requires PTX loss, making SFT data more transparent.
    • The 7B/8B parameter scale is suitable for academia to train on a single machine with 8x A800 GPUs.

19 Harm Categories

These cover comprehensive safety risk dimensions, including but not limited to: insult, discrimination, privacy invasion, cybercrime, financial crime, white-collar crime, mental manipulation, etc.

Correlation Analysis Between Categories: - The correlation coefficient between financial crime and white-collar crime is 0.55. - Insulting behavior and discriminatory behavior exhibit a significant correlation. - Most categories have low or negative correlations, demonstrating the effectiveness of the taxonomy.

Definition of 3 Severity Levels

Level Description Scope of Impact
Minor Short-term, mild negative impact, self-recoverable Individual
Moderate Typically illegal, potentially causing severe personal harm or limited group impact Individual \(\rightarrow\) Group
Severe Targeted at groups, causing widespread severe harm and long-term impact Group \(\rightarrow\) Society

References include the US Congress, MPAA movie ratings, FEMA emergency management, PEGI game ratings, and the Anthropic Responsible Scaling Policy.

Annotation Method: Dual-Preference + Single-Preference

Safety Meta-Label Annotation

  • 28 full-time annotators performed human-AI collaborative annotation.
  • Annotation content: safety (safe/unsafe), corresponding harm category (out of 19), and severity level (out of 3).
  • Compared to pure human annotation in BeaverTails, consistency is significantly improved.

Dual-Preference Annotation

  • Helpfulness preference \(\mathcal{D}_R\): Labelling which of the two responses to the same prompt is more helpful.
  • Harmlessness preference \(\mathcal{D}_C\): Labelling which of the two responses to the same prompt is more harmless.
  • Decoupled annotation allows for the independent training of the Reward Model and the Cost Model.

Single-Preference Annotation

  • Overall preferences that comprehensively consider both helpfulness and harmlessness.
  • Directly evaluating the trade-off from scratch.

Application 1: Severity-Sensitive Moderation Model

A moderation model is trained using all severity meta-labels:

Method Safety Accuracy F1-Score False Positive Rate
Llama-Guard 0.78 0.71 0.055
Llama-Guard 2 0.88 0.87 0.107
Perspective API 0.53 0.18 0.053
OpenAI Moderation API 0.53 0.10 0.002
Ours 0.93 0.93 0.077
  • Binary classification (safe/unsafe) accuracy is 93%, significantly outperforming all baselines.
  • Severity classification accuracy is 85%.
  • Precise matching accuracy for the 19 harm categories is 71.3%.

Application 2: Safe RLHF Pipeline

Reward Model (RM)

A Bradley-Terry model is used to train the helpfulness reward model:

\[\mathcal{L}_R(\phi; \mathcal{D}_R) = -\mathbb{E}[\log \sigma(R_\phi(y_w, x) - R_\phi(y_l, x))]\]

Cost Model (CM)

In addition to the pairwise comparison loss, a classification term is introduced to leverage safety label information:

\[\mathcal{L}_C(\psi; \mathcal{D}_C) = -\mathbb{E}[\log \sigma(s_w \cdot C_\psi(y_w, x)) + \log \sigma(s_l \cdot C_\psi(y_l, x))]\]

where \(s(y) = +1\) (harmful) or \(-1\) (harmless).

Experiments

RLHF Alignment Experiments (Table 2)

Dataset / Setting Alpaca1 Helpfulness Alpaca1 Harmlessness Alpaca2 Harmlessness Alpaca3 Harmlessness
BeaverTails (dual) 76.8% 83.7% 63.8% 77.1%
Ours (single) 81.4% 86.1% 88.6% 86.8%
Ours (dual) 87.3% 86.5% 94.0% 92.5%

Key Findings: 1. Dual-preference (decoupling helpfulness and harmlessness) significantly outperforms single-preference direct alignment. 2. The quality of PKU-SafeRLHF data is superior to BeaverTails.

Direct Comparison Experiments (Table 3)

Win rates of PKU-SafeRLHF aligned models vs. original Alpaca models: - Helpfulness win rate: 80.86% ~ 90.25% - Harmlessness win rate: 86.50% ~ 92.33%

RM/CM Evaluation & Verification

  • The evaluations from RM and CM show high consistency with human evaluation (Figure 7a).
  • The safety threshold of CM (score = 0) aligns closely with the human-annotated safety boundaries (3~4 points).
  • This validates that the CM can serve as a reliable pointwise evaluation metric, despite being trained via pairwise ranking loss.

Highlights & Insights

  1. Multi-level Safety Annotation: This work introduces 19 harm categories + 3 severity levels into safety preference data for the first time, moving beyond simple binary safety judgment.
  2. Decoupled Annotation Design: Dual-preference annotations for helpfulness and harmlessness enable researchers to independently study both dimensions and their trade-offs.
  3. High-Quality Data: Joint human-AI annotation significantly improves consistency, and a carefully designed data generation pipeline reduces noise by 32%.
  4. Practical Moderation Models: The severity-sensitive moderation models enable fine-grained control over different risk levels, making them suitable for real-world deployment.
  5. Calibration Properties of CM: Although the Cost Model is trained using pairwise loss, its scores align highly with human severity assessments, allowing for direct pointwise evaluation.

Limitations & Future Work

  1. Data Scale: Compared to large-scale preference datasets from commercial entities, the scale remains relatively small (166.8k).
  2. Category Overlap: There is unavoidable overlap among the 19 harm categories (e.g., financial crime \(\leftrightarrow\) white-collar crime).
  3. Limited Domain Adaptability: Although coverage is broad, highly specialized domains like legal, medical, and finance require domain-specific annotations.
  4. Cultural and Linguistic Limitations: The dataset focuses primarily on English, limiting its applicability to non-English languages and different cultural settings.
  5. Annotator Bias: Despite collaborative annotation, individual differences among the 28 annotators may still impact consistency.
  • Safety Datasets: BeaverTails (Ji et al., 2024), PKU-Beaver (Dai et al., 2023)
  • Alignment Methods: RLHF (Ouyang et al., 2022), SafeRLHF (Dai et al., 2024), DPO (Rafailov et al., 2024)
  • Safety Moderation: Llama-Guard (Inan et al., 2023), Perspective API, OpenAI Moderation API
  • Safety Evaluation: Red Teaming (Zhu et al., 2023), Anthropic Responsible Scaling Policy

Rating

⭐⭐⭐⭐ (4/5)

  • Value: High-quality, multi-level safety preference dataset that fills the gap in fine-grained safety annotations (+1).
  • Practicality: The moderation models and RLHF pipeline can be directly applied to real-world applications (+0.5).
  • Annotation Design: The innovative dual-preference + severity rating annotation scheme yields theoretical and practical significance (+0.5).
  • Open-Source Contribution: Dataset and models are fully open-sourced, benefiting community research (+0.5).
  • Deductions: Slightly limited data scale, hard-to-eliminate overlaps among harm categories, and English-only focus (-1).