HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages¶

Conference: NeurIPS 2025 arXiv: 2505.11475 Code: https://huggingface.co/datasets/nvidia/HelpSteer3 Area: Multilingual Translation Keywords: Preference Dataset, Human Annotation, STEM, Code, Multilingual, Reward Models, CC-BY-4.0

TL;DR¶

NVIDIA releases a 40K+ open-source human-annotated preference dataset covering general, STEM, code, and multilingual (13 languages) tasks. The reward model trained on this dataset achieves 82.4% (+10%) on RM-Bench, with a commercially friendly CC-BY-4.0 license.

Background & Motivation¶

Background: Preference data has evolved from low-quality annotations (HH-RLHF) to GPT-4-labeled data (UltraFeedback) to synthetically filtered datasets (HelpSteer2), yet diversity remains insufficient — nearly all mainstream datasets are English-only.

Limitations of Prior Work: As LLM applications expand into programming, scientific reasoning, and multilingual interaction, RLHF data must cover these emerging domains. GPT-4-annotated data is restricted from commercial use under terms of service.

Key Challenge: High quality, diversity, and permissive licensing are difficult to achieve simultaneously — prior datasets satisfy at most two of these three requirements.

Goal: Construct a large-scale, high-quality, permissively licensed preference dataset spanning STEM, code, and multilingual domains.

Key Insight: Engage annotators with domain-specific expertise (scientists, engineers, multilingual speakers) under a stratified quality control framework.

Core Idea: Expert-stratified annotation + multi-domain and multilingual coverage + CC-BY-4.0 = the most comprehensive open-source preference dataset to date.

Method¶

Overall Architecture¶

Stratified data collection: different subsets are sourced differently (ShareGPT/WildChat), with 17 models generating paired responses, 3–5 independent expert annotators providing 7-point ratings with rationales, and rigorous post-processing applied throughout.

Key Designs¶

Stratified Annotator Expertise:
- General: everyday scenarios; STEM: relevant degree and work experience; Code: software engineering background for code quality evaluation; Multilingual: language fluency to verify correct target-language usage.
- Weighted Cohen's \(\kappa\) reaches 0.890 (strong agreement); positional bias is negligible (mean preference shift \(-0.003\)).
Broad Coverage:
- Code (8,857 samples): 14 programming languages, Python at 38.2%.
- Multilingual (8,063 samples): 13 natural languages, Chinese at 30.2%.
- STEM (4,918 samples): high-difficulty scientific problems.
- General (18,638 samples): general-purpose scenarios.
Multi-Level Quality Control:
- 3–5 independent annotators → retain the 3 most consistent → remove invalid entries and filter outliers.
- Bias detection: positional bias \(< 0.003\), standard deviation \(1.950\).

Key Experimental Results¶

Main Results¶

Dataset / RM	RM-Bench Overall	RM-Bench Hard	JudgeBench
Trained on HelpSteer2	~72%	~61%	~63%
English RM (Gen+STEM+Code)	79.9%	71.1%	73.7%
Multilingual RM	82.4%	80.0%	69.4%

Dataset Statistics Comparison¶

Metric	HelpSteer3	HelpSteer2
Total Samples	40,476	9,125
Avg. Context Length	2,638	711
Cohen's \(\kappa\)	0.890	0.878

Key Findings¶

RM-Bench Hard improves from ~56% to 80% (+24%); high-quality data provides the greatest benefit on difficult samples.
A 4.4× scale-up with substantially expanded diversity is achieved while maintaining high inter-annotator agreement (\(\kappa = 0.890\)).
The CC-BY-4.0 license enables unrestricted commercial use.

Highlights & Insights¶

First systematic treatment of diversity: STEM + Code + Multilingual breaks the precedent of English-only preference data.
Quality and scale achieved simultaneously: a 4× expansion while maintaining \(\kappa > 0.89\).
+10% on RM-Bench demonstrates that data quality is the critical bottleneck for reward model performance.

Limitations & Future Work¶

Multilingual distribution is uneven (Chinese at 30%; some languages are underrepresented).
Coverage is text-only; multimodal preferences are not addressed.
Annotation of certain subjective tasks may remain contentious.

vs. HelpSteer2: 4.4× larger with multi-domain and multilingual expansion.
vs. UltraFeedback: human annotation vs. GPT-4 annotation, with no licensing restrictions.
vs. Skywork-Preference: higher annotation quality (\(\kappa = 0.890\)) with broader dimensional coverage.

Rating¶

Novelty: ⭐⭐⭐⭐ First systematically constructed multi-dimensional preference dataset.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple RM training runs, multi-benchmark evaluation, and ablation studies.
Writing Quality: ⭐⭐⭐⭐ Data construction pipeline described in thorough detail.
Value: ⭐⭐⭐⭐⭐ CC-BY-4.0 open release with a +10% RM improvement; directly applicable to industrial settings.