HelpSteer3: Human-Annotated Feedback and Edit Data to Empower Inference-Time Scaling¶

Conference: ACL 2025
arXiv: 2503.04378
Code: Available
Institution: NVIDIA Area: Others
Keywords: Inference-time scaling, human feedback, feedback-edit model, open-domain tasks, RLHF, dataset

TL;DR¶

NVIDIA releases the HelpSteer3 dataset (annotated by over 7,000 annotators from over 80 countries) to train specialized Feedback and Edit models. During inference, these models establish an "initial response \(\rightarrow\) feedback \(\rightarrow\) edit" loop to enable inference-time scaling for open-domain general tasks. Based on the Llama 3 series 70B model, this method achieves a score of 92.7 on Arena Hard, outperforming OpenAI o1-preview (90.4) and DeepSeek R1 (92.3).

Background & Motivation¶

Limitations of Inference-Time Scaling: Current mainstream inference-time scaling techniques (such as DeepSeek R1, OpenAI o1) rely on the "thinking before answering" paradigm. The core requirement is that task answers must be verifiable (e.g., mathematics, coding, logical reasoning), which cannot be generalized to open-domain general tasks.
Insufficient Feedback Formats in RLHF: Traditional RLHF primarily uses pairwise preferences (A vs B) or fixed-dimension scoring (e.g., accuracy, creativity), lacking natural language feedback that specifically points out "what is wrong and how to improve".
Limited Effectiveness of Self-Feedback: Directly prompting instruction models for self-feedback and self-editing might work on simple tasks, but shows almost no improvement or even degradation on highly difficult tasks (such as complex programming problems in Arena Hard).
Inspiration from Human Editing Paradigms: When writing papers, coding, or making major decisions, humans follow a cycle of "first draft \(\rightarrow\) soliciting feedback \(\rightarrow\) revising." This rich feedback-improvement mechanism has not yet been fully utilized by LLMs.
Core Problem: Can specialized models be trained to mimic human feedback and editing capabilities, allowing open-domain tasks to also achieve effective inference-time scaling?

Method¶

Overall Architecture: Feedback-Edit Inference-Time Scaling System¶

The system consists of three independent models: 1. Initial Response Model (e.g., Llama-3.1-Nemotron-70B-Instruct): Generates the initial response. 2. Feedback Model: Generates detailed natural language feedback for the initial response, pointing out shortcomings and suggesting directions for improvement. 3. Edit Model: Edits and improves the initial response based on the feedback.

Dataset Construction: HelpSteer3¶

Data Collection Process¶

Prompt Sources: Sampled from ShareGPT and WildChat, covering four major categories: General, STEM, Coding, and Multilingual.
Response Generation: Responses are generated using 16+ different models (including Nemotron 340B, Mistral Large 2, Gemma 2, etc.), deliberately incorporating models with varying capability levels to enhance generalization.
Feedback Annotation: Over 7,000 annotators provide 3-5 pieces of natural language feedback (50-250 words) for each response, starting with "The response is {not/slightly/partially/mostly/perfectly} helpful", focusing on overall helpfulness evaluation.
Response Editing: Gathered feedback is aggregated and handed over to an independent pool of annotators for response editing, only using the feedback of the three most consistent annotators.

Three Training Datasets¶

Feedback Demonstration (81,642 samples): Teaches the model how to generate feedback.
Edit Demonstration (14,461 samples): Teaches the model how to edit responses based on feedback, including all permutations of feedback to learn order-invariance.
Edit Preference (3,274 pairs): Distinguishes between good and bad edits (such as edits that do not follow feedback or simply copy the original source).

Model Training¶

Initialized with Llama-3.3-70B-Instruct: - Feedback SFT: Fine-tuned on the Feedback Demonstration dataset for 1 epoch. - Edit SFT: Fine-tuned on the Edit Demonstration dataset for 1 epoch. - Edit RM: Trained a Bradley-Terry Reward Model on the Edit Preference dataset, designed such that each batch contains both (bad edit, good edit) and (no edit, good edit) pairs. - Edit RL: Further optimized the Edit model using REINFORCE Leave One Out (RLOO) guided by the Edit RM. RL training addresses the issue where the SFT model had an approximately 30% probability of directly copying the original response.

Multi-Dimensional Inference-Time Scaling¶

Four scalable dimensions: - Number of Initial Responses: Multiple initial responses are generated for each prompt (Best-of-N) and selected via a reward model. - Number of Effective Feedbacks: More feedback is generated and re-ranked based on constructive criticism keywords to filter out effective feedback. - Number of Edited Responses: Multiple edited versions are generated for the same set of feedback, and the one with the highest reward is selected. - Multi-Dimensional Joint Scaling: Scales multiple dimensions simultaneously to achieve optimal performance.

Key Experimental Results¶

Experimental Setup¶

Evaluation Metrics: AlpacaEval 2.0 LC (Easy), GPT-4-Turbo MT Bench (Medium), Arena Hard (Hard).
Base Models: Llama-3.1-Nemotron-70B-Instruct, Llama-3.3-70B-Instruct.
External Baselines: Llama-3.1-405B-Instruct, Claude-3.5-Sonnet, GPT-4o, OpenAI o1-preview, DeepSeek R1.

Main Results¶

Model	MT Bench	AlpacaEval LC	Arena Hard
Nemotron-70B-Instruct	8.98	57.6	85.0
+ Feedback + Edit	9.16	62.8	87.0
Llama-3.3-70B-Instruct	8.29	35.0	62.4
+ Feedback + Edit	9.07	36.9	74.8
GPT-4o-2024-05-13	8.74	57.5	79.3
Claude-3-5-Sonnet	8.81	52.4	79.2

The Feedback-Edit system significantly improves the performance of base models across all three metrics, with a controllable increase in response length.

Ablation Study¶

Setting	MT Bench	AlpacaEval LC	Arena Hard
Nemotron-70B Baseline	8.98	57.6	85.0
+ Self-Feedback + Self-Edit	9.11	64.6	84.6 \(\downarrow\)
+ Feedback + Self-Edit	8.94	66.2	85.4
+ Feedback + Edit w/o RL	9.12	64.4	86.4
+ Edit w/o Feedback	9.14	67.4	84.5 \(\downarrow\)
+ Feedback + Edit (Full)	9.16	62.8	87.0

Key Findings: - Self-Feedback is effective for simple tasks but degrades on hard tasks, proving the necessity of training specialized models. - Removing RL results in an \(\sim 30\%\) probability of the Edit model directly copying the original response without any modifications. - Removing Feedback actually results in performance below the baseline on Arena Hard (84.5 vs 85.0), illustrating that feedback is critical for hard tasks.

Inference-Time Scaling Performance¶

Optimal configuration (8 initial responses \(\times\) 16 effective feedbacks + Nemotron-70B-Select selector): - Arena Hard: 92.7, outperforming OpenAI o1-preview (90.4) and DeepSeek R1 (92.3). - It only requires approximately \(16 \times\) token generation (equivalent to Best-of-16), but performs significantly better than pure Best-of-N (88.5).

Distillation Experiments¶

Model	AlpacaEval LC	Arena Hard
Llama-3.1-8B + Distill	41.5	55.5
Llama-3.3-70B + Distill	61.6	88.8
Nemotron-70B + Distill	61.3	88.4

Distilled data can significantly boost the zero-shot performance of base models (Llama-3.3-70B Arena Hard: 62.4 \(\rightarrow\) 88.8), making it suitable for latency-sensitive scenarios.

Highlights & Insights¶

Novel Inference-Time Scaling Paradigm: It systematizes the "feedback-edit" human collaboration mode into an LLM inference-time scaling method, making it the first inference-time scaling scheme to achieve SOTA on open-domain general tasks.
Large-Scale, High-Quality Dataset: 7,000+ annotators, 80+ countries, 14 programming languages, and 13 natural languages; data is open-sourced under CC-BY-4.0.
Decomposable System Deployment: The Feedback/Edit models can be deployed on different computing resources respectively. Sampling can be parallelized, resulting in a total latency of only about \(2 \times\) greedy generation—much lower than methods like DeepSeek R1 that require sequential generation of a large number of thinking tokens.
Rigorous Ablation Design: Through control experiments such as Self-Feedback, Self-Edit, removing RL, and removing Feedback, the contribution of each component is clearly quantified.
Feasibility of Distillation: It validates that the data generated by the feedback-edit system can be used for distillation, catering to different latency requirements.

Limitations & Future Work¶

Computational Cost: The optimal configuration requires generating a large amount of feedback alongside re-ranking and filtering. There is still room for optimization in sampling and selection processes (e.g., constrained decoding to reduce ineffective feedback).
Data Timeliness: Prompts originate from ShareGPT/WildChat from 2023-2024, which may not represent current, highly complex user queries.
Response Length Limitations: Prompts requiring response lengths of 2000+ words or inputs of 4000+ words were skipped, limiting applicability in long-text scenarios.
Validated Only at the 70B Scale: The effectiveness of the complete Feedback-Edit system has not been explored on larger (e.g., 405B) or smaller (except for distillation to 8B) models.
Small Size of Edit Preference Data (3,274 pairs): This may limit the generalization capability of the Edit RM, and it only covers General/STEM subsets.

RLHF and Preference Modeling: Ouyang et al. (2022), HelpSteer2 (Wang et al., 2024), UltraFeedback (Cui et al., 2023), etc., use score or preference pairs, while ours extends feedback to natural language.
Inference-Time Scaling (Thinking Paradigm): OpenAI o1, DeepSeek R1, QwQ, etc., scale by training models to generate chains-of-thought, but are limited to verifiable tasks.
Self-Correction and Self-Improvement: Self-Refine (Madaan et al., 2023), Self-Debug (Chen et al., 2023). Ours proves that self-correction by general-purpose models is ineffective on difficult tasks.
Critique Models: CritiqueLLM (Ke et al., 2024), Critique-out-Loud (Ankner et al., 2024), Shepherd (Wang et al., 2023), etc., train critic models, while ours goes a step further to incorporate editing components to form a closed loop.
Aligner (Ji et al., 2024): Trains feedback-free edit models, whereas ours demonstrates that feedback-guided editing significantly outperforms feedback-free editing on difficult tasks.

Rating¶

⭐⭐⭐⭐ — This work establishes a new paradigm for inference-time scaling in open-domain general tasks, with a solid dataset scale and experimental design, and clear, powerful ablation analyses. The outperformance over o1-preview and DeepSeek R1 on Arena Hard is compelling. Limitations reside in the optimization of computational costs and the potential for scaling the data size further.