Skip to content

WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback

Conference: ACL2026
arXiv: 2408.15549
Code: No public training code; Dataset: https://huggingface.co/datasets/microsoft/WildFeedback
Area: LLM Alignment / User Feedback / Preference Learning
Keywords: In-situ user feedback, Preference data construction, SAT/DSAT, DPO, checklist evaluation

TL;DR

WildFeedback automatically identifies satisfied/dissatisfied feedback from real multi-round user-ChatGPT conversations, converting naturally occurring user preferences into preference training samples and per-case checklist evaluation standards. This allows small open-source instruction models to align more closely with real-world user needs than UltraFeedback training on both general benchmarks and real user preference tests.

Background & Motivation

Background: LLM alignment typically relies on two types of data: human-annotated preference data or synthetic preference data generated/judged by strong models like GPT-4. The former is costly, subjective, and limited in scale; the latter is cheap and scalable but risks recursively injecting the strong model's own preferences and biases into the student model.

Limitations of Prior Work: Real-world users naturally express feedback within products, such as "Thanks, this is exactly what I wanted" or "No, please rewrite it; you ignored my requirement." These signals are closer to actual usage scenarios than offline annotations, yet they are not structured as thumbs-up/down and are often scattered across multi-turn contexts. Simply using the responses that triggered feedback for training is insufficient because negative feedback only indicates the old response was poor; a better response satisfying that specific user's preference must also be constructed.

Key Challenge: Alignment requires real user preferences, which are inherently noisy, implicit, and context-dependent. Relying solely on static annotation sets lacks scale and authenticity, while relying only on model self-evaluation weakens the diversity of human preferences.

Goal: The authors aim to address three sub-problems: first, detecting which user utterances in real multi-turn conversations contain satisfaction or dissatisfaction signals; second, converting these signals into preferred-dispreferred response pairs for SFT/DPO; third, constructing an evaluation method where automatic assessment is based on the actual preferences expressed by the user in that sample rather than just asking GPT-4 "which is better."

Key Insight: The key observation is that although user feedback lacks explicit labels, it often follows interpretable linguistic patterns in sessions. By summarizing these patterns into SAT/DSAT rubrics and having GPT-4 detect and summarize preferences based on these rubrics, "wild feedback" can be transformed into structured preference data.

Core Idea: Replace offline human/model preference annotations with in-situ user feedback from real multi-turn conversations, and use sample-level user preference checklists to guide both preference sample construction and model evaluation.

Method

WildFeedback does not propose a new alignment loss but rather a data pipeline from real user interactions to preference training and evaluation. The input is a batch of multi-turn user-LLM conversations, and the output is a preference dataset containing prompts, user preference descriptions, preferred responses, dispreferred responses, and a held-out benchmark evaluated via user preference checklists.

Overall Architecture

The workflow consists of four steps. First, it identifies user satisfaction (SAT) and dissatisfaction (DSAT) signals turn-by-turn within 148,715 multi-turn ChatGPT sessions from WildChat. Second, for sessions containing feedback, it extracts the full conversation history prior to the feedback as the prompt and summarizes the user's subsequent feedback into natural language preferences. Third, it constructs response pairs based on these preferences: the original response triggering a DSAT serves as the dispreferred response, while the preferred response is generated by GPT-4 or the current policy model guided by user preference prompts. Fourth, it uses the generated WildFeedback data to perform one round of SFT followed by one round of DPO on instruction models like Phi-3, LLaMA-3, and Qwen-2, evaluating them on general benchmarks and a user preference checklist benchmark.

The paper also incorporates evaluation into the framework. Traditional AlpacaEval/MT-Bench uses GPT-4 for general quality judgments, but here each sample has preferences summarized from real feedback, such as "be more concise," "requires factual correction," or "do not ignore formatting." During evaluation, these preferences are provided to GPT-4 as a checklist for pairwise comparison, reducing the misalignment between the judge and real user preferences.

Key Designs

  1. SAT/DSAT Feedback Signal Identification:

    • Function: Identify whether a user is expressing satisfaction or dissatisfaction from natural multi-turn dialogues and locate specific feedback utterances.
    • Mechanism: The method adapts the user satisfaction estimation ideas from SPUR, using 9 categories of SAT rubrics and 9 categories of DSAT rubrics. SAT includes thanks, learning, compliance, praise, personal details, humor, confirmation, positive closing, and narrowing down; DSAT includes negative feedback, request for modification, factual error, unrealistic expectation, no further interaction, being ignored, low quality, insufficient detail, and style issues. GPT-4o classifies utterances based on these rubrics.
    • Design Motivation: Real user feedback is not always an explicit button but is often reflected in the language of the next turn. Categorizing feedback into interpretable rubrics avoids using vague emotional words as training signals and facilitates analysis of why users are satisfied or dissatisfied.
  2. Constructing Semi-synthetic Preference Pairs from Feedback:

    • Function: Convert original sessions into (prompt, preferred, dispreferred) preference samples usable for DPO.
    • Mechanism: For sessions with SAT/DSAT, the system first has GPT-4 summarize user preferences, then truncates the conversation history before the feedback as the prompt. For the GPT-4 expert version, the original DSAT-triggering response is the dispreferred response, and GPT-4 generates the preferred response under preference and safety constraints. For the on-policy version, Phi-3, Qwen-2, and LLaMA-3 generate their own preferred/dispreferred responses, where preferred generation is guided by user preference system prompts.
    • Design Motivation: This step transforms "user dissatisfaction" from a weak supervision signal into trainable preference pairs. Compared to data like UltraFeedback where GPT-4 provides uniform scores, WildFeedback prompts and preferences originate from real human-computer interactions and preserve multi-turn context.
  3. Checklist-guided Evaluation & Filtering:

    • Function: Use sample-level user preference constraints for evaluation and data filtering to prevent automatic judges from scoring based only on general preferences.
    • Mechanism: Instead of merely comparing two responses, the judge (GPT-4) is given the summarized user preferences as a checklist and required to perform pairwise judgment accordingly. During on-policy data construction, if a generated preferred response fails to beat the dispreferred response under checklist evaluation, the sample is filtered out; the GPT-4 expert version is largely retained due to its stability.
    • Design Motivation: LLM-as-a-judge tends to favor long responses or its own style, which may deviate from real user preferences. Adding a checklist shifts the evaluation criteria from "generalized good response" to "what this specific user wanted in this specific task."

Loss & Training

Training does not modify the DPO objective but integrates WildFeedback data into a standard alignment pipeline: each base model undergoes 1 epoch of SFT on preferred responses, followed by 1 epoch of DPO on full preference pairs. Experiments cover three open-source instruction models: Phi-3-mini-4k-instruct, Meta-Llama-3-8B-Instruct, and Qwen2-7B-Instruct, comparing five settings: original instruction model, WF GPT-4, WF On-policy, UF GPT-4, and UF On-policy.

The test set construction also accounts for over-fitting. The authors used FAISS to cluster user prompts and summarized preferences into 70 groups, selecting 10 samples closest to each cluster center, then deduplicating and filtering meaningless tasks to obtain 540 held-out samples. This ensures evaluation reflects the "mainstream preferences of most users in similar tasks" rather than a few idiosyncratic individual preferences.

Key Experimental Results

Main Results

WildFeedback demonstrates it can extract a usable scale of feedback data from real sessions. Out of 148,715 multi-turn sessions in WildChat, approximately 12.8% contain feedback signals; 20,281 GPT-4 version preference samples and several on-policy versions were eventually constructed.

Data/Metric SAT DSAT Total
Sessions with feedback 5,447 13,582 148,715
Utterances with feedback 8,186 27,711 628,467
GPT-4 vs. Human Consistency \(\kappa=0.69\) \(\kappa=0.50\) Near human level

Compared to existing preference datasets, WildFeedback is characterized by being multi-turn, derived from real in-situ user feedback, and having longer prompts that closer resemble actual product interactions.

Dataset Samples Prompt Length Response Length Multi-turn? Feedback Source
WebGPT 38,925 51 188 No Human Annotation
Anthropic HH 118,263 186 95 No Human Annotation
OASST1 35,905 168 221 Yes Human Written
UltraFeedback 61,135 159 256 No GPT-4
Ours GPT-4 20,281 929 440 Yes In-situ User Feedback
Ours Qwen-2 11,509 1,057 541 Yes In-situ User Feedback
Ours Phi-3 9,194 931 344 Yes In-situ User Feedback
Ours LLaMA-3 10,659 982 376 Yes In-situ User Feedback

On general benchmarks, training with WildFeedback generally outperforms both the original models and UltraFeedback controls. This is most evident in Phi-3 and LLaMA-3 where WF GPT-4 improves AlpacaEval 2, Arena-Hard, and MT-Bench simultaneously.

Model/Training Data AlpacaEval2 LC AlpacaEval2 WR Arena-Hard WR MT-Bench
Phi-3 Base 24.3 17.4 15.4 7.32
Phi-3 + WF On-policy 29.0 27.1 30.1 7.42
Phi-3 + UF On-policy 27.2 25.9 28.7 7.40
Phi-3 + WF GPT-4 34.9 36.6 32.4 7.75
Phi-3 + UF GPT-4 32.5 38.4 30.5 7.68
LLaMA-3 Base 22.9 22.6 20.6 7.10
LLaMA-3 + WF GPT-4 34.2 42.8 32.9 7.57
LLaMA-3 + UF GPT-4 32.2 43.2 32.6 7.49
Qwen-2 Base 28.7 26.0 24.9 7.55
Qwen-2 + WF On-policy 42.6 34.4 36.1 8.02
Qwen-2 + UF On-policy 38.3 34.2 29.2 7.72

Ablation Study

Rather than traditional module-removal ablation, the paper validates components through data construction versions, evaluation checklists, UltraFeedback comparison, and feedback type analysis.

Configuration/Analysis Key Metric Description
Pairs without checklist GPT-4 judge bias GPT-4 doesn't always favor user-aligned responses without a checklist, influenced by general aesthetic.
With checklist Alignment > 70% Checklist successfully re-focuses judge attention on sample-specific user needs for GPT-4 expert data.
Small model on-policy preferred Alignment ~ 50% Small models have weaker controllability, necessitating checklist-based filtering for preference pairs.
WildFeedback held-out test Gain +5.3% LLaMA-3 + WF GPT-4 win rate vs UF GPT-4 rises from 45.5% to 50.8% with checklist, showing WF is more user-aligned.
Feedback Distribution DSAT types DSAT is concentrated on modification needs and factual errors; SAT is more dispersed.

Key Findings

  • The benefit of WildFeedback is not just "more data" but better alignment between training data and actual usage scenarios. Its prompts come from multi-turn sessions and preferences from natural feedback, making its gains on real-user preference benchmarks more interpretable than those of UltraFeedback.
  • The Checklist is the most critical evaluation design. Without it, the GPT-4 judge might select responses based on universal aesthetics; with it, it can better distinguish which response satisfies the actual preference expressed by the user in that session.
  • DSAT significantly outweighs SAT, indicating a natural selection bias in product data: users are more likely to continue interacting to correct a model when dissatisfied. This focuses data on failure cases but may over-represent negative feedback scenarios in the training distribution.

Highlights & Insights

  • The biggest highlight is elevating "user feedback" from noise in product logs to trainable preference data. While many works assume preferences must be scored by annotators or strong models, WildFeedback shows that next-turn user reactions are a form of supervision.
  • Checklist-guided evaluation is highly transferable to personalized agents, recommendation dialogues, and customer service. Once "what the user wants" is summarized from behavior or text, evaluation shifts from general quality scores to sample-level goal achievement.
  • The analysis of feedback types is valuable: dissatisfaction often stems from factual errors and modification requests, while satisfaction is more fragmented. This suggests actual system optimization should prioritize fixing "hard errors" over merely achieving a more pleasant tone.
  • The "semi-synthetic" strategy is pragmatic: users provide the real preference, and strong models complete the preferred response. It acknowledges the necessity of model-generated data while restricting it within the constraints of real user preferences.

Limitations & Future Work

  • In-situ feedback may be malicious, dangerous, or unreasonable. The authors use safety prompts and OpenAI moderation filtering, which is only a baseline; a more systematic distinction between "genuine preferences" and "preferences that should not be learned" is needed.
  • Selection bias exists. Users leave feedback more often when dissatisfied, so WildFeedback may over-represent correction, rewriting, and complaining scenarios while under-representing silent but satisfied users.
  • Evaluation still relies on GPT-4o as a judge. While checklists mitigate bias, they do not eliminate systematic issues of LLM-as-a-judge, especially when GPT-4 also summarizes the checklist.
  • On-policy small models have weak controllability regarding user preferences; nearly half of the preferred responses may not actually align with preferences. Rejection sampling or multi-model cross-verification could improve data quality.
  • vs UltraFeedback: UltraFeedback scores offline prompt-responses via GPT-4, offering scale and reproducibility; WildFeedback mines real multi-turn feedback, offering better alignment with real product needs.
  • vs Anthropic HH / WebGPT: These rely on human-annotated preferences, which are controllable but expensive, and annotator preferences may not represent end-users. WildFeedback uses actual task feedback, reducing the "annotator-user" preference gap.
  • vs OASST1: OASST1 contains multi-turn dialogues, but many are human-written; WildFeedback captures how users actually follow up and supplement requirements after model failures in real interactions.
  • Insights: Alignment data can shift from "annotation tasks" to "interaction log mining." For education, medical QA, and agents, researching how clicks, dwell time, retries, and rewrites translate into preference signals is a promising direction.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Automatic construction of preference data and checklist evaluation from in-situ feedback is highly relevant to real product alignment.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Covers multiple models, benchmarks, and human consistency, though long-term effects on users remain unassessed.
  • Writing Quality: ⭐⭐⭐⭐ Clear methodological chain and convincing data diagnostics.
  • Value: ⭐⭐⭐⭐⭐ Directly inspires LLM alignment, conversational recommendation, and interactive evaluation; serves as a foundational framework for learning from real feedback.