Aligning Large Language Models with Implicit Preferences from User-Generated Content¶

Conference: ACL 2025
arXiv: 2506.04463
Code: https://zhaoxuan.info/PUGC.github.io/
Area: LLM/NLP
Keywords: LLM alignment, implicit preferences, User-Generated Content (UGC), DPO, scalable alignment

TL;DR¶

Propounds the PUGC framework, which leverages implicit human preferences in unlabeled User-Generated Content (UGC) to generate preference data. By converting UGC into queries and reference texts, the framework scores model-generated responses and employs DPO to achieve scalable, domain-specific alignment, reaching a state-of-the-art length-controlled win rate of 35.93% on Alpaca Eval 2 based on Mistral-7B.

Background & Motivation¶

Background: LLM alignment relies on preference feedback data. While human annotation yields high quality, it is costly and unscalable. Although LLM-based labeling (e.g., Constitutional AI) is scalable, it may introduce model biases.

Limitations of Prior Work: (a) High-quality preference data remains a scarce resource, as each instance requires humans to compare the quality of two responses; (b) Domain-specific alignment is even more challenging, as general preference data does not apply well to vertical domains; (c) A vast amount of unlabeled UGC containing rich implicit human preferences exists on the internet but remains unexploited.

Key Challenge: UGC is not created to guide LLM generation, yet it reflects the knowledge, values, and preferences of its creators. The primary challenge is how to convert these "implicit preferences" into "explicit training signals".

Goal: Automatically extract preference signals from unlabeled UGC to achieve low-cost, scalable LLM alignment.

Key Insight: Treat UGC as a "reference standard for good responses." Although UGC does not directly answer questions, the information and perspectives it contains can be utilized to evaluate the quality of LLM responses.

Core Idea: Implicit preferences from UGC \(\rightarrow\) reference texts \(\rightarrow\) scoring LLM responses \(\rightarrow\) DPO alignment.

Method¶

Overall Architecture¶

The PUGC pipeline consists of: (1) Automatically generating user queries from UGC (via reverse questioning); (2) Demanding a policy LLM to generate multiple candidate responses for each query; (3) Scoring each candidate response using the original UGC as a reference text—responses closer to the UGC's information/style receive higher scores; (4) Constructing preference pairs from the scores for DPO training.

Key Designs¶

UGC → Query Conversion:
- Function: Automatically generate corresponding user queries from unstructured UGC
- Mechanism: Prompt the LLM with "What question does this text answer?" to reverse-generate queries from UGC
- Design Motivation: UGC itself is not in QA format and needs to be transformed before being used for alignment
UGC-based Reference Scoring:
- Function: Evaluate the quality of LLM-generated responses using UGC as a reference standard
- Mechanism: Measure the alignment between LLM responses and UGC using metrics such as semantic similarity and information coverage. High score = preferred, low score = dispreferred
- Design Motivation: UGC embodies the knowledge and insights of the creators; responses that align with it are more likely to be "good" answers
Domain-Specific Alignment:
- Function: Achieve domain alignment directed at specific domain UGC
- Mechanism: Collect domain-specific UGC (such as upvoted posts from medical, legal, or technical communities) and allow the model to learn preferences within that domain
- Design Motivation: The standard for a "good response" varies across domains—the medical domain expects accuracy and caution, whereas technical communities demand practical and detailed explanations

Loss & Training¶

Standard DPO loss is adopted, where preferred/dispreferred responses are determined by the UGC-based reference scores.
Fine-tuning is conducted based on Mistral-7B-Instruct.

Key Experimental Results¶

Main Results¶

Method	Alpaca Eval 2 LC Win Rate(↑)	Description
Mistral-7B-Instruct baseline	Baseline	No alignment
DPO + Traditional Preference Data	Medium	Human/LLM-annotated preferences
DPO + PUGC	35.93% (SOTA)	+9.37% over traditional

Analysis Dimensions¶

Dimension	Result
Reward Signal Quality	PUGC's implicit preferences show high correlation with human-annotated preferences
Domain-Specific Alignment	Performs better when aligned on vertical-domain UGC
UGC Quality Robustness	Exhibits a degree of robustness against low-quality UGC
Theory of Mind	PUGC alignment enhances the model's capability of understanding user intentions

Key Findings¶

Implicit preferences from UGC indeed yield high-quality alignment signals, as evidenced by a significant gain of 9.37%.
Domain-specific UGC alignment outperforms generic alignment within its targeted domain, validating the value of domain adaptation.
The method exhibits tolerance toward UGC quality, meaning not all UGC needs to be of pristine quality.
It enhances the model's Theory of Mind, enabling it to better comprehend what users truly desire.

Highlights & Insights¶

The insight that "UGC equals free preference data" holds immense practical value, as billions of UGC pieces on the internet represent untapped alignment resources.
The conversion pipeline from "implicit to explicit" is highly practical and scalable, with a concise design of reverse-generating queries and scoring based on UGC references.
Domain-specific alignment is a unique advantage of PUGC; while general preference data struggles to cover all vertical fields, domain-specific UGC is ubiquitous.
The enhancement in Theory of Mind implies that UGC contains rich signals about "how humans formulate thoughts."

Limitations & Future Work¶

The quality and representativeness of UGC affect the alignment direction; UGC from low-quality communities may introduce biases.
The accuracy of the "reverse-generated queries" serves as a bottleneck in the pipeline, as incorrect queries lead to faulty preference signals.
Validated only on Mistral-7B; performance on larger models remains unexplored.
UGC may contain outdated or erroneous information.

vs Traditional RLHF/DPO: Standard methods require explicit human preference annotations; PUGC extracts implicit preferences from unlabeled UGC, substantially reducing costs.
vs Constitutional AI: Utilizes AI principles to generate preferences; PUGC extracts preferences from authentic human creations, making it more grounded.
vs AgoraBench (Evaluating LMs as Data Generators): AgoraBench evaluates synthetic data quality, whereas PUGC utilizes existing UGC, representing a different data source strategy.

Rating¶

Novelty: ⭐⭐⭐⭐ Utilizing implicit UGC preferences for alignment is novel and offers exceptionally high practical value.
Experimental Thoroughness: ⭐⭐⭐⭐ Attained Alpaca Eval 2 SOTA along with multi-dimensional analysis.
Writing Quality: ⭐⭐⭐⭐ Motivation is clear.
Value: ⭐⭐⭐⭐⭐ A scalable alignment solution with direct utility for practical products.