Reverse Engineering Human Preferences with Reinforcement Learning¶
Conference: NeurIPS 2025 (Spotlight)
arXiv: 2505.15795
Code: N/A
Area: AI Safety
Keywords: LLM-as-a-Judge, Adversarial Attack, Preference Reverse Engineering, Reinforcement Learning, Detectability
TL;DR¶
A reinforcement learning-trained preamble generator is used to inflate the evaluation scores of downstream LLMs, exposing critical vulnerabilities in the LLM-as-a-Judge evaluation framework. The attack is nearly undetectable and demonstrates cross-model transferability.
Background & Motivation¶
Starting Point¶
Goal: Background: LLM-as-a-Judge has become the dominant framework for evaluating LLM capabilities—leveraging a powerful LLM as a judge to predict human preferences. However, this framework is susceptible to malicious exploitation:
Gamability: LLM outputs can be adversarially optimized to satisfy the judge model.
Limitations of Prior Work: Existing post-hoc editing methods directly modify model responses, making them readily detectable.
Security Threat: If leaderboard scores can be artificially inflated, the credibility of the entire evaluation ecosystem is called into question.
Method¶
Overall Architecture¶
A preamble generator model is trained to prepend an implicit steering text before the candidate LLM's response, inducing the judge LLM to assign higher scores.
Key Designs¶
1. Pipeline Architecture
Two LLMs are composed in sequence: - Preamble Generator \(\pi_\phi\): Generates a preamble \(p\) conditioned on the prompt. - Candidate LLM \(M\) (frozen): Takes \([p; \text{prompt}]\) as input and produces response \(y\). - Judge LLM \(J\): Scores the response \(y\).
Key: The candidate LLM remains frozen; only the preamble generator is trained.
2. RL Training
- Reward Signal: The judge LLM's score for the response.
- Policy: Parameters of the preamble generator \(\pi_\phi\).
- Objective: \(\max_\phi \mathbb{E}_{p \sim \pi_\phi}[J(M(p, \text{prompt}))]\)
- Optimized using PPO.
3. Stealthiness
- Unlike direct response editing, the preamble is used internally by the model; both the user and the judge observe only the candidate LLM's natural output.
- The preamble influences the model's activation states, thereby altering generation behavior.
- The attack is nearly undetectable: the distributional shift in the responses themselves is minimal.
Loss & Training¶
where \(R(p)\) is the judge score and \(\pi_0\) is the initial policy of the preamble generator.
Key Experimental Results¶
Main Results¶
Evaluation score gains across different frameworks (AlpacaEval 2.0, Win Rate vs. GPT-4):
| Method | Baseline Win Rate | Post-Attack Win Rate | Detectability |
|---|---|---|---|
| Direct Response | 25.3% | - | - |
| Post-hoc Editing | - | 38.5% | High (82%) |
| Manual Prompt Injection | - | 32.1% | Medium (55%) |
| Preamble Generator (Ours) | - | 42.8% | Low (12%) |
Cross-model transferability:
| Train Judge | Test Judge | Baseline | Post-Attack | Transfer |
|---|---|---|---|---|
| GPT-4 | GPT-4 | 25.3% | 42.8% | (in-distribution) |
| GPT-4 | Claude-3 | 28.1% | 38.5% | ✓ Effective |
| GPT-4 | Gemini | 26.5% | 36.2% | ✓ Effective |
| Claude-3 | GPT-4 | 25.3% | 37.5% | ✓ Effective |
Ablation Study¶
Effect of preamble length on attack performance:
| Preamble Length (tokens) | Win Rate | Detectability | Response Naturalness |
|---|---|---|---|
| 0 (no preamble) | 25.3% | 0% | 100% |
| 16 | 32.5% | 5% | 98% |
| 64 | 38.8% | 8% | 95% |
| 128 | 42.8% | 12% | 92% |
| 256 | 43.2% | 25% | 85% |
Key Findings¶
- The preamble generator raises Win Rate from 25.3% to 42.8%, surpassing post-hoc editing by a substantial margin.
- Attack detectability is only 12%, far below that of direct editing (82%).
- Cross-model transfer: A preamble generator trained against one judge model remains effective against other judge models.
- This suggests that certain features of human preference are model-agnostic and can be systematically exploited.
Highlights & Insights¶
- Spotlight paper: Exposes a fundamental security vulnerability in the LLM-as-a-Judge paradigm.
- Undetectability: The core innovation lies in conducting the attack via the preamble (invisible) rather than by modifying the response (visible).
- Cross-model transfer: Implies the existence of systematic biases in preference judgments, rather than weaknesses specific to individual models.
- Dual implications: Serves simultaneously as a security warning and as a new direction for optimizing upstream inputs via preamble-based steering.
Limitations & Future Work¶
- Although the preamble is "invisible" to end users, inspection of the system prompt can still reveal it.
- If evaluators simultaneously audit the system prompt, the attack's effectiveness is reduced.
- RL training requires a large number of judge API calls, resulting in substantial computational cost.
- The study focuses primarily on English-language settings; cross-lingual generalization remains unvalidated.
Related Work & Insights¶
- LLM-as-a-Judge (Zheng et al.): Established the LLM-based evaluation framework.
- Prompt Injection: Related work on system prompt attacks.
- AlpacaEval: LLM evaluation leaderboard.
Rating¶
- ⭐ Novelty: 9/10 — The preamble-based attack is a genuinely novel approach that surfaces deep-seated issues.
- ⭐ Experimental Thoroughness: 8/10 — Directly informs the design of more robust evaluation systems.
- ⭐ Writing Quality: 9/10 — Spotlight-level work with clear and compelling argumentation.