GRAM: A Generative Foundation Reward Model for Reward Generalization¶
Conference: ICML 2025
arXiv: 2506.14175
Code: None
Area: Image Generation (LLM Alignment)
Keywords: Reward Model, Generative Model, Foundation Model, RLHF, Generalization
TL;DR¶
GRAM proposes training reward models using a generative (rather than discriminative) approach. It pre-trains a generative reward model through large-scale unsupervised learning, fine-tunes it with supervised data, and proves that label smoothing is mathematically equivalent to a regularized pairwise ranking loss, thereby achieving reward generalization across tasks.
Background & Motivation¶
Background: In LLM alignment, the reward model (RM) is a core component of RLHF, used to guide the model to generate outputs that align with human preferences. Currently, reward models are typically trained in a discriminative manner—directly learning a scoring function on human preference label data.
Limitations of Prior Work: Discriminative reward models rely heavily on labeled human preference data, resulting in insufficient generalization capabilities. When facing new tasks or out-of-distribution data, RM performance drops significantly. Meanwhile, acquiring high-quality preference data is extremely expensive.
Key Challenge: The limited amount of labeled preference data versus the need for reward models to generalize to a wide range of tasks. How can an RM obtain generalization capabilities through large-scale unlabeled data, similar to foundation language models?
Goal: Build a "foundation reward model" that can transfer to various tasks with minimal or even zero labeled data.
Key Insight: Leverage the "pre-training + fine-tuning" paradigm of LLMs—first train a generative RM on large-scale unsupervised data, then fine-tune it with a small amount of preference data.
Core Idea: The log-likelihood of a generative model naturally serves as a reward signal. By pre-training and fine-tuning, a highly generalizable foundation reward model can be constructed.
Method¶
Overall Architecture¶
- Stage 1 (Unsupervised Pre-training): Pre-train in a generative manner on large-scale unlabeled text to learn the general quality distribution of language.
- Stage 2 (Supervised Fine-tuning): Fine-tune on labeled human preference data to adapt the generative capability for reward scoring.
- Inference: Given a prompt-response pair, the log-likelihood (or its variant) of the generative RM is used as the reward score.
Key Designs¶
-
Generative RM:
- Unlike traditional RMs that append a scalar classification head at the end of the sequence, GRAM directly uses the log-likelihood \(\log p_\theta(y|x)\) of the generative model as the reward.
- Key Insight: High-quality responses should have a higher likelihood under the generative model.
- Design Motivation: Generative models can leverage large-scale unlabeled data for pre-training to obtain a generalized understanding of language quality.
-
Equivalence between Label Smoothing and Regularized Pairwise Ranking:
- Proves that when training with label smoothing, the generative loss is equivalent to a regularized pairwise ranking loss.
- This means: \(\mathcal{L}_{\text{smooth}} = (1-\epsilon)\mathcal{L}_{\text{CE}}(y_w) + \epsilon \mathcal{L}_{\text{CE}}(y_l)\) under preference training can be interpreted as simultaneously maximizing the likelihood of the preferred response and minimizing the likelihood of the dispreferred response.
- Design Motivation: Establish a unified perspective of generative and discriminative models under the same training objective class.
-
Transfer of Foundation Reward Models:
- The pre-trained generative RM can be applied directly (zero-shot) or with few-shot fine-tuning to various downstream tasks.
- Including response ranking, RLHF training signals, and task adaptation.
- Design Motivation: Mimic the zero-shot and few-shot transfer capabilities of foundation language models.
Loss & Training¶
- Pre-training: Standard autoregressive language model loss \(\mathcal{L} = -\sum_t \log p_\theta(y_t | y_{<t}, x)\)
- Fine-tuning: Preference loss with label smoothing, which is equivalent to a regularized pairwise ranking loss.
- The training strategy supports two modes:
- Direct training on preference data (label smoothed CE).
- Freezing the generative model and training only lightweight adaptation layers.
Key Experimental Results¶
Main Results¶
| Task | Metric | GRAM | Discriminative Baseline | Gain |
|---|---|---|---|---|
| Preference Ranking (RewardBench) | Accuracy↑ | Significant improvement | Standard BT RM | Several percentage points |
| RLHF Training | Win Rate↑ | Higher | Standard RM | Significant |
| Task Adaptation (Few-shot) | Accuracy↑ | Better | Direct Fine-tuning RM | Significant improvement |
| Zero-shot Transfer | Accuracy↑ | Viable | Requires training data | No annotation needed |
Ablation Study¶
| Configuration | Key Metric | Description |
|---|---|---|
| No Pre-training | Significant drop | Unsupervised pre-training is key to generalization |
| No Label Smoothing | Drop | Regularization effect is important for stable training |
| Pure Discriminative RM | Poor generalization | Performance degrades severely on out-of-distribution data |
| Different Pre-training Scales | Improves with scale | Larger models = Better generalization |
Key Findings¶
- Generative RMs significantly outperform discriminative RMs in cross-task generalization.
- Label smoothing is not just a regularization technique; it has a deep mathematical connection with pairwise ranking.
- The quality and scale of pre-training directly determine the generalization capability of the foundation RM.
- GRAM outperforms several strong baselines in both RLHF training and direct ranking tasks.
Highlights & Insights¶
- Theoretical Contribution: Proof of equivalence between label smoothing and regularized pairwise ranking, unifying the generative and discriminative training perspectives.
- Paradigm Innovation: Extends the "pre-training + fine-tuning" paradigm from language models to reward models.
- Practical Value: The foundation RM can be reused across tasks, substantially reducing annotation costs for new tasks.
- Meta-learning Perspective: Generative pre-training can be viewed as implicitly learning the meta-knowledge of "what constitutes good text."
Limitations & Future Work¶
- The inference cost of generative RMs is higher than that of discriminative RMs with a single scalar output.
- Log-likelihood as a reward may not align perfectly with human preferences on certain tasks (e.g., creative writing).
- Domain bias in pre-training data may affect the generalization of the RM in specific specialized fields.
- Computational cost of large-scale pre-training.
Related Work & Insights¶
- Connection with DPO: DPO also uses likelihood ratio as an implicit reward; GRAM further generalizes this concept to foundation models.
- Comparison with GPT-4 as a judge: GRAM provides a parameterized, trainable alternative.
- Insight: The equivalence between generative and discriminative models can potentially be leveraged in more scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐ The concept of generative foundation RM and its theoretical connections are novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive multi-task validation.
- Writing Quality: ⭐⭐⭐⭐ Good integration of theory and experiments.
- Value: ⭐⭐⭐⭐⭐ Provides a new paradigm for RM training with high practicality.