GoalRank: Group-Relative Optimization for a Large Ranking Model¶
Conference: ICLR 2026 arXiv: 2509.22046 Code: None Area: LLM Alignment / Recommendation Ranking Keywords: ranking, generator-only, group-relative optimization, scaling law, recommendation
TL;DR¶
This paper theoretically proves that for any Multi-Generator-Evaluator (Multi-G-E) ranking system, there exists a larger generator-only model that approximates the optimal policy with smaller error and satisfies scaling laws. Based on this, GoalRank is proposed—a framework that uses a reward model to construct a group-relative reference policy for training a large generator-only ranking model, achieving significant improvements over SOTA in online A/B testing.
Background & Motivation¶
Background: The ranking stage of recommender systems selects an ordered list of length \(L\) from \(N\) candidates (a \(P(N,L)\) combinatorial space). The dominant paradigm is the Generator-Evaluator (G-E) two-stage framework: a generator produces candidate lists, and an evaluator selects the best one.
Limitations of Prior Work: The gains from increasing the number of generators or using multi-generator ensembles saturate rapidly (Fig. 1d). The two-stage paradigm introduces cross-stage inconsistency and engineering complexity.
Key Challenge: The policy space of two-stage methods is constrained to mixtures of \(k\) small generators, whereas the scaling law of end-to-end large models suggests that a single, larger model may perform better.
Goal: (1) Can a generator-only model theoretically outperform Multi-G-E? (2) How should such a large ranking model be trained?
Key Insight: Theorem 1 proves that a larger generator-only model has strictly smaller approximation error than any finite Multi-G-E system, and this error approaches zero as the model scales.
Core Idea: A reward model is used to construct a group-relative softmax reference policy over groups of candidate lists, and a large ranking model is trained via cross-entropy to approximate the optimal policy.
Method¶
Overall Architecture¶
(1) Train a reward model \(\hat{r}(l)\) to estimate list-level user feedback → (2) Construct a list group \(\mathcal{B}_u\) for each user (containing lists generated by the main model and various auxiliary policies) → (3) Build a reference policy \(\pi^{ref}\) via group-relative softmax → (4) Minimize the cross-entropy between \(\pi_\theta\) and \(\pi^{ref}\).
Key Designs¶
-
Theorem 1 (Theoretical Justification): For any \(k\)-mixture \((α,β)\)-bounded policy space \(\mathcal{C}_m^k\), there exists a generator-only policy space \(\mathcal{F}_M\) with width \(\geq kα+n\) such that \(\mathcal{E}(\mathcal{F}_M) < \mathcal{E}(\mathcal{C}_m^k)\), and \(\lim_{n→∞} \mathcal{E}(\mathcal{F}_M) = 0\).
-
Group-Relative Reference Policy:
-
\(\pi^{ref}(l|\mathcal{B}) = \frac{\exp((\hat{r}(l) - \bar{r}_\mathcal{B})/\sigma_\mathcal{B})}{\sum_{l'} \exp((\hat{r}(l') - \bar{r}_\mathcal{B})/\sigma_\mathcal{B})}\)
- Normalization (subtracting the group mean and dividing by the standard deviation) preserves the ranking order within the group even under a biased reward model.
-
Condition: the ranking order is reliable when the intra-group reward gap exceeds \(\sigma^*\).
-
List Group Construction: An auxiliary policy set \(\mathcal{M}\) (comprising heuristic and lightweight neural models) is introduced to generate diverse list groups for each user, ensuring sufficiently large intra-group reward gaps.
Key Experimental Results¶
Main Results (ML-1M / Industry / Amazon-Book)¶
| Method | ML-1M H@6 | Industry H@6 | Book H@6 |
|---|---|---|---|
| Best G-E (PIER) | 62.74 | 45.35 | 71.14 |
| Best MG-E (G-100) | 60.64 | - | - |
| GoalRank | 64.51 | 55.33 | 74.29 |
| Gain vs. best baseline | +2.77% | +11.3% | +1.7% |
Scaling Law Verification¶
GoalRank's performance improves consistently with model parameter count and training data volume, demonstrating a clear scaling law.
Online A/B Testing¶
On an industrial-scale short-video platform, GoalRank achieves significant improvements over SOTA baselines on core metrics.
Key Findings¶
- Generator-only GoalRank outperforms all G-E and MG-E baselines across all datasets, validating Theorem 1.
- MG-E methods show saturating or even declining performance gains as the number of generators increases from 3 to 100.
- Group-relative normalization makes training robust to absolute biases in the reward model, requiring only correct ordinal relationships.
Highlights & Insights¶
- Theory-driven practice: Theorem 1 is not merely theoretical decoration—it motivates the entire framework design: abandoning the two-stage paradigm, scaling up the generator, and using group-relative normalization to avoid reliance on precise rewards.
- Correspondence with DPO/RLHF: GoalRank's group-relative optimization closely parallels RLHF/DPO in LLM alignment—both use reward signals to construct reference distributions for policy training, differing only in that the task shifts from text generation to list ranking.
- Scaling law for recommender systems: This work is the first to demonstrate LLM-like scaling laws in a ranking task.
Limitations & Future Work¶
- A pretrained reward model is required; its quality ceiling determines the upper bound of GoalRank.
- The choice of the auxiliary policy set \(\mathcal{M}\) affects the diversity of list groups and the magnitude of intra-group reward gaps.
- Validation is limited to recommendation ranking; applicability to LLM alignment (e.g., RLHF) remains unexplored.
Related Work & Insights¶
- vs. PIER/NAR4Rec: Two-stage G-E methods are constrained by finite coverage of the candidate list space. GoalRank optimizes directly over the full policy space.
- vs. DPO/RCPO: RCPO replaces pairwise comparisons with ranked-choice preferences in text alignment; GoalRank replaces pointwise scoring with group-relative optimization in recommendation ranking. The core ideas are consistent.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The theoretical proof that generator-only models outperform G-E systems is a strong contribution; the group-relative optimization approach is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Public datasets + industrial datasets + online A/B testing + scaling law analysis.
- Writing Quality: ⭐⭐⭐⭐ — Theory and practice are well integrated.
- Value: ⭐⭐⭐⭐ — Significant implications for the ranking paradigm in recommender systems.