Skip to content

GoalRank: Group-Relative Optimization for a Large Ranking Model

Conference: ICLR 2026 arXiv: 2509.22046 Code: None Area: LLM Alignment / Recommendation Ranking Keywords: ranking, generator-only, group-relative optimization, scaling law, recommendation

TL;DR

This paper theoretically proves that for any Multi-Generator-Evaluator (Multi-G-E) ranking system, there exists a larger generator-only model that approximates the optimal policy with smaller error and satisfies scaling laws. Based on this, GoalRank is proposed—a framework that uses a reward model to construct a group-relative reference policy for training a large generator-only ranking model, achieving significant improvements over SOTA in online A/B testing.

Background & Motivation

Background: The ranking stage of recommender systems selects an ordered list of length \(L\) from \(N\) candidates (a \(P(N,L)\) combinatorial space). The dominant paradigm is the Generator-Evaluator (G-E) two-stage framework: a generator produces candidate lists, and an evaluator selects the best one.

Limitations of Prior Work: The gains from increasing the number of generators or using multi-generator ensembles saturate rapidly (Fig. 1d). The two-stage paradigm introduces cross-stage inconsistency and engineering complexity.

Key Challenge: The policy space of two-stage methods is constrained to mixtures of \(k\) small generators, whereas the scaling law of end-to-end large models suggests that a single, larger model may perform better.

Goal: (1) Can a generator-only model theoretically outperform Multi-G-E? (2) How should such a large ranking model be trained?

Key Insight: Theorem 1 proves that a larger generator-only model has strictly smaller approximation error than any finite Multi-G-E system, and this error approaches zero as the model scales.

Core Idea: A reward model is used to construct a group-relative softmax reference policy over groups of candidate lists, and a large ranking model is trained via cross-entropy to approximate the optimal policy.

Method

Overall Architecture

(1) Train a reward model \(\hat{r}(l)\) to estimate list-level user feedback → (2) Construct a list group \(\mathcal{B}_u\) for each user (containing lists generated by the main model and various auxiliary policies) → (3) Build a reference policy \(\pi^{ref}\) via group-relative softmax → (4) Minimize the cross-entropy between \(\pi_\theta\) and \(\pi^{ref}\).

Key Designs

  1. Theorem 1 (Theoretical Justification): For any \(k\)-mixture \((α,β)\)-bounded policy space \(\mathcal{C}_m^k\), there exists a generator-only policy space \(\mathcal{F}_M\) with width \(\geq kα+n\) such that \(\mathcal{E}(\mathcal{F}_M) < \mathcal{E}(\mathcal{C}_m^k)\), and \(\lim_{n→∞} \mathcal{E}(\mathcal{F}_M) = 0\).

  2. Group-Relative Reference Policy:

  3. \(\pi^{ref}(l|\mathcal{B}) = \frac{\exp((\hat{r}(l) - \bar{r}_\mathcal{B})/\sigma_\mathcal{B})}{\sum_{l'} \exp((\hat{r}(l') - \bar{r}_\mathcal{B})/\sigma_\mathcal{B})}\)

  4. Normalization (subtracting the group mean and dividing by the standard deviation) preserves the ranking order within the group even under a biased reward model.
  5. Condition: the ranking order is reliable when the intra-group reward gap exceeds \(\sigma^*\).

  6. List Group Construction: An auxiliary policy set \(\mathcal{M}\) (comprising heuristic and lightweight neural models) is introduced to generate diverse list groups for each user, ensuring sufficiently large intra-group reward gaps.

Key Experimental Results

Main Results (ML-1M / Industry / Amazon-Book)

Method ML-1M H@6 Industry H@6 Book H@6
Best G-E (PIER) 62.74 45.35 71.14
Best MG-E (G-100) 60.64 - -
GoalRank 64.51 55.33 74.29
Gain vs. best baseline +2.77% +11.3% +1.7%

Scaling Law Verification

GoalRank's performance improves consistently with model parameter count and training data volume, demonstrating a clear scaling law.

Online A/B Testing

On an industrial-scale short-video platform, GoalRank achieves significant improvements over SOTA baselines on core metrics.

Key Findings

  • Generator-only GoalRank outperforms all G-E and MG-E baselines across all datasets, validating Theorem 1.
  • MG-E methods show saturating or even declining performance gains as the number of generators increases from 3 to 100.
  • Group-relative normalization makes training robust to absolute biases in the reward model, requiring only correct ordinal relationships.

Highlights & Insights

  • Theory-driven practice: Theorem 1 is not merely theoretical decoration—it motivates the entire framework design: abandoning the two-stage paradigm, scaling up the generator, and using group-relative normalization to avoid reliance on precise rewards.
  • Correspondence with DPO/RLHF: GoalRank's group-relative optimization closely parallels RLHF/DPO in LLM alignment—both use reward signals to construct reference distributions for policy training, differing only in that the task shifts from text generation to list ranking.
  • Scaling law for recommender systems: This work is the first to demonstrate LLM-like scaling laws in a ranking task.

Limitations & Future Work

  • A pretrained reward model is required; its quality ceiling determines the upper bound of GoalRank.
  • The choice of the auxiliary policy set \(\mathcal{M}\) affects the diversity of list groups and the magnitude of intra-group reward gaps.
  • Validation is limited to recommendation ranking; applicability to LLM alignment (e.g., RLHF) remains unexplored.
  • vs. PIER/NAR4Rec: Two-stage G-E methods are constrained by finite coverage of the candidate list space. GoalRank optimizes directly over the full policy space.
  • vs. DPO/RCPO: RCPO replaces pairwise comparisons with ranked-choice preferences in text alignment; GoalRank replaces pointwise scoring with group-relative optimization in recommendation ranking. The core ideas are consistent.

Rating

  • Novelty: ⭐⭐⭐⭐ — The theoretical proof that generator-only models outperform G-E systems is a strong contribution; the group-relative optimization approach is novel.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Public datasets + industrial datasets + online A/B testing + scaling law analysis.
  • Writing Quality: ⭐⭐⭐⭐ — Theory and practice are well integrated.
  • Value: ⭐⭐⭐⭐ — Significant implications for the ranking paradigm in recommender systems.