GoalRank: Group-Relative Optimization for a Large Ranking Model¶

Conference: ICLR 2026 arXiv: 2509.22046 Code: None Area: LLM Alignment / Recommendation Ranking Keywords: ranking, generator-only, group-relative optimization, scaling law, recommendation

TL;DR¶

This paper theoretically proves that for any Multi-Generator-Evaluator (Multi-G-E) ranking system, there exists a larger generator-only model that approximates the optimal policy with smaller error and satisfies scaling laws. Based on this, GoalRank is proposed—a framework that uses a reward model to construct a group-relative reference policy for training a large generator-only ranking model, achieving significant improvements over SOTA in online A/B testing.

Background & Motivation¶

Background: The ranking stage of recommender systems selects an ordered list of length \(L\) from \(N\) candidates (a \(P(N,L)\) combinatorial space). The dominant paradigm is the Generator-Evaluator (G-E) two-stage framework: a generator produces candidate lists, and an evaluator selects the best one.

Limitations of Prior Work: The gains from increasing the number of generators or using multi-generator ensembles saturate rapidly (Fig. 1d). The two-stage paradigm introduces cross-stage inconsistency and engineering complexity.

Key Challenge: The policy space of two-stage methods is constrained to mixtures of \(k\) small generators, whereas the scaling law of end-to-end large models suggests that a single, larger model may perform better.

Goal: (1) Can a generator-only model theoretically outperform Multi-G-E? (2) How should such a large ranking model be trained?

Key Insight: Theorem 1 proves that a larger generator-only model has strictly smaller approximation error than any finite Multi-G-E system, and this error approaches zero as the model scales.

Core Idea: A reward model is used to construct a group-relative softmax reference policy over groups of candidate lists, and a large ranking model is trained via cross-entropy to approximate the optimal policy.

Method¶

Overall Architecture¶

(1) Train a reward model \(\hat{r}(l)\) to estimate list-level user feedback → (2) Construct a list group \(\mathcal{B}_u\) for each user (containing lists generated by the main model and various auxiliary policies) → (3) Build a reference policy \(\pi^{ref}\) via group-relative softmax → (4) Minimize the cross-entropy between \(\pi_\theta\) and \(\pi^{ref}\).

Key Designs¶

Theorem 1 (Theoretical Justification): For any \(k\)-mixture \((α,β)\)-bounded policy space \(\mathcal{C}_m^k\), there exists a generator-only policy space \(\mathcal{F}_M\) with width \(\geq kα+n\) such that \(\mathcal{E}(\mathcal{F}_M) < \mathcal{E}(\mathcal{C}_m^k)\), and \(\lim_{n→∞} \mathcal{E}(\mathcal{F}_M) = 0\).
Group-Relative Reference Policy:
\(\pi^{ref}(l|\mathcal{B}) = \frac{\exp((\hat{r}(l) - \bar{r}_\mathcal{B})/\sigma_\mathcal{B})}{\sum_{l'} \exp((\hat{r}(l') - \bar{r}_\mathcal{B})/\sigma_\mathcal{B})}\)
Normalization (subtracting the group mean and dividing by the standard deviation) preserves the ranking order within the group even under a biased reward model.
Condition: the ranking order is reliable when the intra-group reward gap exceeds \(\sigma^*\).
List Group Construction: An auxiliary policy set \(\mathcal{M}\) (comprising heuristic and lightweight neural models) is introduced to generate diverse list groups for each user, ensuring sufficiently large intra-group reward gaps.

Key Experimental Results¶

Main Results (ML-1M / Industry / Amazon-Book)¶

Method	ML-1M H@6	Industry H@6	Book H@6
Best G-E (PIER)	62.74	45.35	71.14
Best MG-E (G-100)	60.64	-	-
GoalRank	64.51	55.33	74.29
Gain vs. best baseline	+2.77%	+11.3%	+1.7%

Scaling Law Verification¶

GoalRank's performance improves consistently with model parameter count and training data volume, demonstrating a clear scaling law.

Online A/B Testing¶

On an industrial-scale short-video platform, GoalRank achieves significant improvements over SOTA baselines on core metrics.

Key Findings¶

Generator-only GoalRank outperforms all G-E and MG-E baselines across all datasets, validating Theorem 1.
MG-E methods show saturating or even declining performance gains as the number of generators increases from 3 to 100.
Group-relative normalization makes training robust to absolute biases in the reward model, requiring only correct ordinal relationships.

Highlights & Insights¶

Theory-driven practice: Theorem 1 is not merely theoretical decoration—it motivates the entire framework design: abandoning the two-stage paradigm, scaling up the generator, and using group-relative normalization to avoid reliance on precise rewards.
Correspondence with DPO/RLHF: GoalRank's group-relative optimization closely parallels RLHF/DPO in LLM alignment—both use reward signals to construct reference distributions for policy training, differing only in that the task shifts from text generation to list ranking.
Scaling law for recommender systems: This work is the first to demonstrate LLM-like scaling laws in a ranking task.

Limitations & Future Work¶

A pretrained reward model is required; its quality ceiling determines the upper bound of GoalRank.
The choice of the auxiliary policy set \(\mathcal{M}\) affects the diversity of list groups and the magnitude of intra-group reward gaps.
Validation is limited to recommendation ranking; applicability to LLM alignment (e.g., RLHF) remains unexplored.

vs. PIER/NAR4Rec: Two-stage G-E methods are constrained by finite coverage of the candidate list space. GoalRank optimizes directly over the full policy space.
vs. DPO/RCPO: RCPO replaces pairwise comparisons with ranked-choice preferences in text alignment; GoalRank replaces pointwise scoring with group-relative optimization in recommendation ranking. The core ideas are consistent.

Rating¶

Novelty: ⭐⭐⭐⭐ — The theoretical proof that generator-only models outperform G-E systems is a strong contribution; the group-relative optimization approach is novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Public datasets + industrial datasets + online A/B testing + scaling law analysis.
Writing Quality: ⭐⭐⭐⭐ — Theory and practice are well integrated.
Value: ⭐⭐⭐⭐ — Significant implications for the ranking paradigm in recommender systems.