Skip to content

Beyond Pairwise: Empowering LLM Alignment With Ranked Choice Modeling

Conference: ICLR 2026
arXiv: 2510.23631
Code: None
Area: LLM Alignment / Preference Optimization
Keywords: preference optimization, ranked choice, DPO, Mallows model, multinomial logit, alignment

TL;DR

The RCPO framework is proposed to extend LLM alignment from pairwise preferences to ranked choice modeling. It unifies utility models (MNL) and ranking models (Mallows-RMJ) via MLE, outperforming DPO and its variants in both single-best and top-k feedback formats.

Background & Motivation

Background: DPO and its variants (SimPO, R-DPO, AlphaPO, etc.) have become mainstream methods for LLM alignment. However, they rely on pairwise preferences—comparing only two responses (preferred vs. dispreferred) per prompt.

Limitations of Prior Work: Real-world annotation feedback is far richer than pairwise comparisons. For instance, InstructGPT collects rankings of \(K\) responses but decomposes them into \(\binom{K}{2}\) pairs for training. Academic work often retains only the highest and lowest scoring pairs. This "pairwise compression" loses intermediate ranking information and may distort the original preference structure.

Key Challenge: Annotators provide multi-way comparisons/rankings, but training algorithms can only digest pairwise data—information waste and structural distortion are coupled issues.

Goal: How to design an alignment framework that directly utilizes ranked choice (single-best, top-k ranking) feedback?

Key Insight: Discrete choice models from economics and operations research provide a mature theoretical basis for handling multi-selection and ranking data. By treating prompts as contexts, responses as items, and candidate sets as assortments, LLM alignment can be mapped directly to the MLE of a choice model.

Core Idea: Unify LLM preference optimization using choice model theory. DPO is a special case of the Bradley-Terry model, and more powerful choices like MNL and Mallows models can be utilized.

Method

Overall Architecture

RCPO reinterprets preference optimization as a discrete choice problem. An annotator chooses the best response (single-best) or ranks the top \(k\) (top-k) from a set of candidates. Following this analogy, the prompt \(x\) is the context, each response \(y\) is an item, and the candidate set \(S\) is the assortment. The alignment objective is formulated as the MLE of a choice model \(g\):

\[\max_{\pi_\theta} \sum_i \log g\big(\mu_i^k, S_i, \{r_{\pi_\theta}(x_i, y)\}_{y \in S_i}\big),\quad r_{\pi_\theta}(x,y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{ref}(y|x)}\]

Here, \(\mu_i^k\) is the top-k ranking, and \(r_{\pi_\theta}\) is the implicit reward (log-likelihood ratio of the policy relative to the reference). The choice model \(g\) is a replaceable component: using Bradley-Terry recovers DPO, while using stronger models allows the framework to ingest ranked choice data directly. The paper instantiates this using two models with closed-form solutions: MNL (utility-based) and Mallows-RMJ (ordinal-based).

This work focuses on preference modeling and loss functions. The core method is "replacing the choice model \(g\) to rewrite the MLE objective" without a multi-stage pipeline; thus, no architecture diagram is provided.

Key Designs

1. MNL (Multinomial Logit) Branch: Generalizing Bradley-Terry from "Pick 1 of 2" to "Pick 1 of N and Top-K"

The Bradley-Terry model underlying DPO can only compare two responses at once. MNL expands the candidate set to arbitrary sizes. When random utility terms follow i.i.d. Gumbel noise, the selection probability is a softmax normalization over the candidate set. In the single-best format, the loss is \(-\log\sigma\big(-\log\sum_{y_i \in S \setminus \{y_w\}} \exp(f_\theta(x, y_i, y_w))\big)\). This compares the preferred response against all non-preferred responses simultaneously via a logsumexp term. For top-k, it chains softmax terms across \(k\) stages, selecting the next best from remaining candidates. DPO is a minimal special case where \(|S|=2, k=1\).

2. Mallows-RMJ Branch: Ordinal Modeling Independent of Reward Magnitude

While MNL expands the candidate set, it still relies on cardinal reward values, making it sensitive to reward model noise. Mallows-RMJ assumes the probability of a ranking \(\mu\) decays exponentially with its distance from a central ranking \(\mu_0\). The probability of a response being selected is proportional to \(\phi(x)^{d(y_i, S)}\), where \(d(y_i, S)\) is the relative rank in \(S\). In the discrete format, the loss essentially counts how many non-preferred items have rewards exceeding the preferred item. In top-k, it extends to pairwise comparisons along the ranking chain. Since it only uses ordinal information, it is inherently more robust to reward numerical fluctuations.

Two engineering challenges are addressed for SGD compatibility: first, the dispersion parameter \(\phi(x)\) is estimated using an entropy proxy; second, the indicator function for reward ranking is approximated using a smooth sigmoid to ensure differentiability while maintaining preference structure semantics.

Loss & Training

On the UltraFeedback dataset, multiple responses are sampled per prompt and ranked using the Skywork-Reward-V2 model. Data is formatted into pairwise, single-best, or top-k structures. The parameter \(\beta\) controls the deviation from the reference policy. The candidate set size \(|S|\) can vary per prompt, allowing finer-grained preference modeling for complex prompts.

Key Experimental Results

Main Results: Llama-3-8B-Instruct

Method AlpacaEval LC↑ AlpacaEval WR↑ Arena-Hard WR↑ UltraFeedback WR↑
DPO 41.24 40.24 32.6 62.36
SimPO 44.15 38.84 33.5 50.17
DPO-AllPairs 33.02 38.47 29.6 51.95
Mallows-RMJ-Pairwise 39.33 48.71

The strongest variant, Mallows-RMJ-PO-Top-2, outperforms the strongest non-RCPO baseline (IPO) by 4.00 / 19.5 / 6.2 / 9.47 percentage points across AlpacaEval LC/WR, Arena-Hard WR, and UltraFeedback WR, respectively.

Main Results (Multi-model)

RCPO consistently outperforms or matches DPO and SimPO across Llama-3-8B, Gemma-2-9B, and Mistral-7B.

Ablation Study

  • DPO-AllPairs: Decomposing rankings into all possible pairs actually degrades performance, confirming that pairwise compression distorts information.
  • Mallows-RMJ: Outperforms DPO even in the pairwise setting, suggesting rank-based models are inherently better suited for preference learning.
  • Top-k Feedback: Further improves performance, validating the value of richer feedback formats.

Key Findings

  • Mallows-RMJ series perform best, particularly on AlpacaEval WR (+8-10 pp), indicating that robustness to reward noise is a critical advantage.
  • Gradient analysis reveals that Mallows-RMJ performs adaptive weighting: it assigns higher weights to prompts with low dispersion and to pairs with close rewards, effectively performing "hard negative mining."
  • MNL's multi-way extension yields improvements but is less significant than Mallows-RMJ.

Highlights & Insights

  • Bridging Choice Theory and Alignment: Systematically introducing discrete choice theory to LLM alignment provides a theoretical framework for designing new algorithms. DPO, SimPO, and R-DPO are viewed as special cases.
  • Rank-based vs. Utility-based Insights: Mallows-RMJ, by using only ordinal relations, is more robust than MNL (which relies on precise reward values). This suggests that when reward models are noisy, rank-based methods are superior.
  • Information Efficiency: Directly training on top-k rankings is more efficient and effective than decomposing them into \(\binom{K}{2}\) pairs, providing direct guidance for preference data collection.

Limitations & Future Work

  • Experiments primarily involve 7-9B models; larger models require validation.
  • Ranking feedback is generated by reward models rather than real human annotations, potentially inheriting systematic biases.
  • The use of an entropy proxy for the dispersion parameter \(\phi(x)\) has not been fully validated.
  • The study focuses on single-best and top-k, leaving other ranking models (e.g., Plackett-Luce, Thurstone) unexplored.
  • vs. DPO (Rafailov et al., 2023): DPO is a special case of RCPO (\(BT + \text{pairwise}\)). RCPO extends both preference formats and choice models.
  • vs. SimPO (Meng et al., 2024): SimPO uses length-normalized log-likelihood but is limited to pairwise comparisons. It can be integrated into the RCPO framework.
  • vs. Align Once (MLC): MLC focuses on cross-lingual consistency, while RCPO focuses on the information efficiency of preference feedback. They are complementary.

Rating

  • Novelty: ⭐⭐⭐⭐ Systematically applying choice theory to LLM alignment is a novel theoretical contribution.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Tested across 3 models and multiple baselines, though limited to 7-9B scales.
  • Writing Quality: ⭐⭐⭐⭐⭐ Rigorous theoretical derivation and clear framework presentation with insightful gradient analysis.
  • Value: ⭐⭐⭐⭐ Provides a generalized framework for alignment, with Mallows-RMJ showing high practical utility.