Post Hoc Regression Refinement via Pairwise Rankings¶

Conference: NeurIPS 2025 arXiv: 2508.16495 Code: ktirta/regref Area: LLM/NLP Keywords: regression refinement, pairwise ranking, inverse-variance weighting, LLM ranker, few-shot learning

TL;DR¶

This paper proposes RankRefine, a model-agnostic post-processing regression refinement method that fuses predictions from a base regressor with estimates derived from pairwise rankings via inverse-variance weighting. Without any retraining, the method achieves up to 10% relative MAE reduction in molecular property prediction using only 20 pairwise comparisons and a general-purpose LLM.

Background & Motivation¶

Accurate prediction of continuous attributes is critical in science and engineering. Molecular property prediction (MPP) serves as a representative example, where reliable estimation of physical and chemical properties can accelerate drug discovery, materials design, and catalyst development. Unlike computer vision or NLP, however, many specialized domains face a fundamental bottleneck: obtaining annotations requires expert-driven experiments that are both costly and time-consuming, and real-world tasks frequently operate under data-scarce conditions (even 50 labeled samples may be considered sufficient).

An underutilized resource is expert knowledge expressed in relative comparative form: pairwise judgments (e.g., "which molecule has higher solubility?") are far easier to elicit than precise numerical annotations. Key insights include:

Humans excel at relative comparison rather than absolute estimation (Thurstone, 1927): pairwise tasks reduce cognitive load and avoid scale-interpretation bias.

General-purpose LLMs inherit this advantage: pretraining corpora are rich in relative statements, and RLHF alignment further reinforces comparative capabilities.

Existing methods either require retraining or address ranking without regression: a plug-and-play post-processing solution is lacking.

Method¶

Overall Architecture¶

The core idea of RankRefine is to fuse predictions from a base regressor with estimates derived from pairwise rankings.

Given: - Base regressor output \(\hat{y}_0^{\text{reg}}\) and its uncertainty \(\sigma_{\text{reg}}^2\) - A reference set \(\mathbb{D} = \{(x_i, y_i)\}_{i=1}^k\) (sampled directly from the training set) - An external ranker (LLM or human expert) providing pairwise comparisons between \(x_0\) and each \(x_i\)

RankRefine computes a ranking-based estimate \(\hat{y}_0^{\text{rank}}\) and fuses the two estimates via inverse-variance weighting.

Key Designs¶

Inverse-Variance Weighting Fusion (RankRefine Fusion Theorem):

If \(\hat{y}_0^{\text{reg}}\) and \(\hat{y}_0^{\text{rank}}\) are independent unbiased Gaussian estimators, the minimum-variance unbiased estimator is:

\[\hat{y}_0^* = \sigma_{\text{post}}^2 \left(\frac{\hat{y}_0^{\text{reg}}}{\sigma_{\text{reg}}^2} + \frac{\hat{y}_0^{\text{rank}}}{\sigma_{\text{rank}}^2}\right), \quad \sigma_{\text{post}}^2 = \left(\frac{1}{\sigma_{\text{reg}}^2} + \frac{1}{\sigma_{\text{rank}}^2}\right)^{-1}\]

Core guarantee: \(\sigma_{\text{post}}^2 < \sigma_{\text{reg}}^2\), meaning any informative ranker with finite variance reduces the expected MAE.

Ranking Estimation via the Bradley-Terry Model:

Pairwise probability \(P(x_i \succ x_j) = s(y_i - y_j)\), where \(s(z) = (1 + e^{-z})^{-1}\). The ranking estimate is obtained by minimizing the negative log-likelihood:

\[\hat{y}_0^{*\text{rank}} = \arg\min_{\hat{y}_0^{\text{rank}}} \left[-\sum_{x_i \in \mathbb{A}} \log s(\hat{y}_0^{\text{rank}} - y_i) - \sum_{x_j \in \mathbb{B}} \log(1 - s(\hat{y}_0^{\text{rank}} - y_j))\right]\]

where \(\mathbb{A}\) is the subset of reference items ranked below \(x_0\) and \(\mathbb{B}\) is the subset ranked above \(x_0\).

Variance of the Ranking Estimate (Lemma 3.2):

\[\sigma_{\text{rank}}^2 \approx \left[\sum_{y_i \in \mathbb{A} \cup \mathbb{B}} s(\Delta_i)(1 - s(\Delta_i))\right]^{-1}, \quad \Delta_i = \hat{y}_0^{*\text{rank}} - y_i\]

This is derived via inverse Fisher information approximation.

Quantitative Condition for Guaranteed Improvement (Implication 4):

For \(\text{MAE}_{\text{post}} \leq \alpha \cdot \text{MAE}_{\text{reg}}\), it suffices that:

\[\sigma_{\text{rank}}^2 \leq \frac{\alpha^2 \sigma_{\text{reg}}^2}{1 - \alpha^2}\]

Loss & Training¶

RankRefine requires no training whatsoever—it is a purely post-processing method. The only requirements are: - A base regressor capable of outputting predictions and uncertainty estimates (e.g., random forest, Gaussian process) - An external ranker providing pairwise comparison results - Regularization: when the ranker is overconfident, apply tempered variance \(\sigma_{\text{rank}}^2 \leftarrow \max(\sigma_{\text{rank}}^2, c \cdot \sigma_{\text{reg}}^2)\)

Key Experimental Results¶

Main Results¶

Molecular property prediction with LLM as ranker (ChatGPT-4o, 20 pairwise comparisons):

Dataset	Pairwise Ranking Accuracy	β (MAE_post/MAE_reg)
Lipophilicity	0.622±0.008	0.957±0.012
Solubility	0.693±0.035	0.934±0.048
VDss	0.605±0.010	0.895±0.053
Caco2	0.660±0.013	0.970±0.027
Half Life	0.602±0.014	0.971±0.005
FreeSolv	0.681±0.050	0.937±0.012

Even with ranking accuracy as low as 60%–69%, RankRefine consistently reduces MAE (β < 1).

Human ranker experiment (age estimation, 6 participants, 15 reference individuals):

MAE_reg	Pairwise Ranking Accuracy	β
6.343±0.610	0.759±0.052	0.954±0.046

Human pairwise judgments (76% accuracy) reduce age estimation MAE by approximately 5%, validating practical utility in human-in-the-loop settings.

Ablation Study¶

Effect of ranking accuracy and reference set size (9 TDC datasets, oracle ranker):

On most datasets, RankRefine yields improvement (β < 1) with ranking accuracy as low as 0.55 and as few as \(k=10\) comparisons. \(k=20\) is generally optimal; further increasing to \(k=30\) yields marginal additional gains.

Comparison with baselines (\(k=30\)):

Method	Ranking Accuracy 0.5–0.95	Ranking Accuracy >0.95
RankRefine vs. Projection (Yan et al. 2024)	RankRefine wins	Projection wins
RankRefine vs. RbR (Gonçalves et al. 2023)	RankRefine wins overall	RankRefine wins

RankRefine outperforms the projection baseline across the practically feasible accuracy range of 0.50–0.95, falling slightly behind only when ranking is near-perfect.

Robustness experiments: - With regressor bias up to 60% SD, β remains < 1 (improvement holds) - With biased reference set sampling (10% coverage), β = 0.884 at RB=90% + 60% ranking accuracy (still 11.6% improvement) - Under distribution shift (non-overlapping reference and query sets), accuracy ≥65% still yields improvement

Key Findings¶

Ranking accuracy threshold is remarkably low: accuracy as low as 55% suffices for improvement, well below intuitive expectation
LLMs do not rely purely on memorization: on proprietary compound–activity datasets not publicly available, ChatGPT-4o still achieves 60.14% pairwise ranking accuracy
Cross-domain generalization: RankRefine is effective across molecular prediction, tabular data (agriculture, education, international fees)
Near-perfect rankers exhibit slight degradation: due to overconfidence in Fisher information and boundary extrapolation issues

Highlights & Insights¶

Theoretical elegance: the guarantee that any finite-variance ranker improves regression (Corollary 3.2.1) provides a rigorous theoretical foundation
Extreme practicality: no retraining, no architectural changes, and negligible computational cost (only LLM API calls required)
Human–machine collaborative self-correction: humans + RankRefine can improve prediction quality without modifying any model
Cognitive science inspiration: the method leverages the cognitive science finding that humans (and LLMs) excel at relative comparison
Particularly valuable under data scarcity: meaningful improvements are achievable with only 50 training samples and 20 LLM pairwise comparisons

Limitations & Future Work¶

Theoretical assumptions (Gaussian, unbiased, independent errors) may not hold in practice, especially under heavy-tailed or skewed noise
The method relies on well-calibrated variance estimates from both the regressor and ranker; miscalibration may offset or diminish gains
Oracle experiments assume uniformly random ranking errors, whereas real rankers may exhibit systematic biases (e.g., consistently failing at extreme values)
The use of the Bradley-Terry model on true attribute values (vs. learned latent scores) introduces a modeling mismatch
The method handles only scalar regression; extension to multivariate or structured targets (e.g., full pharmacokinetic profiles) remains unexplored

Complementary to RankUp (Huang et al. 2024, joint training of regression and ranking): RankRefine is purely post-processing and does not alter training
Distinguished from Pairwise Difference Regression (Tynes et al. 2021): the latter requires training to predict pairwise differences, whereas RankRefine does not
Suggested direction: leveraging LLM reasoning capabilities (rationales) to provide interpretable ranking justifications, enhancing trust in decision-critical domains

Rating¶

Novelty: ⭐⭐⭐⭐ Inverse-variance weighting fusion is not novel per se, but the framework design integrating pairwise rankings into post-processing regression is elegant and theoretically well-grounded
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 9 molecular datasets + 3 tabular datasets + human experiments + LLM experiments + extensive ablations
Writing Quality: ⭐⭐⭐⭐⭐ The theory–experiment–analysis logical chain is complete, derivations are clear, and experimental design is thorough
Value: ⭐⭐⭐⭐ Highly practical for data-scarce domains, though applicable scenarios may be concentrated in scientific computing