Post Hoc Regression Refinement via Pairwise Rankings¶
Conference: NeurIPS 2025 arXiv: 2508.16495 Code: ktirta/regref Area: LLM/NLP Keywords: regression refinement, pairwise ranking, inverse-variance weighting, LLM ranker, few-shot learning
TL;DR¶
This paper proposes RankRefine, a model-agnostic post-processing regression refinement method that fuses predictions from a base regressor with estimates derived from pairwise rankings via inverse-variance weighting. Without any retraining, the method achieves up to 10% relative MAE reduction in molecular property prediction using only 20 pairwise comparisons and a general-purpose LLM.
Background & Motivation¶
Accurate prediction of continuous attributes is critical in science and engineering. Molecular property prediction (MPP) serves as a representative example, where reliable estimation of physical and chemical properties can accelerate drug discovery, materials design, and catalyst development. Unlike computer vision or NLP, however, many specialized domains face a fundamental bottleneck: obtaining annotations requires expert-driven experiments that are both costly and time-consuming, and real-world tasks frequently operate under data-scarce conditions (even 50 labeled samples may be considered sufficient).
An underutilized resource is expert knowledge expressed in relative comparative form: pairwise judgments (e.g., "which molecule has higher solubility?") are far easier to elicit than precise numerical annotations. Key insights include:
Humans excel at relative comparison rather than absolute estimation (Thurstone, 1927): pairwise tasks reduce cognitive load and avoid scale-interpretation bias.
General-purpose LLMs inherit this advantage: pretraining corpora are rich in relative statements, and RLHF alignment further reinforces comparative capabilities.
Existing methods either require retraining or address ranking without regression: a plug-and-play post-processing solution is lacking.
Method¶
Overall Architecture¶
The core idea of RankRefine is to fuse predictions from a base regressor with estimates derived from pairwise rankings.
Given: - Base regressor output \(\hat{y}_0^{\text{reg}}\) and its uncertainty \(\sigma_{\text{reg}}^2\) - A reference set \(\mathbb{D} = \{(x_i, y_i)\}_{i=1}^k\) (sampled directly from the training set) - An external ranker (LLM or human expert) providing pairwise comparisons between \(x_0\) and each \(x_i\)
RankRefine computes a ranking-based estimate \(\hat{y}_0^{\text{rank}}\) and fuses the two estimates via inverse-variance weighting.
Key Designs¶
Inverse-Variance Weighting Fusion (RankRefine Fusion Theorem):
If \(\hat{y}_0^{\text{reg}}\) and \(\hat{y}_0^{\text{rank}}\) are independent unbiased Gaussian estimators, the minimum-variance unbiased estimator is:
Core guarantee: \(\sigma_{\text{post}}^2 < \sigma_{\text{reg}}^2\), meaning any informative ranker with finite variance reduces the expected MAE.
Ranking Estimation via the Bradley-Terry Model:
Pairwise probability \(P(x_i \succ x_j) = s(y_i - y_j)\), where \(s(z) = (1 + e^{-z})^{-1}\). The ranking estimate is obtained by minimizing the negative log-likelihood:
where \(\mathbb{A}\) is the subset of reference items ranked below \(x_0\) and \(\mathbb{B}\) is the subset ranked above \(x_0\).
Variance of the Ranking Estimate (Lemma 3.2):
This is derived via inverse Fisher information approximation.
Quantitative Condition for Guaranteed Improvement (Implication 4):
For \(\text{MAE}_{\text{post}} \leq \alpha \cdot \text{MAE}_{\text{reg}}\), it suffices that:
Loss & Training¶
RankRefine requires no training whatsoever—it is a purely post-processing method. The only requirements are: - A base regressor capable of outputting predictions and uncertainty estimates (e.g., random forest, Gaussian process) - An external ranker providing pairwise comparison results - Regularization: when the ranker is overconfident, apply tempered variance \(\sigma_{\text{rank}}^2 \leftarrow \max(\sigma_{\text{rank}}^2, c \cdot \sigma_{\text{reg}}^2)\)
Key Experimental Results¶
Main Results¶
Molecular property prediction with LLM as ranker (ChatGPT-4o, 20 pairwise comparisons):
| Dataset | Pairwise Ranking Accuracy | β (MAE_post/MAE_reg) |
|---|---|---|
| Lipophilicity | 0.622±0.008 | 0.957±0.012 |
| Solubility | 0.693±0.035 | 0.934±0.048 |
| VDss | 0.605±0.010 | 0.895±0.053 |
| Caco2 | 0.660±0.013 | 0.970±0.027 |
| Half Life | 0.602±0.014 | 0.971±0.005 |
| FreeSolv | 0.681±0.050 | 0.937±0.012 |
Even with ranking accuracy as low as 60%–69%, RankRefine consistently reduces MAE (β < 1).
Human ranker experiment (age estimation, 6 participants, 15 reference individuals):
| MAE_reg | Pairwise Ranking Accuracy | β |
|---|---|---|
| 6.343±0.610 | 0.759±0.052 | 0.954±0.046 |
Human pairwise judgments (76% accuracy) reduce age estimation MAE by approximately 5%, validating practical utility in human-in-the-loop settings.
Ablation Study¶
Effect of ranking accuracy and reference set size (9 TDC datasets, oracle ranker):
On most datasets, RankRefine yields improvement (β < 1) with ranking accuracy as low as 0.55 and as few as \(k=10\) comparisons. \(k=20\) is generally optimal; further increasing to \(k=30\) yields marginal additional gains.
Comparison with baselines (\(k=30\)):
| Method | Ranking Accuracy 0.5–0.95 | Ranking Accuracy >0.95 |
|---|---|---|
| RankRefine vs. Projection (Yan et al. 2024) | RankRefine wins | Projection wins |
| RankRefine vs. RbR (Gonçalves et al. 2023) | RankRefine wins overall | RankRefine wins |
RankRefine outperforms the projection baseline across the practically feasible accuracy range of 0.50–0.95, falling slightly behind only when ranking is near-perfect.
Robustness experiments: - With regressor bias up to 60% SD, β remains < 1 (improvement holds) - With biased reference set sampling (10% coverage), β = 0.884 at RB=90% + 60% ranking accuracy (still 11.6% improvement) - Under distribution shift (non-overlapping reference and query sets), accuracy ≥65% still yields improvement
Key Findings¶
- Ranking accuracy threshold is remarkably low: accuracy as low as 55% suffices for improvement, well below intuitive expectation
- LLMs do not rely purely on memorization: on proprietary compound–activity datasets not publicly available, ChatGPT-4o still achieves 60.14% pairwise ranking accuracy
- Cross-domain generalization: RankRefine is effective across molecular prediction, tabular data (agriculture, education, international fees)
- Near-perfect rankers exhibit slight degradation: due to overconfidence in Fisher information and boundary extrapolation issues
Highlights & Insights¶
- Theoretical elegance: the guarantee that any finite-variance ranker improves regression (Corollary 3.2.1) provides a rigorous theoretical foundation
- Extreme practicality: no retraining, no architectural changes, and negligible computational cost (only LLM API calls required)
- Human–machine collaborative self-correction: humans + RankRefine can improve prediction quality without modifying any model
- Cognitive science inspiration: the method leverages the cognitive science finding that humans (and LLMs) excel at relative comparison
- Particularly valuable under data scarcity: meaningful improvements are achievable with only 50 training samples and 20 LLM pairwise comparisons
Limitations & Future Work¶
- Theoretical assumptions (Gaussian, unbiased, independent errors) may not hold in practice, especially under heavy-tailed or skewed noise
- The method relies on well-calibrated variance estimates from both the regressor and ranker; miscalibration may offset or diminish gains
- Oracle experiments assume uniformly random ranking errors, whereas real rankers may exhibit systematic biases (e.g., consistently failing at extreme values)
- The use of the Bradley-Terry model on true attribute values (vs. learned latent scores) introduces a modeling mismatch
- The method handles only scalar regression; extension to multivariate or structured targets (e.g., full pharmacokinetic profiles) remains unexplored
Related Work & Insights¶
- Complementary to RankUp (Huang et al. 2024, joint training of regression and ranking): RankRefine is purely post-processing and does not alter training
- Distinguished from Pairwise Difference Regression (Tynes et al. 2021): the latter requires training to predict pairwise differences, whereas RankRefine does not
- Suggested direction: leveraging LLM reasoning capabilities (rationales) to provide interpretable ranking justifications, enhancing trust in decision-critical domains
Rating¶
- Novelty: ⭐⭐⭐⭐ Inverse-variance weighting fusion is not novel per se, but the framework design integrating pairwise rankings into post-processing regression is elegant and theoretically well-grounded
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 9 molecular datasets + 3 tabular datasets + human experiments + LLM experiments + extensive ablations
- Writing Quality: ⭐⭐⭐⭐⭐ The theory–experiment–analysis logical chain is complete, derivations are clear, and experimental design is thorough
- Value: ⭐⭐⭐⭐ Highly practical for data-scarce domains, though applicable scenarios may be concentrated in scientific computing