Provably Efficient Online RLHF with One-Pass Reward Modeling¶

Conference: NeurIPS 2025 arXiv: 2502.07193 Code: github.com/ZinYY/Online_RLHF Area: LLM Alignment Keywords: online RLHF, reward modeling, online mirror descent, contextual dueling bandit, computational efficiency

TL;DR¶

This paper proposes a one-pass reward modeling method based on online mirror descent (OMD) that eliminates the computational bottleneck in online RLHF — namely, storing all historical data and re-optimizing from scratch at each iteration — achieving \(\mathcal{O}(1)\) time and memory complexity per iteration while also improving upon MLE methods in statistical efficiency.

Background & Motivation¶

RLHF (Reinforcement Learning from Human Feedback) is a core technique for aligning LLMs. Conventional RLHF relies on fixed preference datasets, but limited coverage makes it difficult for reward models to generalize to out-of-distribution samples. Online RLHF addresses this by iteratively collecting data and continuously improving the model, a strategy whose effectiveness has been demonstrated by Claude and LLaMA-2.

However, online RLHF faces a severe computational bottleneck: - Each iteration requires incorporating new data into the historical dataset and re-optimizing the reward model over the entire dataset - Under MLE estimation, the computational complexity at round \(t\) is \(\mathcal{O}(t \log t)\) with \(\mathcal{O}(t)\) memory - This becomes prohibitive over long training horizons, especially on edge devices

Core Problem: Can one design an online RLHF algorithm that is both statistically and computationally efficient?

Method¶

Overall Architecture¶

The paper formalizes RLHF as a contextual preference bandit problem and adopts the Bradley-Terry preference model:

\[\mathbb{P}[y=1 \mid x, a, a'] = \sigma(\phi(x,a)^\top \theta^* - \phi(x,a')^\top \theta^*)\]

where \(\phi(x,a)\) is a known feature map, \(\theta^*\) is an unknown parameter, and \(\sigma\) is the sigmoid function. The nonlinearity coefficient \(\kappa = \max 1/\dot{\sigma}(\cdot)\) characterizes the learning difficulty (\(\kappa\) can be exponentially large).

Key Designs: One-Pass Reward Modeling¶

Problem with conventional MLE: At round \(t\), one must solve

\[\hat{\theta}_{t+1} = \arg\min_{\theta} \sum_{i=1}^t \ell_i(\theta)\]

which requires iterating over all historical data, incurring \(\mathcal{O}(t \log t)\) computation per round.

Proposed approach: MLE is replaced by OMD with a second-order Taylor expansion under a customized local norm:

\[\widetilde{\theta}_{t+1} = \arg\min_{\theta \in \Theta} \left\{ \langle g_t(\widetilde{\theta}_t), \theta \rangle + \frac{1}{2\eta} \|\theta - \widetilde{\theta}_t\|_{\widetilde{\mathcal{H}}_t}^2 \right\}\]

where the local norm is \(\widetilde{\mathcal{H}}_t = \mathcal{H}_t + \eta H_t(\widetilde{\theta}_t)\) and \(\mathcal{H}_t = \sum_{i=1}^{t-1} H_i(\widetilde{\theta}_{i+1}) + \lambda I\).

The key closed-form equivalent update is:

\[\widetilde{\theta}_{t+1}' = \widetilde{\theta}_t - \eta \widetilde{\mathcal{H}}_t^{-1} g_t(\widetilde{\theta}_t), \quad \widetilde{\theta}_{t+1} = \text{Proj}_\Theta(\widetilde{\theta}_{t+1}')\]

This update depends only on the current sample and requires no stored historical data, yielding \(\mathcal{O}(1)\) per round.

Core design insight: The local norm \(\mathcal{H}_t\) captures second-order information (approximating the Hessian), which is essential for statistical efficiency. Standard OMD with a first-order approximation would sacrifice convergence rates; the second-order Taylor expansion here achieves both a closed-form solution and statistical efficiency.

Three Online RLHF Settings¶

Setting 1: Passive Data Collection — The algorithm does not control data collection and applies the "pessimism in the face of uncertainty" principle:

\[\widetilde{J}_{T+1}(\pi) = \Phi^\top \widetilde{\theta}_{T+1} - \widetilde{\beta}_{T+1} \|\Phi\|_{\mathcal{H}_{T+1}^{-1}}\]

Setting 2: Active Data Collection — The algorithm queries the sample with the highest uncertainty:

\[(x_{t+1}, a_{t+1}, a_{t+1}') = \arg\max \|\phi(x,a) - \phi(x,a')\|_{\mathcal{H}_{t+1}^{-1}}\]

Setting 3: Deployment-Time Adaptation — During online deployment, exploitation and exploration are balanced: the first action maximizes the estimated reward, while the second jointly maximizes the reward and its distance from the first action.

Practical Techniques¶

Hessian-Vector Product (HVP) + Conjugate Gradient: Reduces the \(\mathcal{O}(d^3)\) Hessian inversion to \(\mathcal{O}(d)\)
Rejection Sampling to approximate model uncertainty: \(n\) responses are sampled and reward ranking is used in place of exact confidence bounds
A time-increasing schedule \(\lambda_t = \lambda_0 \cdot \min(1, f(t/T))\) is adopted for \(\lambda\) to approximate the cumulative Hessian effect

Loss & Training¶

Backbone models: Llama-3-8B-Instruct and Qwen2.5-7B-Instruct
Feature dimension: \(d = 4096\) (last-layer embeddings)
Datasets: Ultrafeedback-binarized (64K prompts) and Mixture2
Passive setting: \(T = 30{,}000\) data points sampled randomly
Active setting: only 6,400 samples selected from the full dataset

Key Experimental Results¶

Main Results¶

Passive Data Collection (Llama-3-8B, Ultrafeedback):

Metric	MLE	Ours (OMD)
Convergence speed	Slow	Fast (especially advantageous when \(T < 10{,}000\))
Evaluation accuracy	Lower	Higher
Evaluation loss	Higher	Lower

In the low-data regime (\(T < 10{,}000\)), the OMD method achieves higher evaluation accuracy with fewer samples, validating the improvement in statistical efficiency.

Active Data Collection (only 6,400 samples):

Method	ACC (%)	Training Time (s)
Rand-MLE	69.51 ± 0.5	4876 ± 47
Active-MLE	69.82 ± 0.4	4982 ± 52
Rand-OMD	68.97 ± 0.6	1456 ± 31
Ours	70.43 ± 0.3	1489 ± 36

The OMD method reduces training time to approximately one-third of MLE while achieving higher accuracy. Active data selection further improves performance.

Ablation Study¶

Deployment-Time Adaptation: Data is split into 20 chunks and processed sequentially. Multiple action selection strategies are compared: - The proposed strategy (best + top-1/q percentile) outperforms random selection, best+second-best, and best+worst strategies under both MLE and OMD baselines - Win rate analysis shows that the OMD method is competitive with MLE while incurring substantially lower computational cost

Key Findings¶

The OMD method yields at least a \(\sqrt{\kappa}\)-fold improvement in statistical efficiency (theoretically guaranteed), where \(\kappa\) can be exponentially large
Training time is reduced by approximately \(3\times\) with equal or higher accuracy
Memory requirements drop from \(\mathcal{O}(T)\) to \(\mathcal{O}(1)\), which is highly significant for edge device deployment
Active data selection combined with OMD is the optimal configuration: highest accuracy and fastest training

Highlights & Insights¶

Strong unity of theory and practice: Starting from a theoretical analysis of contextual dueling bandits, the paper derives a practically deployable LLM algorithm
Unified treatment of three settings: Passive, active, and deployment-time settings all follow naturally from the same OMD framework
Practical significance of \(\mathcal{O}(1)\) complexity: The core bottleneck in online RLHF is continuously updating the reward model; this paper reduces the associated complexity from linear growth in the number of iterations to a constant
Elegant local norm design: Constructing the local norm using the Hessian at a lookahead point preserves second-order information while admitting a closed-form solution
HVP + Conjugate Gradient for practicality: Reducing complexity from \(\mathcal{O}(d^3)\) to \(\mathcal{O}(d)\) makes real-time updates feasible in a 4096-dimensional feature space

Limitations & Future Work¶

Assumes a fixed feature map \(\phi\) — in practice, the feature space changes during LLM fine-tuning
Based on the Bradley-Terry model — not extended to more general preference models such as Plackett-Luce
Assumes a linear reward function — real-world reward functions are typically nonlinear and require neural network approximation
Experiments operate only on last-layer embeddings — multi-layer features and full fine-tuning remain unexplored
Theoretical guarantees for rejection sampling-based uncertainty approximation are weak — a gap between theory and practice remains

Online RLHF (Dong et al., Guo et al.): Demonstrate that online iterative training outperforms offline training, but at high computational cost
Faury et al. (Implicit OMD for logistic bandits): Provide the theoretical foundation for the OMD approach, which this paper extends to the RLHF setting
Zhang & Sugiyama: Source of the second-order Taylor expansion approximation idea
Das et al. (Active RLHF): Baseline for the active data collection setting; this paper substantially improves computational efficiency while maintaining equivalent statistical guarantees
Foster et al.: Address the problem of enumerating an exponentially large response space, which is complementary to this paper's focus on iterative computational cost

Takeaway: The OMD + local norm combination is generalizable to other online learning problems (e.g., online DPO, online preference optimization) and is particularly valuable for personalized alignment on edge devices.

Rating¶

Novelty: ⭐⭐⭐⭐ — The application of OMD + local norm to RLHF is novel, though the underlying techniques originate from the bandit literature
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers three settings with both theoretical and empirical validation, though scale is limited (Llama-3 8B)
Writing Quality: ⭐⭐⭐⭐⭐ — Rigorous theoretical derivations, clear algorithmic descriptions, and well-organized setting taxonomy
Value: ⭐⭐⭐⭐ — Addresses the core computational bottleneck of online RLHF with practical implications for edge deployment and continual learning