Provably Efficient Online RLHF with One-Pass Reward Modeling¶
Conference: NeurIPS 2025 arXiv: 2502.07193 Code: github.com/ZinYY/Online_RLHF Area: LLM Alignment Keywords: online RLHF, reward modeling, online mirror descent, contextual dueling bandit, computational efficiency
TL;DR¶
This paper proposes a one-pass reward modeling method based on online mirror descent (OMD) that eliminates the computational bottleneck in online RLHF — namely, storing all historical data and re-optimizing from scratch at each iteration — achieving \(\mathcal{O}(1)\) time and memory complexity per iteration while also improving upon MLE methods in statistical efficiency.
Background & Motivation¶
RLHF (Reinforcement Learning from Human Feedback) is a core technique for aligning LLMs. Conventional RLHF relies on fixed preference datasets, but limited coverage makes it difficult for reward models to generalize to out-of-distribution samples. Online RLHF addresses this by iteratively collecting data and continuously improving the model, a strategy whose effectiveness has been demonstrated by Claude and LLaMA-2.
However, online RLHF faces a severe computational bottleneck: - Each iteration requires incorporating new data into the historical dataset and re-optimizing the reward model over the entire dataset - Under MLE estimation, the computational complexity at round \(t\) is \(\mathcal{O}(t \log t)\) with \(\mathcal{O}(t)\) memory - This becomes prohibitive over long training horizons, especially on edge devices
Core Problem: Can one design an online RLHF algorithm that is both statistically and computationally efficient?
Method¶
Overall Architecture¶
The paper formalizes RLHF as a contextual preference bandit problem and adopts the Bradley-Terry preference model:
where \(\phi(x,a)\) is a known feature map, \(\theta^*\) is an unknown parameter, and \(\sigma\) is the sigmoid function. The nonlinearity coefficient \(\kappa = \max 1/\dot{\sigma}(\cdot)\) characterizes the learning difficulty (\(\kappa\) can be exponentially large).
Key Designs: One-Pass Reward Modeling¶
Problem with conventional MLE: At round \(t\), one must solve
which requires iterating over all historical data, incurring \(\mathcal{O}(t \log t)\) computation per round.
Proposed approach: MLE is replaced by OMD with a second-order Taylor expansion under a customized local norm:
where the local norm is \(\widetilde{\mathcal{H}}_t = \mathcal{H}_t + \eta H_t(\widetilde{\theta}_t)\) and \(\mathcal{H}_t = \sum_{i=1}^{t-1} H_i(\widetilde{\theta}_{i+1}) + \lambda I\).
The key closed-form equivalent update is:
This update depends only on the current sample and requires no stored historical data, yielding \(\mathcal{O}(1)\) per round.
Core design insight: The local norm \(\mathcal{H}_t\) captures second-order information (approximating the Hessian), which is essential for statistical efficiency. Standard OMD with a first-order approximation would sacrifice convergence rates; the second-order Taylor expansion here achieves both a closed-form solution and statistical efficiency.
Three Online RLHF Settings¶
Setting 1: Passive Data Collection — The algorithm does not control data collection and applies the "pessimism in the face of uncertainty" principle:
Setting 2: Active Data Collection — The algorithm queries the sample with the highest uncertainty:
Setting 3: Deployment-Time Adaptation — During online deployment, exploitation and exploration are balanced: the first action maximizes the estimated reward, while the second jointly maximizes the reward and its distance from the first action.
Practical Techniques¶
- Hessian-Vector Product (HVP) + Conjugate Gradient: Reduces the \(\mathcal{O}(d^3)\) Hessian inversion to \(\mathcal{O}(d)\)
- Rejection Sampling to approximate model uncertainty: \(n\) responses are sampled and reward ranking is used in place of exact confidence bounds
- A time-increasing schedule \(\lambda_t = \lambda_0 \cdot \min(1, f(t/T))\) is adopted for \(\lambda\) to approximate the cumulative Hessian effect
Loss & Training¶
- Backbone models: Llama-3-8B-Instruct and Qwen2.5-7B-Instruct
- Feature dimension: \(d = 4096\) (last-layer embeddings)
- Datasets: Ultrafeedback-binarized (64K prompts) and Mixture2
- Passive setting: \(T = 30{,}000\) data points sampled randomly
- Active setting: only 6,400 samples selected from the full dataset
Key Experimental Results¶
Main Results¶
Passive Data Collection (Llama-3-8B, Ultrafeedback):
| Metric | MLE | Ours (OMD) |
|---|---|---|
| Convergence speed | Slow | Fast (especially advantageous when \(T < 10{,}000\)) |
| Evaluation accuracy | Lower | Higher |
| Evaluation loss | Higher | Lower |
In the low-data regime (\(T < 10{,}000\)), the OMD method achieves higher evaluation accuracy with fewer samples, validating the improvement in statistical efficiency.
Active Data Collection (only 6,400 samples):
| Method | ACC (%) | Training Time (s) |
|---|---|---|
| Rand-MLE | 69.51 ± 0.5 | 4876 ± 47 |
| Active-MLE | 69.82 ± 0.4 | 4982 ± 52 |
| Rand-OMD | 68.97 ± 0.6 | 1456 ± 31 |
| Ours | 70.43 ± 0.3 | 1489 ± 36 |
The OMD method reduces training time to approximately one-third of MLE while achieving higher accuracy. Active data selection further improves performance.
Ablation Study¶
Deployment-Time Adaptation: Data is split into 20 chunks and processed sequentially. Multiple action selection strategies are compared: - The proposed strategy (best + top-1/q percentile) outperforms random selection, best+second-best, and best+worst strategies under both MLE and OMD baselines - Win rate analysis shows that the OMD method is competitive with MLE while incurring substantially lower computational cost
Key Findings¶
- The OMD method yields at least a \(\sqrt{\kappa}\)-fold improvement in statistical efficiency (theoretically guaranteed), where \(\kappa\) can be exponentially large
- Training time is reduced by approximately \(3\times\) with equal or higher accuracy
- Memory requirements drop from \(\mathcal{O}(T)\) to \(\mathcal{O}(1)\), which is highly significant for edge device deployment
- Active data selection combined with OMD is the optimal configuration: highest accuracy and fastest training
Highlights & Insights¶
- Strong unity of theory and practice: Starting from a theoretical analysis of contextual dueling bandits, the paper derives a practically deployable LLM algorithm
- Unified treatment of three settings: Passive, active, and deployment-time settings all follow naturally from the same OMD framework
- Practical significance of \(\mathcal{O}(1)\) complexity: The core bottleneck in online RLHF is continuously updating the reward model; this paper reduces the associated complexity from linear growth in the number of iterations to a constant
- Elegant local norm design: Constructing the local norm using the Hessian at a lookahead point preserves second-order information while admitting a closed-form solution
- HVP + Conjugate Gradient for practicality: Reducing complexity from \(\mathcal{O}(d^3)\) to \(\mathcal{O}(d)\) makes real-time updates feasible in a 4096-dimensional feature space
Limitations & Future Work¶
- Assumes a fixed feature map \(\phi\) — in practice, the feature space changes during LLM fine-tuning
- Based on the Bradley-Terry model — not extended to more general preference models such as Plackett-Luce
- Assumes a linear reward function — real-world reward functions are typically nonlinear and require neural network approximation
- Experiments operate only on last-layer embeddings — multi-layer features and full fine-tuning remain unexplored
- Theoretical guarantees for rejection sampling-based uncertainty approximation are weak — a gap between theory and practice remains
Related Work & Insights¶
- Online RLHF (Dong et al., Guo et al.): Demonstrate that online iterative training outperforms offline training, but at high computational cost
- Faury et al. (Implicit OMD for logistic bandits): Provide the theoretical foundation for the OMD approach, which this paper extends to the RLHF setting
- Zhang & Sugiyama: Source of the second-order Taylor expansion approximation idea
- Das et al. (Active RLHF): Baseline for the active data collection setting; this paper substantially improves computational efficiency while maintaining equivalent statistical guarantees
- Foster et al.: Address the problem of enumerating an exponentially large response space, which is complementary to this paper's focus on iterative computational cost
Takeaway: The OMD + local norm combination is generalizable to other online learning problems (e.g., online DPO, online preference optimization) and is particularly valuable for personalized alignment on edge devices.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The application of OMD + local norm to RLHF is novel, though the underlying techniques originate from the bandit literature
- Experimental Thoroughness: ⭐⭐⭐⭐ — Covers three settings with both theoretical and empirical validation, though scale is limited (Llama-3 8B)
- Writing Quality: ⭐⭐⭐⭐⭐ — Rigorous theoretical derivations, clear algorithmic descriptions, and well-organized setting taxonomy
- Value: ⭐⭐⭐⭐ — Addresses the core computational bottleneck of online RLHF with practical implications for edge deployment and continual learning