Tackling Heavy-Tailed Q-Value Bias in Offline-to-Online Reinforcement Learning with Laplace-Robust Modeling¶
Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=I7UK5qHNBL
Code: https://github.com/USTC-AI4EEE/LAROO
Area: Reinforcement Learning / Offline-to-Online RL
Keywords: Offline-to-Online RL, Q-value Bias, Heavy-tailed Distribution, Laplace-Robust Modeling, Ensemble Models
TL;DR¶
This paper reveals for the first time that the Q-value bias during the online fine-tuning stage of Offline-to-Online Reinforcement Learning (O2O RL) follows a heavy-tailed distribution. It proposes LAROO: using an adaptive Laplace noise to "absorb" the heavy-tailed nature of the bias into the noise, combined with a robust loss \(D_b(x)\) to reduce estimation variance, and a conservative ensemble estimate to pull the bias mean back to zero. LAROO outperforms previous state-of-the-art O2O methods with an average improvement of +54.8% on D4RL.
Background & Motivation¶
Background: O2O RL pre-trains an agent on an offline dataset and then fine-tunes it with a small amount of online interaction to break through the ceiling of insufficient state-action coverage in offline data. Due to distribution shifts between offline and online data, the pre-trained Q-network misestimates Q-values on online data, misleading the update direction. Mainstream approaches focus on "improving Q-value accuracy"—either by adding conservative penalties (Cal-QL), using ensemble models for stable estimation (ENOTO), or increasing update frequency (SO2).
Limitations of Prior Work: A common assumption of these methods is that the Q-bias (the difference between estimated Q-values and true cumulative returns) has finite variance or follows a Gaussian distribution. By using Monte Carlo for ground truth returns and calculating sample-wise differences, this paper statistics the Q-bias distribution of Cal-QL / PEX / ENOTO / SO2 during online fine-tuning. It finds their Kurtosis generally ranges from 8.5 to 9.0 (Gaussian is only 3), with variances in the tens of thousands and extremely long right tails—clearly indicating a heavy-tailed, positively skewed distribution rather than Gaussian.
Key Challenge: The root of the heavy-tailed Q-bias is non-homogeneous distribution shift. Online samples further from the offline distribution exhibit larger Q-bias (verified by Spearman correlations of ~0.44-0.54 with DDR/DKNN distances). These distant points constitute the long tail, while the max operator amplifies these overestimations. Heavy-tailed bias brings immense estimation variance; under \(l_2\) loss, extreme outliers in the tail dominate gradients, causing Q-values to oscillate violently or even collapse, making fine-tuning unstable and slow. Existing methods focus only on reducing the mean/variance of the bias without addressing its distributional shape, thus failing to suppress Kurtosis.
Goal: Explicitly model and eliminate the "heavy-tail" pathology at the distributional level, "standardizing" Q-bias from a difficult heavy-tailed distribution into a form with a near-zero mean and controlled tails.
Key Insight: The authors draw inspiration from robust regression, where the Laplace distribution is used to model errors with outliers. The Laplace distribution is naturally suited for heavy-tailed data with outliers, and its negative log-likelihood is proportional to the absolute error \(|x|\) (linear growth) rather than the square growth of \(l_2\), naturally suppressing tail outliers.
Core Idea: Introduce a parameterizable, adaptive Laplace noise term to "absorb" the heavy-tailedness of the Q-bias, transferring the heavy tails from the bias to the noise. Then, use an ensemble model to pull the residual overestimation mean back to zero. Together, these steps reshape the heavy-tailed Q-bias into a standardized form.
Method¶
Overall Architecture¶
LAROO performs two operations on top of standard off-policy Q-learning (using TD3/TD3BC as the backbone): ① Modeling and transferring heavy tails—assuming the difference between true and estimated Q-values follows a Laplace noise \(\varepsilon_\theta\). The Q-network is updated by minimizing the KL divergence between two Laplace likelihoods ("ground truth given TD target" and "ground truth given estimated Q"), deriving a robust loss function \(D_b(x)\) to replace the original \(l_2\) loss. ② Adaptation + Re-centering—using a robust variance estimate of the in-batch TD-error to update the Laplace scale parameter \(b\) in real-time (ensuring the noise fits the current heavy-tailedness). Simultaneously, an ensemble model calculates the TD target using a minimum value to pull the bias mean toward zero.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Online Fine-tuning Data<br/>Heavy-tailed Q-bias"] --> B["Laplace Noise Modeling<br/>Absorb heavy-tailedness into noise"]
B --> C["Robust Loss Db(x)<br/>Linear penalty + bounded gradient"]
A --> D["Adaptive Scale Update<br/>TD-error robust variance estimate b"]
D --> C
C --> E["Ensemble Re-centering<br/>Min Q pulls mean back"]
E --> F["Standardized Q-bias<br/>Stable & efficient fine-tuning"]
Key Designs¶
1. Laplace Noise Modeling: Transferring heavy tails from bias to noise
Addressing the mismatch where Q-bias is heavy-tailed but prior methods assume Gaussian/finite variance, LAROO assumes the target Q-value and estimated Q-value each carry independent Laplace noise \(\varepsilon_{\hat\theta}\sim\mathrm{Laplace}(\mu,b_1)\) and \(\varepsilon_\theta\sim\mathrm{Laplace}(\mu,b_2)\), binding the noise to the bias via the definition \(\mathrm{Bias}(Q_\theta)=-\mathbb{E}[\varepsilon_\theta]\). The Laplace likelihood of the true \(Q(s,a)\) is written under both conditions, and the Q-network is updated by minimizing their KL divergence (Eq. 5). The advantage is that the negative log of the Laplace likelihood is proportional to \(|Q(s,a)-Q_\theta(s,a)|\), demonstrating linear rather than quadratic growth. Consequently, optimization is dominated by the "central mass" of the Q-bias rather than rare extreme outliers in the tail. The heavy-tailedness is explicitly handled by the noise term, while the Q-value itself remains an expectation estimate.
2. Robust Loss \(D_b(x)\): Replacing \(l_2\) square penalty with bounded gradients
Simplifying the KL divergence by setting \(b=b_1=b_2\) yields the robust function replacing \(l_2\):
It possesses two critical properties: first, the exponential term \(\exp(-|x|/b)\) down-weights the loss for large tail deviations, weakening the influence of outliers; second, its gradient is constrained within \([-1/b,\,1/b]\) and does not increase with \(x\). This contrasts with \(l_2\) loss gradients, which diverge linearly with bias. Furthermore, \(D_b(x)\) is intrinsically coupled with the Laplace distribution: as heavy-tailed bias becomes more frequent and scale \(b\) increases, the gradient bounds tighten, automatically increasing suppression of extreme deviations. The authors theoretically prove (Theorem 4.4/4.5) that when \(b>1\), both single-step estimation bias and variance updated with \(D_b(x)\) are strictly smaller than with \(l_2\).
3. Adaptive Scale Update: Tracking heavy-tailedness with TD-error robust variance
To ensure the Laplace noise fits the changing bias distribution during fine-tuning, the scale parameter \(b\) (corresponding to tail thickness and variance) must be updated in real-time. Since true Q-bias is unavailable during training and standard sample variance is sensitive to Kurtosis, LAROO uses two techniques: first, TD-error is used as a proxy—under independence assumptions, \(T Q_{\hat\theta}-Q_\theta=\varepsilon_\theta-\varepsilon_{\hat\theta}\), so the TD-error variance is exactly twice the Q-bias variance, thus \(b=s_{\omega^*}/\sqrt{2}\). Second, the Kurtosis-robust MBBE variance estimator (MSE-best biased estimator) \(s_{\omega^*}^2=\big(\tfrac{\kappa}{n}+\tfrac{n+1}{n-1}\big)^{-1}s^2\) replaces standard sample variance, explicitly incorporating Kurtosis \(\kappa\).
4. Ensemble Re-centering: Pulling residual overestimation mean to zero
Noise modeling + \(D_b(x)\) mainly suppress the "heavy tail," but large positive tail deviations can still push the mean positive (overestimation). LAROO introduces an ensemble of Q-functions, where the TD target uses the minimum Q-value over a random subset: \(y_{\min}=r+\gamma\max_{a'}\min_{1\le k\le M}Q^{(k)}_{\hat\theta}(s',a')\). The final loss averages \(D_b\big(Q^{(k)}_{\theta}-y_{\min}\big)\) across \(K\) heads (Eq. 8). Taking the minimum effectively re-centers because different Q-heads are approximately independent; taking the minimum reduces the probability of selecting the "most optimistic head," thereby lowering high positive bias and pushing the mean toward zero.
Loss & Training¶
The offline phase uses TD3BC pre-training for 1 million gradient steps (LAPO for sparse AntMaze). The online phase uses TD3 with ensemble Q-functions for 100k steps. The online loss is the ensemble \(D_b(x)\) loss (Eq. 8), with scale \(b\) re-estimated per batch using MBBE. The backbone aligns with the ENOTO baseline for fairness.
Key Experimental Results¶
Main Results¶
On D4RL (MuJoCo + sparse AntMaze), LAROO is compared against SOTA methods like PEX / SO2 / Cal-QL / ENOTO / BOORL over 5 seeds. Normalized returns after fine-tuning for representative tasks:
| Task | SO2 | Cal-QL | ENOTO | BOORL | LAROO |
|---|---|---|---|---|---|
| Hopper-medium | 94.4 | 90.8 | 96.6 | 102.1 | 106.7 |
| Walker2d-random | 20.8 | 1.6 | 38.3 | 6.4 | 71.6 |
| Walker2d-medium | 100.6 | 83.7 | 110.2 | 98.6 | 120.4 |
| Walker2d-medium-expert | 110.8 | 110.2 | 118.0 | 109.1 | 126.7 |
| Halfcheetah-medium | 84.4 | 52.2 | 84.8 | 89.7 | 92.5 |
Total performance gain across all tasks within 100k steps \(\delta_{\text{sum}}(0.1M)\): LAROO reaches 550.4, far exceeding the 355.4 of the second-best BOORL. Final performance is +54.8% higher than BOORL on average.
Ablation Study¶
| Configuration | Phenomenon | Explanation |
|---|---|---|
| LAROO (Full) | Near-zero mean, controlled tails | Noise + Ensemble synergistically standardize Q-bias |
| w/o Noise Model | Near-zero mean but still heavy-tailed | Ensemble only re-centers the mean, cannot suppress Kurtosis |
| w/o Ensemble Model | Removed tails but still overestimated | Noise removes tails but mean remains positive |
| \(D_b(x)\) vs Huber/Cauchy | \(D_b(x)\) is superior | Better fit for heavy-tailed Q-bias |
| Laplace vs Gaussian fit | Laplace fits better | Empirical Q-bias is closer to Laplace |
Key Findings¶
- Noise model contributes more than ensemble: Laplace noise modeling is the more critical component, though both are necessary—one handles heavy tails, the other handles the mean.
- Strong plug-and-play capability: Simply replacing the \(l_2\) loss of existing methods (TD3 / Cal-QL / ENOTO) with \(D_b(x)\) further improves performance.
- Gains beyond high UTD/Ensemble: Even with UTD=1 and ensemble size=1, LAROO outperforms Cal-QL/PEX in 10 out of 13 experiments, showing gains stem from robust modeling rather than just hyperparameter tuning.
Highlights & Insights¶
- Turning "Diagnosis" into "Explicit Modeling Goal": After statistically proving Q-bias has a high Kurtosis (8.5+), the authors targeted this with Laplace noise. This creates a logical closed loop from problem discovery to design.
- "Transferring" rather than "Eliminating" Heavy Tails: Moving heavy-tailedness from the bias to a parameterizable noise term is equivalent to replacing \(l_2\) with the linear-penalty \(D_b(x)\), which is elegant and theoretically sound.
- TD-error as a Proxy for Q-bias Variance: Since true Q-bias is unobservable during training, using \(\mathrm{Var}(\text{TD-error})=2\,\mathrm{Var}(\text{Q bias})\) combined with MBBE provides a robust observation mechanism.
Limitations & Future Work¶
- The methodology relies on Assumptions 4.1/4.2 (independent noise, same Laplace parameters). If the actual bias deviates significantly from Laplace (e.g., Cauchy-like extreme tails), the fitting advantage might diminish.
- Theoretical guarantees (smaller bias/variance) depend on \(b>1\); guarantees for \(b \le 1\) are not provided.
- Evaluations are limited to continuous control in D4RL; generalization to image-based observations or discrete actions remains to be verified.
Related Work & Insights¶
- vs Cal-QL / ENOTO: These methods reduce the mean/variance of Q-bias but cannot suppress Kurtosis; LAROO operates at the full distribution level to handle heavy tails explicitly.
- vs Distributional RL (DSAC): DSAC models the entire Q-value (return) distribution and often assumes Gaussianity, which is incompatible with expectation-based pre-trained Q-networks. LAROO only models the Q-bias as heavy-tailed, keeping the Q-value as an expectation estimate for better compatibility.
- vs Extreme Q-learning: The latter uses the Gumbel distribution to estimate max Q-values, but its training objective differs from standard offline algorithms. LAROO changes only the loss and re-centering method, preserving the Q-value estimation target.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First to reveal heavy-tailed Q-bias in O2O RL and explicitly model it via Laplace noise.
- Experimental Thoroughness: ⭐⭐⭐⭐ Extensive D4RL tasks with 5 seeds; solid ablation and plug-and-play testing.
- Writing Quality: ⭐⭐⭐⭐ Clear logic from discovery to theory.
- Value: ⭐⭐⭐⭐⭐ \(D_b(x)\) is highly practical for stable O2O RL fine-tuning.