Skip to content

ORVIT: Near-Optimal Online Distributionally Robust Reinforcement Learning

Conference: AAAI 2026 arXiv: 2508.03768 Code: None Area: Image Generation Keywords: Distributionally Robust RL, online learning, f-divergence, Minimax Optimal, Sample Complexity

TL;DR

This paper studies online distributionally robust reinforcement learning and proposes the RVI-\(f\) algorithm based on \(f\)-divergence uncertainty sets. It achieves near minimax-optimal regret bounds under both \(\chi^2\) and KL divergences without relying on any structural assumptions.

Background & Motivation

State of the Field

Distributionally robust reinforcement learning (DRRL) optimizes worst-case performance over uncertainty sets centered at the training environment, providing a principled solution to the sim-to-real gap. Existing studies have produced rich results under generative model and offline data access settings.

Limitations of Prior Work

Existing DRRL methods either rely on a generative model (allowing arbitrary queries of state-action pairs) or assume offline datasets with sufficient coverage. These assumptions are difficult to satisfy in practical deployment—real environments are unknown and data are sparse, requiring the agent to explore on its own.

Root Cause

Online DRRL faces an information gap problem: states covered by the worst-case environment may never be visited during training, yet the agent must still act reliably in those unseen states. This leads to an inherent mismatch between the training data distribution and the evaluation objective (the robust value function).

Paper Goals

To design an online DRRL algorithm with near-optimal sample complexity without relying on fail-state structure, coverage assumptions, or vanishing minimum-value assumptions.

Starting Point

The paper focuses on \(\chi^2\) and KL divergence uncertainty sets, leveraging the smooth geometric properties of these divergences to mitigate the information gap (more tractable than TV divergence), and achieves tight regret bounds through carefully designed exploration bonuses.

Core Idea

By directly constructing confidence intervals for the robust value function (rather than modeling the full transition dynamics), the paper uses the dual representation of \(f\)-divergences to design adaptive bonus terms, enabling near-optimal online robust learning without structural assumptions.

Method

Overall Architecture

RVI-\(f\) (Robust Value Iteration with \(f\)-Divergence) is a model-based meta-algorithm that executes three phases per episode \(k\): 1. Nominal transition estimation: Estimate the empirical transition kernel \(\hat{P}_h^k\) of the training environment from historical data. 2. Optimistic robust estimation: Construct confidence intervals for the robust value function via optimistic upper bounds to determine the current policy. 3. Policy execution and data collection: Execute the policy, observe rewards and state transitions, and update the dataset.

Key Design 1: Dual Representation of \(f\)-Divergence Uncertainty Sets

Function: Provides a computationally tractable equivalent form for the robust expected value.

Mechanism: Convex duality reduces the robust expectation to a one-dimensional optimization problem.

For \(\chi^2\) divergence (\(f(t) = (t-1)^2\)): $\(\mathbb{E}_{\mathcal{U}_h^\sigma(s,a)}[V] = \sup_{\eta \in [0,H]} \left\{ -\sqrt{\sigma \text{Var}_{P_h^\star(\cdot|s,a)}((\eta - V)_+)} + [\mathbb{P}_h^\star(V - \eta)_+](s,a) \right\}\)$

For KL divergence (\(f(t) = t \log t\)): $\(\mathbb{E}_{\mathcal{U}_h^\sigma(s,a)}[V] = \sup_{\eta} \left\{ -\eta \log([\mathbb{P}_h^\star(\exp\{-V/\eta\})](s,a)) - \eta\sigma \right\}\)$

Design Motivation: The dual representation reduces high-dimensional distributional optimization to scalar optimization, enabling efficient solution of the robust Bellman equation. Moreover, the smoothness of \(\chi^2\) and KL divergences moderates the influence of rare states through the geometric structure of the divergence, alleviating the information gap problem present under TV divergence.

Key Design 2: Optimistic Robust Estimation and Bonus Terms

Function: Construct confidence intervals \([\underline{Q}_h^k, \overline{Q}_h^k]\) for the robust value function, ensuring the true robust optimal value function falls within the interval with high probability.

Mechanism: $\(\overline{Q}_h^k(s,a) = \min\{\overline{R}_h^k(s,a) + B_{k,h}^f(s,a),\; H\}\)$

where \(\overline{R}_h^k\) is the robust Bellman operator evaluated under the estimated model, and \(B_{k,h}^f\) is the key exploration bonus term.

For \(\chi^2\) divergence, the bonus term is: $\(B_{k,h}^{\chi^2}(s,a) = \sqrt{\frac{\sigma c_1 L \text{Var}_{\hat{P}_h^k}\left[\frac{\overline{V}_{h+1}^k + \underline{V}_{h+1}^k}{2}\right]}{N_h^k(s,a) \vee 1}} + \frac{2\sqrt{\sigma}\mathbb{E}_{\hat{P}_h^k}[\overline{V}_{h+1}^k - \underline{V}_{h+1}^k]}{H} + \text{lower order terms}\)$

Design Motivation: The bonus term comprises three components—a variance term (reflecting estimation uncertainty), an interval width term (reflecting confidence interval quality), and low-order correction terms. The variance-weighted design causes the bonus to naturally shrink for frequently visited state-action pairs, yielding tight regret bounds.

Key Design 3: Lower Bound Construction

Function: Prove minimax lower bounds for online DRRL by constructing hard instances.

Mechanism: A family of hard RMDPs \(\{\mathcal{M}_1, \dots, \mathcal{M}_S\}\) is constructed, each with 3 states, \(A\) actions, and horizon \(H\). All MDPs share the same nominal transition but differ in the rewards of different actions at \(s_2\). The Bretagnolle-Huber inequality is applied to establish a regret lower bound for any algorithm on this family.

Design Motivation: By carefully choosing \(\mu^\star = \sqrt{c'A/(SKp)}\) and the worst-case transition kernel, the lower bound matches the upper bound in all key parameters.

Loss & Training

The optimization objective is to minimize cumulative robust regret: $\(\text{Regret}(K) = \sum_{k=1}^K [V_1^{\star,\sigma}(s_1^k) - V_1^{\pi^k,\sigma}(s_1^k)]\)$

Key Experimental Results

Main Results

Method Assumptions \(\chi^2\) Regret KL Regret
TV-OPROVI Vanishing minimum value - -
TV/χ²/KL-ORBIT Coverage ratio \(C_{vr}\) \(\tilde{O}(\sqrt{C_{vr}H^4S^3AK})\) Extra \(S/(\sigma P_{\min}^\star)\) factor
RVI-χ² (Ours) None \(\tilde{O}(\sqrt{H^4(1+\sigma)SAK})\) -
RVI-KL (Ours) None - \(\tilde{O}(\sqrt{H^4 e^{2H^2}SAK/(P_{\min}^\star \sigma^2)})\)

Lower Bound Matching

Uncertainty Set Upper Bound Lower Bound Gap
\(\chi^2\) \(\tilde{O}(\sqrt{H^4(1+\sigma)SAK})\) \(\Omega(\sqrt{H^4(1+\sigma)SAK})\) Logarithmic factor
KL \(\tilde{O}(\sqrt{H^4 e^{2H^2}SAK/(P_{\min}^\star\sigma^2)})\) \(\Omega(\sqrt{H^4SAK/(P_{\min}^\star\sigma^2)})\) \(H\) dependence

Key Findings

  1. Minimax optimality achieved under \(\chi^2\) divergence (up to logarithmic factors)—the first assumption-free optimal result in online DRRL.
  2. The exponential factor \(e^{2H^2}\) under KL divergence is not an artifact of the analysis but an intrinsic cost of the geometry of the KL uncertainty set.
  3. On Gambler's Problem and Frozen Lake experiments, the method consistently improves worst-case performance under significant distribution shift.
  4. The sample complexity \(\tilde{O}(H^5(1+\sigma)SA/\varepsilon^2)\) is consistent with known results in the generative model setting.

Highlights & Insights

  1. First near-optimal online DRRL algorithm without structural assumptions: no fail-state, coverage ratio, or vanishing minimum-value assumptions are required.
  2. Unified meta-algorithm framework: RVI-\(f\) accommodates different \(f\)-divergences, demonstrating the generality of the framework.
  3. Tight minimax lower bounds: information-theoretic lower bounds are established for the online setting for the first time, filling a theoretical gap.
  4. Deep insight: the smoothness of \(\chi^2\) and KL divergences naturally alleviates the information gap problem that requires additional assumptions under TV divergence.

Limitations & Future Work

  1. Under KL divergence, a gap in \(H\) dependence between the upper and lower bounds remains (\(e^{2H^2}\) vs. no exponential term).
  2. The algorithm is model-based, requiring storage of transition kernel estimates with memory overhead \(O(S^2AH)\).
  3. Only tabular MDPs are considered; the approach has not been extended to function approximation settings.
  4. Experiments are small-scale (Gambler's Problem and Frozen Lake), lacking validation in large-scale environments.
  5. Under KL divergence, the assumption \(P_{\min}^\star > 0\) is required, which is restrictive for sparse-transition environments.
  6. When \(\sigma\) is very small, the robust bound may be loose relative to the non-robust case.
  1. UCB-VI (Azar et al. 2017): Minimax-optimal algorithm for non-robust online RL; the sampling strategy in this paper is inspired by it.
  2. He et al. (ICML 2025): Online DRRL algorithm based on the coverage ratio assumption; this paper removes that assumption.
  3. Lu et al. (2024): Online DRRL under TV divergence, requiring a vanishing minimum-value assumption.
  4. Insight: A "geometric perspective" on robust optimization—the smoothness properties of different divergences determine the difficulty of exploration, an insight potentially generalizable to broader distributionally robust optimization problems.

Rating

⭐⭐⭐⭐⭐ (5/5)

Strengths: Outstanding theoretical contributions (first assumption-free + minimax-optimal result), elegant algorithm design, unified framework accommodating multiple divergences, and complete upper and lower bound analysis.

Weaknesses: The experimental section is relatively weak, with insufficient demonstration of practical application scenarios.