Sample-Efficient Tabular Self-Play for Offline Robust Reinforcement Learning¶

Conference: NeurIPS 2025 arXiv: 2512.00352 Code: None Area: Robust Reinforcement Learning / Offline RL Keywords: offline RL, robust MDP, zero-sum Markov game, sample complexity, pessimism

TL;DR¶

This paper proposes the RTZ-VI-LCB algorithm for offline robust two-player zero-sum Markov games (RTZM G). By combining pessimistic robust value iteration with Bernstein-style penalties, it achieves a near-optimal sample complexity of \(O(C_r^* \cdot H^4 \cdot S \cdot (A+B) / \varepsilon^2)\), significantly improving upon the prior best result of \(O(H^5 \cdot S^2 \cdot AB / \varepsilon^2)\) in terms of dependence on both the state space and the action space.

Background & Motivation¶

State of the Field¶

Background: Offline RL learns policies from a fixed dataset without online interaction. Robust RL further accounts for environmental uncertainty, where the transition kernel may vary within an uncertainty set. The combination of these two settings forms the emerging direction of offline robust RL.

Limitations of Prior Work: (a) Existing offline robust RL methods exhibit redundant dependence on \(S\) (number of states) and \(AB\) (joint action space size) in their sample complexity bounds; (b) distribution coverage measures are overly conservative — standard concentrability coefficients are excessively pessimistic for worst-case analysis.

Key Challenge: How can one simultaneously handle distributional shift and environmental uncertainty under partial coverage, while obtaining tight sample complexity bounds?

Key Insight: The paper introduces robust unilateral clipped concentrability as a tighter distribution measure, and employs two-stage subsampling to suppress statistical dependencies, thereby achieving information-theoretically optimal sample bounds.

Core Idea: A model-based approach — first estimate the uncertainty set of the transition kernel, then perform pessimistic (LCB) value iteration over this set, using Bernstein inequalities to obtain tighter statistical estimates.

Method¶

Overall Architecture¶

Input: Offline dataset (\(N\) tuples of \((s, a_1, a_2, r, s')\)), uncertainty set radius \(\rho\)
Algorithm: RTZ-VI-LCB (Robust Two-player Zero-sum Value Iteration with Lower Confidence Bound)
Output: Near-optimal robust policy \(\hat{\pi}\)

Key Designs¶

Robust Value Iteration with Pessimistic Estimation
- Function: Performs worst-case optimization over the transition kernel within the estimated uncertainty set, while applying LCB penalties to state-action pairs with insufficient coverage.
- Mechanism: \(\hat{Q}_h(s,a,b) = \hat{r}_h(s,a,b) + \min_{P \in \hat{U}(s,a,b)} P \hat{V}_{h+1} - \Gamma_h(s,a,b)\), where \(\hat{U}\) is the estimated uncertainty set and \(\Gamma_h\) is the Bernstein penalty term.
- Design Motivation: The pessimism principle prevents overly optimistic value estimates in regions with insufficient data.
Robust Unilateral Clipped Concentrability
- Function: Defines a tighter distribution coverage measure \(C_r^*\).
- Mechanism: Coverage is required only for the state-action distribution of the target policy (rather than all policies), and clipping is applied to eliminate extreme values in density ratios.
- Comparison with Prior Measures: Strictly smaller than the standard concentrability coefficient, permitting more relaxed data collection conditions.
Two-Stage Subsampling
- Function: Randomly partitions the dataset into two disjoint subsets, used separately for uncertainty set construction and value iteration.
- Design Motivation: Suppresses statistical dependencies between estimation errors and value functions, simplifying the theoretical analysis.

Loss & Training¶

Uncertainty set construction: Based on the empirical transition kernel \(\hat{P}_h\) and TV/KL/chi-squared balls.
Bernstein penalty: \(\Gamma_h(s,a,b) = c \sqrt{\frac{\hat{\text{Var}} V_{h+1}}{N_h(s,a,b)}} + \frac{H}{N_h(s,a,b)}\)
Pure offline learning; no online interaction is required.

Key Experimental Results¶

Sample Complexity Comparison¶

Main Results¶

Method	Sample Complexity	State Dependence	Action Dependence
Prev. SOTA	\(O(H^5 S^2 AB / \varepsilon^2)\)	\(S^2\)	\(AB\)
RTZ-VI-LCB	\(O(C_r^* H^4 S (A+B) / \varepsilon^2)\)	\(S\)	\(A+B\)
Information-theoretic lower bound	\(\Omega(C_r^* H^3 S (A+B) / \varepsilon^2)\)	\(S\)	\(A+B\)

Key Findings¶

The sample complexity matches the information-theoretic lower bound in \(S\) and \(A+B\) up to an \(H\) factor.
The algorithm is optimal under both high- and low-uncertainty regimes.
The robust unilateral clipped concentrability coefficient is strictly smaller than the standard coefficient.

Highlights & Insights¶

First optimal sample complexity for offline RTZM G: Achieves information-theoretic optimality in the \(S\) and \(A+B\) dimensions.
New distribution measure: Robust unilateral clipped concentrability is a general-purpose tool potentially applicable to other offline RL problems.
Unified treatment of high and low uncertainty regimes: A single algorithmic framework achieves optimal performance in both extremes.

Limitations & Future Work¶

Tabular setting only: Assumes a finite state-action space; function approximation is not considered.
Gap of an \(H\) factor: Whether the remaining gap relative to the lower bound can be closed remains open.
Restriction to two-player zero-sum games: Although extensions to multi-player general-sum games are discussed, the core analysis is limited to the two-player zero-sum setting.
Limited empirical validation: The contribution is primarily theoretical, with minimal experimental verification.

vs. Blanchet et al. (2024): Their sample complexity is \(O(S^2 AB)\); this work improves it to \(O(S(A+B))\).
vs. Shi et al. (2022): They analyze offline robust single-agent RL; this work extends the framework to two-player zero-sum games.
vs. Cui & Du (2022): They study offline (non-robust) Markov games; this work incorporates environmental robustness.

Rating¶

Novelty: ⭐⭐⭐⭐ New distribution measure and optimal sample complexity bounds
Experimental Thoroughness: ⭐⭐ Primarily theoretical; limited empirical evaluation
Writing Quality: ⭐⭐⭐⭐ Theorems are clearly stated; technically rigorous
Value: ⭐⭐⭐⭐ Significant theoretical advancement for offline robust RL

Additional Remarks¶

The theoretical analysis framework and technical tools developed in this paper offer insights for adjacent research areas.
The core contribution lies in a deeper theoretical understanding, providing a foundation for subsequent practical improvements.
The paper is methodologically complementary to other NeurIPS 2025 papers published concurrently.
The paper's exposition of problem motivation and technical approach is worth studying.
Readers are encouraged to consult the appendix for complete experimental details and proofs.

Technical Notes¶

The proof of the core theorem relies on multi-step refined concentration inequality analysis and properties of the Bellman equation.
The choice between pessimism and optimism in algorithm design is a central consideration in offline RL theory.
Logarithmic factors hidden in the theoretical bounds are non-negligible in practical applications.