Distributionally Robust Online Markov Game with Linear Function Approximation¶
Conference: AAAI 2026 arXiv: 2511.07831 Code: None Area: Multi-Agent Reinforcement Learning / Robust Game Theory Keywords: Markov Game, Distributional Robustness, Linear Function Approximation, Coarse Correlated Equilibrium, Sample Complexity
TL;DR¶
This paper studies online distributionally robust Markov games with linear function approximation. It is the first to identify the hardness of learning in this setting, and proposes the DR-CCE-LSI algorithm, which achieves minimax-optimal sample complexity with respect to the feature dimension \(d\) under a specific feature mapping condition.
Background & Motivation¶
State of the Field¶
Multi-agent reinforcement learning (MARL) has been extensively studied within game-theoretic frameworks, with Markov games serving as a central model. Linear function approximation has been introduced to handle large state spaces. Some progress has been made on single-agent robust RL and tabular robust games.
Limitations of Prior Work¶
The sim-to-real gap is more severe in MARL: equilibrium solutions are highly sensitive to small environmental perturbations.
Online settings are understudied: existing robust Markov game algorithms primarily target generative model or offline settings.
Linear approximation and robustness have not been combined: the online robust general-sum game setting with linear function approximation remains entirely unexplored.
Root Cause¶
Support shift: samples collected online may fail to cover all trajectories under worst-case environments, causing algorithms to perform no better than random guessing on unobserved states. This is the fundamental obstacle to robust online learning.
Paper Goals¶
To design provably sample-efficient algorithms for finding approximate robust coarse correlated equilibria (CCE) in general-sum games with linear function approximation and \(d\)-rectangular uncertainty sets.
Starting Point¶
- First establish the impossibility of online learning via a constructed counterexample (lower bound \(\Omega(\sigma \cdot HK)\)).
- Introduce the vanishing minimum assumption to circumvent the support shift problem.
- Design a least-squares value iteration algorithm with agent-specific bonus terms.
Core Idea¶
Under the vanishing minimum assumption, exploit the \(d\)-rectangular structure to preserve linearity, and achieve sample-efficient robust equilibrium learning via coordinate-wise ridge regression and agent-specific exploration bonuses.
Method¶
Overall Architecture¶
The DR-CCE-LSI (Distributionally Robust CCE Least Square Iteration) algorithm proceeds as follows in each episode \(k\): 1. Each player \(i\) estimates the robust action-value function \(Q_{i,h}^k\) via ridge regression using historical data. 2. A CCE of the \(n\)-player matrix game induced by the estimated \(Q\) matrices is computed via the Find-CCE subroutine. 3. The CCE policy is executed to interact with the environment and collect new data. 4. The feature covariance matrix \(\Lambda_h^{k+1}\) is updated.
Key Design 1: \(d\)-Rectangular Uncertainty Set¶
Function: Define an appropriate robust uncertainty set compatible with the linear MDP structure.
Mechanism: Assume the transition kernel \(P_h(s'|s, \boldsymbol{a}) = \langle \phi_{s\boldsymbol{a}}, \mu_h^0(s') \rangle\) has a linear structure. The \(d\)-rectangular uncertainty set applies perturbations to each coordinate of \(\mu_h\): $\(\mathcal{U}_{TV}^{\sigma_i}(\mu_h^0) : \bigotimes_{j \in [d]} \{\mu_{h,j} : D_{TV}(\mu_{h,j} \| \mu_{h,j}^0) \leq \sigma_i\}\)$
Design Motivation: The \(d\)-rectangular structure ensures the robust action-value function retains a linear form \(Q_{i,h}^{\pi,\sigma}(s,\boldsymbol{a}) = \langle \phi_{s\boldsymbol{a}}, w_{i,h} \rangle + \text{bonus}\), avoiding the completeness assumptions required in general function approximation.
Key Design 2: Vanishing Minimum Assumption and Equivalent Optimization¶
Function: Avoid the support shift problem by assuming \(\min_s V_{i,h}^{\pi,\sigma}(s) = 0\).
Mechanism: Under this assumption, the robust optimization problem simplifies to: $\(\inf_{P \in \mathcal{U}_{TV}^{\sigma_i}(P_{h,s,\boldsymbol{a}}^0)} \mathbb{E}_P[V] = \sigma_i \mathbb{E}_{\tilde{P}_{h,s,\boldsymbol{a}}}[V]\)$
This guarantees that the support of the worst-case transition kernel does not exceed that of the nominal kernel (equation (12) in Proposition 4.3).
Design Motivation: This assumption is equivalent to adding an absorbing failure state \(s_f\) (with zero reward) to the Markov game, which is natural in many practical scenarios (e.g., games where a player may fail at each step).
Key Design 3: Agent-Specific Bonus Terms¶
Function: Design exploration bonuses to guide sufficient policy exploration.
Mechanism: The bonus for player \(i\) is: $\(\Gamma_{h,k}^i(s,\boldsymbol{a}) = \beta_i \sum_{j=1}^d \sqrt{\phi_j(s,\boldsymbol{a}) \mathbf{1}_j^T (\Lambda_h^k)^{-1} \mathbf{1}_j \phi_j(s,\boldsymbol{a})}\)$
where \(\beta_i = \min\{H, 1/\sigma_i\} \sqrt{c_\beta n d \log(ndHK/\delta)}\).
Design Motivation: Unlike the single UCB bonus in non-robust settings, the robust setting requires \(d\) coordinate-wise confidence upper bounds. Each coordinate corresponds to an independent ridge regression task targeting \([V_{i,h+1}^k(s_{h+1}^\tau)]_{\alpha_j}\). Furthermore, the factor \(\min\{H, 1/\sigma_i\}\) in \(\beta_i\) reflects each player's individual robustness preference.
Key Design 4: Find-CCE Subroutine¶
Function: Address the instability (non-Lipschitz continuity) of CCE with respect to the \(Q\)-value matrix.
Mechanism: First locate the nearest surrogate game matrix \(\tilde{Q}\) within an \(\epsilon\)-cover of the \(Q\)-value function class, then compute the exact CCE of \(\tilde{Q}\). This ensures that for any two neighboring \(Q\)-values in the cover, the deviation in value functions when using the same CCE is bounded within \(2\epsilon\).
Design Motivation: Directly computing CCE from \(Q\) functions causes the covering number of the value function class to explode (Lemma 4.4 gives a counterexample showing that a small perturbation to \(Q\) can cause a CCE value difference of 1). The "discretize first, then solve equilibrium" strategy resolves this technical difficulty.
Loss & Training¶
The regret is defined as the sum over all players of the maximum unilateral deviation gain: $\(\text{Regret}(K) = \max_{i \in [n]} \sum_{k=1}^K [V_{i,1}^{\star, \pi_{-i}^k, \sigma}(s_1^k) - V_{i,1}^{\pi^k, \sigma}(s_1^k)]\)$
Key Experimental Results¶
Main Results¶
In a simulated game with 5 states, 2 players, and \(H=3\) (structured as in Figure 1 of the paper), DR-CCE-LSI is compared against the non-robust baseline NQOVI:
| Perturbation level \(\rho\) | DR-CCE-LSI (Ours) | NQOVI |
|---|---|---|
| 0.0 (no shift) | Slightly lower (optimality sacrificed for robustness) | Better |
| 0.1 | Comparable | Begins to degrade |
| 0.2+ | Significantly better | Performance degrades sharply |
Theoretical Comparison¶
| Setting | Method | Upper Bound | Lower Bound |
|---|---|---|---|
| Non-robust single-agent | he2023 | \(\sqrt{d^2H^3K}\) | \(\sqrt{d^2H^3K}\) |
| Non-robust multi-player | cisneros2023 | \(\sqrt{d^3H^5K}\) | \(\sqrt{d^2H^3K}\) |
| Robust multi-player (Ours) | DR-CCE-LSI | \(dH\min\{H,1/\sigma\}\sqrt{K}\) | \(dH^{1/2}\min\{H,1/\sigma\}\sqrt{K}\) |
Key Findings¶
- The proposed upper bound matches that of single-agent robust settings in \(d\) and \(K\), achieving minimax optimality with respect to feature dimension \(d\).
- The factor \(\min\{H, 1/\min\{\sigma_i\}\}\) in the regret bound characterizes the cost of robustness.
- The advantage of DR-CCE-LSI becomes more pronounced under higher environmental perturbation sensitivity.
- The impossibility theorem (Theorem 4.1) establishes that without additional assumptions, the regret is \(\Omega(\sigma \cdot HK)\) (linear growth, hence not learnable).
Highlights & Insights¶
- First theoretical result: The first sample-efficient algorithm for online robust linear Markov games.
- Complete picture of impossibility and feasibility: The paper first proves a lower bound impossibility, then makes the problem tractable via the appropriate vanishing minimum assumption.
- Interesting observation on "risk preference consistency": Inconsistent robustness levels among players (large variance in \(\sigma_i\)) leads to reduced sample efficiency, suggesting that a shared risk awareness is beneficial in cooperative learning.
- Technical contribution: The Find-CCE subroutine resolves the fundamental technical challenge of CCE instability with respect to \(Q\)-values.
Limitations & Future Work¶
- A gap of \(H^{1/2}\) in the \(H\)-dependence between upper and lower bounds remains, potentially addressable via variance-weighted ridge regression.
- The vanishing minimum assumption, while reasonable, restricts the scope of applicability.
- Only TV-divergence-based \(d\)-rectangular uncertainty sets are considered; extensions to \(\chi^2\) and KL divergences remain unexplored.
- The feature mapping must satisfy non-degeneracy and positive definiteness conditions (Corollary 5.3), imposing additional requirements on feature design.
- Experiments are conducted only in small-scale simulated environments, lacking empirical evaluation in more complex settings.
- Robust games in the decentralized learning setting remain an open problem.
Related Work & Insights¶
- Liu et al. (2024): Single-agent online robust linear MDP with upper bound \(O(dH\min\{1/\sigma, H\}\sqrt{K})\); the present work generalizes this to the multi-player setting.
- Cisneros et al. (2023): The NQOVI algorithm for non-robust online linear games with upper bound \(O(\sqrt{d^3H^5K})\).
- Jiao et al. (2024): Minimax-optimal algorithm for robust games in the generative model setting.
- Insight: The additional challenges of robustness in game settings (equilibrium instability + support shift) provide a rich space for theoretical investigation.
Rating¶
⭐⭐⭐⭐ (4/5)
Strengths: Novel problem formulation, complete theoretical analysis (impossibility + feasibility + minimax optimality), and solid technical contributions.
Weaknesses: The gap in \(H\)-dependence remains unresolved, experiments are overly simplistic, and empirical validation of the vanishing minimum assumption is insufficient.