Distributionally Robust Online Markov Game with Linear Function Approximation¶

Conference: AAAI 2026 arXiv: 2511.07831 Code: None Area: Multi-Agent Reinforcement Learning / Robust Game Theory Keywords: Markov Game, Distributional Robustness, Linear Function Approximation, Coarse Correlated Equilibrium, Sample Complexity

TL;DR¶

This paper studies online distributionally robust Markov games with linear function approximation. It is the first to identify the hardness of learning in this setting, and proposes the DR-CCE-LSI algorithm, which achieves minimax-optimal sample complexity with respect to the feature dimension $d$ under a specific feature mapping condition.

Background & Motivation¶

State of the Field¶

Multi-agent reinforcement learning (MARL) has been extensively studied within game-theoretic frameworks, with Markov games serving as a central model. Linear function approximation has been introduced to handle large state spaces. Some progress has been made on single-agent robust RL and tabular robust games.

Limitations of Prior Work¶

The sim-to-real gap is more severe in MARL: equilibrium solutions are highly sensitive to small environmental perturbations.

Online settings are understudied: existing robust Markov game algorithms primarily target generative model or offline settings.

Linear approximation and robustness have not been combined: the online robust general-sum game setting with linear function approximation remains entirely unexplored.

Root Cause¶

Support shift: samples collected online may fail to cover all trajectories under worst-case environments, causing algorithms to perform no better than random guessing on unobserved states. This is the fundamental obstacle to robust online learning.

Paper Goals¶

To design provably sample-efficient algorithms for finding approximate robust coarse correlated equilibria (CCE) in general-sum games with linear function approximation and $d$-rectangular uncertainty sets.

Starting Point¶

First establish the impossibility of online learning via a constructed counterexample (lower bound $\Omega(\sigma \cdot HK)$).
Introduce the vanishing minimum assumption to circumvent the support shift problem.
Design a least-squares value iteration algorithm with agent-specific bonus terms.

Core Idea¶

Under the vanishing minimum assumption, exploit the $d$-rectangular structure to preserve linearity, and achieve sample-efficient robust equilibrium learning via coordinate-wise ridge regression and agent-specific exploration bonuses.

Method¶

Overall Architecture¶

The DR-CCE-LSI (Distributionally Robust CCE Least Square Iteration) algorithm proceeds as follows in each episode $k$: 1. Each player $i$ estimates the robust action-value function $Q_{i,h}^k$ via ridge regression using historical data. 2. A CCE of the $n$-player matrix game induced by the estimated $Q$ matrices is computed via the Find-CCE subroutine. 3. The CCE policy is executed to interact with the environment and collect new data. 4. The feature covariance matrix $\Lambda_h^{k+1}$ is updated.

Key Design 1: $d$-Rectangular Uncertainty Set¶

Function: Define an appropriate robust uncertainty set compatible with the linear MDP structure.

Mechanism: Assume the transition kernel $P_h(s'|s, \boldsymbol{a}) = \langle \phi_{s\boldsymbol{a}}, \mu_h^0(s') \rangle$ has a linear structure. The $d$-rectangular uncertainty set applies perturbations to each coordinate of $\mu_h$: $$\mathcal{U}_{TV}^{\sigma_i}(\mu_h^0) : \bigotimes_{j \in [d]} \{\mu_{h,j} : D_{TV}(\mu_{h,j} \| \mu_{h,j}^0) \leq \sigma_i\}$$

Design Motivation: The $d$-rectangular structure ensures the robust action-value function retains a linear form $Q_{i,h}^{\pi,\sigma}(s,\boldsymbol{a}) = \langle \phi_{s\boldsymbol{a}}, w_{i,h} \rangle + \text{bonus}$, avoiding the completeness assumptions required in general function approximation.

Key Design 2: Vanishing Minimum Assumption and Equivalent Optimization¶

Function: Avoid the support shift problem by assuming $\min_s V_{i,h}^{\pi,\sigma}(s) = 0$.

Mechanism: Under this assumption, the robust optimization problem simplifies to: $$\inf_{P \in \mathcal{U}_{TV}^{\sigma_i}(P_{h,s,\boldsymbol{a}}^0)} \mathbb{E}_P[V] = \sigma_i \mathbb{E}_{\tilde{P}_{h,s,\boldsymbol{a}}}[V]$$

This guarantees that the support of the worst-case transition kernel does not exceed that of the nominal kernel (equation (12) in Proposition 4.3).

Design Motivation: This assumption is equivalent to adding an absorbing failure state $s_f$ (with zero reward) to the Markov game, which is natural in many practical scenarios (e.g., games where a player may fail at each step).

Key Design 3: Agent-Specific Bonus Terms¶

Function: Design exploration bonuses to guide sufficient policy exploration.

Mechanism: The bonus for player $i$ is: $$\Gamma_{h,k}^i(s,\boldsymbol{a}) = \beta_i \sum_{j=1}^d \sqrt{\phi_j(s,\boldsymbol{a}) \mathbf{1}_j^T (\Lambda_h^k)^{-1} \mathbf{1}_j \phi_j(s,\boldsymbol{a})}$$

where $\beta_i = \min\{H, 1/\sigma_i\} \sqrt{c_\beta n d \log(ndHK/\delta)}$.

Design Motivation: Unlike the single UCB bonus in non-robust settings, the robust setting requires $d$ coordinate-wise confidence upper bounds. Each coordinate corresponds to an independent ridge regression task targeting $[V_{i,h+1}^k(s_{h+1}^\tau)]_{\alpha_j}$. Furthermore, the factor $\min\{H, 1/\sigma_i\}$ in $\beta_i$ reflects each player's individual robustness preference.

Key Design 4: Find-CCE Subroutine¶

Function: Address the instability (non-Lipschitz continuity) of CCE with respect to the $Q$-value matrix.

Mechanism: First locate the nearest surrogate game matrix $\tilde{Q}$ within an $\epsilon$-cover of the $Q$-value function class, then compute the exact CCE of $\tilde{Q}$. This ensures that for any two neighboring $Q$-values in the cover, the deviation in value functions when using the same CCE is bounded within $2\epsilon$.

Design Motivation: Directly computing CCE from $Q$ functions causes the covering number of the value function class to explode (Lemma 4.4 gives a counterexample showing that a small perturbation to $Q$ can cause a CCE value difference of 1). The "discretize first, then solve equilibrium" strategy resolves this technical difficulty.

Loss & Training¶

The regret is defined as the sum over all players of the maximum unilateral deviation gain: $$\text{Regret}(K) = \max_{i \in [n]} \sum_{k=1}^K [V_{i,1}^{\star, \pi_{-i}^k, \sigma}(s_1^k) - V_{i,1}^{\pi^k, \sigma}(s_1^k)]$$

Key Experimental Results¶

Main Results¶

In a simulated game with 5 states, 2 players, and $H=3$ (structured as in Figure 1 of the paper), DR-CCE-LSI is compared against the non-robust baseline NQOVI:

Perturbation level $\rho$	DR-CCE-LSI (Ours)	NQOVI
0.0 (no shift)	Slightly lower (optimality sacrificed for robustness)	Better
0.1	Comparable	Begins to degrade
0.2+	Significantly better	Performance degrades sharply

Theoretical Comparison¶

Setting	Method	Upper Bound	Lower Bound
Non-robust single-agent	he2023	$\sqrt{d^2H^3K}$	$\sqrt{d^2H^3K}$
Non-robust multi-player	cisneros2023	$\sqrt{d^3H^5K}$	$\sqrt{d^2H^3K}$
Robust multi-player (Ours)	DR-CCE-LSI	$dH\min\{H,1/\sigma\}\sqrt{K}$	$dH^{1/2}\min\{H,1/\sigma\}\sqrt{K}$

Key Findings¶

The proposed upper bound matches that of single-agent robust settings in $d$ and $K$, achieving minimax optimality with respect to feature dimension $d$.
The factor $\min\{H, 1/\min\{\sigma_i\}\}$ in the regret bound characterizes the cost of robustness.
The advantage of DR-CCE-LSI becomes more pronounced under higher environmental perturbation sensitivity.
The impossibility theorem (Theorem 4.1) establishes that without additional assumptions, the regret is $\Omega(\sigma \cdot HK)$ (linear growth, hence not learnable).

Highlights & Insights¶

First theoretical result: The first sample-efficient algorithm for online robust linear Markov games.
Complete picture of impossibility and feasibility: The paper first proves a lower bound impossibility, then makes the problem tractable via the appropriate vanishing minimum assumption.
Interesting observation on "risk preference consistency": Inconsistent robustness levels among players (large variance in $\sigma_i$) leads to reduced sample efficiency, suggesting that a shared risk awareness is beneficial in cooperative learning.
Technical contribution: The Find-CCE subroutine resolves the fundamental technical challenge of CCE instability with respect to $Q$-values.

Limitations & Future Work¶

A gap of $H^{1/2}$ in the $H$-dependence between upper and lower bounds remains, potentially addressable via variance-weighted ridge regression.
The vanishing minimum assumption, while reasonable, restricts the scope of applicability.
Only TV-divergence-based $d$-rectangular uncertainty sets are considered; extensions to $\chi^2$ and KL divergences remain unexplored.
The feature mapping must satisfy non-degeneracy and positive definiteness conditions (Corollary 5.3), imposing additional requirements on feature design.
Experiments are conducted only in small-scale simulated environments, lacking empirical evaluation in more complex settings.
Robust games in the decentralized learning setting remain an open problem.

Liu et al. (2024): Single-agent online robust linear MDP with upper bound $O(dH\min\{1/\sigma, H\}\sqrt{K})$; the present work generalizes this to the multi-player setting.
Cisneros et al. (2023): The NQOVI algorithm for non-robust online linear games with upper bound $O(\sqrt{d^3H^5K})$.
Jiao et al. (2024): Minimax-optimal algorithm for robust games in the generative model setting.
Insight: The additional challenges of robustness in game settings (equilibrium instability + support shift) provide a rich space for theoretical investigation.

Rating¶

⭐⭐⭐⭐ (4/5)

Strengths: Novel problem formulation, complete theoretical analysis (impossibility + feasibility + minimax optimality), and solid technical contributions.

Weaknesses: The gap in $H$-dependence remains unresolved, experiments are overly simplistic, and empirical validation of the vanishing minimum assumption is insufficient.

Setting	Method	Upper Bound	Lower Bound
Non-robust single-agent	he2023	\(\sqrt{d^2H^3K}\)	\(\sqrt{d^2H^3K}\)
Non-robust multi-player	cisneros2023	\(\sqrt{d^3H^5K}\)	\(\sqrt{d^2H^3K}\)
Robust multi-player (Ours)	DR-CCE-LSI	\(dH\min\{H,1/\sigma\}\sqrt{K}\)	\(dH^{1/2}\min\{H,1/\sigma\}\sqrt{K}\)