Convergence of Regret Matching in Potential Games and Constrained Optimization¶

Conference: ICLR2026
OpenReview: https://openreview.net/forum?id=EOV1q1U23N
Code: To be confirmed
Area: Optimization Theory / Online Learning / Game Theory
Keywords: Regret Matching, Potential Games, Constrained Optimization, Convergence Analysis, No-Regret Learning

TL;DR¶

This paper provides the first proof that Regret Matching+ (RM+) serves as a reliable and fast first-order optimizer for "constrained optimization over products of simplices" (including potential games as a special case), converging to an \(\epsilon\)-KKT point in \(O_\epsilon(1/\epsilon^4)\) steps. Simultaneously, it proves that the original Regret Matching (RM) can take exponential steps to converge even in two-player common-interest games, establishing the first worst-case (and exponential) separation between RM and RM+.

Background & Motivation¶

Background: Regret Matching (RM), proposed by Hart & Mas-Colell (2000) with roots in Blackwell’s (1956) approachability framework, is a fundamental algorithm in no-regret online learning. Every action is selected with a probability proportional to its "accumulated non-negative regret." It is parameter-free and scale-invariant. Its variant, RM+ (Tammelin 2014), simply truncates negative components of the regret vector to zero at each step. Despite this minor change, RM+ is significantly more powerful in practice and serves as the core engine for superhuman AIs in Poker (Libratus / DeepStack) and Stratego. In two-player zero-sum games, the theory that no-regret implies convergence of the average strategy to a minimax/Nash equilibrium is well-established.

Limitations of Prior Work: Beyond the regime of two-player zero-sum games, theoretical understanding is scarce. The question of whether "RM converges to Nash equilibrium in potential games" has remained open for two decades (Kleinberg et al. 2009; Marden et al. 2007). Furthermore, experiments by Tewolde et al. (2025) found that the RM family (especially RM+) performs exceptionally well on constrained optimization benchmarks, significantly outperforming gradient-descent-type algorithms. However, no theory guaranteed even its asymptotic convergence to KKT points in constrained optimization.

Key Challenge: The primary obstacle is that "no-regret" and "convergence to KKT points" are fundamentally decoupled in non-convex problems. Proposition 3.2 provides a counterexample: there exists an iteration sequence where the regret relative to observed gradients is exactly zero, yet the KKT gap at every point remains \(\Omega(1)\). That is, no-regret property alone cannot imply convergence. Moreover, RM+ lacks the "one-step improvement" property of gradient descent; even if initialized near a KKT point, it may overshoot severely in one step. Since it is parameter-free, it lacks a learning rate to mitigate such behavior.

Goal: Within a unified framework of "maximizing an \(L\)-smooth function \(u\) over a product of simplices \(X=\Delta(A_1)\times\cdots\times\Delta(A_n)\)" (where potential games are a multi-linear special case), this work aims to characterize the iteration complexity required for RM and RM+ to reach \(\epsilon\)-KKT points / \(\epsilon\)-Nash equilibria and distinguish the essential differences between them.

Key Insight: The authors observe that while regret and KKT gap are generally unrelated, they become entangled in the specific trajectory generated by RM+. Specifically, the \(\ell_2\) norm of the RM+ regret vector is monotonically increasing and grows by at least \(\text{KKTGap}^2\) at each step. When this norm is sufficiently large, RM+ recovers a one-step improvement property.

Core Idea: The convergence rate is parametrized as a function of the "growth rate of cumulative regret"—the slower the regret grows, the faster the convergence. This allows the no-regret property of RM+ (\(\sqrt T\) regret) to be translated into an \(O_\epsilon(1/\epsilon^4)\) convergence rate. RM is shown to be exponentially slow because it can be stalled when the regret of optimal actions remains negative for long periods.

Method¶

Overall Architecture¶

The paper is purely theoretical. The core analysis follows the "regret ↔ KKT gap" line of reasoning. The setting views the constrained optimization problem \(\max_{x\in X} u(x)\) as a collaborative optimization where \(n\) "players" each control a simplex \(\Delta(A_i)\) and observe their respective gradient block \(\nabla_{x_i} u(x)\). The goal is to minimize the KKT gap:

\[\text{KKTGap}(x) := \max_{x'\in X}\langle x'-x,\nabla u(x)\rangle = \sum_{i=1}^n \text{BRGap}_i\big(x_i,\nabla_{x_i}u(x)\big),\]

which is the sum of best-response gaps for all players. A zero KKT gap is equivalent to a first-order stationary point. When \(u\) is multi-linear, this reduces to finding a Nash equilibrium in potential games. The two algorithms differ only in truncation: RM uses \(\theta^{(t)}=[r^{(t-1)}]_+\) to normalize the strategy but retains negative values in the cumulative regret \(r^{(t)}\), while RM+ truncates the entire cumulative regret to be non-negative: \(r^{(t)}=[r^{(t-1)}+u^{(t)}-\langle x^{(t)},u^{(t)}\rangle\mathbf 1]_+\). The paper analyzes simultaneous and alternating (coordinate-descent-like) updates.

Key Designs¶

1. One-Step Improvement Lemma for RM+: Lower Bounding Utility Gain by "Square of Best-Response Gap / Regret Norm"

In potential games (where utility is linear for a single player), RM+ possesses a clean monotonic improvement property (Lemma 3.3). For any non-negative regret \(r\) and strategy \(x=r/\|r\|_1\), the update \(x'=r'/\|r'\|_1\) satisfies:

\[\langle x'-x,u\rangle \ge \frac{1}{\|r'\|_1}\Big(\max_{a}u[a]-\langle x,u\rangle\Big)^2 = \frac{1}{\|r'\|_1}\text{BRGap}(x,u)^2.\]

This implies that utility strictly increases as long as the current strategy is not a best response, and the improvement is significant provided the regret norm \(\|r'\|_1\) is not too large. This lemma holds for any non-negative initial regret, an invariance property of RM+ truncation that RM lacks. It shows RM+ does not "idle" without making progress.

2. Parametrizing Convergence Speed with Regret Growth: Linking Nash Convergence to No-Regret Properties

By applying a telescoping sum of the one-step improvement to the potential function, the core conclusion of Theorem 3.4 / 1.2 is reached: if the regret on each simplex grows as at most \(T^\alpha\) (\(\alpha\in[0,1/2]\)), then \(\epsilon\)-lazy alternating RM+ converges to an \(\epsilon\)-Nash equilibrium within:

\[1+\frac{(C(n,m)\Phi_{\text{range}})^{\beta}}{\epsilon^{2\beta}}\ \text{steps},\quad \beta=\tfrac{1}{1-\alpha}\]

where \(\Phi_{\text{range}}\) is the range of the potential function. The pivotal realization is that RM+ always guarantees regret \(\le\sqrt{mT}\) (meaning \(\alpha=1/2\)), which immediately yields the main result of \(O_\epsilon(1/\epsilon^4)\) in Theorem 1.1 (corresponding to a \(T^{-1/4}\) rate). This result is "surprising" because it is the first to let regret dominate the convergence speed to Nash equilibria; previously, regret was only known to drive convergence to Coarse Correlated Equilibria (CCE). A corollary is that if RM+ lacked the no-regret property (linear regret \(\Omega(t)\)), the bound would be exponential; conversely, if regret only grows as a constant (\(\alpha=0\)), the rate improves to \(T^{-1/2}\).

3. "Conditional One-Step Improvement + Monotonic Regret Norm" in Non-convex / Simultaneous Updates: Guaranteeing Progress

Outside potential games, \(u\) is no longer multi-linear, and the standard one-step improvement of RM+ fails. The authors resolve this through the coupling of two lemmas. First (Lemma 3.7), using the quadratic smoothness bound \(u(x')\ge u(x)+\langle\nabla u(x),x'-x\rangle-\tfrac L2\|x-x'\|_2^2\), they prove a one-step improvement \(u(x')-u(x)\ge\tfrac{1}{2\|r'\|_1}\text{KKTGap}^2\) holds only when the regret norm is sufficiently large (\(\|r'\|_2\ge\max\{2m,9mL\}\)). Here, small norm is the obstacle, reversing the intuition of potential games. Second (Lemma 3.8), the crucial "glue":

\[\|r^{(t)}\|_2^2 \ge \|r^{(t-1)}\|_2^2 + \|[g^{(t)}]_+\|_2^2 \ge \|r^{(t-1)}\|_2^2 + \text{KKTGap}(x^{(t)})^2,\]

showing that the \(\ell_2\) norm of the regret vector is monotonically non-decreasing and increases by at least the square of the KKT gap at each step. Combined, these eliminate the possibility of RM+ stalling in the small-norm region: if the algorithm hasn't converged (large KKT gap), the regret norm increases rapidly until it crosses the threshold where Lemma 3.7 takes over. This yields \(O_\epsilon(1/\epsilon^4)\) for general \(L\)-smooth functions (Theorem 3.9). For standard "zero-initialized" versions without hyperparameters, the authors guarantee a coarser \(O_\epsilon(1/\epsilon^8)\) (Theorem 3.12), highlighting the cost of parameter-independence.

4. Exponential Lower Bound for RM: Optimal Actions Stalled by Extreme Negative Regret

In contrast to RM+, the authors construct a class of two-player \(m\times m\) common-interest games where RM requires \(m^{\Omega(m)}\) steps to converge to a \(\tfrac{1}{2m}\)-Nash equilibrium (Theorem 4.4 / 1.3). This bound holds for both adversarial and uniform random initializations. The failure stems from RM's conditional one-step improvement (Lemma 3.6): utility only improves if the cumulative regret of some best-response action is non-negative. In the constructed game, the regret of the unique "good" action takes exponentially long to return to positive values; until then, RM effectively stalls. Notably, RM still converges to CCE at \(T^{-1/2}\) (being no-regret), leading to a striking contrast (Corollary 1.4): there exist two-player potential games where RM reaches \(\epsilon\)-CCE in \(O_\epsilon(1/\epsilon^2)\) steps but requires \(\exp(\Omega(1/\epsilon))\) steps for \(\epsilon\)-Nash. This is the first worst-case separation between RM and RM+, and it is exponential.

Loss & Training¶

The paper introduces an acceleration method: Discounted RM+ (DRM+). In each round, the regret vector is multiplied by a discount factor \(\alpha^{(t)}=1-\gamma\) (geometric discounting, fixed \(\gamma\in(0,1)\)). Discounting keeps the regret norm pinned within \(\sqrt{m/\gamma}\) (corresponding to the "constant regret" \(\alpha=0\) case in Key Design 2), while preserving the one-step improvement. Consequently, \(\epsilon\)-lazy alternating DRM+ improves the convergence rate from \(T^{-1/4}\) to \(T^{-1/2}\) (Corollary 3.5, approx. \(1+\tfrac{m\Phi_{\text{range}}}{\epsilon^2\sqrt\gamma}\) steps). This is the first proof that discounting provides a "provable worst-case improvement" over standard RM+.

Key Experimental Results¶

As a pure theory paper, there are no large-scale empirical experiments, only a schematic (Figure 1) showing the KKT gap of RM remaining high after 100,000 steps in the constructed common-interest game, while RM+ converges in hundreds. The core "data" are the theoretical complexities.

Main Results: RM+ Convergence Complexity¶

Setting	Algorithm / Condition	Complexity (to \(\epsilon\)-KKT / \(\epsilon\)-Nash)	Source
Constrained Opt. over Simplex Products	RM+ (General case)	\(O_\epsilon(1/\epsilon^4)\)	Thm 1.1
Same as above	RM+, Regret growth \(\le T^\alpha\)	\(O_\epsilon(1/\epsilon^{2/(1-\alpha)})\)	Thm 1.2
Same as above	RM+, Bounded regret (\(\alpha=0\))	\(O_\epsilon(1/\epsilon^2)\)	Thm 1.2
Potential Games	\(\epsilon\)-lazy Alternating RM+	\(1+(m\Phi_{\text{range}})^2/\epsilon^4\)	Thm 3.4
Potential Games	\(\epsilon\)-lazy Alternating DRM+ (Discount \(1-\gamma\))	\(1+m\Phi_{\text{range}}/(\epsilon^2\sqrt\gamma)\)	Cor 3.5
Multi-simplex (Zero Init, Parameter-free)	RM+	\(O_\epsilon(1/\epsilon^8)\)	Thm 3.12

Separation: RM vs RM+¶

Target	RM	RM+	Note
\(\epsilon\)-CCE (Average Dist.)	\(O_\epsilon(1/\epsilon^2)\)	\(O_\epsilon(1/\epsilon^2)\)	Both are no-regret; CCE is fast for both.
\(\epsilon\)-Nash (Individual)	\(\exp(\Omega(1/\epsilon))\)	\(O_\epsilon(1/\epsilon^4)\)	Exponential separation in 2-player common-interest games (Cor 1.4).

Key Findings¶

Regret norm monotonicity is the pivot of the proof: Lemma 3.8 (norm increases by at least \(\text{KKTGap}^2\) each step) ensures RM+ cannot circulate indefinitely in small-norm regions, which is the switch that activates the "conditional one-step improvement."
Truncating negative regret makes a radical difference: The only difference between RM and RM+ is zeroing negative regret, yet it changes Nash convergence from exponential to polynomial. RM fails because optimal actions can accumulate extreme negative regret.
"Parameter-free" is not free: To achieve the optimal \(1/\epsilon^4\) in multi-simplex settings, the regret must be initialized above a threshold related to \(L\) and \(m\). Standard zero-initialization only guarantees \(1/\epsilon^8\).
Discounting provides provable acceleration: DRM+ improves the rate from \(T^{-1/4}\) to \(T^{-1/2}\), providing the first worst-case theoretical justification for why discounting performs better in practice.

Highlights & Insights¶

Bridging the unrelated quantities of "KKT gap" and "cumulative regret": Generally, these have no relationship (Prop 3.2). This paper proves that on the RM+ trajectory, the KKT gap is controlled by regret. The non-asymptotic rate is a direct byproduct of the no-regret property, a profound insight.
The "Conditional Improvement + Norm Monotonicity" framework is transferable: For any parameter-free first-order method lacking global one-step improvement but possessing some form of potential/norm monotonicity, one can adopt this two-stage argument: prove the norm grows quickly, then prove improvement in the large-norm regime.
The lower bound construction illuminates the stalling mechanism: By using a recursively defined payoff matrix \(A_{m,k}\), the paper ensures actions take turns. Each action requires \(T_k\ge\frac{k-2}{2}T_{k-1}\) steps to return to positive regret, resulting in factorial-like delay. This "layered delay" construction is valuable for analyzing other lagging dynamics.

Limitations & Future Work¶

Whether RM+ truly has \(\Omega(\sqrt T)\) regret in potential games is unknown: If regret bounds can be tightened below \(\sqrt T\), Theorem 1.2 would automatically imply faster Nash convergence.
Assumes full feedback: All results assume observation of full utility/gradient vectors. Whether the rate holds under stochastic or bandit feedback is undetermined.
Asymptotic convergence of RM under alternating updates remains open: The lower bound shows it can be exponentially slow, but whether it eventually converges is not resolved.
Lazy version requires prior knowledge of \(\epsilon\): \(\epsilon\)-lazy alternating RM+ is not an anytime algorithm. While a non-lazy version (Theorem C.7) also achieves \(O_\epsilon(1/\epsilon^4)\), it introduces additional iteration dependencies.

vs. RM/RM+ theory in zero-sum games (Farina et al. 2023): In zero-sum games, both RM and RM+ share a \(T^{-1/2}\) rate with no provable separation. This work shifts to potential games/constrained optimization to provide the first exponential worst-case separation, justifying the practical preference for RM+.
vs. Gradient Descent optimizers: GD relies on tuning learning rates for improvement, but experiments show RM+ is often superior. This work establishes RM+ as a "reliable and fast" first-order optimizer, officially bringing it into the optimization toolkit.
vs. Classical CCE convergence (Moulin & Vial 1978): Previously, regret was only used to guarantee convergence to the weaker CCE. Theorem 1.2 proves that regret also governs convergence to the stronger Nash equilibrium, narrowing the gap between these solution concepts under RM+ dynamics.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Resolves a 20-year-old open problem and provides the first exponential separation between RM and RM+.
Experimental Thoroughness: ⭐⭐⭐⭐ Pure theory paper; lacks empirical plots but theorems comprehensively cover simultaneous/alternating/discounted/multi-simplex settings.
Writing Quality: ⭐⭐⭐⭐⭐ Exceptional clarity in explaining the non-obvious "Regret ↔ KKT gap" bridge.
Value: ⭐⭐⭐⭐⭐ Provides a rigorous foundation for the core RM+ algorithm used in Game AI, guiding practical choices (e.g., using RM+ over RM and applying discounting).