Optimism Without Regularization: Constant Regret in Zero-Sum Games¶
Conference: NeurIPS 2025 arXiv: 2506.16736 Code: Not available Area: Other Keywords: Fictitious Play, Optimistic Learning, Zero-Sum Games, Regret Bounds, No Regularization
TL;DR¶
This paper provides the first proof that Optimistic Fictitious Play without regularization achieves \(O(1)\) constant regret in \(2\times2\) zero-sum games, matching the optimal rate of regularized Optimistic FTRL. It further establishes an \(\Omega(\sqrt{T})\) regret lower bound for Alternating Fictitious Play, separating the capabilities of optimism and alternation in the unregularized setting.
Background & Motivation¶
Regret bounds for online learning algorithms in games constitute a central problem in game theory and machine learning. The key background is as follows:
Is regularization necessary? - In general adversarial settings, regularization is essential for no-regret learning (e.g., FTRL requires a bounded step size \(\eta\)). - However, in the special structure of zero-sum games, unregularized algorithms can also achieve sublinear regret.
History of Fictitious Play (FP): - FP is the most classical unregularized algorithm (Brown, 1951), equivalent to Follow-the-Leader with step size \(\eta \to \infty\). - FP can incur \(\Omega(T)\) regret in general settings (highly sensitive to oscillation). - In zero-sum games, Robinson (1951) proved sublinear regret for FP, though the upper bound is slow: \(O(T^{1-1/n})\) for \(n\times n\) games. - Recent work has progressively improved FP's regret bounds, achieving \(O(\sqrt{T})\) on diagonal matrices and generalized rock-paper-scissors matrices.
Regularized optimistic algorithms: - Optimistic FTRL (including Optimistic MWU and Optimistic GD) achieves \(O(1)\) constant regret in zero-sum games. - However, the standard RVU bound proof technique critically relies on a finite step size upper bound (i.e., nonzero regularization).
This motivates the core question: Can \(O(1)\) regret be achieved in zero-sum games with no regularization (\(\eta \to \infty\))? Can a variant of FP attain \(O(1)\)?
This question has both theoretical significance and direct applications in equilibrium computation for combinatorial games and self-play training in multi-agent reinforcement learning.
Method¶
Overall Architecture¶
The paper studies Optimistic Fictitious Play (OFP)—the optimistic variant of FP—with the following update rule:
Compared to standard FP, OFP additionally introduces a bias toward the most recent feedback (the \(+Ax_2^t\) term), which is equivalent to Optimistic FTRL without regularization.
Key Designs¶
- Geometric perspective in dual space: An energy function \(\Psi(y) = \max_{x \in \Delta_m \times \Delta_n} \langle x, y \rangle\) is defined as the support function of \(\Delta_m \times \Delta_n\), where \(y^t\) denotes the cumulative payoff vector. The key relationship is:
That is, regret equals exactly the energy of the dual iterate (Proposition 3.4). Proving constant regret is thus equivalent to proving energy boundedness.
- Unified skew subgradient descent perspective (Proposition 3.6): Both FP and OFP can be written as skew subgradient descent on the energy \(\Psi\), differing only in the predicted dual vector:
where \(\tilde{y}^{t+1} = y^t\) (FP) or \(\tilde{y}^{t+1} = 2y^t - y^{t-1}\) (OFP).
- Standard FP: Due to the anti-symmetry of \(J\), the single-step energy change satisfies \(\Delta\Psi \geq 0\) (non-decreasing energy), growing by at least a constant over \(\sqrt{T}\) steps.
-
Optimistic FP: When the predicted and actual dual vectors map to the same primal vertex (i.e., \(\partial\Psi(y^{t+1}) = \partial\Psi(\tilde{y}^{t+1})\)), \(\Delta\Psi \leq 0\) (energy non-increasing).
-
Exploiting the structure of \(2\times2\) games: Using Assumption 1 (any \(2\times2\) zero-sum game can be transformed WLOG into the form \(\det A = 0\), \(a, d > \max\{0, b, c\}\)), the dual vectors \(y^t\) all lie in the same two-dimensional subspace, enabling planar geometric analysis of OFP's trajectory.
Core Proof Idea for the Main Theorem¶
Theorem 3.5 (Energy Boundedness): In a \(2\times2\) zero-sum game, the dual energy of OFP satisfies:
where \(a_{\max} = \|A\|_\infty\) and \(a_{\text{gap}} = \min_{(i,j),(k,\ell)} |A_{ij} - A_{k\ell}|\).
The core argument is that in dual space, the trajectory of \(y^t\) is confined to a bounded region: whenever the energy grows beyond a certain threshold, the optimistic prediction causes the algorithm to "rebound" back into a low-energy region.
Lower Bound for Alternating FP¶
Theorem 3.2: On the Matching Pennies game, for almost all initializations, Alternating FP incurs \(\Omega(\sqrt{T})\) regret. This separates the capabilities of optimism and alternation in the unregularized setting: optimism achieves \(O(1)\), while alternation alone cannot improve beyond \(O(\sqrt{T})\).
Key Experimental Results¶
Main Results: Empirical Regret Comparison¶
| Game | FP Regret Growth | OFP Regret Growth | AFP Regret Growth |
|---|---|---|---|
| Matching Pennies (\(2\times2\)) | \(\sim\sqrt{T}\) | Constant | \(\sim\sqrt{T}\) |
| \(15\times15\) Identity Matrix | \(\sim\sqrt{T}\) | Constant | \(\sim\sqrt{T}\) |
| \(15\times15\) Generalized RPS | \(\sim\sqrt{T}\) | Constant | \(\sim\sqrt{T}\) |
Ablation Study: Summary of Regret Bounds in \(2\times2\) Games¶
| Algorithm Variant | Bounded \(\eta\) (FTRL) | \(\eta \to \infty\) (FP) |
|---|---|---|
| Standard | \(O(\sqrt{T})\) | \(O(\sqrt{T})\) |
| Optimistic | \(O(1)\) | \(O(1)\) ⭐ Ours |
| Alternating | \(O(T^{1/5})\) | \(\Omega(\sqrt{T})\) ⭐ Ours |
Key Findings¶
- Constant regret of OFP holds empirically in higher-dimensional games: Although the theory is proven only for \(2\times2\), OFP exhibits constant regret on \(15\times15\) games, strongly suggesting the result generalizes.
- Alternation cannot substitute for optimism: In the unregularized setting, the two are strictly separated (\(O(1)\) vs. \(\Omega(\sqrt{T})\)), whereas with regularization, alternation can achieve \(O(T^{1/5})\).
- The energy function provides an intuitive geometric picture: The dual trajectory of OFP oscillates within a bounded region without diverging, whereas the energy of FP and AFP grows monotonically.
Highlights & Insights¶
- First result to match the optimal rate of regularized algorithms within the unregularized regime, challenging the implicit assumption that regularization is a necessary condition for \(O(1)\) regret.
- The energy analysis in dual space introduces a novel proof technique fundamentally different from the standard RVU bound approach.
- The separation between optimism and alternation is theoretically elegant: under the same FP framework, the two improvement strategies exhibit qualitatively different behavior in the large step size limit.
- The results have potential implications for equilibrium computation algorithms (e.g., MCCFR variants in combinatorial games) and self-play RL.
Limitations & Future Work¶
- Restricted to \(2\times2\) zero-sum games: The proof relies heavily on the structure of \(2\times2\) games (dual vectors lying in a two-dimensional subspace); extending to general \(n\times n\) games is the primary open problem.
- Assumes a unique interior Nash equilibrium: Degenerate games (e.g., those with pure strategy NE) are outside the scope.
- Constants depend on game parameters: \(8a_{\max}(1+2a_{\max}/a_{\text{gap}})^2\) can be very large when \(a_{\text{gap}}\) is small (near-degenerate).
- Last-iterate convergence not addressed: Only time-average convergence is proven; the last-iterate behavior of OFP remains unknown.
- The lower bound for Alternating FP holds only for specific initializations: Coverage is limited to almost all irrational initializations.
Related Work & Insights¶
- The paper has deep connections to energy conservation in continuous-time Hamiltonian flows: discretization causes energy growth in standard FP, while the optimistic correction "restores" approximate conservation.
- The results may inspire the design of unregularized algorithms for extensive-form games.
- The findings provide theoretical reference for understanding convergence behavior in self-play training (e.g., AlphaStar, OpenAI Five).
- The methodology bears connections to work on unregularized acceleration in Frank-Wolfe optimization.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First proof that a FP variant achieves \(O(1)\) regret—a surprising result.
- Experimental Thoroughness: ⭐⭐⭐ Experiments are primarily validatory (\(T=10{,}000\)); large-scale or application-oriented experiments are absent.
- Writing Quality: ⭐⭐⭐⭐⭐ Theoretical derivations are elegant, intuitions are well-explained, and Table 1 provides a clear landscape summary.
- Value: ⭐⭐⭐⭐ Significant theoretical contribution to online learning and game theory, though the restriction to \(2\times2\) limits immediate applicability.