Sampling Complexity of TD and PPO in RKHS¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=5gUMhTUDi0
Code: https://github.com/conniemessi/TD-and-PPO-in-RKHS
Area: Learning Theory / Reinforcement Learning Theory
Keywords: RKHS, Temporal Difference, PPO/NPG, Sample Complexity, Non-asymptotic bounds

TL;DR¶

In the unified function space of Reproducing Kernel Hilbert Spaces (RKHS), this paper decouples and analyzes policy evaluation (kernelized TD critic) and policy improvement (KL-regularized proximal/natural gradient updates). It provides non-asymptotic, instance-adaptive convergence bounds dependent on RKHS entropy and derives per-iteration sampling rules required to guarantee $O(k^{-1/2})$ convergence, validating the theoretically predicted step-size scheduling on CartPole, Acrobot, and HalfCheetah.

Background & Motivation¶

Background: Policy gradient and trust region methods (NPG, TRPO, PPO, actor-critic) combined with TD critics are the primary tools for modern continuous control RL, as they are compatible with high-capacity function approximators and provide stable improvement steps.

Limitations of Prior Work: Despite empirical success, global convergence theory for such algorithms under "rich function approximation" remains fragmented. Most existing analyses hold only for tabular/linear cases or rely on heavy realizability / concentrability conditions. When using non-linear critics, guarantees often assume the value/advantage functions are "exactly" estimated. More importantly, many policy improvement bounds treat policy expectation terms as having "no sampling noise," thus the amount of data required for a single improvement step has never been clearly elucidated.

Key Challenge: A coupling exists between the statistical error of policy evaluation (how accurately the critic is learned) and the step size/sample volume of policy improvement—less accurate evaluation necessitates more conservative steps or more samples. Previous theories separate the two, either assuming exact evaluation or failing to specify per-iteration data requirements. Meanwhile, different function classes (tabular, Sobolev, Gaussian kernels, wide neural networks via NTK) have fragmented analyses, lacking a unified framework.

Goal: (i) Design a policy evaluation algorithm under rich function classes with controllable statistical errors; (ii) couple this evaluation step with a provable improvement step toward the optimal policy and explicitly quantify the required samples per iteration.

Key Insight: The authors adopt a "function space" perspective, directly optimizing policies in RKHS. The advantage of RKHS is its ability to represent tabular, linear, Sobolev, Gaussian kernels, and wide neural networks (via their NTK) using a unified language of entropy/covering numbers. The dynamics of kernel gradient descent mirror the NTK dynamics of wide networks trained via gradient descent; hence, RKHS results can be directly transferred to neural critics/actors.

Core Idea: Integrate a "kernelized TD estimator" (using implicit preconditioning for geometric convergence, avoiding matrix inversion) with "KL-regularized proximal policy updates" (obtaining NPG/PPO-style updates by exponentiating the evaluation value) in RKHS. Convergence rates are characterized using RKHS entropy, which in turn determines the required per-iteration sampling volume.

Method¶

Overall Architecture¶

Consider an MDP $(S,A,P,r,\gamma)$, assuming the target Q-function $Q^\pi$ resides in an RKHS $\mathcal{H}(S\times A)$ induced by a symmetric positive definite kernel $K$. The algorithm utilizes an outer-loop policy iteration (NPG in RKHS, Algorithm 1): in each outer iteration $k$, the current policy $\pi_k$ collects $n^{(k)}$ one-step transition quadruplets $(s_0,a_0,s_1,a_1)$. Then, kernelized TD iterates $T^{(k)}$ steps in RKHS to obtain a critic $f^{(k)}\approx Q^{(k)}$. Finally, a single KL-regularized proximal update advances the policy to $\pi_{k+1}\propto \pi_k\exp\{\Delta_k f^{(k)}\}$. The contribution is establishing non-asymptotic bounds for these two steps and linking them to prove a global convergence rate of $O(k^{-1/2})$, while providing the required sampling rules to achieve this rate.

As a theoretical analysis paper focusing on error decomposition and policy improvement inequalities, it does not utilize a multi-module engineering pipeline; the core resides in the bounds and their coupling relationships.

Key Designs¶

1. Kernelized TD critic: Implicit preconditioning via RKHS functional gradients to avoid cubic matrix inversion

Policy evaluation is formulated as a fixed-point kernel ridge regression (KRR) in RKHS: $$\hat{Q}^\pi=\min_{f\in\mathcal{H}}\frac{1}{n}\sum_{i=1}^n\big(f(\omega_0^{(i)})-r(\omega_0^{(i)})-\gamma\hat{Q}^\pi(\omega_1^{(i)})\big)^2+\lambda\|f\|_{\mathcal{H}}^2,$$ where $\omega=(s,a)$. By the Representer Theorem, it has a closed-form solution $\hat{Q}^\pi=K(\cdot,\omega_0)b^\pi$, where $b^\pi=[K+\lambda nI-\gamma C]^{-1}r$. However, direct inversion is cubic. The authors treat this as functional optimization in RKHS, using kernel gradient descent: $$f_{t+1}=(1-\alpha_t)f_t-\eta_t\sum_{i=1}^n\big(f_t(\omega_0^{(i)})-r(\omega_0^{(i)})-\gamma f_t(\omega_1^{(i)})\big)K(\omega_0^{(i)},\cdot).$$ Crucially, using the RKHS inner product $\langle\cdot,\cdot\rangle_{\mathcal{H}}$ instead of the $\ell_2$ inner product on $\mathbb{R}^n$ changes the gradient and Hessian, equating to an implicit preconditioner that directs iterations toward favorable directions. With proper constant weight decay $\alpha$ and step size $\eta$, the error converges geometrically (exponentially) to $\hat{Q}^\pi$ without requiring matrix inversion.

2. Non-asymptotic TD error bounds dependent on RKHS entropy: A unified bound for tabular / Sobolev / NTK / Gaussian

The authors characterize the statistical error of the critic using empirical processes, where RKHS complexity is measured by entropy: the entropy of the unit ball satisfies $H(\delta,\|\cdot\|_{L^\infty},\mathcal{B})\le C\delta^{-2\beta}|\log\delta|^{2\kappa}$ for $\beta\in[0,1)$ and $\kappa\ge 0$. Theorem 10 provides the non-asymptotic TD error: $$\sqrt{\frac{1}{n}\sum_{i=1}^n\big(f_t(\omega_0^{(i)})-Q^\pi(\omega_0^{(i)})\big)^2}\le O_p\Big((1-c\gamma)^{-\frac{2+\beta}{2+2\beta}}\,n^{-\frac{1}{2+2\beta}}\,|\log n|^{\frac{\kappa}{1+\beta}}\Big)\|Q^\pi\|_{\mathcal{H}}.$$ A single entropy assumption links four function space classes, demonstrating the "instance-adaptive" nature of the work: rates collapse to $n^{-1/2}$ for tabular/Gaussian and specific non-parametric rates for Sobolev and NTK.

3. KL-regularized proximal update: RKHS-version NPG by exponentiating evaluations with explicit sampling rules

Policy improvement uses the following update: $$\pi_{k+1}=\arg\max_{\int_A\pi=1,\pi\ge0}\mathbb{E}_n\Big[\Delta_k\int_A f^{(k)}(s,a)\pi(a|s)da-\mathrm{KL}\big(\pi(\cdot|s)\,\|\,\pi_k(\cdot|s)\big)\Big].$$ The solution is $\pi_{k+1}\propto\pi_k\exp\{\Delta_k f^{(k)}\}$, equivalent to natural policy gradient, termed explicit PPO (NPG). This generalizes NPG to general RKHS functions where policies can be continuous functions of $(s,a)$. Its most significant contribution is explicitly quantifying the sample volume required to guarantee improvement.

4. Evaluation-Improvement Coupling: Mirror-descent inequality + sampling table for $O(k^{-1/2})$ global convergence

Measuring performance via expected total return $R[\pi]$, the authors adapt mirror descent analysis to obtain the core inequality (Theorem 13): $$\inf_k\big(R[\pi^*]-R[\pi_k]\big)\le\frac{\big(\sum_k 2\Delta_k\|f^{(k)}-Q^{(k)}\|_{L^\infty}\big)+\big(\sum_k\Delta_k^2(1-\gamma)^{-1}\|r\|_{L^\infty}\big)+\mathbb{E}_{S\sim\nu^*}\mathrm{KL}(\pi^*\|\pi_0)}{\sum_k\Delta_k}.$$ Global convergence $O(k^{-1/2})$ is achieved by setting $\Delta_k=1/\sqrt{k}$ and following the sampling rules in Table 1 for different kernels. This reveals that the required samples increase as iterations $k$ and policy complexity $\|\pi_k\|_{\mathcal{H}}$ grow.

Loss & Training¶

The theoretical algorithm follows a two-stage "Evaluation for $T$ steps + Improvement for one step" structure. In practice, a $T$ that is too small (e.g., $T=4$) might lead to insufficient critic learning and unstable actor updates. Consequently, the implementation utilizes joint optimization: the critic's TD-error loss and the actor's NPG objective are combined into a unified loss, minimized together for $T$ epochs per batch. The model uses a two-layer MLP where actor and critic share parameters, and the policy is $\pi_\theta\propto\mathrm{Softmax}(f_\theta)$.

Key Experimental Results¶

Main Results: Step-size Scheduling and Sampling Efficiency¶

Environment	Verification Point	Key Observation
CartPole-v1	Step-size exponent $\alpha$	$\Delta_k=k^{-0.5}$ converges; $k^{-0.2}$ diverges; $k^{-1.5}$ plateaus.
Acrobot-v1	Step-size exponent $\alpha$	Same trend; $k^{-0.5}$ shows slight drops at late stages (policy complexity vs. sample scarcity).
HalfCheetah-v5	Scalability	Maintains reasonable runtime; $k^{-0.5}$ remains optimal.

Results consistently confirm Corollary 14: only the $\Delta_k=1/\sqrt{k}$ schedule achieves both convergence and stability.

Sampling/Computational Efficiency (CartPole-v1, vs. clip-PPO)¶

Method	Data Requirement	Performance	Throughput (steps/s)
clip-PPO (GAE)	Requires full trajectories	Slower convergence, higher variance	5406.0
Ours NPG (one-step TD)	Only $(s_0,a_0,r_0,s_1,a_1)$	Stable, reaches max reward 500	7210.1

The critic learns via single-step TD-error, which is more lightweight than GAE's recursive advantage calculations, yielding ~33% higher throughput.

Key Findings¶

Step-size scheduling is the decisive factor for NPG+TD stability: The theoretical $k^{-1/2}$ is a "critical exponent" verified across environments.
One-step TD critic outperforms GAE: Removing dependence on full trajectories improves both sampling efficiency and computational throughput.
Sample demand growing with policy complexity is a real constraint: When the network grows complex and samples are insufficient, performance drops, matching theoretical predictions of $n(k)$ growing with $\|\pi_k\|_{\mathcal{H}}$.

Highlights & Insights¶

Unifying four function spaces via RKHS entropy: Tabular, Sobolev, NTK, and Gaussian kernels are all unified under one framework of minimax optimal rates.
Functional gradient as implicit preconditioning: Performing gradient descent under the RKHS inner product automatically provides preconditioning and geometric convergence.
Explicit sample requirement formulas: Theorem 13 explicitly decomposes global suboptimality into evaluation error, step-size drift, and initial KL, allowing for the derivation of a sampling schedule.
NTK bridging theory and practice: Since kernel gradient descent mirrors the NTK dynamics of wide networks, RKHS conclusions directly guide neural critic/actor design.

Limitations & Future Work¶

Gap between $L^2$ and $L^\infty$ metrics: Convergence analysis is established under $L^2$, but RL theory often requires $L^\infty$; bridging this gap remains future work.
Difficulty in estimating RKHS norm $\|\pi_k\|_{\mathcal{H}}$: Particularly for NTK where policies correspond to deep networks, making the sampling table difficult to execute precisely in practice.
Dependency on sampling quality (Assumption 1 & 6): Requires bounded initial distributions and transitions that do not deviate far from the initial distribution.
Gap between theoretical two-stage and joint implementation: The implementation uses a combined loss, creating a slight discrepancy with the theoretical analysis of sequential stages.

vs. Neural TD (Cai et al., 2019): They provide finite-sample analysis for neural TD with sub-linear rates; this paper achieves geometric convergence and minimax rates while covering continuous actions.
vs. Mirror Descent for Neural Policies (Liu et al., 2019): They analyze neural policies in the NTK regime for discrete actions; this paper generalizes NPG to general RKHS and continuous actions with explicit sampling counts.
vs. Kernel LSTD (Duan et al., 2024): They analyze kernel LSTD for Markov chains; this work utilizes gradient-based kernel TD for policy evaluation and couples it with policy improvement.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Unified four function spaces via RKHS entropy and explicitly coupled evaluation error with sample complexity.
Experimental Thoroughness: ⭐⭐⭐ Standard control tasks verify trends, but experiments are light for a theoretical paper and do not include large-scale benchmarks.
Writing Quality: ⭐⭐⭐⭐ Clear progression from theorems to sampling tables; however, notation is dense.
Value: ⭐⭐⭐⭐ Provides a firmer theoretical foundation for PPO/NPG under rich function approximation with actionable sampling/step-size rules.

Environment	Verification Point	Key Observation
CartPole-v1	Step-size exponent \(\alpha\)	\(\Delta_k=k^{-0.5}\) converges; \(k^{-0.2}\) diverges; \(k^{-1.5}\) plateaus.
Acrobot-v1	Step-size exponent \(\alpha\)	Same trend; \(k^{-0.5}\) shows slight drops at late stages (policy complexity vs. sample scarcity).
HalfCheetah-v5	Scalability	Maintains reasonable runtime; \(k^{-0.5}\) remains optimal.