Scalable Policy-Based RL Algorithms for POMDPs¶
Conference: NeurIPS 2025 arXiv: 2510.06540 Code: None Area: Reinforcement Learning / Theory Keywords: POMDP, Policy Optimization, TD Learning, Superstate MDP, Linear Function Approximation
TL;DR¶
This paper proposes approximating POMDPs as finite-state Superstate MDPs (where states are truncated histories), derives a tighter upper bound on the optimal value function gap (decaying exponentially with history length), and provides the first finite-time convergence guarantees for standard TD learning combined with policy optimization under non-Markovian sampling.
Background & Motivation¶
Partially Observable Markov Decision Processes (POMDPs) serve as a general framework for decision-making under uncertainty, with broad applications in autonomous driving, medical diagnosis, and game playing. However, solving POMDPs faces fundamental challenges:
Key Challenge: POMDPs can be converted into fully observable MDPs via belief states, but belief states are continuous distributions that grow with history length, leading to PSPACE-complete computational complexity.
Limitations of Prior Work: - History-based methods (Loch & Singh, Littman) are effective in practice but lack rigorous theoretical performance guarantees - Kara & Yüksel (2023) proposed the Superstate MDP to approximate POMDPs as finite-state MDPs, but the value function gap bound is \(O(1/(1-\gamma)^2)\) (overly loose) - Cayci et al. (2024) resort to computationally expensive \(m\)-step TD learning as a workaround - Q-learning methods face convergence difficulties under linear function approximation
Core Problem: Can standard RL algorithms (TD learning + Policy Optimization) directly approximate POMDPs with theoretical guarantees?
Method¶
Overall Architecture¶
- POMDP → Superstate MDP Conversion: Given truncation length \(l\), the history \(H_t\) is truncated to the most recent \(l\) (action, observation) pairs, forming the Superstate \(\mathcal{G}(H_t)\)
- Approximation Quality Analysis: The gap between the optimal value function of the Superstate MDP and that of the original POMDP is shown to decay exponentially with \(l\)
- Policy Optimization Algorithm: Alternates between TD learning (policy evaluation) and exponential policy updates (based on POLITEX)
Key Designs¶
- Improved Approximation Guarantee (Theorem 2): Under the uniform filter stability condition (Assumption 1): $\(\|V^*({\pi}(H)) - \tilde{V}(\mathcal{G}(H))\|_\infty \leq \frac{2\bar{r}(1-\rho)^l}{1-\gamma} + \frac{2\bar{r}\gamma(1-\rho)^l}{(1-\gamma)((1-\gamma)+\gamma(1-\rho)^l)}\)$
Key Improvement: Prior bounds scale with \((1-\gamma)^{-2}\) or \((1-\gamma)^{-3}\); the proposed bound approximates \(\frac{2\bar{r}(1-\rho)^l}{1-\gamma}\) when \((1-\rho)^l\) is sufficiently small, reducing the horizon dependence from quadratic to linear. Moreover, this is a worst-case bound rather than an expectation bound.
- Key Algebraic Lemma (Lemma 2): For positive vectors \(\mathbf{a}, \mathbf{b}, \mathbf{c}, \mathbf{d}\) with \(\sum a_i = \sum c_i = 1\): $\(|\sum a_i b_i - \sum c_i d_i| \leq \frac{\|a-c\|_1}{2}\max(\|b\|_\infty, \|d\|_\infty) + \|b-d\|_\infty - \frac{\|a-c\|_1}{4}\|b-d\|_\infty\)$
Conventional decomposition-based triangle inequality arguments introduce looseness; this inequality exploits the normalization of probability mass functions to obtain a tighter bound. The lemma has independent value and can also improve results of Subramanian & Mahajan (2019).
- Convergence of Standard TD Learning under Non-Markovian Sampling (Lemma 4): The core challenge is that samples are drawn from the belief state \(\pi(H_t)\) of the true POMDP rather than the Superstate \(\mathcal{G}(H_t)\). The authors show that under the filter stability condition, the transition operator of the Superstate MDP possesses a contraction property (Lemma 3), enabling standard TD learning to converge to a value function approximation of the Superstate MDP despite operating in a non-Markovian environment.
Loss & Training¶
Algorithm 2: Policy Optimization for Superstate MDP - Outer loop (\(M\) policy updates): Exponential update rule from POLITEX: \(\mu_i(a|B) \propto \exp(\eta \sum_{j=1}^i \bar{Q}^{\mu_{j-1}}(B,a))\) - Inner loop (\(\tau + l'\) steps of TD learning): Standard SARSA updates + projection onto a ball of radius \(R\) - First \(l'\) steps serve as a burn-in period for filter stabilization - Supports linear function approximation: \(Q(B,a) = \phi^T(B,a)\theta\)
Key Experimental Results¶
This paper is a purely theoretical contribution; core results are presented as theorems.
Main Results (Theorem 3, Regret Bound)¶
| Component | Description | Order |
|---|---|---|
| \(\xi_{\text{FA}}\) | Function approximation error | Depends on feature quality |
| \(\xi_{\text{HA}}\) | History approximation error | \(\sim (1-\rho)^l\), decays exponentially with \(l\) |
| Main term | Regret convergence rate | \(O(T^{3/4}\log T)\) |
Total regret: \(\mathcal{R}_T \leq T \cdot (\xi_{\text{FA}} + \xi_{\text{HA}}) + O(T^{3/4}\log T)\)
Comparison with Prior Work¶
| Method | Approximation Bound | Algorithm | Computational Cost | Assumptions |
|---|---|---|---|---|
| Kara & Yüksel (2023) | \(O(1/(1-\gamma)^2)\) | Q-learning | Standard | Ergodicity + asymptotic convergence |
| Cayci et al. (2024) | Polynomial | \(m\)-step TD | High | Similar |
| Abel et al. (2016) | \(O(1/(1-\gamma)^3)\) | — | — | — |
| Ours | \(O((1-\rho)^l/(1-\gamma))\) | Standard TD + PO | Standard | Filter stability |
Key Findings¶
- Standard TD learning is for the first time proven to be directly applicable to settings where the true dynamics are a POMDP, with convergence guarantees
- The \(m\)-step TD variant of Cayci et al. is not required, avoiding additional computational overhead
- Linear function approximation enables scalability to large state spaces
- The algebraic lemma (Lemma 2) can independently improve other POMDP approximation analyses
Highlights & Insights¶
- Practicality: Provides theoretical justification for the intuitively appealing practice of treating finite histories as MDP states and solving with standard RL algorithms
- Generality of the Algebraic Lemma: Lemma 2 offers a tighter control than the triangle inequality for bounds of the form \(|\sum a_ib_i - \sum c_id_i|\), with broad applicability
- Separation of Concerns: The value function error is decomposed into a function approximation error \(\xi_{\text{FA}}\) (improvable via better features) and a history approximation error \(\xi_{\text{HA}}\) (exponentially reducible by increasing \(l\)), providing clear guidance for hyperparameter tuning
Limitations & Future Work¶
- The filter stability condition (Assumption 1) requires sufficient mixing of transition and observation kernels, which may not hold in certain deterministic environments
- The \(O(T^{3/4}\log T)\) regret rate may not be optimal
- The exponentially growing Superstate space (\(|\mathcal{Y}|^l \cdot |\mathcal{A}|^l\)) still poses scalability challenges for large \(l\)
- No numerical experiments are provided to validate the tightness of theoretical bounds
- Future work may explore replacing linear function approximation with LSTMs or Transformers for greater expressive power
Related Work & Insights¶
- Significantly improves upon the approximation bounds and algorithmic guarantees of Kara & Yüksel (2023)
- Extends the POLITEX algorithm from the average-reward to the discounted-reward setting
- Filter stability theory (van Handel, 2008) provides the key tool for belief state contraction
- Offers improved performance bounds for the approximate information state framework of Subramanian & Mahajan (2019)
Rating¶
- Novelty: ⭐⭐⭐⭐ The algebraic lemma is novel and broadly useful; the convergence analysis of standard TD in POMDPs represents an important theoretical advance
- Experimental Thoroughness: ⭐⭐⭐ Purely theoretical work with no experiments, but theorems are clearly and completely stated
- Writing Quality: ⭐⭐⭐⭐ Problem motivation is well articulated, with thorough comparison to prior work
- Value: ⭐⭐⭐⭐ Provides a theoretical foundation for scalable POMDP solving with practical implications for RL system design