Skip to content

Revealing POMDPs: Qualitative and Quantitative Analysis for Parity Objectives

Conference: AAAI 2026 arXiv: 2511.13134 Code: None Area: Reinforcement Learning Keywords: POMDP, Revealing POMDP, Parity Objectives, Computational Complexity, Quantitative Analysis

TL;DR

This paper proves that limit-sure analysis for revealing POMDPs under parity objectives is equivalent to almost-sure analysis (EXPTIME-complete), and that quantitative analysis can also be completed within EXPTIME, thereby resolving two important open problems for this subclass.

Background & Motivation

State of the Field

Partially Observable Markov Decision Processes (POMDPs) are a central model for sequential decision-making under uncertainty. At each step, the environment is in a hidden state; after the controller selects an action, it observes only partial state information (a signal) and must make optimal decisions on this basis.

Computational analysis of POMDPs encompasses two broad categories:

Qualitative Analysis: - Almost-sure: Does there exist a policy under which the objective is satisfied with probability exactly 1? - Limit-sure: Can the probability of satisfying the objective be made arbitrarily close to 1?

Quantitative Analysis: Computing an approximation of the optimal probability of satisfying the objective.

Objective Types in POMDPs

  • Reachability objectives: A target set of states is visited at least once.
  • Parity objectives: Each state is assigned a non-negative integer priority; the minimum priority appearing infinitely often must be even. Parity objectives can express all ω-regular specifications (including liveness, LTL, etc.) and are the most general temporal logic objectives.

The Undecidability Barrier in General POMDPs

For general POMDPs, the computational landscape is highly negative:

Analysis Type Reachability Parity
Almost-sure EXPTIME-complete Undecidable
Limit-sure Undecidable Undecidable
Quantitative Undecidable Undecidable

This means that for general POMDPs, all analysis problems except almost-sure reachability are algorithmically intractable. A natural research direction therefore arises: do there exist decidable subclasses of POMDPs?

Core Motivation: Why Study Revealing POMDPs?

Revealing POMDPs are a special subclass in which every visited state has a positive probability of being "announced" to the controller. Intuitively, the controller's uncertainty (belief distribution) occasionally collapses to a Dirac distribution over a single state, granting complete state knowledge.

This subclass has physical significance: in many practical applications (computational biology, probabilistic planning, robotic path planning), the environment is globally partially observable but occasionally "leaks" complete state information.

Prior work (Belly et al. 2025) established that almost-sure analysis of revealing POMDPs under parity objectives is EXPTIME-complete. However, limit-sure analysis and quantitative analysis remained open problems.

Method

Overall Architecture

The authors' technical approach proceeds in two main stages:

  1. Introducing belief-reachability objectives: This new objective requires that a target state be visited at a moment when the controller has complete state knowledge (i.e., the belief is a Dirac distribution).
  2. Proving that the value of parity objectives equals the belief-reachability value to the almost-sure winning state set.

This decomposition reduces the complex parity objective problem to the more tractable belief-reachability problem.

Key Designs

1. Formal Definition of Revealing POMDPs

A revealing POMDP requires that for every state \(s'\), if there is a positive-probability transition to \(s'\) from \((s,a)\), then \(s'\) must also be announced as a signal with positive probability. Formally:

\[\sum_{z \in \mathcal{Z}} \delta(s,a)(s',z) > 0 \implies \delta(s,a)(s',s') > 0\]

That is, every reachable state has a corresponding "revealing signal" such that when the controller observes this signal, its belief collapses to \(\mathds{1}[s']\) (complete state certainty).

2. Belief-Reachability Objectives

Given a target state set \(\mathcal{X}\), the belief-reachability objective requires the existence of some time \(t\) at which the controller's belief \(B_t\) is a Dirac distribution over a target state:

\[\text{Belief-Reach}(\mathcal{X}) := \{\rho \in \Omega : \exists t \geq 0, B_t(\rho) \in \mathcal{D}_\mathcal{X}\}\]

This is strictly stronger than ordinary reachability—it requires not only that a target state be visited, but also that the controller knows it has visited a target state. The authors show that belief-reachability generalizes reachability objectives (Remark 2).

3. Reliable Actions and Stopping Policies

  • Reliable actions: Actions that preserve the current belief-reachability value. Formally, action \(a\) is reliable if and only if the expected value after executing \(a\) equals the current value.
  • Stopping policies: Policies that, within finitely many steps and with positive probability, collapse the belief to a Dirac distribution over a terminal state.
  • Key Lemma (Lemma 7): A policy that is both stopping and uses only reliable actions is optimal.

Core proof idea: Reliable actions make the belief-reachability value a martingale, while stopping guarantees arrival at a terminal within finite time. Together, these yield optimality via the optional stopping theorem.

4. Stopping Property of Uniformly Random Reliable Policies (Lemma 8)

For revealing POMDPs, the policy that selects reliable actions uniformly at random is \((n,q)\)-stopping with parameters: - \(n = |\mathcal{S}| + 2\) - \(q = \delta_{\min}^2 (\delta_{\min}/|\mathcal{A}|)^{|\mathcal{S}|}\)

where \(\delta_{\min}\) is the minimum nonzero probability in the transition function. The proof constructs a stratified set of states and exploits the revealing property to guarantee a positive probability of state revelation at each step.

5. Reduction from Belief-Reachability to Parity Objectives (Lemma 15)

The value of a parity objective equals the belief-reachability value to the almost-sure winning state set. Since the almost-sure winning state set is computable in EXPTIME (a known result), and the quantitative analysis of belief-reachability lies in EXPTIME (Theorem 1), the overall quantitative analysis of parity objectives also lies in EXPTIME.

Algorithm: Point-Based Dynamic Programming

To achieve the EXPTIME upper bound, the authors avoid the naïve approach of explicitly enumerating all exponentially many histories (which would yield 2EXPTIME), and instead employ point-based dynamic programming:

  • The belief space is discretized into a finite set of representative belief points.
  • Backward induction is performed over an exponentially long finite horizon.
  • Lemma 10 reduces the infinite-horizon problem to an exponentially long finite-horizon problem.
  • Lemma 11 provides a point-based algorithm that approximates finite-horizon belief-reachability values within EXPTIME.

Loss & Training

This is a purely theoretical work and involves no neural network training. The core "optimization" consists of searching the policy space for optimal policies, realized via dynamic programming and linear programming methods.

Key Experimental Results

Main Results

This is a purely theoretical work; the "experiments" are proofs of complexity results. The main results are summarized below.

Computational complexity for general POMDPs:

Analysis Problem Reachability Parity
Almost-sure EXPTIME-complete Undecidable
Limit-sure Undecidable Undecidable
Quantitative Undecidable Undecidable

Computational complexity for revealing POMDPs (contributions of this paper in bold):

Analysis Problem Reachability Parity
Almost-sure EXPTIME EXPTIME-complete
Limit-sure EXPTIME EXPTIME-complete
Quantitative EXPTIME EXPTIME

Ablation Study

Logical dependencies among theorems (the "ablation" of a theoretical work):

Result Key Lemmas Required Core Technique
Theorem 1 (Belief-reachability quantitative analysis ∈ EXPTIME) Lemma 7, 8, 10, 11 Reliable-action martingale argument + point-based DP
Corollary 5 (Limit-sure = Almost-sure) Theorem 4 Existence of optimal policies
Theorem 3 (Parity quantitative analysis ∈ EXPTIME) Theorem 1 + Lemma 15 Parity → belief-reachability reduction
Theorem 4 (Existence of optimal policy) Corollary 9 + reduction Construction of stopping optimal policies

Key Findings

  1. Limit-sure equals almost-sure (Corollary 5): This is a property specific to revealing POMDPs—in general POMDPs, limit-sure is strictly weaker than almost-sure and its analysis is undecidable.
  2. Quantitative and qualitative analyses share the same complexity (EXPTIME): Quantitative analysis is typically harder than qualitative, but in revealing POMDPs both fall within the same complexity class.
  3. Optimal policies exist (Theorem 4): Not only can the optimal value be approximated, but there exist policies that achieve it exactly. This does not hold even for general finite-memory policies.
  4. Belief-reachability as a bridging tool: This newly introduced objective type serves as a bridge from reachability to parity objectives.

Highlights & Insights

  1. Completing the complexity landscape for revealing POMDPs: All entries in Table 2 are filled, providing a comprehensive complexity characterization for this subclass.
  2. The belief-reachability concept: This is a novel objective type with independent value, incorporating "knowledge acquisition" into the objective definition.
  3. The power of revelation: Occasional state revelation suffices to transform undecidable problems into decidable ones, revealing a profound effect of partial information on computational complexity.
  4. Elegant application of martingale arguments: Reliable actions maintain the value as a martingale; stopping policies guarantee finite termination. The combination yields a concise and powerful proof strategy.
  5. Breaking the finite-memory bottleneck: In general POMDPs, limit-sure and quantitative analyses remain undecidable even under finite-memory policy restrictions. The revealing property provides a more structurally grounded avenue for progress.

Limitations & Future Work

  1. Practical feasibility of EXPTIME: Although theoretically decidable, the actual running time of EXPTIME algorithms may be prohibitive, especially for large state and action spaces.
  2. Restrictiveness of the revealing assumption: Requiring every state to have a positive probability of being revealed may be too strong for certain application domains.
  3. Discounted objectives not considered: The paper focuses on logical objectives (parity/reachability) and does not address discounted cumulative reward objectives more commonly used in reinforcement learning.
  4. Absence of matching lower bounds: The quantitative analysis establishes an EXPTIME upper bound but does not provide a matching EXPTIME lower bound (only the almost-sure and limit-sure parity analyses have EXPTIME-completeness results).
  5. Extension to weaker revealing models: Whether weaker notions of the revealing property can also guarantee decidability remains an open question.
  • Belly et al. (2025): Proved that almost-sure parity analysis for revealing POMDPs is EXPTIME-complete; the direct predecessor of this work.
  • Chen and Liew (2023): Introduced the revealing blind MDP model and studied discounted objectives.
  • Avrachenkov et al. (2025): Studied belief-stationary policies for strongly connected revealing POMDPs.
  • Pineau et al. (2003): Pioneering work on point-based value iteration algorithms.
  • Insight: Structured POMDP subclasses represent a highly promising research direction—identifying structural assumptions naturally satisfied in practice offers a principled path to overcoming undecidability barriers.

Rating

  • Novelty: ⭐⭐⭐⭐ (resolves two important open problems)
  • Experimental Thoroughness: ⭐⭐⭐⭐ (theoretical work with complete proofs)
  • Writing Quality: ⭐⭐⭐⭐⭐ (clear structure with excellent progression from overview to detail)
  • Value: ⭐⭐⭐⭐ (completes the complexity characterization of revealing POMDPs; serves as a theoretical benchmark)