Automaton Constrained Q-Learning¶

Conference: NeurIPS 2025 arXiv: 2510.05061 Code: None Area: Reinforcement Learning / Safe RL Keywords: LTL, Automaton, Safety Constraints, Goal-Conditioned RL, CMDP

TL;DR¶

This paper proposes ACQL (Automaton Constrained Q-Learning), which translates Linear Temporal Logic (LTL) task specifications into automata and combines goal-conditioned learning with minimal safety constraints. ACQL is the first scalable method to simultaneously support sequential temporal goals and non-stationary safety constraints in continuous control environments.

Background & Motivation¶

Starting Point¶

Key Insight: Real-world robotic tasks require achieving sequential sub-goals while adhering to dynamic safety constraints (e.g., a warehouse robot must restock shelves in order while avoiding obstacles and remaining within battery range).

Root Cause¶

Key Challenge: Goal-conditioned RL (GCRL) can reach individual goals but cannot reason over goal sequences or enforce safety; safe RL can only handle static safety constraints.

State of the Field¶

Background: LTL can express complex temporal tasks, but existing LTL+RL methods rely on sparse binary rewards and perform poorly in complex continuous environments.

Limitations of Prior Work¶

Limitations of Prior Work: There is no scalable method that simultaneously supports temporal goals and safety constraints.

Method¶

Overall Architecture¶

ACQL constructs an augmented product CMDP through the following steps:

Translate STL (Signal Temporal Logic) specifications into a Deterministic Büchi Automaton (DBA).
Extract a safety constraint map \(S: \mathcal{Q} \to \Phi\) and a liveness condition / sub-goal map \(G: \mathcal{Q} \to \mathcal{G}^+\) from the automaton.
Compose the MDP with the automaton into an augmented CMDP with state space \(\mathcal{S}^A = \mathcal{S} \times \mathcal{G}^+ \times \mathcal{Q}\).
Learn a reward Q-function \(Q^r\) and a safety Q-function \(Q^c\); the policy is obtained by constrained maximization: \(\pi^*(s^A) = \arg\max_{a: Q^c(s^A,a) > \mathcal{L}} Q^r(s^A, a)\).

Key Designs¶

Goal-Conditioned Learning + HER to Address Sparse Rewards:
- Function: Incorporate the sub-goal list \(g^+\) into the product CMDP state and apply HER to learn from failed trajectories.
- Mechanism: Sub-goals corresponding to automaton states are explicitly encoded into observations; HER retrospectively assigns rewards for sub-goals that were reached.
- Design Motivation: Sparse rewards are the core bottleneck of prior LTL+RL methods; GCRL techniques substantially mitigate this issue.
Minimal Safety Constraint as a Replacement for Cumulative Cost:
- Function: Replace the standard CMDP cumulative cost constraint with a minimal safety constraint \(\mathbb{E}[\min_t c^A(s_t, a_t)] > \mathcal{L}\).
- Mechanism: Safety evaluation is recast as a classification problem (safe / unsafe) rather than regression over future cumulative costs; the safety Q-function is defined as \(Q^c(s,a) = \mathbb{E}[\min_t c^A(s_t, a_t)]\) with a progressively scheduled discount factor \(\gamma_c \to 1\).
- Design Motivation: Predicting cumulative costs requires modeling long-range dependencies, leading to high variance and slow convergence; the minimal safety formulation reduces the problem to binary classification, and \(\mathcal{L}=0\) applies universally across all tasks.

Loss & Training¶

Reward Q-function: Standard TD target \(y^r_t = r^A_t + \gamma Q^r_{\bar\psi}(s^A_{t+1}, \pi_j(s^A_{t+1}))\)
Safety Q-function: Hamilton–Jacobi reachability-based target \(y^c_t = \gamma_c \min\{c^A(s_t, a_t), Q^c_{\bar\theta}(s^A_{t+1}, \pi_j(s^A_{t+1}))\} + (1-\gamma_c) c^A(s_t, a_t)\)
\(\gamma_c\) is progressively scheduled from a small value to \(1.0\) (three-timescale stochastic approximation, with convergence guarantees).

Key Experimental Results¶

Main Results¶

Task	Agent	CRM-RS	LOF	ACQL
Dual-goal sequential navigation	Ant	0% success	~20%	~80%
Branching navigation	Ant	0% success	~10%	~70%
Single goal + safety constraint	Quadcopter	~30%	~40%	~90%
Dual goal + vanishing safety	PointMass	~50%	~20%	~95%

Ablation Study¶

Removing HER: Performance degrades substantially, confirming the critical role of sub-goals + HER in overcoming sparse rewards.
Replacing minimal safety with cumulative cost: Safety constraint violation rates increase, and \(\mathcal{L}\) requires manual tuning.
Fixing \(\gamma_c=1\) (no scheduling): Safety Q-function learning becomes unstable.

Key Findings¶

ACQL significantly outperforms CRM-RS and LOF across all tested tasks.
Non-stationary safety constraints (e.g., "avoid region \(o_1\) before reaching \(g_1\), but may pass through it afterward") pose a fatal challenge for baseline methods.
ACQL is successfully deployed on a 6-DOF robotic arm, completing goal-reaching tasks in a cabinet environment with safety constraints.

Highlights & Insights¶

First unified framework: Elevates Safe RL and GCRL to the task class expressible by LTL.
The minimal safety constraint is more concise and general than cumulative cost (\(\mathcal{L}=0\) applies to all tasks) and yields more stable learning.
A single goal-conditioned policy handles all sub-goals, unlike LOF which requires training a separate policy for each edge.
Convergence is theoretically guaranteed via three-timescale stochastic approximation.

Limitations & Future Work¶

Evaluation is limited to recurrence-class LTL formulas; more general LTL fragments are not covered.
Obstacles in the experimental environments are introduced solely through task specifications (no physical obstacles), which may limit realism.
The method is restricted to the online RL setting; zero-shot generalization to new LTL tasks is not addressed.
Scalability to high-dimensional continuous observation spaces (e.g., visual inputs) remains unverified.

Reward Machines (CRM-RS) handle safety implicitly via episode termination, which lacks robustness.
LOF requires training an independent policy for each sub-goal, limiting scalability.
The minimal safety idea, drawn from Hamilton–Jacobi reachability analysis, elegantly bridges formal methods and deep RL.

Rating¶

Theoretical Innovation: ⭐⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Overall: ⭐⭐⭐⭐