Expressive Temporal Specifications for Reward Monitoring¶

Conference: AAAI 2026 arXiv: 2511.12808 Code: github Area: Other Keywords: Temporal Logic, Reward Monitor, Quantitative Semantics, Sparse Reward, Reinforcement Learning

TL;DR¶

This paper leverages quantitative linear temporal logic (LTLf[F]) to automatically synthesize quantitative reward monitors (QRMs) that generate dense, continuous-valued reward streams for reinforcement learning agents at runtime, fundamentally alleviating the sparse reward problem in long-horizon tasks under Boolean semantics.

Background & Motivation¶

Core Challenge in Reward Design¶

Reward function design in reinforcement learning is critical yet fraught with difficulties: - Manual design burden: Different environments require careful engineering by domain experts, resulting in high maintenance costs. - Sparse reward problem: When goals require many steps to achieve, agents receive little effective feedback over extended periods, severely hampering learning efficiency. - Safety concerns: Poorly designed rewards may lead to reward hacking, unintended behaviors, or even catastrophic outcomes.

Temporal Logic as Reward Specification¶

Temporal logic provides a formal, high-level language for expressing complex goals and constraints. However, existing work predominantly adopts Boolean semantics (satisfied/not satisfied), which inherently produces sparse rewards—feedback is only given when the formula is fully satisfied.

Advantages of Quantitative Semantics¶

This paper proposes replacing Boolean semantics with quantitative semantics, where the degree of formula satisfaction is a real value in \([0,1]\). This enables the monitor to output a continuous signal at every timestep indicating "how close the agent is to completing the task," thereby: 1. Providing denser training signals; 2. Naturally supporting non-Markovian properties; 3. Remaining algorithm-agnostic and compatible with any RL algorithm.

Method¶

Overall Architecture¶

The system pipeline is as follows: the user specifies task objectives as "specification–reward" pairs in LTLf[F] formulas with scalar weights → the Synth algorithm automatically synthesizes a quantitative reward monitor QRM → the QRM is composed with the environment MDP via synchronous product to form an augmented MDP → any RL algorithm trains on the augmented MDP.

Key Designs¶

1. Quantitative Linear Temporal Logic LTLf[F]¶

LTLf[F] shares the syntax of standard LTLf but extends semantics from Boolean to real-valued \([0,1]\):

Atomic proposition \(p\): value given by labeling function \(\mathcal{L}(s)(p) \in [0,1]\) (rather than 0/1)
Negation: \([\![\neg\varphi]\!] = 1 - [\![\varphi]\!]\)
Conjunction: \([\![\varphi_1 \wedge \varphi_2]\!] = \min([\![\varphi_1]\!], [\![\varphi_2]\!])\)
Eventually \(F\): \([\![F\varphi]\!] = \max_{j} [\![\varphi,j]\!]\)
Always \(G\): \([\![G\varphi]\!] = \min_{j} [\![\varphi,j]\!]\)
Until \(\mathcal{U}\): \([\![\varphi_1 \mathcal{U} \varphi_2]\!] = \max_j \min_{k<j}([\![\varphi_2,j]\!], [\![\varphi_1,k]\!])\)

This fuzzy treatment yields continuous-valued feedback at every timestep, rather than a 0/1 signal only at terminal states.

2. Quantitative Reward Monitor (QRM) Construction¶

A QRM is defined as a finite-state machine with registers \(\mathcal{A} = (Q, q_0, \delta_q, \mathcal{V}, \delta_v, \delta_r)\): - State set \(Q\): tracks progress in formula evaluation - Registers \(\mathcal{V}\): store quantitative values of subformulas (e.g., accumulated min/max values) - Reward function \(\delta_r\): reads current satisfaction degree from reward registers and multiplies by weight \(\rho\)

The construction procedure (Algorithm 1 Synth) is inductive: - Atom \(p\): 2 states; register records \(\mathcal{L}_p(s)\) - Negation: reuses the sub-monitor, adds a \(1 - \mathcal{V}(\psi)\) register - Conjunction: Cartesian product of two sub-monitors; registers track min - Until: Cartesian product plus three new registers tracking \(\min\varphi_1\), \(\max\varphi_2\), and \(\max(\min\varphi_1, \max\varphi_2)\)

A memoization cache avoids redundant computation for repeated subformulas.

Theorem 1: The construction is linear in the number of states and transitions with respect to formula size.

Theorem 3 (Correctness): The register value output by the QRM at each timestep \(i\) equals the LTLf[F] semantic value \([\![\varphi,i]\!](\lambda)\).

3. Safety Specification Handling¶

For syntactically identifiable safety formulas (safe-LTLf, excluding Until and in negation normal form), once violated the monitor outputs a fixed penalty \(\zeta \leq 0\) for all remaining timesteps. This covers two scenarios: - Terminal failure (e.g., running a red light in autonomous driving → immediate termination) - Non-terminal hazard (positive rewards should not resume after a violation)

In composite monitors, violation of any safety property globally overrides the rewards of all monitors.

4. Compositional Learning with MDPs¶

The augmented MDP \(\mathcal{M} \otimes \mathcal{A}\) is defined as: - Augmented state space \(\mathcal{S}' = Q \times \mathcal{S}\) - Transition probabilities unchanged; rewards produced by the monitor - Theorem 4: Markov policies on the augmented MDP suffice to optimally capture non-Markovian objectives

Loss & Training¶

The proposed method is algorithm-agnostic: - Discrete environments (Toy, Safety Gridworlds): tabular Q-learning - Continuous environments (Classic, Box2D): PPO - Reward convergence detection uses exponential moving average with span=32 and tolerance adapted to \([0.002, 0.02]\)

Key Experimental Results¶

Main Results¶

Comparison across 12 Gymnasium/DeepMind environments using three reward sources: handcrafted reward (Base), Boolean monitor (Boolean), and quantitative monitor (Quant.).

Environment Category	Environment	Base Task Completion	Boolean Task Completion	Quant. Task Completion
Classic	Acrobot	99.07%	4.84%	94.77%
Classic	Mountain Car	39.82%	37.04%	42.28%
Classic	Pendulum	74.39%	45.03%	72.11%
Toy	Frozen Lake	58.86%	62.16%	58.96%
Toy	Cliff Walking	85.33%	84.38%	84.95%
Safety	Sokoban	50.23%	52.00%	53.15%
Box2D	Bipedal Walker	16.78%	7.43%	14.43%
Box2D	Lunar Lander	57.45%	43.44%	45.61%

Ablation Study¶

Configuration	Key Observation	Note
Quantitative vs. Boolean semantics	Quantitative consistently ≥ Boolean	Quantitative semantics provides denser signals
Classic/Box2D gap > Toy/Grid	Continuous spaces benefit more	Quantitative distance metrics more effective in continuous environments
Safety specification global override	Safety Gridworlds safety consistently enforced	All rewards blocked after safety violation
Convergence speed	Frozen Lake: 29 vs. 45	Quantitative monitor can converge faster

Key Findings¶

Quantitative monitors consistently ≥ Boolean monitors: Across all 12 environments, the quantitative monitor's task completion rate either exceeds or matches that of the Boolean monitor.
Significant advantage in Classic/Box2D: In continuous state spaces, quantitative distance metrics (e.g., angular proximity, positional proximity) offer qualitative improvements over Boolean "reached/not reached" signals.
Outperforms handcrafted rewards in some environments: In Mountain Car, the quantitative monitor (42.28%) surpasses the handcrafted reward (39.82%), as quantitative specifications can simultaneously encode position and velocity objectives.
Linear construction overhead: The number of QRM states and transitions grows linearly with formula size, confirming practical applicability.

Highlights & Insights¶

Elegant integration of formal methods and RL: Users only need to write high-level temporal logic formulas; the system automatically generates dense rewards, replacing tedious manual reward engineering.
Natural densification via quantitative semantics: No additional reward shaping or hindsight experience replay (HER) is required—quantitative semantics inherently produces dense signals.
Safety-first design philosophy: The mechanism of globally overriding all rewards upon safety specification violation is both simple and effective.
Natural support for non-Markovian objectives: By including monitor states in the augmented MDP state, non-Markovian objectives are Markovianized.
Linear construction complexity: The linear relationship with formula size ensures practical usability.

Limitations & Future Work¶

Quantitative atom design still requires domain knowledge: Although the monitor is constructed automatically, the quantitative labeling function \(\mathcal{L}(s)(p)\) still requires manual definition.
Limited advantage in small discrete environments: In Toy/Grid environments, quantitative distance may guide agents toward high-reward but suboptimal regions.
Discounted temporal logic unexplored: Discounted LTL could potentially further improve performance on long-horizon tasks.
Multi-agent settings not addressed: The authors note that ATLf[F] could extend the approach to multi-agent scenarios.
Lack of large-scale environment validation: The test environments are relatively simple; validation on complex benchmarks such as Atari or MuJoCo has not been conducted.

Rating¶

Novelty: ⭐⭐⭐⭐ — The idea of synthesizing reward monitors via quantitative temporal logic is novel and mathematically grounded.
Experimental Thoroughness: ⭐⭐⭐⭐ — 12 environments, 3 baselines, and detailed fluent definitions.
Writing Quality: ⭐⭐⭐⭐⭐ — Formally rigorous, proofs complete, appendix extremely detailed.
Value: ⭐⭐⭐⭐ — Provides a mathematically principled automated solution for reward design in RL.