Test-driven Reinforcement Learning in Continuous Control¶
Conference: AAAI 2026 (Oral)
arXiv: 2511.07904
Code: https://github.com/KezhiAdore/TdRL
Area: Reinforcement Learning / Continuous Control / Task Representation
Keywords: Test-driven RL, Satisficing Theory, Multi-objective Optimization, Trajectory Return Function, Lexicographic Comparison
TL;DR¶
This paper proposes the Test-driven Reinforcement Learning (TdRL) framework, which replaces a single reward function with multiple test functions — pass-fail tests defining optimality criteria and indicative tests guiding learning — to represent task objectives. A return function is learned via lexicographic-heuristic trajectory comparison, matching or surpassing hand-crafted reward methods on the DeepMind Control Suite while naturally supporting multi-objective optimization.
Background & Motivation¶
In RL, the reward function bears a dual responsibility: (1) defining optimal behavior; and (2) guiding the learning process. This dual role makes reward design extremely difficult, frequently causing the following issues:
- Reward Hacking: agents exploit loopholes in the reward function rather than genuinely completing the task
- Design Bias: experts tend to evaluate individual state-action pairs in isolation, neglecting their effect on the entire trajectory
- Multi-objective Dilemma: real-world tasks (e.g., autonomous driving must simultaneously account for safety, speed, comfort, and regulations) make it extremely difficult to determine weights among objectives
- Limitations of PbRL: relies on human preference labels, which are subject to subjective bias
- Limitations of IRL: requires large amounts of expert demonstrations and generalizes poorly
- LLM-generated Rewards: still require manual domain knowledge and extensive iterative training feedback
Core Insight (Satisficing Theory): In multi-objective settings, humans do not pursue the optimum on a single metric but instead seek a "satisfactory solution" across objectives. For instance, drivers do not blindly minimize travel time but arrive on schedule subject to safety, comfort, and compliance constraints.
Method¶
Overall Architecture¶
A four-stage iterative pipeline: 1. Collect Trajectory: the policy interacts with the environment to collect trajectories 2. Return Learning: update the return function \(R_\xi^{ind}\) based on trajectory comparison outcomes 3. Reward Learning: decompose trajectory returns into a state-action reward function \(r_\phi(s,a)\) 4. Policy Optimization: optimize the policy using SAC/PPO based on the learned reward
Key Designs¶
-
Test Function Taxonomy:
- Pass-fail tests \(z^{pf}: \tau \to \{0, 1\}\): hard constraints defining optimal behavior (e.g., "did the agent reach the goal?" "is the torso upright?"), with binary outcomes
- Indicative tests \(z^{ind}: \tau \to \mathbb{R}\): provide continuous metrics to guide learning (e.g., "locomotion speed," "energy consumption"); they do not define optimality but help discriminate between trajectories
- The two test types have distinct roles: pass-fail tests define "what is correct," while indicative tests facilitate "how to learn efficiently"
- No pre-specified weights are required for combination — lexicographic ordering handles priority automatically
-
Theoretical Guarantee (Theorem 1):
- If the trajectory return function \(R(\tau)\) assigns higher returns to trajectories closer to the optimal trajectory set \(\tilde{\mathcal{T}}\) (distance monotonicity)
- Then maximum-entropy policy optimization yields policies closer to the optimal policy set \(\tilde{\Pi}\)
- Formally: \(d(\tau_1, \tilde{\mathcal{T}}) \leq d(\tau_2, \tilde{\mathcal{T}}) \Rightarrow R(\tau_1) \geq R(\tau_2) \Rightarrow d(\pi_2, \tilde{\Pi}) \leq d(\pi_1, \tilde{\Pi})\)
-
Lexicographic Trajectory Comparison:
- Rather than directly computing \(d(\tau, \tilde{\mathcal{T}})\) (since the optimal trajectory set is unknown), heuristic rules determine which of \(\tau_1\) and \(\tau_2\) is closer to \(\tilde{\mathcal{T}}\)
- Priority rules: (1) both pass all pass-fail tests → tie; (2) the trajectory passing more pass-fail tests is preferred; (3) the trajectory passing harder tests (lower historical pass rate) is preferred; (4) rank by degree of optimization on indicative tests, prioritizing the least-optimized dimension
- Outputs \(\mu \in \{0, 0.5, 1\}\) as preference labels for the Bradley-Terry model
-
Return Function Learning:
- A fully connected network \(R_\xi^{ind}\) maps \(n\) indicative test results to a scalar return
- Loss: \(\mathcal{L}_R^{Dis}\) (distance-based cross-entropy) + \(\mathcal{L}_R^{Penalty}\) (numerical stability term)
- Decomposition into state-action reward: \(\mathcal{L}_r = \sum[R(\tau) - \sum r_\phi(s,a)]^2\)
- Two gradient balancing methods: GN (gradient norm normalization) and ES (early stopping, \(K^{ES}=10\))
Loss & Training¶
- Return function: \(\mathcal{L}_R^{Dis}\) (BT-model cross-entropy) + \(\mathcal{L}_R^{Penalty}\) (MSE regularization)
- Reward decomposition: \(\mathcal{L}_r = \text{MSE}(R(\tau),\, \sum r_\phi(s,a))\)
- Policy optimization: standard SAC objective (with maximum entropy), or PPO
Key hyperparameters: 9,000 steps of unsupervised pre-training; trajectory buffer maximum capacity 100; segment size 50; reward/return network lr = \(3 \times 10^{-4}\); reward ensemble = 3
Key Experimental Results¶
Main Results: DM-Control Continuous Control Tasks¶
| Task | SAC + Oracle Reward | TdRL-GN | TdRL-ES | PPO + Oracle | PPO + TdRL |
|---|---|---|---|---|---|
| Walker-Stand | ~980 | ≈980 | ≈980 | ~970 | ~960 |
| Walker-Run | ~650 | ≈670 | ≈650 | ~550 | ~480 |
| Cheetah-Run | ~830 | ≈830 | ≈850 | ~750 | ~720 |
| Quadruped-Run | ~780 | ≈800 | ≈770 | ~650 | ~600 |
Ablation Study¶
| Variant | Walker-Run Performance | Notes |
|---|---|---|
| TdRL-ES (\(K^{ES}=10\)) | ~650 | Recommended default |
| Without Penalty term | Unstable training | Return values grow unbounded |
| Direct reward learning (no return decomposition) | Unstable training | Requires tanh clipping + repeated rescaling |
| \(K^{ES}\) too large | Performance drops | Insufficient Penalty constraint |
| \(K^{ES}\) too small | Performance drops | Over-constrained, return learning impeded |
Multi-objective Analysis (Walker-Run)¶
| Objective | Oracle SAC | TdRL | Threshold |
|---|---|---|---|
| Torso upright \(\cos(\theta)\) | ✓ (satisfied) | ✓ (satisfied) | [0.9, 1.0] |
| Torso height | ✗ (not satisfied) | ✓ (satisfied) | >1.2 |
| X-axis velocity | ~8 | ~8 | 8 |
Key Findings¶
- TdRL matches or surpasses oracle reward: achieves comparable performance without manual reward engineering
- Natural multi-objective support: in Walker-Run, the weight design of the oracle reward leads to "upright but crouched running" (high upright score, low stand height); TdRL's pass-fail tests ensure all three objectives are satisfied
- TdRL's policy stability is slightly lower than that of oracle reward — trajectory-level evaluation inherently has greater variance than state-action evaluation, but this precisely mitigates reward hacking
- Both GN and ES gradient balancing methods are effective; ES is simpler (recommended: \(K^{ES}=10\))
- TdRL is also applicable to on-policy methods (PPO), although the theory is limited to the maximum-entropy framework
Highlights & Insights¶
- A new task representation paradigm: test functions vs. reward functions, with functional separation (defining objectives vs. guiding learning)
- The lexicographic comparison is inspired by human decision-making: hard constraints are checked first, then soft metrics — consistent with the human heuristic of "ensure safety before seeking optimality"
- Complete theoretical contribution: Theorem 1 provides convergence guarantees, rigorously derived from max-entropy RL
- Key distinction from PbRL: TdRL does not require human preference labels but automatically generates trajectory comparisons via test functions
- The conceptual contribution outweighs the engineering contribution — the paper introduces a new way of thinking about RL task design
Limitations & Future Work¶
- Theory is grounded solely in the maximum-entropy RL framework; on-policy methods such as PPO lack theoretical guarantees
- Test functions still require manual design; though simpler than reward functions, the process is not fully automated
- Lexicographic comparison is a heuristic; theoretical optimality is not guaranteed
- Early-stage training is slower than with oracle reward (the return function must first be learned)
- Validation is limited to simulated environments; no experiments on real robots or autonomous driving scenarios
Related Work & Insights¶
| Comparison Dimension | TdRL | PbRL | IRL | Reward Engineering |
|---|---|---|---|---|
| No preference labels required | ✓ | ✗ | ✓ | ✓ |
| No expert demonstrations required | ✓ | ✓ | ✗ | ✓ |
| Trajectory-level evaluation | ✓ | ✓ | — | ✗ |
| Multi-objective support | Native | Requires design | — | Requires weights |
- Future direction: using LLMs to automatically generate test functions (noted in the paper)
- Broader potential of satisficing theory in RL
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (entirely new paradigm; Oral-level conceptual contribution)
- Technical Depth: ⭐⭐⭐⭐ (theoretical theorem + heuristic algorithm + complete implementation)
- Experimental Thoroughness: ⭐⭐⭐ (only 4 DM-Control tasks; real-world scenarios absent)
- Practical Value: ⭐⭐⭐⭐ (reduces task design complexity in multi-objective RL)
- The DMC benchmark is relatively simple; scalability to more complex tasks remains to be verified
- The lexicographic assumption presupposes a clear priority ordering among tests, which may be more complex in practice
- The relationship with and advantages over RLHF require further clarification
Related Work & Insights¶
- vs. traditional reward design: simpler to design; objective definition and learning guidance are decoupled
- vs. RLHF: replaces human comparative feedback with pre-defined test functions
- vs. multi-objective RL: native framework support vs. requiring additional algorithmic adaptation
Inspiration & Connections¶
The test-driven task representation approach is applicable to safety constraint definition in autonomous driving. It also offers an alternative perspective on the reward hacking problem.
Rating ⭐⭐⭐⭐ (4/5)¶
An Oral paper with novel concepts and solid theoretical contributions. However, the experimental scope is limited (DMC only), and test function design still involves subjective factors.