Zero-Shot Context Generalization in Reinforcement Learning from Few Training Contexts¶

Conference: NeurIPS 2025 arXiv: 2507.07348
Code: https://github.com/chapman20j/ZeroShotGeneralization-CMDPs
Area: Reinforcement Learning Keywords: context generalization, contextual MDP, data augmentation, Bellman equation, zero-shot transfer

TL;DR¶

This paper proposes the Context-Enhanced Bellman Equation (CEBE) and Context Sample Enhancement (CSE), which leverage first-order derivative information of environment dynamics and reward functions with respect to context parameters to achieve zero-shot generalization to unseen contexts when training is restricted to a single context.

Background & Motivation¶

Deep reinforcement learning (DRL) has achieved remarkable success in games, NLP, and robotics, yet trained policies often fail to generalize to evaluation environments with different parameters. For instance, in robotic control, discrepancies between training-time physical parameters (e.g., friction coefficients, mass) and deployment conditions can cause policy failure.

Two main strategies exist to address this: (1) continual learning (training continues at deployment), which is often infeasible due to safety and cost constraints; and (2) domain randomization (training across multiple contexts), which can be prohibitively expensive when constructing diverse training environments (e.g., designing complex robots).

The root cause is that zero-shot generalization from few contexts is impossible without sufficient structural priors. However, in many physical control problems, the form of the dynamics equations is known, with uncertainty only in the parameters.

The paper's starting point is to exploit environment differentiability: since the transition function \(T^c\) and reward function \(R^c\) are differentiable with respect to the context parameter \(c\), nearby context dynamics can be approximated via first-order Taylor expansion. This enables data augmentation without sampling from new environments, achieving "virtual" domain randomization.

Method¶

Overall Architecture¶

Given a Contextual MDP (CMDP), data is collected under a base training context \(c_0\). The context gradients \(\partial_c T\) and \(\partial_c R\) of the dynamics and reward are used to apply linear approximation augmentation to sampled data, generating virtual samples for nearby contexts \(c\). These augmented samples are then used to optimize CEBE, training a policy that generalizes to unseen contexts.

Key Designs¶

Context-Enhanced Bellman Equation (CEBE):
- The standard Bellman equation may have unknown transitions and rewards at a given context \(c\).
- CEBE uses approximated transition and reward functions via first-order Taylor expansion:
- Deterministic transition: \(T_{\mathrm{CE}}^c(s,a) = \delta_{f^{c_0}(s,a) + \partial_c f^{c_0}(s,a)(c-c_0)}\)
- Reward: \(R_{\mathrm{CE}}^c = R^{c_0} + \partial_c R^{c_0} \cdot (c - c_0) + \partial_{s'} R^{c_0} \partial_c T^{c_0} \cdot (c - c_0)\)
- At \(c = c_0\), CEBE reduces exactly to the standard Bellman equation.
First-Order Accuracy Guarantee (Theorem 2):
- Core result: \(\|Q_{\mathrm{CE}}^c - Q_{\mathrm{BE}}^c\|_\infty \leq O(\|c - c_0\|^2)\)
- The Q-function of CEBE is a first-order approximation of the true Q-function (error is second-order).
- Prerequisite: transition and reward functions are sufficiently smooth with respect to context parameters.
- A general \((T, R)\)-stability result for Q-functions under small perturbations to transitions and rewards is also established (Theorem 1).
Context Sample Enhancement (CSE):
- Translates CEBE into a practical data augmentation method for environments with deterministic transitions.
- Given a sample \((s, a, r, s')\) and a context perturbation \(\Delta c\), CSE generates an augmented sample:
- \(\bar{r} = r + \partial_c R \cdot \Delta c + \partial_{s'} R \cdot \partial_c T \cdot \Delta c\)
- \(\bar{s}' = s' + \partial_c T \cdot \Delta c\)
- Implementation is straightforward: a linear transformation is applied to sampled data within each training batch.
- Unlike domain randomization (LDR), CSE does not require sampling from new environments—only context gradients are needed.
Policy Optimality Guarantee (Theorem 4): An \(\varepsilon\)-optimal policy derived by optimizing CEBE is \((2\delta + 2\varepsilon)\)-optimal in the original CMDP, where \(\delta\) is the CEBE approximation error.

Loss & Training¶

A standard off-policy RL algorithm (e.g., SAC) is used. Within the training loop: sample a batch from the replay buffer → generate context perturbation \(\Delta c \in \mathcal{B}(c, \varepsilon)\) → augment samples via CSE → update networks with augmented samples. The entire procedure only requires access to context gradient information from the environment.

Key Experimental Results¶

Main Results¶

Environment	Metric	CSE (Ours)	Baseline	LDR (Ideal Upper Bound)
SimpleDirection	Return	Close to LDR	Significant degradation	Optimal
PendulumGoal (\(g\) varies)	Return	\(\approx\) LDR	Large degradation	\(\approx\) CSE
PendulumGoal (\(\tau > 0.6\))	Return	Outperforms LDR	Degradation	Below CSE
CheetahVelocity	Return	\(\approx\) LDR	Degrades after \(v > 1.5\)	\(\approx\) CSE
AntDirection	Return	\(\approx\) LDR in most regions	Significantly worse	\(\approx\) CSE

Ablation Study¶

Configuration	Q-Function Approximation Error	Note
Cliffwalker (Reward 1)	Slope \(\approx 2\) (log-log)	Validates \(O(\\|c - c_0\\|^2)\) first-order accuracy
Cliffwalker (Reward 2)	Slope \(\approx 2\) (log-log)	Holds under different reward functions

Key Findings¶

In simple environments (SimpleDirection), CSE nearly perfectly matches the performance of ideal domain randomization (LDR).
In PendulumGoal, CSE surpasses LDR in certain context regions (e.g., high target torque).
In MuJoCo environments (CheetahVelocity, AntDirection), CSE performs on par with LDR and substantially outperforms the baseline.
Tabular experiments precisely validate the first-order accuracy theory for \(Q_{\mathrm{CE}}\) (slope \(\approx 2\) on log-log plots).

Highlights & Insights¶

Minimal and efficient: CSE only requires context gradient information from the environment (generally available in physics simulators) and is implemented as simple linear data augmentation with negligible additional computational cost.
Theory-driven practice: Starting from CMDP perturbation theory, the paper rigorously proves first-order accuracy before deriving a practical algorithm—theory and experiments are in perfect correspondence.
Effective substitute for LDR: Domain randomization performance is achieved without requiring additional environment sampling.

Limitations & Future Work¶

CSE is currently limited to environments with deterministic transitions; stochastic transitions would require defining gradients of transport maps.
Both the analysis and experiments assume fully observable states and contexts; partially observable settings may require additional treatment.
The effective range of the Taylor expansion is limited—first-order approximation may be insufficient for large context shifts.
The sample complexity advantage of CSE over domain randomization in high-dimensional context spaces remains to be theoretically analyzed.

vs. Domain Randomization (LDR): LDR requires sampling from multiple real contexts; CSE achieves comparable results via gradient approximation, making it suitable for settings where constructing new environments is costly.
vs. Meta-RL: Meta-RL requires a large number of contexts for meta-training; CSE needs only a single context plus gradient information.
vs. Qiao et al. (2021): They apply sample enhancement in the action space via gradients; this paper extends the idea to the context space to address generalization.
vs. Modi et al. (2017): They assume access to multiple contexts and use zeroth-order approximation; this paper achieves broader coverage from a single context using first-order approximation.

Rating¶

Novelty: ⭐⭐⭐⭐ Introducing perturbation analysis into CMDP generalization is a novel perspective with solid theoretical contributions.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-level validation from tabular to MuJoCo settings, though environment complexity could be further increased.
Writing Quality: ⭐⭐⭐⭐⭐ Theoretical derivations are clear, experimental design is sound, and the paper is concisely structured.
Value: ⭐⭐⭐⭐ Provides theoretical foundations and practical tools for sim-to-real transfer and few-context generalization.