Learning What Matters Now: Dynamic Preference Inference under Contextual Shifts¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=qRbCkTk9ZR
Code: https://github.com/XianweiC/DPI
Area: Multi-Objective Reinforcement Learning / Preference Inference / Cognition-Inspired Decision Making
Keywords: Multi-objective RL, dynamic preferences, variational inference, preference-conditioned policy, non-stationary environments, Envelope operator

TL;DR¶

This work models preference weights in multi-objective RL (MORL)—traditionally treated as "known constants"—as latent variables that drift with context. It maintains a posterior belief of "what matters now" via online variational inference and jointly trains this with a preference-conditioned actor–critic, enabling agents to rapidly reprioritize goals following event-driven distribution shifts.

Background & Motivation¶

Background: Multi-objective reinforcement learning (MORL) deals with vector-valued rewards (e.g., efficiency vs. safety, energy vs. ethics). Two main approaches exist: scalarization, which collapses vector rewards into scalars using a fixed preference vector, and Pareto-based methods (e.g., Envelope Q-learning, PD-MORL), which approximate the Pareto front for post-hoc preference selection. A common prerequisite for these methods is that preference vectors are externally provided.

Limitations of Prior Work: In reality, preferences are almost never directly observed. Humans may prioritize fairness and patience while waiting in a queue, but as hunger increases or time runs out, they shift weight toward "energy/survival" and justify cutting in line. That is, the relative importance of goals drifts dynamically with contextual factors like resources, time, and risk; certain goals may become unreachable, temporarily irrelevant, or exceptionally important. Agents with fixed weights either pursue unreachable goals or ignore emerging priorities, which can be catastrophic in safety-critical scenarios like autonomous driving or healthcare.

Key Challenge: Cognitive science offers descriptive theories (self-regulation, multi-goal pursuit, constructive preferences) on how humans reprioritize preferences according to context, but these are not algorithms deployable in high-dimensional, partially observable, non-stationary control problems. Conversely, RL algorithms excel at policy optimization but outsource the determination of "what is worth optimizing" to a fixed reward function. A bridge between the two is missing.

Goal: To enable agents to infer and adapt to their current goal trade-offs online while remaining effective and interpretable, given only vector rewards and partial observations under non-stationary conditions.

Core Idea: [Preference as Latent State] The preference vector $\omega_t$, which encodes the relative importance of multiple objectives, is modeled as a latent variable that must be inferred online rather than a prior fixed input. The agent maintains a posterior $q_\phi(\omega_t \mid s_{t-H+1:t})$, sampling from it to capture cognitive uncertainty and explore multiple value configurations before acting. A preference-conditioned actor–critic then conditions its policy on the inferred $\omega_t$. This framework is intentionally decomposed into two processes inspired by cognitive science: value appraisal (determining "what matters now") and action selection (deciding "how to act accordingly").

Method¶

Overall Architecture¶

DPI (Dynamic Preference Inference) is a two-module, cognition-inspired agent. The value appraisal module feeds recent state history $s_{t-H+1:t}$ (including observations and internal states like energy/remaining time) into a recurrent encoder, outputting a distribution of latent preference vectors over the simplex $\Delta^{d-1}$, from which $\hat\omega_t$ is sampled. The action selection module is a preference-conditioned actor–critic. During decision-making, an on-policy envelope operator selects the candidate with the highest scalarized value among $K$ preference candidates. The entire pipeline is stabilized by three regularization terms: variational ELBO, directional alignment, and self-consistency. Note: the environment transition $p_{env}(s_{t+1}\mid s_t,a_t)$ is independent of $\omega_t$ and remains a standard MDP; the generative decomposition of preferences is strictly an internal perceptual model for inference.

flowchart LR
    H["State History<br/>s_{t-H+1:t}"] --> E[Recurrent History Encoder]
    E --> Q["Preference Posterior<br/>q_φ(z_t|·)=N(μ,σ²)<br/>ω=softmax(z)"]
    Q -->|"Sample K Candidates ω⁽ⁱ⁾"| ENV["Envelope Operator<br/>ω̂=argmax ⟨ω⁽ⁱ⁾, V(s,ω⁽ⁱ⁾)⟩"]
    S["Current State s_t"] --> AC
    ENV -->|"ω̂_t"| AC["Preference-conditioned actor–critic<br/>π(a|s,ω̂), V⃗(s,ω̂)∈R^d"]
    AC --> ACT[Step Environment / Vector Reward r⃗_t]
    ACT -.Vector Return as Evidence.-> E

Key Designs¶

1. Value Appraisal as Variational Inference: Automatically amplifying uncertainty in ambiguous contexts. Preferences are handled as distributed latent states with uncertainty. Specifically, an unconstrained latent vector $z_t\in\mathbb{R}^d$ is introduced with posterior $q_\phi(z_t\mid e_t)=\mathcal{N}(\mu_t,\mathrm{diag}(\sigma_t^2))$, which is mapped to the simplex via $\omega_t=\mathrm{softmax}(z_t)$. This design allows for two benefits: ambiguous contexts yield wider posteriors (expressing "I am unsure what to prioritize"), and the agent explores multiple value configurations by sampling $z_t$ rather than sticking to a single trade-off. Evidence support for preference configurations is defined via a Boltzmann-rational likelihood, where the scalarized return under a given preference serves as the log-likelihood: $p(e_t\mid z_t)\propto\exp(\beta\,U_t(\omega_t;e_t))$, where $U_t(\omega_t;e_t)=\langle\omega_t,\vec{G}_t(e_t)\rangle$ and $\vec{G}_t$ is the vector return estimated by the actor–critic. With an isotropic Gaussian prior $p_0(z)=\mathcal{N}(0,I)$, minimizing the KL divergence between $q_\phi$ and the target posterior $p^*(z_t\mid e_t)$ is equivalent to maximizing the ELBO: $$\mathcal{L}_{\text{ELBO}}=\beta\,\mathbb{E}_{z_t\sim q_\phi}\big[U_t(\omega_t;e_t)\big]-\mathrm{KL}\big(q_\phi(z_t\mid e_t)\,\|\,\mathcal{N}(0,I)\big).$$ The first term aligns beliefs with evidence, while the second "anchors" preferences via the prior, preventing drastic shifts unless recent returns provide strong contradictory evidence.

2. Action Selection via On-Policy Envelope Operator: Selecting the most promising trade-off from candidates. Humans rarely commit to a single objective weight; they weigh several reasonable configurations and act on the most promising one. Correspondingly, policies and values are preference-conditioned: $\pi_\theta(a_t\mid s_t,\omega_t)$, $\vec{V}_\theta(s_t,\omega_t)\in\mathbb{R}^d$, with scalarized value $V^{\text{scalar}}(s_t,\omega_t)=\langle\omega_t,\vec{V}_\theta\rangle$. At each step, $K$ candidates are sampled from the posterior, and the one maximizing scalarized value is chosen: $\hat\omega_t=\arg\max_i\langle\omega_t^{(i)},\vec{V}_\theta(s_t,\omega_t^{(i)})\rangle$. This adapts the envelope operator from Yang et al., but executes it on-policy at decision time with a single preference-conditioned actor–critic rather than offline on a Q-network. This $\hat\omega_t$ is reused for optimization: vector TD error $\delta_t=\vec{r}_t+\gamma\vec{V}_\theta(s_{t+1},\hat\omega_t)-\vec{V}_\theta(s_t,\hat\omega_t)$ and vector advantage $\vec{A}_t$ via GAE are projected to scalars $A_t=\langle\hat\omega_t,\vec{A}_t\rangle$ for PPO's clipped surrogate, ensuring consistency between action and preference spaces. The critic is stabilized by dual supervision: $\mathcal{L}_{\text{critic}}=\xi\|\vec{V}_\theta-\vec{G}\|_2^2+(1-\xi)(V^{\text{scalar}}_\theta-\langle\hat\omega,\vec{G}\rangle)^2$.

3. Directional Alignment + Self-Consistency: Preventing jitter and "loophole exploitation." Simply maximizing $U_t=\langle\omega_t,\vec{G}_t\rangle$ may induce degenerate solutions where preferences oscillate violently or exploit transient reward fluctuations. DPI adds two cognition-inspired regularizations. Directional alignment aligns the predicted preference with the direction of the vector return: $$\mathcal{L}_{\text{dir}}=\mathbb{E}\Big[1-\tfrac{\langle\omega_t^{\text{pred}},\vec{G}_t\rangle}{\|\omega_t^{\text{pred}}\|_2\,\|\vec{G}_t\|_2}\Big],$$ active only when $\|\vec{G}_t\|>0$, which prevents the agent from caring about unattainable goals. Self-consistency $\mathcal{L}_{\text{stab}}=\|\omega_t^{\text{pred}}-\hat\omega_t\|_2^2$ anchors the encoder's prediction to the preference selected by the envelope operator, reducing misalignment between "what the strategy intends" and "what the inference yields." The total objective is: $$\mathcal{L}=\mathcal{L}_{\text{PPO}}+\mathcal{L}_{\text{critic}}-\mathcal{L}_{\text{ELBO}}+\lambda\mathcal{L}_{\text{dir}}+\gamma\mathcal{L}_{\text{stab}}.$$

Key Experimental Results¶

Main Results¶

Evaluated in Queue (energy vs. ethics) and Maze (2D navigation with deadlines, hazards, and energy constraints) environments under non-stationarity. Comparison against 6 baselines (MER = Mean Episode Return, SR = Success Rate):

Method	Queue MER	Queue SR(%)	Maze MER	Maze SR(%)
RANDOM	−24.24	17.25	−223.55	0.00
FIXED (Fixed Prefs)	−4.19	10.05	16.15	1.12
RS (Random Switch)	−4.29	11.43	−23.66	0.01
HEURISTIC (Rule-based)	−1.60	10.05	−3.65	0.00
ENVELOPE (Ext. Prefs)	−3.54	25.10	10.36	0.01
DPI (w/ Q-learning)	3.74	29.09	27.35	42.94
DPI (w/ PPO)	10.34	39.95	30.16	59.04

In Queue, DPI surpasses the strongest baseline (ENVELOPE) by 14.85 percentage points in SR. In Maze, MER is +191.1% higher than ENVELOPE, with a 59.0% SR (while FIXED/HEURISTIC/ENVELOPE fail almost entirely).

Ablation Study¶

Short-term resilience measured by Post-Shift Performance (PS@K, average return over $K$ steps after an event):

Variant	Description	Effect
Full DPI	All reg. terms + PPO	Fastest recovery, highest PS@K
w/o KL	No KL prior in ELBO	Significant performance drop
w/o dir	No directional alignment	Significant performance drop
w/o sta	No self-consistency anchoring	Significant performance drop
w/ Q-learning	Q-learning instead of Actor-Critic	Significant degradation in recovery

History window $H$ ablation: performance degrades at $H=1$ (lack of context) and plateaus after $H\geq9$; $H=3$ is chosen as a compute-performance trade-off.

Key Findings¶

PS@K Curves: HEURISTIC and ENVELOPE remain stagnant post-event. FIXED maintains moderate PS@K by greedily farming scalar rewards but fails tasks (low SR), proving no single preference handles all events. DPI recovers quickly and leads significantly.
Preference–Reward Alignment: Measured by $\mathrm{Align}(t)=\frac{\langle\hat\omega_t,\vec{r}_t\rangle}{\|\hat\omega_t\|\|\vec{r}_t\|}$, DPI maintains positive cosine similarity that spikes post-event, while baselines hover near zero or become negative. This indicates only DPI learns value representations that track task semantics.
Interpretable event→preference→behavior chain: The agent takes shortcuts when deadlines tighten, increases avoidance during hazard surges, and prioritizes waiting when energy is low. Patterns are consistent across contexts, showing DPI re-evaluates "what matters now" rather than replaying fixed plans.

Highlights & Insights¶

Challenging the "Known Reward/Preference" Assumption: DPI's true contribution lies in reframing the problem: preference is a latent state, and value appraisal is a sub-problem to be learned. It translates descriptive theories of "constructive preferences" into a deployable variational inference algorithm.
Elegant Reuse of the Envelope Operator: By bringing the envelope operator online at decision time within a preference-conditioned actor–critic, the agent maintains on-policy consistency across both action and preference spaces.
Regularization against Degeneracy: Directional alignment and self-consistency terms prevent the agent from pursuing unattainable goals or exploiting transient fluctuations, abstracting the intuition that human preferences are smooth and grounded in feasibility.
Uncertainty as Exploration: The Gaussian posterior + softmax captures cognitive uncertainty and naturally provides an exploration mechanism within the preference space.

Limitations & Future Work¶

Controlled Simulations: The environments (Queue/Maze/Modified MuJoCo) are closed and controlled. Moving toward open-world or real-world scenarios remains a core challenge.
Simple Preference Dynamics: Current event-driven shifts are discrete and single-agent, lacking multi-agent coupling or long-term evolutionary preference dynamics.
Strong Boltzmann-Rationality Assumptions: Treating scalarized returns directly as log-likelihoods may not hold in more complex POMDPs.
Hyperparameter Sensitivity: While robust to major parameters, the cost of tuning $\beta$ and coefficients $\lambda, \gamma$ on more difficult tasks is not fully shown.

MORL Strands: Both scalarization and Pareto/Envelope methods assume external preferences. DPI fills the gap where preference weights are treated explicitly as latent states.
Cognitive Bridging: Bounded rationality, appraisal theory, and dual-process frameworks support the "appraisal vs. selection" split, while the Bayesian brain theory provides the foundation for variational posterior updates.
Difference from RLHF: Unlike RLHF, which infers rewards from external human feedback, DPI enables agents to infer their own current trade-offs online using internal vector reward evidence.

Rating¶

Novelty: ⭐⭐⭐⭐ — Reframing preferences as drifting latent states is a significant shift in problem setting; credit for a clean instantiation via VI + envelope, though components are existing tools.
Experimental Thoroughness: ⭐⭐⭐ — Systemic benchmarks across three environments and six baselines with detailed alignment analysis; however, environments are somewhat toy-like, and MuJoCo results are secondary.
Writing Quality: ⭐⭐⭐⭐ — Clear narrative from cognitive motivation to interpretable behavior; intuitive examples like the queue-jumping anecdote.
Value: ⭐⭐⭐⭐ — Provides an interpretable, online-adaptive preference inference paradigm for MORL and cognitive-inspired RL, with open-source code enhancing utility.