MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following¶

Conference: ACL2026
arXiv: 2606.06058
Code: https://github.com/m-salmani78/MDP-GRPO
Area: LLM Alignment / RLVR / Multi-Constraint Instruction Following
Keywords: GRPO, Verifiable Rewards, Multi-Constraint Instructions, Advantage Stabilization, Prospect Theory

TL;DR¶

MDP-GRPO addresses the instability of GRPO under discrete, low-variance rewards in multi-constraint instruction following. By integrating multi-temperature sampling, dual-anchor advantage, prospect-theoretic shaping, and asymmetric KL divergence, it enables small models to achieve more stable soft/hard constraint satisfaction rates on IFEval, FollowBench, and custom multi-constraint test sets.

Background & Motivation¶

Background: LLMs are capable of following many natural language instructions. However, when a request contains multiple explicit constraints (e.g., format, vocabulary, case, ending phrases, structured output), models often fail to satisfy all of them. In real-world deployment—such as legal templates, product copy, developer tool outputs, and security policies—these multi-constraint prompts are common, and "missing one makes it unusable."

Limitations of Prior Work: Reinforcement Learning with Verifiable Rewards (RLVR) is well-suited for such tasks because each constraint can be deterministically checked by rule-based verifiers, avoiding the bias of learned reward models or LLM-as-a-judge. However, these rewards are typically discrete, sparse, and low-variance. GRPO relies on within-group z-score normalization of multiple samples for the same prompt; when the reward distribution within a group is overly homogeneous, the advantage estimation suffers from three pathological failure modes.

Key Challenge: The within-group relative normalization of GRPO only identifies "who is better in the group" but tends to ignore the absolute reward level. In the early stages of multi-constraint tasks, all samples might be equally wrong or equally right, leading to zero intra-group variance. Even when variance is non-zero, it can amplify minute differences into excessive gradients. Models need to retain within-group comparison signals while simultaneously knowing "how far they are from full constraint satisfaction."

Goal: The authors aim to stabilize GRPO without introducing a critic, making it suitable for deterministic multi-constraint rewards. Goals include reducing homogeneous groups, restoring learning signals in zero-variance groups, controlling the magnitude of advantage updates, and more conservatively constraining policy drift when punishing constraint violations.

Key Insight: Instead of modifying the rewards themselves, the paper focuses on sampling and advantage estimation. Multi-temperature sampling prevents homogeneous groups; dual-anchor advantage injects absolute goal levels into the advantage estimate; prospect shaping uses loss aversion from behavioral economics to limit update magnitudes; and asymmetric KL divergence more strongly constrains policy deviation during negative advantage updates.

Core Idea: The standard single within-group z-score advantage of GRPO is extended into a hybrid advantage consisting of a "within-group relative signal + goal-aware anchor signal." This is passed through bounded, asymmetric shaping before the policy update, mitigating low-variance amplification, mean-centering blindness, and zero-variance collapse simultaneously.

Method¶

Overall Architecture¶

MDP-GRPO follows the critic-free group-based policy optimization used in GRPO. Given an instruction \(x\), the model generates \(G\) completions. The reward for each completion is the ratio of satisfied constraints \(r(x,y)=\frac{1}{C(x)}\sum_t c_t(x,y)\). While standard GRPO uses the group mean and standard deviation to calculate advantage, MDP-GRPO inserts three stabilization modules: multi-temperature sampling to generate more diverse responses, dual-calculation of group-relative signals \(z_i\) and goal-aware signals \(\delta_i\), and bounded prospect-theoretic shaping to obtain the final signal used in a clipped GRPO objective with asymmetric KL coefficients based on the sign of \(A_i\).

Regarding data, the authors constructed 3,000 training prompts. Approximately one-third of the seed prompts are from existing data, while the rest are manually curated, covering general Q&A, creative writing, and material assistance. Each instruction is injected with 1-6 constraints, with a taxonomy covering 9 high-level categories and 26 constraint types, verified through deterministic validators such as regex and parsers.

Key Designs¶

Multi-Temperature Group Sampling:
- Function: Reduces the probability of \(G\) completion rewards being identical for the same prompt.
- Mechanism: Standard GRPO typically uses a fixed temperature for sampling the entire group. MDP-GRPO employs a temperature schedule \(\mathbf{T}=[\tau_1,...,\tau_G]\) for different samples in a group, e.g., \([0.1, 0.4, 0.7, 1.0]\). Low-temperature samples provide high-quality exploitation, while high-temperature samples increase exploration, making different constraint satisfaction patterns more likely.
- Design Motivation: The root cause of zero-variance collapse is the lack of reward differentiation within a group, leaving the advantage without direction. Multi-temperature sampling improves reward dispersion at the data generation stage without changing the objective function, which is particularly beneficial for small group sizes like \(G=4\).
Dual-Anchor Advantage:
- Function: Retains both relative intra-group comparison and absolute performance relative to the goal.
- Mechanism: The standard group signal is \(z_i=(r_i-\mu_{group})/(\sigma_{group}+\epsilon)\). MDP-GRPO introduces a goal-aware anchor: assuming a neutral baseline where each constraint is satisfied independently with \(p=0.5\), the reward target center is \(\mu_{goal}=0.5\) and the standard deviation is \(\sigma_{goal}=1/(2\sqrt{C(x)})\). The absolute advantage is expressed as \(\delta_i=2\sqrt{C(x)}(r_i-0.5)\). The final advantage is a mixture of shaped \(z_i\) and \(\delta_i\).
- Design Motivation: Mean-centering blindness causes "all-wrong groups" and "all-right groups" to look identical after normalization. The goal-aware anchor informs the model whether a completion is above or below the neutral constraint satisfaction level, restoring directional signals even when all group samples are identical.
Prospect-Theoretic Shaping and Asymmetric KL:
- Function: Limits excessive updates caused by low-variance amplification and imposes stronger penalties on constraint violations.
- Mechanism: A scaled tanh transformation is applied to the raw advantage signal, with a larger upper bound on the negative side, i.e., \(\lambda_->\lambda_+>0\). Experiments use \((\lambda_+,\lambda_-)=(1.25,2.0)\) and \(\beta_{PT}=0.8\), making positive gains exhibit diminishing returns and negative violations more "painful." Additionally, asymmetric KL uses a higher coefficient \(\beta^{high}_{KL}=0.025\) when \(A_i<0\) and \(\beta^{low}_{KL}=0.01\) when \(A_i\ge 0\).
- Design Motivation: In multi-constraint tasks, regressing on a single constraint can make an output unusable; thus, negative updates should be more conservative. Bounded tanh prevents advantage explosion, and loss aversion emphasizes correcting samples that violate constraints.

Loss & Training¶

Training utilizes the standard GRPO clipped surrogate, replacing the advantage and optionally using asymmetric KL. Models tested include Gemma-2-2B-Instruct and Llama-3.2-3B-Instruct. Training settings: single NVIDIA A100, learning rate \(1\times10^{-5}\), batch size 32, PPO clip \(\epsilon_{clip}=0.2\), base KL coefficient 0.01, maximum generation length 1024, top_p=0.9, 1 epoch. The primary group size is \(G=8\), with \(G=4\) used to analyze the effect of multi-temperature sampling in small groups. The dual-anchor mixing weight is \(\alpha=0.2\), and the target center is \(\mu_{goal}=0.5\).

Key Experimental Results¶

Main Results¶

Model / Group Size	Method	IFEval SSR/HSR	Custom SSR/HSR	FollowBench SSR/HSR	Key Observation
Gemma-2-2B, G=8	Baseline	56.7 / 45.1	54.8 / 18.8	63.7 / 52.9	Zero-shot instruction model
Gemma-2-2B, G=8	GRPO	73.7 / 62.4	68.4 / 29.0	64.0 / 53.2	Standard GRPO shows significant gains
Gemma-2-2B, G=8	MDP-GRPO	75.3 / 64.1	70.3 / 32.8	66.9 / 57.4	More stable; Custom HSR +3.8 vs GRPO
Llama-3.2-3B, G=8	Baseline	54.2 / 46.8	60.3 / 20.8	69.7 / 59.8	Llama initially stronger on FollowBench
Llama-3.2-3B, G=8	GRPO	66.1 / 58.5	65.1 / 24.8	68.4 / 58.9	Custom gains; FollowBench slight drop
Llama-3.2-3B, G=8	MDP-GRPO	71.3 / 59.8	65.8 / 25.2	69.4 / 59.1	IFEval SSR +5.2 vs GRPO

The paper notes that individual components are not always globally optimal across all metrics. For instance, PT-GRPO achieves the highest HSR of 65.8% for Gemma-2-2B on IFEval, and DA-PT-GRPO achieves a slightly higher SSR of 71.5% for Llama-3.2-3B on IFEval than MDP-GRPO's 71.3%. The full pipeline is emphasized as providing a more stable overall profile rather than ranking first in every single metric.

Ablation Study¶

Setting	Key Figure	Description
Gemma, G=8, Custom HSR	GRPO 29.0, DA-GRPO 32.6, DA-PT-GRPO 33.4, MDP-GRPO 32.8	Goal anchors help most with complex constraint combinations
Gemma, G=4, IFEval	GRPO 69.7/58.2, MT-GRPO 71.1/59.4, MDP-GRPO 71.2/59.5	MT restores reward dispersion in small groups
Gemma, G=4, Custom HSR	GRPO 28.6, MT-GRPO 30.6, MDP-GRPO 30.4	MT effects are amplified in low-diversity settings
Llama, G=4, IFEval	GRPO 67.2/55.0, MT-GRPO 70.5/58.4	Multi-temperature sampling also effective for Llama small groups
Difficulty Analysis	Baseline HSR <10% at Difficulty 4; DA-PT-GRPO ~20% vs GRPO ~12% at Difficulty 5	Stabilization methods are more robust to degradation at high constraint counts

Key Findings¶

Standard GRPO significantly improves verifiable instruction following but suffers from homogeneous groups and mean-centering issues on difficult multi-constraint prompts.
Dual-anchor (DA) shows the most consistent HSR gains on the Custom Test Set, aligning with its design goal of fixing zero-variance/absolute blindness.
Prospect shaping effectively controls KL drift while preserving reward gains; DA-PT-GRPO further suppresses KL drift.
MT-GRPO might increase KL and decrease entropy under the current schedule, requiring careful temperature tuning; however, it is critical for performance recovery in the \(G=4\) small-group setting.

Highlights & Insights¶

Diagnosis precedes method: The three failure modes (low-variance amplification, mean-centering blindness, and zero-variance collapse) clearly explain the instability of GRPO under discrete rewards.
Advantage modification over reward modification: Multi-constraint rewards provided by deterministic checkers are reliable and cheap. MDP-GRPO avoids introducing learned reward models, instead supplementing signals at the advantage estimation and sampling stages.
Reintroducing absolute target levels to critic-free RL: The appeal of GRPO is the lack of a value model, but the cost is only seeing relative within-group quality. Dual-anchor serves as a lightweight compromise, adding "goal awareness" to critic-free methods.
Restrained use of Prospect Theory: It is not used to redefine human preference goals, but rather as a bounded asymmetric transformation for advantage shaping, which is more acceptable from an engineering perspective.

Limitations & Future Work¶

The method relies on explicit, automatically verifiable constraints. For subjective, stylistic, or underspecified constraints, learned rewards or preference feedback may be necessary, bringing back reward misspecification and judge bias.
MDP-GRPO introduces multiple hyperparameters: anchor mixing weight, shaping parameters, temperature schedule, and asymmetric KL. These require recalibration and KL monitoring when migrating to different reward scales or task domains.
Experiments focused on 2B/3B instruction-tuned models. Structured domains such as large-scale models, multilingual tasks, tool use, and code generation have not yet been validated.
The method optimizes for constraint satisfaction encoded in the reward specification; it does not automatically guarantee broader safety, factuality, or value alignment.
While multi-temperature sampling increases dispersion, it may also increase KL or decrease entropy, requiring controlled decoding during practical training.

vs Standard GRPO: Standard GRPO uses intra-group z-score advantage, which is simple but fragile under discrete, low-variance rewards; MDP-GRPO stabilizes updates via sampling, anchors, and shaping.
vs MAPO / NGRPO: Related works attempt to fix group-relative advantage allocation or all-negative groups; MDP-GRPO differs by simultaneously reducing homogeneous groups, restoring goal anchors, and applying prospect-style bounded shaping.
vs KTO: KTO uses prospect-theoretic utility at the objective function level for binary preferences; MDP-GRPO applies prospect-inspired shaping at the advantage level, keeping the reward definition unchanged.
Insights: For any task where rule-based verifiers are strong and rewards are discrete—such as formatted output, tool API schemas, code linting, or compliance document generation—one should consider goal-anchor advantage instead of standard GRPO.

Rating¶

Novelty: ⭐⭐⭐⭐☆ Clear contribution by deconstructing GRPO failure modes and combining dual-anchor/prospect shaping.
Experimental Thoroughness: ⭐⭐⭐⭐☆ Covers two models, two group sizes, three types of benchmarks, ablation, and training diagnostics; the main drawback is the limited model scale.
Writing Quality: ⭐⭐⭐⭐☆ Motivations and formulas are clear, tables are complete; overall logic is well-structured.
Value: ⭐⭐⭐⭐☆ Very practical for RLVR, multi-constraint tasks, and critic-free policy optimization, suitable for future expansion to larger models.