MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following¶
Conference: ACL2026
arXiv: 2606.06058
Code: https://github.com/m-salmani78/MDP-GRPO
Area: LLM Alignment / RLVR / Multi-Constraint Instruction Following
Keywords: GRPO, Verifiable Rewards, Multi-Constraint Instructions, Advantage Stabilization, Prospect Theory
TL;DR¶
MDP-GRPO addresses the instability of GRPO under discrete, low-variance rewards in multi-constraint instruction following. By integrating multi-temperature sampling, dual-anchor advantage, prospect-theoretic shaping, and asymmetric KL divergence, it enables small models to achieve more stable soft/hard constraint satisfaction rates on IFEval, FollowBench, and custom multi-constraint test sets.
Background & Motivation¶
Background: LLMs are capable of following many natural language instructions. However, when a request contains multiple explicit constraints (e.g., format, vocabulary, case, ending phrases, structured output), models often fail to satisfy all of them. In real-world deployment—such as legal templates, product copy, developer tool outputs, and security policies—these multi-constraint prompts are common, and "missing one makes it unusable."
Limitations of Prior Work: Reinforcement Learning with Verifiable Rewards (RLVR) is well-suited for such tasks because each constraint can be deterministically checked by rule-based verifiers, avoiding the bias of learned reward models or LLM-as-a-judge. However, these rewards are typically discrete, sparse, and low-variance. GRPO relies on within-group z-score normalization of multiple samples for the same prompt; when the reward distribution within a group is overly homogeneous, the advantage estimation suffers from three pathological failure modes.
Key Challenge: The within-group relative normalization of GRPO only identifies "who is better in the group" but tends to ignore the absolute reward level. In the early stages of multi-constraint tasks, all samples might be equally wrong or equally right, leading to zero intra-group variance. Even when variance is non-zero, it can amplify minute differences into excessive gradients. Models need to retain within-group comparison signals while simultaneously knowing "how far they are from full constraint satisfaction."
Goal: The authors aim to stabilize GRPO without introducing a critic, making it suitable for deterministic multi-constraint rewards. Goals include reducing homogeneous groups, restoring learning signals in zero-variance groups, controlling the magnitude of advantage updates, and more conservatively constraining policy drift when punishing constraint violations.
Key Insight: Instead of modifying the rewards themselves, the paper focuses on sampling and advantage estimation. Multi-temperature sampling prevents homogeneous groups; dual-anchor advantage injects absolute goal levels into the advantage estimate; prospect shaping uses loss aversion from behavioral economics to limit update magnitudes; and asymmetric KL divergence more strongly constrains policy deviation during negative advantage updates.
Core Idea: The standard single within-group z-score advantage of GRPO is extended into a hybrid advantage consisting of a "within-group relative signal + goal-aware anchor signal." This is passed through bounded, asymmetric shaping before the policy update, mitigating low-variance amplification, mean-centering blindness, and zero-variance collapse simultaneously.
Method¶
Overall Architecture¶
MDP-GRPO follows the critic-free group-based policy optimization used in GRPO. Given an instruction \(x\), the model generates \(G\) completions. The reward for each completion is the ratio of satisfied constraints \(r(x,y)=\frac{1}{C(x)}\sum_t c_t(x,y)\). While standard GRPO uses the group mean and standard deviation to calculate advantage, MDP-GRPO inserts three stabilization modules: multi-temperature sampling to generate more diverse responses, dual-calculation of group-relative signals \(z_i\) and goal-aware signals \(\delta_i\), and bounded prospect-theoretic shaping to obtain the final signal used in a clipped GRPO objective with asymmetric KL coefficients based on the sign of \(A_i\).
Regarding data, the authors constructed 3,000 training prompts. Approximately one-third of the seed prompts are from existing data, while the rest are manually curated, covering general Q&A, creative writing, and material assistance. Each instruction is injected with 1-6 constraints, with a taxonomy covering 9 high-level categories and 26 constraint types, verified through deterministic validators such as regex and parsers.
Key Designs¶
-
Multi-Temperature Group Sampling:
- Function: Reduces the probability of \(G\) completion rewards being identical for the same prompt.
- Mechanism: Standard GRPO typically uses a fixed temperature for sampling the entire group. MDP-GRPO employs a temperature schedule \(\mathbf{T}=[\tau_1,...,\tau_G]\) for different samples in a group, e.g., \([0.1, 0.4, 0.7, 1.0]\). Low-temperature samples provide high-quality exploitation, while high-temperature samples increase exploration, making different constraint satisfaction patterns more likely.
- Design Motivation: The root cause of zero-variance collapse is the lack of reward differentiation within a group, leaving the advantage without direction. Multi-temperature sampling improves reward dispersion at the data generation stage without changing the objective function, which is particularly beneficial for small group sizes like \(G=4\).
-
Dual-Anchor Advantage:
- Function: Retains both relative intra-group comparison and absolute performance relative to the goal.
- Mechanism: The standard group signal is \(z_i=(r_i-\mu_{group})/(\sigma_{group}+\epsilon)\). MDP-GRPO introduces a goal-aware anchor: assuming a neutral baseline where each constraint is satisfied independently with \(p=0.5\), the reward target center is \(\mu_{goal}=0.5\) and the standard deviation is \(\sigma_{goal}=1/(2\sqrt{C(x)})\). The absolute advantage is expressed as \(\delta_i=2\sqrt{C(x)}(r_i-0.5)\). The final advantage is a mixture of shaped \(z_i\) and \(\delta_i\).
- Design Motivation: Mean-centering blindness causes "all-wrong groups" and "all-right groups" to look identical after normalization. The goal-aware anchor informs the model whether a completion is above or below the neutral constraint satisfaction level, restoring directional signals even when all group samples are identical.
-
Prospect-Theoretic Shaping and Asymmetric KL:
- Function: Limits excessive updates caused by low-variance amplification and imposes stronger penalties on constraint violations.
- Mechanism: A scaled tanh transformation is applied to the raw advantage signal, with a larger upper bound on the negative side, i.e., \(\lambda_->\lambda_+>0\). Experiments use \((\lambda_+,\lambda_-)=(1.25,2.0)\) and \(\beta_{PT}=0.8\), making positive gains exhibit diminishing returns and negative violations more "painful." Additionally, asymmetric KL uses a higher coefficient \(\beta^{high}_{KL}=0.025\) when \(A_i<0\) and \(\beta^{low}_{KL}=0.01\) when \(A_i\ge 0\).
- Design Motivation: In multi-constraint tasks, regressing on a single constraint can make an output unusable; thus, negative updates should be more conservative. Bounded tanh prevents advantage explosion, and loss aversion emphasizes correcting samples that violate constraints.
Loss & Training¶
Training utilizes the standard GRPO clipped surrogate, replacing the advantage and optionally using asymmetric KL. Models tested include Gemma-2-2B-Instruct and Llama-3.2-3B-Instruct. Training settings: single NVIDIA A100, learning rate \(1\times10^{-5}\), batch size 32, PPO clip \(\epsilon_{clip}=0.2\), base KL coefficient 0.01, maximum generation length 1024, top_p=0.9, 1 epoch. The primary group size is \(G=8\), with \(G=4\) used to analyze the effect of multi-temperature sampling in small groups. The dual-anchor mixing weight is \(\alpha=0.2\), and the target center is \(\mu_{goal}=0.5\).
Key Experimental Results¶
Main Results¶
| Model / Group Size | Method | IFEval SSR/HSR | Custom SSR/HSR | FollowBench SSR/HSR | Key Observation |
|---|---|---|---|---|---|
| Gemma-2-2B, G=8 | Baseline | 56.7 / 45.1 | 54.8 / 18.8 | 63.7 / 52.9 | Zero-shot instruction model |
| Gemma-2-2B, G=8 | GRPO | 73.7 / 62.4 | 68.4 / 29.0 | 64.0 / 53.2 | Standard GRPO shows significant gains |
| Gemma-2-2B, G=8 | MDP-GRPO | 75.3 / 64.1 | 70.3 / 32.8 | 66.9 / 57.4 | More stable; Custom HSR +3.8 vs GRPO |
| Llama-3.2-3B, G=8 | Baseline | 54.2 / 46.8 | 60.3 / 20.8 | 69.7 / 59.8 | Llama initially stronger on FollowBench |
| Llama-3.2-3B, G=8 | GRPO | 66.1 / 58.5 | 65.1 / 24.8 | 68.4 / 58.9 | Custom gains; FollowBench slight drop |
| Llama-3.2-3B, G=8 | MDP-GRPO | 71.3 / 59.8 | 65.8 / 25.2 | 69.4 / 59.1 | IFEval SSR +5.2 vs GRPO |
The paper notes that individual components are not always globally optimal across all metrics. For instance, PT-GRPO achieves the highest HSR of 65.8% for Gemma-2-2B on IFEval, and DA-PT-GRPO achieves a slightly higher SSR of 71.5% for Llama-3.2-3B on IFEval than MDP-GRPO's 71.3%. The full pipeline is emphasized as providing a more stable overall profile rather than ranking first in every single metric.
Ablation Study¶
| Setting | Key Figure | Description |
|---|---|---|
| Gemma, G=8, Custom HSR | GRPO 29.0, DA-GRPO 32.6, DA-PT-GRPO 33.4, MDP-GRPO 32.8 | Goal anchors help most with complex constraint combinations |
| Gemma, G=4, IFEval | GRPO 69.7/58.2, MT-GRPO 71.1/59.4, MDP-GRPO 71.2/59.5 | MT restores reward dispersion in small groups |
| Gemma, G=4, Custom HSR | GRPO 28.6, MT-GRPO 30.6, MDP-GRPO 30.4 | MT effects are amplified in low-diversity settings |
| Llama, G=4, IFEval | GRPO 67.2/55.0, MT-GRPO 70.5/58.4 | Multi-temperature sampling also effective for Llama small groups |
| Difficulty Analysis | Baseline HSR <10% at Difficulty 4; DA-PT-GRPO ~20% vs GRPO ~12% at Difficulty 5 | Stabilization methods are more robust to degradation at high constraint counts |
Key Findings¶
- Standard GRPO significantly improves verifiable instruction following but suffers from homogeneous groups and mean-centering issues on difficult multi-constraint prompts.
- Dual-anchor (DA) shows the most consistent HSR gains on the Custom Test Set, aligning with its design goal of fixing zero-variance/absolute blindness.
- Prospect shaping effectively controls KL drift while preserving reward gains; DA-PT-GRPO further suppresses KL drift.
- MT-GRPO might increase KL and decrease entropy under the current schedule, requiring careful temperature tuning; however, it is critical for performance recovery in the \(G=4\) small-group setting.
Highlights & Insights¶
- Diagnosis precedes method: The three failure modes (low-variance amplification, mean-centering blindness, and zero-variance collapse) clearly explain the instability of GRPO under discrete rewards.
- Advantage modification over reward modification: Multi-constraint rewards provided by deterministic checkers are reliable and cheap. MDP-GRPO avoids introducing learned reward models, instead supplementing signals at the advantage estimation and sampling stages.
- Reintroducing absolute target levels to critic-free RL: The appeal of GRPO is the lack of a value model, but the cost is only seeing relative within-group quality. Dual-anchor serves as a lightweight compromise, adding "goal awareness" to critic-free methods.
- Restrained use of Prospect Theory: It is not used to redefine human preference goals, but rather as a bounded asymmetric transformation for advantage shaping, which is more acceptable from an engineering perspective.
Limitations & Future Work¶
- The method relies on explicit, automatically verifiable constraints. For subjective, stylistic, or underspecified constraints, learned rewards or preference feedback may be necessary, bringing back reward misspecification and judge bias.
- MDP-GRPO introduces multiple hyperparameters: anchor mixing weight, shaping parameters, temperature schedule, and asymmetric KL. These require recalibration and KL monitoring when migrating to different reward scales or task domains.
- Experiments focused on 2B/3B instruction-tuned models. Structured domains such as large-scale models, multilingual tasks, tool use, and code generation have not yet been validated.
- The method optimizes for constraint satisfaction encoded in the reward specification; it does not automatically guarantee broader safety, factuality, or value alignment.
- While multi-temperature sampling increases dispersion, it may also increase KL or decrease entropy, requiring controlled decoding during practical training.
Related Work & Insights¶
- vs Standard GRPO: Standard GRPO uses intra-group z-score advantage, which is simple but fragile under discrete, low-variance rewards; MDP-GRPO stabilizes updates via sampling, anchors, and shaping.
- vs MAPO / NGRPO: Related works attempt to fix group-relative advantage allocation or all-negative groups; MDP-GRPO differs by simultaneously reducing homogeneous groups, restoring goal anchors, and applying prospect-style bounded shaping.
- vs KTO: KTO uses prospect-theoretic utility at the objective function level for binary preferences; MDP-GRPO applies prospect-inspired shaping at the advantage level, keeping the reward definition unchanged.
- Insights: For any task where rule-based verifiers are strong and rewards are discrete—such as formatted output, tool API schemas, code linting, or compliance document generation—one should consider goal-anchor advantage instead of standard GRPO.
Rating¶
- Novelty: ⭐⭐⭐⭐☆ Clear contribution by deconstructing GRPO failure modes and combining dual-anchor/prospect shaping.
- Experimental Thoroughness: ⭐⭐⭐⭐☆ Covers two models, two group sizes, three types of benchmarks, ablation, and training diagnostics; the main drawback is the limited model scale.
- Writing Quality: ⭐⭐⭐⭐☆ Motivations and formulas are clear, tables are complete; overall logic is well-structured.
- Value: ⭐⭐⭐⭐☆ Very practical for RLVR, multi-constraint tasks, and critic-free policy optimization, suitable for future expansion to larger models.