Generalizing Verifiable Instruction Following¶
Conference: NeurIPS 2025 arXiv: 2507.02833 Code: Available Area: Reinforcement Learning Keywords: Instruction Following, Verifiable Constraints, RLVR, GRPO, Generalization
TL;DR¶
This paper introduces IFBench, a benchmark for evaluating generalization in precise instruction following, demonstrating that current SOTA models severely overfit to the 25 constraint templates of IFEval. It further proposes IF-RLVR, a training method based on GRPO with verifiable rewards, which significantly improves both in-domain and out-of-domain instruction following performance.
Background & Motivation¶
Precise Instruction Following (IF) is a critical capability for effective human–LLM interaction. Users frequently embed output constraints in their instructions, such as "answer only with yes or no" or "mention 'abracadabra' at least 3 times." IFEval is the most widely used evaluation benchmark, comprising 25 verifiable constraint templates, yet it has become saturated—many 2B-parameter models already exceed 80% accuracy.
Core Finding: Most models severely overfit to IFEval's 25 constraint types and fail to generalize to unseen output constraints. This is largely because mainstream training approaches (as described in, e.g., the Nemotron-4 technical report) directly synthesize instruction-following data from the IFEval taxonomy. This paper exposes such overfitting by constructing IFBench, on which leading models such as GPT-4.1 and Claude 3.7 Sonnet score below 50%.
Method¶
Overall Architecture¶
The work comprises three interrelated contributions:
- IFBench (Evaluation): 58 novel, diverse, and challenging verifiable constraints spanning 7 categories: counting, ratio, words, sentence, format, custom, and copy.
- IFTrain (Training Constraints): 29 new manually annotated training constraints with corresponding verification functions.
- IF-RLVR (Training Method): RL training using GRPO with verifiable rewards.
Key Designs¶
IFBench Benchmark Construction¶
- Constraint Sources: Collected from LM user feedback and hand-crafted to cover core IF skills.
- Selection Criteria: Each constraint must be accompanied by a Python verification function to ensure reproducible evaluation.
- Test Prompt Construction: Instantiated constraints are appended to held-out prompts from WildChat to prevent train–test leakage.
- Evaluation Setup: 300 prompts, each with 1–2 constraints.
- Single-turn: Instruction and constraint are provided simultaneously.
- Multi-turn: The model first responds to the instruction, then is asked in a second turn to revise the response to satisfy the constraint.
Representative Constraint Examples: Maintaining a 2:1 ratio of declarative to interrogative sentences, using only unique words, copying a specific portion of the input, etc.
IF-RLVR Training Pipeline¶
Data Construction: - Prompts are randomly sampled from Tülu-3-SFT. - Each prompt is appended with 1 to \(n\) constraints (\(n \in \{1,2,3,4,5,6\}\)). - Constraints are drawn from IFTrain and IFEval with extended variable ranges. - A constraint conflict dictionary is maintained to prevent contradictory constraint combinations. - Approximately 60k–100k training prompts are generated in total.
Training: - GRPO (Group Region Policy Optimization) is used for outcome supervision. - Each output is scored based on whether the constraints are satisfied.
Loss & Training¶
Multi-Constraint Reward Function: $\(\text{Instance Reward} = \sum_{i=1}^{n} \text{verifiable\_reward}_i \cdot \text{reward\_multiplier}_i \cdot \text{reward\_weight}_i\)$
The reward multiplier and weight are generally set to 1, and can be adjusted to up- or down-weight specific rewards.
Training Hyperparameters: max_token_length=2048, temperature=1, lr=5e-7, 16 samples/prompt, 8 H100 GPUs, local mini-batch=32, approximately 2,000 training steps (~1 day). For base models, a reasoning chat template is used: max_token_length=10240, beta=0.
Key Experimental Results¶
Main Results¶
Large Gap Between SOTA Models on IFEval vs. IFBench:
| Model | IFEval (%) | IFBench (%) |
|---|---|---|
| o3 | ~95 | ~55 |
| Claude 4 Sonnet | ~90 | <50 |
| Qwen3-32B | ~90 | <50 |
| GPT-4.1 | ~88 | <50 |
IF-RLVR Training Results:
| Model | IFEval Before→After | IFBench Before→After |
|---|---|---|
| Tülu-3-8B-DPO | 82.4 → 92.2 | 28.9 → 44.6 |
| Qwen2.5-7B (base) | N/A → 87.8 | N/A → 53.7 |
| Llama3.1-8B (base) | N/A → 88.2 | N/A → 54.1 |
| OLMo2-instruct | 61.7 → 74.5 | 16.7 → 44.6 |
Ablation Study¶
Effect of Multi-Constraint Training (Qwen2.5 policy):
| Constraints per Prompt | IFBench | IFEval |
|---|---|---|
| 1 | 48.9 | 71.2 |
| 2 | 53.1 | 79.9 |
| 3 | 59.5 | 77.8 |
| 5 | 55.8 | 79.9 |
| 6 | 54.1 | 85.8 |
Effect of Training Constraint Diversity: Combining IFTrain (out-of-domain) and IFEval (in-domain) constraints yields the best overall performance. Training with only the 29 out-of-domain constraints already improves IFBench scores; adding all 25 IFEval constraints achieves the highest IFEval scores.
Generalization Across Variable Ranges: Training with a wider variable range (covering and extending beyond the test range) performs ≥ training on the same range > training on a disjoint range.
GRPO vs. DPO Comparison (same data and policy):
| Training Method | IFEval | IFBench |
|---|---|---|
| DPO after DPO | 79.67 | 29.3 |
| GRPO after DPO | 89.65 | 30.6 |
Base vs. Instruct Models under IF-RLVR: Base models trained with a reasoning chat template generalize better on IFBench (54.1 vs. 44.6), suggesting that RLVR combined with reasoning is beneficial for IF generalization.
Key Findings¶
- Severe Overfitting: SOTA models exceed 90% on IFEval but fall below 50% on IFBench.
- Constraint Diversity Is Critical: Increasing both the variety of training constraints and the number of constraints per prompt substantially improves generalization.
- GRPO Substantially Outperforms DPO: Under identical data, GRPO consistently outperforms DPO, as RLVR can produce accurate training signals for prompts of arbitrary difficulty.
- Constraint–Task Trade-off: After IF-RLVR training, models tend to prioritize satisfying constraints at the expense of response quality (LLM-as-judge scores drop from 7.0 to 6.4).
- RLVR Is Viable for Base Models: Strong IF capability can be acquired via direct RLVR on base models without SFT or DPO pretraining.
Highlights & Insights¶
- Exposing the Generalization Illusion: Saturation on IFEval reflects memorization of 25 constraint types rather than genuine IF capability.
- IFBench Targets Long-Tail Constraints: It covers skills where models are genuinely weak, including counting, ratio, and copy constraints.
- Unique Advantage of RLVR: Unlike DPO, which requires chosen/rejected pairs that are difficult to construct, RLVR only requires a verification function to generate training signal for prompts of any difficulty.
- Instruction Hierarchy Finding: Different models prioritize constraints and tasks differently—Qwen2.5 tends to prioritize constraints, while Tülu-3 tends to prioritize task quality.
- Multi-Turn Training: Mixing single-turn and multi-turn training data yields the best results.
Limitations & Future Work¶
- The paper focuses exclusively on verifiable constraints; many real-world user constraints are difficult to verify automatically.
- Some constraints may appear unnatural or contrived.
- IF-RLVR training slightly degrades performance on other downstream tasks (e.g., AlpacaEval).
- Balancing strategies when constraints conflict with task requirements warrant further investigation—incorporating a preference reward model signal is suggested.
- Joint training of IF-RLVR with other RLVR tasks such as mathematics and coding remains unexplored.
Related Work & Insights¶
- IFEval (Zhou et al., 2023): An IF evaluation benchmark with 25 verifiable constraints; now largely saturated.
- FollowBench (Jiang et al., 2023): Tests IF ability under incrementally increasing numbers of constraints, but relies on LLM-as-judge evaluation.
- VFF (Wang et al.): Automatically generates verifiable training/test data and trains with SFT and DPO.
- Tülu-3 (Lambert et al., 2024): First demonstrates the potential of RLVR for instruction following.
- DeepSeek-R1 (Guo et al., 2025): A successful application of RLVR to mathematical reasoning.
- Takeaway: Verifiable rewards combined with diverse constraint combinations constitute the key paradigm for improving IF generalization.
Rating¶
- Novelty: ★★★★☆ — IFBench fills an important gap; the IF-RLVR training method is systematic and comprehensive.
- Experimental Thoroughness: ★★★★★ — Ablations are exceptionally detailed, covering constraint count, variable ranges, training methods, and multi-turn settings.
- Value: ★★★★★ — IFBench, IFTrain, and IF-RLVR code are all open-sourced and directly reusable.
- Writing Quality: ★★★★☆ — Content is rich but information-dense; some experiments require careful cross-referencing.