Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL¶

Conference: ACL 2026 arXiv: 2604.17073 Code: None Area: LLM Evaluation Keywords: Abstention Calibration, Post-Refusal Clarification, Verifiable Reward, GRPO, Unanswerable Queries

TL;DR¶

Abstain-R1 proposes a clarification-aware RLVR reward that jointly optimizes explicit abstention and post-refusal clarification (identifying missing information) on unanswerable queries, enabling a 3B model to match or surpass large models such as DeepSeek-R1 on both abstention and clarification quality.

Background & Motivation¶

Background: RL post-training (e.g., RLVR/GRPO) has substantially improved the reasoning capabilities of LLMs; however, existing training objectives assume all queries are answerable and reward answer generation per se, even when queries are in fact unsolvable.

Limitations of Prior Work: When a query is semantically clear but informationally insufficient (e.g., undefined variables, contradictory premises), models tend to hallucinate or "fill in the world" to produce superficially complete answers, incurring what has been termed the "Hallucination Tax." Existing abstention methods either train models to produce generic refusals ("I don't know") or encourage follow-up questions without verifying whether those questions accurately identify the missing critical information.

Key Challenge: A bare refusal provides no value — users need to know why a query cannot be answered and what information is missing. Yet existing RL training lacks verifiable signals to assess the quality of post-refusal clarification.

Goal: To train models to (1) explicitly abstain on unanswerable queries; (2) provide semantically aligned clarification after abstaining, accurately identifying missing information; and (3) maintain performance on answerable queries.

Key Insight: Incorporating clarification quality into the RLVR reward design, using a lightweight verifier model to assess whether a model's clarification is semantically consistent with a reference clarification.

Core Idea: Mixing unanswerable samples into standard GRPO training and jointly optimizing abstention and clarification via a hierarchical reward function consisting of an abstention format reward and a clarification correctness reward.

Method¶

Overall Architecture¶

A three-stage training pipeline: (1) constructing the Abstain-CoT dataset (with reasoning chains and abstention+clarification annotations) for SFT cold-start; (2) performing SFT on Qwen2.5-3B-Instruct to establish basic abstention and reasoning formats; (3) applying GRPO-based RL training with a 30%/70% mix of unanswerable and answerable queries, optimized with a composite reward function.

Key Designs¶

Clarification-Aware Composite Reward Function:
- Function: Provides learnable, fine-grained reward signals for unanswerable queries.
- Mechanism: The total reward \(r(o,y)\) is defined for two cases: answerable queries receive a format reward \(r_{\text{fmt}}\) plus a correctness reward \(r_{\text{ans}}\); unanswerable queries receive a format reward plus an abstention reward \(r_{\text{ref}}\). The key lies in the hierarchical design of \(r_{\text{ref}}\): outputting a boxed "I don't know" yields a base score of 0.3, with an additional 0.7 awarded if the clarification passes the verifier \(\mathcal{V}\), totaling 1.0. A penalty of \(-1\) is applied when the model abstains on answerable queries to prevent over-refusal.
- Design Motivation: Rewarding abstention alone causes models to refuse everything. Adding a clarification correctness reward and an abstention penalty on the answerable side creates a bilateral constraint.
Lightweight Verifier Model \(\mathcal{V}\):
- Function: Judges in real time during RL training whether a clarification is correct.
- Mechanism: The original question is reformulated as a meta-level query ("why is this unanswerable?"), and the verifier compares the semantic consistency between the model's clarification \(\hat{c}\) and the reference clarification \(c^\star\). A conservative 3B verifier (xVerify-3B-Ia) is used during training to reduce reward hacking, while a stronger o4-mini is used at evaluation.
- Design Motivation: String matching is too brittle; LLM-based semantic comparison is more robust. Using verifiers of different strengths for training and evaluation mitigates overfitting.
Abstain-CoT Dataset and SFT Cold-Start:
- Function: Provides an initial abstention and reasoning format for the RL stage.
- Mechanism: Semantically clear but unanswerable samples are selected from AbstentionBench, and DeepSeek-V3 is used to generate 4.6K structured training samples with <thinking> reasoning chains, spanning mathematics, life sciences, fact-checking, and other domains.
- Design Motivation: Without SFT cold-start, RL must learn the abstention format from scratch, which is extremely difficult to converge under sparse rewards.

Loss & Training¶

Standard GRPO objective is used: \(G\) candidate outputs are generated per query, policy gradients are computed based on within-group relative advantage \(A_i\), and KL regularization is applied to prevent divergence from the reference policy. Answerable and unanswerable queries are mixed at a 7:3 ratio.

Key Experimental Results¶

Main Results¶

Dataset	Metric	Abstain-R1 (3B)	Qwen2.5-3B	DeepSeek-R1	Gain (vs base)
Abstain-Test	U-Ref (Abstention Rate)	68.1%	9.4%	52.2%	+58.7%
Abstain-Test	U-Clar (Clarification Accuracy)	55.1%	0.6%	46.5%	+54.5%
Abstain-Test	A-Acc (Answerable Accuracy)	57.2%	48.8%	78.6%	+8.4%
SelfAware	U-Ref	91.4%	82.3%	63.8%	+9.1%
Abstain-QA	U-Ref	40.1%	30.0%	9.1%	+10.1%

Ablation Study¶

Configuration	A-Acc	U-Ref	U-Clar	Note
Abstain-R1	57.2%	68.1%	55.1%	Full model
w/o SFT	53.3%	65.1%	8.5%	No cold-start; clarification quality collapses
w/o RL	55.4%	51.9%	37.0%	SFT only; insufficient abstention
w/o Unans	67.5%	4.4%	3.1%	No unanswerable data; almost never abstains
w/o clari reward	55.9%	64.5%	50.2%	No clarification reward; clarification degrades

Key Findings¶

SFT is the primary source of clarification capability (removing it drops U-Clar from 55.1% to 8.5%); RL primarily reinforces the timing of abstention.
The abstention penalty on answerable queries is critical: without it, A-FU (false abstention rate) rises sharply from 20.4% to 36.2%.
The 3B model surpasses large models including DeepSeek-R1 on abstention and clarification, demonstrating that calibrated abstention can be achieved through targeted training rather than scale alone.
During RL training, the model gradually becomes more concise while simultaneously improving abstention rate, clarification accuracy, and answer accuracy.

Highlights & Insights¶

Treating post-refusal clarification as a first-class training objective is the paper's most central contribution: rather than simply saying "I don't know," the model is trained to say "I don't know, and here is why," which is of great value in high-stakes applications such as medicine and law.
The hierarchical reward design (0.3 base abstention + 0.7 clarification correctness) achieves a sound balance between conciseness and informativeness, and is transferable to other RL training scenarios requiring structured outputs.
Using verifiers of different strengths for training and evaluation (conservative 3B for training, strong o4-mini for evaluation) is a practical technique for mitigating reward hacking.

Limitations & Future Work¶

Answerable accuracy remains substantially below large models (57.2% vs. DeepSeek-R1's 78.6%), with the 3B backbone's reasoning capacity as the bottleneck.
The false abstention rate of 20.4% is non-trivial, with approximately one in five answerable questions incorrectly refused.
Clarification quality depends on the quality of reference clarifications, which are generated by DeepSeek-V3 and may introduce bias.
The approach targets only "semantically clear but informationally insufficient" unanswerable queries and does not cover other unanswerable types such as semantic ambiguity.

vs. AbstentionBench: The latter evaluates abstention capability but does not address training methods; Abstain-R1 provides a complete training-evaluation framework.
vs. Hallucination Tax (Song et al.): The latter diagnoses how RL training exacerbates hallucination; Abstain-R1 directly provides a solution (mixing unanswerable samples with a composite reward).
vs. CoCoNot: The latter uses SFT to learn contextual non-compliance but is brittle out-of-distribution; Abstain-R1 achieves stronger generalization via RL.

Rating¶

Novelty: ⭐⭐⭐⭐ Incorporating clarification quality into RLVR is a novel perspective, though the core technique remains standard GRPO.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three benchmarks, multi-dimensional metrics, detailed ablations, reward sensitivity analysis, and training dynamics analysis.
Writing Quality: ⭐⭐⭐⭐⭐ Research questions are precisely defined, RQ organization is clear, and figures and tables are information-dense.