ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training¶
Conference: ACL 2026 arXiv: 2604.07484 Code: GitHub Area: Alignment RLHF / Reward Modeling Keywords: Generative Reward Model, Self-Training, Consistency-Aware, Pseudo Labels, Position Bias
TL;DR¶
ConsistRM proposes a consistency-aware self-training framework for generative reward models (GRMs). It introduces two modules — temporal consistency pseudo-labels (integrating online-state and memory-driven preference consistency) and semantic consistency critique rewards (measuring semantic similarity across multiple generated critiques) — achieving an average improvement of 1.5% across five benchmarks without human annotation, while significantly mitigating position bias.
Background & Motivation¶
State of the Field: Generative reward models (GRMs) replace traditional scalar reward models by generating textual critiques and preference labels, offering stronger expressiveness and generalization. Representative works include DeepSeek-GRM (critique generation + self-derived rules) and RM-R1 (distilled reasoning traces + reinforcement learning).
Limitations of Prior Work: GRM training faces two major challenges: (1) reliance on expensive human-annotated data, limiting scalability; and (2) self-training methods (e.g., majority-vote pseudo-labels in TTRL) are prone to reward hacking and early overfitting to noisy pseudo-labels, as reward signals are tightly coupled with the policy model.
Root Cause: Self-training requires reliable pseudo-labels, yet model-generated pseudo-labels are inherently unstable — single-round voting is susceptible to sampling randomness, and pseudo-label bias accumulates in later training stages.
Paper Goals: Design a stable and effective GRM self-training framework that requires no human annotation.
Starting Point: Leverage the model's intrinsic consistency signals as a self-supervised source — if a model produces consistent preference judgments for the same sample across multiple samples and training rounds, those judgments are more likely to be correct.
Core Idea: Construct reliable pseudo-labels via temporal consistency (current round + historical memory) and provide fine-grained rewards via semantic consistency (similarity across multiple critique texts), enabling stable annotation-free GRM self-training.
Method¶
Overall Architecture¶
Given a query \(q\) and two candidate responses \((a_1, a_2)\), the GRM generates a structured output \(o = (c, y)\), where \(c\) is a textual critique and \(y \in \{-1, 1\}\) is the preference label. ConsistRM provides self-supervised signals for GRPO reinforcement learning through two core modules: Consistency-Aware Answer Reward (CAAR) and Consistency-Aware Critique Reward (CACR).
Key Designs¶
-
Consistency-Aware Answer Reward (CAAR):
- Function: Construct reliable pseudo-labels for self-training.
- Mechanism: Integrates two layers of consistency signals. Online-state consistency \(s_{\text{online}}^{(n)} = \frac{1}{K}\sum_{j=1}^{K} y_j\) aggregates preference predictions from \(K\) rollouts in the current round; memory-driven consistency \(s_{\text{memory}}^{(n)} = \frac{1}{n-1}\sum_{i=0}^{n-1} \hat{y}^{(i)}\) aggregates pseudo-labels from all previous rounds. The final pseudo-label is \(\hat{y}^{(n)} = \text{sgn}(s_{\text{online}}^{(n)} + s_{\text{memory}}^{(n)})\); when the two signals disagree, the output is 0 (no supervision provided), preventing low-confidence samples from dominating optimization.
- Design Motivation: Online voting alone is unreliable in early training; historical memory provides a stable anchor. The ternary label scheme (+1/−1/0) explicitly handles uncertain samples, yielding more robustness than binary forced classification.
-
Consistency-Aware Critique Reward (CACR):
- Function: Provide fine-grained quality rewards for critique texts.
- Mechanism: Each critique \(c_j\) is encoded into a vector using Qwen3-4B-Embedding; a cosine similarity matrix is computed and critiques are ranked by semantic consistency. Critiques ranked in the top \(p\) with correct preference labels receive an additional reward \(r_j^{(c)} = 0.1\). The intuition is that semantically consistent critiques across multiple generations indicate the model has converged to a stable evaluation region for that sample.
- Design Motivation: CAAR supervises outcomes (preference labels); CACR provides complementary process supervision (critique content). High semantic consistency in critiques is more likely to reflect reliable evaluation.
-
Format Constraints and Combined Reward:
- Function: Ensure valid output format and integrate multi-level rewards.
- Mechanism: The final reward is \(r^{(n)} = r_j^{(a,n)} + r_j^{(c,n)}\) when the format is valid and \(\hat{y} \neq 0\); \(r = -5\) for invalid format; \(r = 0\) when \(\hat{y} = 0\). GRPO is used for reinforcement training with global batch size 64, learning rate 1e-6, 8 rollouts, and KL coefficient 0.001.
- Design Motivation: Format constraints ensure parsable outputs; the combined reward provides consistent optimization signals at multiple granularities.
Loss & Training¶
GRPO is used for training over 4 epochs, with maximum generation length 1024 (training) / 2048 (inference) and temperature 1.0 (training) / 0 (inference). Training follows a two-stage pipeline: SFT on HelpSteer3, followed by RFT (reinforcement fine-tuning). ConsistRM replaces the reward signal during the RFT stage.
Key Experimental Results¶
Main Results¶
Five-Benchmark Performance on Qwen3-8B
| Method | RewardBench | PPE Pref | RM-Bench | RMB | JudgeBench | Avg. | Δ |
|---|---|---|---|---|---|---|---|
| Qwen3-8B (Base) | 81.6 | 63.8 | 75.8 | 78.8 | 54.3 | 70.9 | - |
| + SFT | 82.7 | 65.0 | 77.1 | 76.9 | 51.7 | 70.7 | -0.2 |
| + RFT | 85.4 | 65.4 | 78.2 | 78.2 | 55.4 | 72.5 | +1.6 |
| + TTRL | 85.3 | 65.0 | 77.4 | 74.2 | 56.8 | 71.7 | +0.8 |
| + ConsistRM | 85.6 | 67.7 | 78.3 | 79.1 | 56.9 | 73.5 | +2.6 |
Ablation Study¶
| Configuration | RewardBench | PPE | RM-Bench | RMB | JudgeBench | Avg. | Δ |
|---|---|---|---|---|---|---|---|
| ConsistRM | 85.6 | 67.7 | 78.3 | 79.1 | 56.9 | 73.5 | - |
| w/o CACR | 84.9 | 64.8 | 77.3 | 78.1 | 56.0 | 72.2 | -1.3 |
| w/o Online-State | 85.5 | 64.1 | 78.6 | 76.7 | 56.7 | 72.3 | -1.2 |
| w/o Memory-Driven | 84.3 | 63.1 | 75.4 | 74.2 | 54.8 | 70.4 | -3.2 |
Key Findings¶
- Memory-driven consistency preference is the most critical component (removing it causes a 3.2-point drop), demonstrating that historical information is essential for pseudo-label quality.
- ConsistRM achieves significantly greater position bias mitigation (+5.3 vs. +1.4 for RFT), as consistency rewards encourage the model to focus on content rather than position.
- ConsistRM enables a 4B model to match the performance of an 8B model through multi-round voting.
- Replacing CACR with token-level confidence (DeepConfidence) leads to reward hacking and performance degradation.
Highlights & Insights¶
- The core assumption — "consistency implies reliability" — is concise and compelling, leveraging the model's intrinsic consistency signals in place of external annotation.
- The temporal memory mechanism (aggregation of historical pseudo-labels) is a key innovation, providing a stable anchor across training rounds.
- The ternary label design (+1/−1/0) elegantly handles uncertain samples, preventing noisy label contamination.
- ConsistRM also improves generation efficiency — producing more concise critiques (1,717 vs. 1,924 tokens).
Limitations & Future Work¶
- Semantic consistency evaluation operates only at the holistic critique level, without fine-grained alignment of individual semantic segments within critiques.
- Validation is limited to Qwen3 and LLaMA-3.1; generalizability across broader model families remains to be verified.
- Hyperparameters for consistency rewards (top-\(p\) ratio, CACR reward value of 0.1) may require task-specific tuning.
Related Work & Insights¶
- vs. TTRL: TTRL uses majority voting as pseudo-labels but lacks temporal consistency across rounds, leading to bias accumulation in later stages; ConsistRM's historical memory significantly mitigates this issue.
- vs. DeepSeek-GRM: The latter requires human-annotated reward signals, whereas ConsistRM is entirely annotation-free.
- Insight: Consistency signals may serve as a general, low-cost quality indicator for self-training scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐ The consistency-aware self-training paradigm is elegantly designed, particularly the temporal consistency memory mechanism.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Five benchmarks, four models, complete ablations, and analysis of position bias and multi-round voting.
- Writing Quality: ⭐⭐⭐⭐ Method descriptions are clear and experimental analysis is thorough.
- Value: ⭐⭐⭐⭐ Provides a practical and effective solution for annotation-free GRM training.