ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training¶

Conference: ACL 2026 arXiv: 2604.07484 Code: GitHub Area: Alignment RLHF / Reward Modeling Keywords: Generative Reward Model, Self-Training, Consistency-Aware, Pseudo Labels, Position Bias

TL;DR¶

ConsistRM proposes a consistency-aware self-training framework for generative reward models (GRMs). It introduces two modules — temporal consistency pseudo-labels (integrating online-state and memory-driven preference consistency) and semantic consistency critique rewards (measuring semantic similarity across multiple generated critiques) — achieving an average improvement of 1.5% across five benchmarks without human annotation, while significantly mitigating position bias.

Background & Motivation¶

State of the Field: Generative reward models (GRMs) replace traditional scalar reward models by generating textual critiques and preference labels, offering stronger expressiveness and generalization. Representative works include DeepSeek-GRM (critique generation + self-derived rules) and RM-R1 (distilled reasoning traces + reinforcement learning).

Limitations of Prior Work: GRM training faces two major challenges: (1) reliance on expensive human-annotated data, limiting scalability; and (2) self-training methods (e.g., majority-vote pseudo-labels in TTRL) are prone to reward hacking and early overfitting to noisy pseudo-labels, as reward signals are tightly coupled with the policy model.

Root Cause: Self-training requires reliable pseudo-labels, yet model-generated pseudo-labels are inherently unstable — single-round voting is susceptible to sampling randomness, and pseudo-label bias accumulates in later training stages.

Paper Goals: Design a stable and effective GRM self-training framework that requires no human annotation.

Starting Point: Leverage the model's intrinsic consistency signals as a self-supervised source — if a model produces consistent preference judgments for the same sample across multiple samples and training rounds, those judgments are more likely to be correct.

Core Idea: Construct reliable pseudo-labels via temporal consistency (current round + historical memory) and provide fine-grained rewards via semantic consistency (similarity across multiple critique texts), enabling stable annotation-free GRM self-training.

Method¶

Overall Architecture¶

Given a query \(q\) and two candidate responses \((a_1, a_2)\), the GRM generates a structured output \(o = (c, y)\), where \(c\) is a textual critique and \(y \in \{-1, 1\}\) is the preference label. ConsistRM provides self-supervised signals for GRPO reinforcement learning through two core modules: Consistency-Aware Answer Reward (CAAR) and Consistency-Aware Critique Reward (CACR).

Key Designs¶

Consistency-Aware Answer Reward (CAAR):
- Function: Construct reliable pseudo-labels for self-training.
- Mechanism: Integrates two layers of consistency signals. Online-state consistency \(s_{\text{online}}^{(n)} = \frac{1}{K}\sum_{j=1}^{K} y_j\) aggregates preference predictions from \(K\) rollouts in the current round; memory-driven consistency \(s_{\text{memory}}^{(n)} = \frac{1}{n-1}\sum_{i=0}^{n-1} \hat{y}^{(i)}\) aggregates pseudo-labels from all previous rounds. The final pseudo-label is \(\hat{y}^{(n)} = \text{sgn}(s_{\text{online}}^{(n)} + s_{\text{memory}}^{(n)})\); when the two signals disagree, the output is 0 (no supervision provided), preventing low-confidence samples from dominating optimization.
- Design Motivation: Online voting alone is unreliable in early training; historical memory provides a stable anchor. The ternary label scheme (+1/−1/0) explicitly handles uncertain samples, yielding more robustness than binary forced classification.
Consistency-Aware Critique Reward (CACR):
- Function: Provide fine-grained quality rewards for critique texts.
- Mechanism: Each critique \(c_j\) is encoded into a vector using Qwen3-4B-Embedding; a cosine similarity matrix is computed and critiques are ranked by semantic consistency. Critiques ranked in the top \(p\) with correct preference labels receive an additional reward \(r_j^{(c)} = 0.1\). The intuition is that semantically consistent critiques across multiple generations indicate the model has converged to a stable evaluation region for that sample.
- Design Motivation: CAAR supervises outcomes (preference labels); CACR provides complementary process supervision (critique content). High semantic consistency in critiques is more likely to reflect reliable evaluation.
Format Constraints and Combined Reward:
- Function: Ensure valid output format and integrate multi-level rewards.
- Mechanism: The final reward is \(r^{(n)} = r_j^{(a,n)} + r_j^{(c,n)}\) when the format is valid and \(\hat{y} \neq 0\); \(r = -5\) for invalid format; \(r = 0\) when \(\hat{y} = 0\). GRPO is used for reinforcement training with global batch size 64, learning rate 1e-6, 8 rollouts, and KL coefficient 0.001.
- Design Motivation: Format constraints ensure parsable outputs; the combined reward provides consistent optimization signals at multiple granularities.

Loss & Training¶

GRPO is used for training over 4 epochs, with maximum generation length 1024 (training) / 2048 (inference) and temperature 1.0 (training) / 0 (inference). Training follows a two-stage pipeline: SFT on HelpSteer3, followed by RFT (reinforcement fine-tuning). ConsistRM replaces the reward signal during the RFT stage.

Key Experimental Results¶

Main Results¶

Five-Benchmark Performance on Qwen3-8B

Method	RewardBench	PPE Pref	RM-Bench	RMB	JudgeBench	Avg.	Δ
Qwen3-8B (Base)	81.6	63.8	75.8	78.8	54.3	70.9	-
+ SFT	82.7	65.0	77.1	76.9	51.7	70.7	-0.2
+ RFT	85.4	65.4	78.2	78.2	55.4	72.5	+1.6
+ TTRL	85.3	65.0	77.4	74.2	56.8	71.7	+0.8
+ ConsistRM	85.6	67.7	78.3	79.1	56.9	73.5	+2.6

Ablation Study¶

Configuration	RewardBench	PPE	RM-Bench	RMB	JudgeBench	Avg.	Δ
ConsistRM	85.6	67.7	78.3	79.1	56.9	73.5	-
w/o CACR	84.9	64.8	77.3	78.1	56.0	72.2	-1.3
w/o Online-State	85.5	64.1	78.6	76.7	56.7	72.3	-1.2
w/o Memory-Driven	84.3	63.1	75.4	74.2	54.8	70.4	-3.2

Key Findings¶

Memory-driven consistency preference is the most critical component (removing it causes a 3.2-point drop), demonstrating that historical information is essential for pseudo-label quality.
ConsistRM achieves significantly greater position bias mitigation (+5.3 vs. +1.4 for RFT), as consistency rewards encourage the model to focus on content rather than position.
ConsistRM enables a 4B model to match the performance of an 8B model through multi-round voting.
Replacing CACR with token-level confidence (DeepConfidence) leads to reward hacking and performance degradation.

Highlights & Insights¶

The core assumption — "consistency implies reliability" — is concise and compelling, leveraging the model's intrinsic consistency signals in place of external annotation.
The temporal memory mechanism (aggregation of historical pseudo-labels) is a key innovation, providing a stable anchor across training rounds.
The ternary label design (+1/−1/0) elegantly handles uncertain samples, preventing noisy label contamination.
ConsistRM also improves generation efficiency — producing more concise critiques (1,717 vs. 1,924 tokens).

Limitations & Future Work¶

Semantic consistency evaluation operates only at the holistic critique level, without fine-grained alignment of individual semantic segments within critiques.
Validation is limited to Qwen3 and LLaMA-3.1; generalizability across broader model families remains to be verified.
Hyperparameters for consistency rewards (top-\(p\) ratio, CACR reward value of 0.1) may require task-specific tuning.

vs. TTRL: TTRL uses majority voting as pseudo-labels but lacks temporal consistency across rounds, leading to bias accumulation in later stages; ConsistRM's historical memory significantly mitigates this issue.
vs. DeepSeek-GRM: The latter requires human-annotated reward signals, whereas ConsistRM is entirely annotation-free.
Insight: Consistency signals may serve as a general, low-cost quality indicator for self-training scenarios.

Rating¶

Novelty: ⭐⭐⭐⭐ The consistency-aware self-training paradigm is elegantly designed, particularly the temporal consistency memory mechanism.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Five benchmarks, four models, complete ablations, and analysis of position bias and multi-round voting.
Writing Quality: ⭐⭐⭐⭐ Method descriptions are clear and experimental analysis is thorough.
Value: ⭐⭐⭐⭐ Provides a practical and effective solution for annotation-free GRM training.