Training Language Model to Critique for Better Refinement¶

Conference: ACL 2025
arXiv: 2506.22157
Code: https://github.com/publicstaticvo/critique
Area: LLM NLP
Keywords: Critique-Refinement Loop, Critique Utility, DPO Variant, Automatic Preference Learning, Multi-Task Evaluation

TL;DR¶

This paper proposes Refinement-oriented Critique Optimization (RCO), which uses "Critique Utility" (CU)—the ratio of refinement improvement driven by critique—as the reward signal to train the critic model. It is optimized via an MSE objective function of a DPO variant without directly evaluating critique quality. Across five tasks (dialogue generation, summarization, QA, mathematical reasoning, and code generation), RCO-trained 7B/13B critic models significantly outperform 70B baseline models and the DPCO method on CU and RQS metrics.

Background & Motivation¶

Background: The critique ability of LLMs is key to automatic evaluation and self-improvement. Recent works have made progress by training critic models via SFT/RLHF on human-annotated critique data.

Limitations of Prior Work: (a) Existing methods train critic models for evaluation rather than driving refinement improvement, causing a detachment between critique and refinement; (b) Directly evaluating critique quality is difficult and subjective, lacking objective standards for what constitutes a "good" critique; (c) Human annotation of critique preferences is expensive and inconsistent in quality.

Key Challenge: A good critique should lead to good refinement results, but existing methods fail to establish a causal link between critique quality and refinement efficacy.

Goal: Design a refinement-oriented critic training method, where the critique quality is directly defined by the degree of improvement it brings.

Key Insight: Build a closed loop: critic generates critiques \(\rightarrow\) actor refines based on critiques \(\rightarrow\) evaluate preference of refinement vs. original response \(\rightarrow\) use preference as critique rewards.

Core Idea: Use the refinement improvement rate (CU = probability of refinement being superior to the original) as the reward signal for the critic model, optimized via a DRO-style MSE objective.

Method¶

Overall Architecture¶

Dataset \(\mathcal{D}\): prompt \(x\) + initial answer \(y_0\) (generated by actor) \(\rightarrow\) critic generates \(N\) critiques \(c_1,...,c_N\) \(\rightarrow\) actor refines \(M\) responses \(y_{i1},...,y_{iM}\) based on each \(c_i\) \(\rightarrow\) evaluate preferences to calculate CU \(\rightarrow\) train the critic using CU as reward.

Key Designs¶

Critique Utility (CU) Definition:
- \(CU(c_i | y_0, x) = P(y \succ y_0 | y \sim \pi_{c_i})\)
- Approximation: \(CU \approx \frac{1}{M}\sum_{j=1}^{M} PS(y_{ij}, y_0)\)
- \(PS = 1\) (refinement better than original), \(0.5\) (tie), \(0\) (original better)
- Preference judgment executed by Qwen-2.5-72B-Instruct, swapping positions to avoid position bias, yielding a total of 10 judgments per critique (\(2M=10\)).
Training Objective Derivation (DRO-style MSE Loss):
- Starting point: Maximize \(\mathbb{E}_{c \sim p_\theta}[CU(c)] - \beta D_{KL}[p_\theta \| p]\)
- Optimal solution: \(p^*(c) = \frac{p(c) \exp(\frac{1}{\beta}CU(c))}{Z_\beta}\)
- Key point: The normalization constant \(Z_\beta\) can be approximated through \(N\) sampled critiques.
- Final loss (Eq. 7): \(\mathcal{L}_{RCO} = \frac{1}{2N}\sum_i (\log\frac{p_\theta(c_i)}{p(c_i)} + \log Z_\beta - \frac{1}{\beta}CU(c_i))^2\)
- Advantage: Compared to standard DPO which only utilizes binary preferences, RCO leverages continuous scalar CU values for finer granularity.
Data Collection Flow:
- Initial Answers: 4 actor models (LLaMA-2-7B/13B/70B-Chat, LLaMA-3-8B-Instruct) \(\times\) 10,000 prompts = 40,000 answers.
- Critique Generation: 5 base critic models (LLaMA-2-7B/13B-Chat, LLaMA-3-8B, Auto-J-13B, UltraCM-13B), generating \(N=4\) critiques for each answer.
- Refinement Generation: For each critique, the actor that generated the initial answer refines \(M=5\) responses.
- Five Tasks: Dialogue generation, summarization, QA, mathematical reasoning, and code generation across 14 datasets.

Key Experimental Results¶

Main Results (CU and RQS Scores)¶

Model	Method	Dialog CU	Summ. CU	QA CU	Math CU	Code CU	Overall CU
—	Initial Answer	—	—	—	—	—	46.9
LLaMA-2-70B-Chat	Baseline	82.7	68.3	88.8	62.7	59.7	72.4
LLaMA-3-70B-Inst	Baseline	82.6	87.8	86.3	76.2	78.1	82.2
Self-refinement	—	75.2	77.6	79.9	64.5	65.8	72.6
LLaMA-2-7B-Chat	Base	83.5	63.4	87.1	60.1	59.7	70.8
LLaMA-2-7B-Chat	+DPCO	79.2	70.2	91.2	58.7	62.1	72.3
LLaMA-2-7B-Chat	+RCO	90.4	77.4	94.3	70.7	—	—

RCO-trained 7B model exceeds LLaMA-2-70B (72.4) and Self-refinement (72.6) in Overall CU.
DPCO (Direct Preference Critique Optimization) offers limited improvement, showing that direct evaluation of critique preferences is less effective than using refinement signals.

On GSM8K, refinement assisted by RCO critic improves accuracy by 3-5% compared to the base critic.
On MBPP code generation, the RCO refinement pass rate increases by 4-7% over self-refinement.
Cross-model generalization: When using LLaMA-3-8B as the actor, the RCO critic also outperforms the baseline.

RewardBench Evaluation¶

Models trained with RCO also show improvements when acting as discriminative judges, indicating that critique training simultaneously enhances preference judgment capabilities.

Human Evaluation¶

Human evaluation of 200 samples: The proportion of RCO critiques rated superior to baselines is significantly higher.
Refinement preference: Refinements based on RCO critiques are superior to those based on baseline critiques.

Key Findings¶

CU as a reward signal is more effective than direct critique preferences—good critiques lead to good refinements.
Continuous scalar CU values provide richer training signals than binary preferences.
Small models (7B) trained with RCO can exceed the critique capabilities of large models (70B).

Highlights & Insights¶

Closed-loop design: The complete feedback loop of critique \(\rightarrow\) refine \(\rightarrow\) evaluate \(\rightarrow\) reward requires no human annotation.
Clever CU definition: Converts the abstract "critique quality" into a quantifiable "refinement improvement rate," which is simple yet effective.
DRO-style objective function: Leverages continuous scalar CU instead of binary preferences, carrying more information than standard DPO.
Multi-task generalization: Covers a wide scope with 5 tasks across 14 datasets.

Limitations & Future Work¶

The calculation of CU depends on Qwen-2.5-72B-Instruct for preference estimation—biases from the judge model will propagate to training.
Each critique requires \(M=5\) refinements and \(2M=10\) position-swapped preference judgments, which incurs high data collection costs.
Actors and Critics use different models. In practice, a unified model that possesses both capabilities might be desired.
The decay of the CU signal over training epochs remains unexplored—whether multi-round iterative training remains consistently effective is unverified.

Difference from CriticGPT (McAleese et al., 2024): The latter trains critics using RLHF + human annotations, whereas RCO automatically derives rewards using refinement signals.
Complementarity with Critic-CoT (Zheng et al., 2024): Critic-CoT focuses on step-by-step critique formats, while RCO focuses on the training signal of the critique—the two can be combined.
Inspiration: The concept of CU can be generalized—not just for critiques, but the quality of any intermediate output can be measured by its downstream efficacy.

Rating¶

Novelty: ⭐⭐⭐⭐ (Innovative idea of using CU as a reward signal)
Theory Depth: ⭐⭐⭐⭐ (Complete DRO derivation with a theoretically grounded objective function)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (5 tasks + CU/RQS/Accuracy/RewardBench + Human Evaluation)
Value: ⭐⭐⭐⭐ (Directly applicable to training open-source critic models)
Overall Recommendation: ⭐⭐⭐⭐ (Solid advancement in the critique-refinement direction)