ACL 2025 (Findings) Reasoning Self-Critic Chain-of-Thought Step-by-Step Reasoning Critic Weakly Supervised Data Construction Iterative Refinement System-2 Thinking

Critic-CoT: Boosting the Reasoning Abilities of Large Language Model via Chain-of-Thoughts Critic¶

Conference: ACL 2025 (Findings)
arXiv: 2408.16326
Code: GitHub
Area: LLM Reasoning
Keywords: Self-Critic, Chain-of-Thought, Step-by-Step Reasoning Critic, Weakly Supervised Data Construction, Iterative Refinement, System-2 Thinking

TL;DR¶

This paper proposes the Critic-CoT framework, which transitions LLM self-criticisms from System-1 intuitive judgments to System-2 deliberate step-by-step analyses through a step-by-step Chain-of-Thought critic paradigm and automated weakly supervised data construction without human annotation. Two-stage training (GPT-4 distillation + self-criticism) improves Llama-3-70B-Instruct performance on GSM8K from 89.6% to 95.4% and on MATH500 from 50.4% to 68.4%. Additionally, it is discovered that criticism capabilities and task-solving capabilities can mutually reinforce each other.

Background & Motivation¶

Background: Self-criticism (Self-Critic) has become a key mechanism for improving the reasoning capabilities of LLMs. Methods such as Reflexion and Self-Refine enable models to generate feedback on their own outputs and iteratively improve them, reducing the reliance on external human annotation.

Limitations of Prior Work: - (Coarse Criticism Granularity) Existing self-criticism methods mostly employ simple prompting to ask the model to directly point out errors in the entire response (instance-level). This resembles fast, intuitive System-1 thinking in cognitive science, resulting in low criticism accuracy. - (Lack of Specialized Training) Models are typically not specifically trained for criticism capabilities, meaning they naturally lack the ability to thoroughly analyze the correctness of each reasoning step and precisely locate the first erroneous step. - (Unclear Relationship between Criticism and Solving) The relationship between criticism capability and task-solving capability has not been deeply investigated—does training the model to criticize harm its original solving capability? Can the two synergize?

Key Challenge: Enabling a model to "criticize" a reasoning process is a complex task. It requires understanding the correctness of each step and precisely locating the first error. However, existing methods lack structured criticism formats and matching large-scale training data (manually annotating reasoning steps is extremely expensive).

Key Insight: Utilizing a step-by-step CoT criticism format enables controlled, weakly supervised data construction. The key insight is that "as long as the criticism successfully locates the error and the corrected answer is correct, the criticism can be deemed valid," eliminating the need for human annotation of the correctness of each step.

Core Idea: By using a step-by-step CoT criticism format combined with automated weakly supervised data construction and two-stage training (distillation \(\rightarrow\) self-criticism), the LLM learns System-2 step-by-step self-criticism and correction.

Method¶

Overall Architecture¶

Input: A problem \(Q\) and an \(n\)-step solution attempt \(Att = [s_1, \ldots, s_n]\) generated by the model.

Criticism Output: Step-by-step labels \(L = [l_1, \ldots, l_n]\), where \(l_i = +1\) (correct) or \(l_i = -1\) (incorrect).

Correction Output: Rewriting from the first erroneous step \(i\): \(Att' = [s'_i, \ldots, s'_{n'}]\).

Key Designs¶

1. Step-by-Step CoT Criticism and Weakly Supervised Data Construction¶

For each solution attempt generated by the model, the critic model reviews each step step-by-step and labels them as correct or incorrect. Weakly supervised constraints (no human annotation required):

Positive Collection Conditions: Erroneous attempt + criticism identifies the error + corrected answer is correct \(\rightarrow\) collect valid criticism data \(C = (Q, Att, Cri)\) and correction data \(R = (Q, Att, Cri_{-1}, Att')\)
Positive Sample Collection: Correct attempt + criticism determines all steps are correct \(\rightarrow\) collect correct criticism samples
Discarding Conditions: Criticism fails or misses an error \(\rightarrow\) resample

Key Insight: Using the correctness of the final answer as an indirect verification signal completely bypasses the need for human annotation of reasoning steps. As long as the math problem has a ground-truth answer, whether the criticism and correction are valid can be automatically determined.

2. Two-Stage Training (Auto Train)¶

Stage 1 (Distilling Teacher Criticism Capability): - Use GPT-4-Turbo as the critic teacher \(M_C\) to perform step-by-step criticism and correction on the outputs of the generator model \(M_G\) (e.g., Llama-3-70B) - Collect a valid dataset \(D_1\) via weakly supervised filtering - Fine-tune the initial model \(M_0\) using \(D_1\) to obtain \(M_1\)

Stage 2 (Self-Criticism Iteration): - Have \(M_1\) act as its own critic, analyzing its own newly generated outputs - Collect more data \(D_2\) through weak supervision - Retrain from \(M_0\) using \(D_1 \cup D_2\) to obtain the final model \(M_2\)

Key Insight: Stage 1 essentially distills the teacher model's Pass1@N capability into the student model's Top1@N capability, meaning the student is theoretically not limited by the teacher's performance ceiling. Stage 2's self-criticism data introduces samples from the model's own distribution, further improving performance.

3. Inference Stage Strategies¶

Iterative Refinement: - Execute repeatedly: criticize \(\rightarrow\) correct \(\rightarrow\) re-criticize, until no errors are found or the limit is reached - Maximum depth \(d=8\), restart limit \(n=8\)

Critic as Filter: - Sample \(m\) solutions for a given problem - Use the critic model to check each solution, filtering out those where errors are detected - Perform Majority Voting on the remaining solutions - Utilizes extra sampling more effectively than pure majority voting

Key Experimental Results¶

Main Results (Table 1: Solution Accuracy %)¶

Model	Method	GSM8K	MATH500
Llama-3-70B-Instruct	Baseline (Greedy)	89.6	50.4
Llama-3-70B-Instruct	Maj1@96	94.1	62.2
Critic-CoT	Direct Reasoning (Without Critic)	91.7	57.6
Critic-CoT	Iterative Refinement	93.3	57.8
Critic-CoT	Critic + Maj1@96	95.4	66.6
Critic-CoT	Critic + Maj1@512	—	68.4
GPT-4-0314	Baseline	92.0	52.6
DeepSeek-V2-236B	Baseline	92.2	56.3

Critic filtering + Maj1@96 improves performance on MATH500 by +4.4pp (62.2 \(\rightarrow\) 66.6) compared to pure Maj1@96.
After training only the criticism capability, direct reasoning without using the critic module also improved from 50.4 to 57.6 (+7.2pp), indicating that criticism training naturally enhances the solving capability as a byproduct.

Critic Accuracy¶

Method	Critic Accuracy on MATH Erroneous Samples
Direct Prompt Critic (System-1)	Lower
Critic-CoT Step-by-Step CoT Critic	Significantly Higher

Step-by-step CoT criticism significantly outperforms instance-level direct prompting in locating the first erroneous step.

OOD Generalization¶

Task	Baseline \(\rightarrow\) Critic-CoT
StrategyQA (Commonsense Reasoning)	Improved
AGIEval (Comprehensive Ability)	Improved
HumanEval (Code Generation)	Improved

Improvements are also observed on out-of-distribution (OOD) tasks, indicating that criticism capability is a generalizable reasoning meta-skill.

Ablation Study¶

Two-Stage vs. Single-Stage: Stage 2 self-criticism data yields significant additional improvements; performance drops when Stage 2 is excluded.
Mutual Reinforcement of Critic & Solver: Direct reasoning performance improves after training the critic (89.6 \(\rightarrow\) 91.7 on GSM8K), indicating that both capabilities share underlying knowledge representations.
Iterative Refinement Depth: Increasing the number of iterative rounds yields diminishing returns, performing worse than the Critic-as-Filter strategy on GSM8K.

Highlights & Insights¶

Elegant Weakly Supervised Closed-Loop Design: Utilizing "criticism \(\rightarrow\) correction \(\rightarrow\) is the answer correct?" as automatic verification completely avoids human annotation of reasoning steps, allowing scalable production of training data.
Important Discovery of Criticism-Solving Mutual Reinforcement: Training criticism capabilities does not harm the original solving capability but rather synergistically enhances it—this is an important empirical contribution, implying that multi-task training in this scenario is a win-win.
Cognitive Science Analogy of System-1 \(\rightarrow\) System-2: Relating the methodology to Kahneman's dual-system theory is clear and persuasive, helping readers intuitively grasp the fundamental differences between instance-level and step-level criticism.
Efficiency of the Critic-as-Filter Strategy: Eliminating low-quality samples using a critic model before majority voting utilizes the execution budget more efficiently than simply increasing the sample size.
Theoretical Explanation for Surpassing Teacher Performance: Stage 1 distills the teacher's Pass1@N \(\rightarrow\) Top1@N mapping, allowing the student to potentially outperform the teacher on Top1.

Limitations & Future Work¶

Core experiments are only conducted on Llama-3-70B: It is not verified whether smaller models (e.g., 7B/13B) can also acquire effective criticism capabilities through this framework. Small models may lack the foundational reasoning required to generate valid criticism.
Weak supervision relies on the assumption "correct answer = valid criticism": This is not applicable to open-ended tasks with multiple correct answers or non-unique target responses (e.g., writing, translation).
Criticism quality ceiling is limited by the generator model's own capabilities: If a model completely lacks knowledge about a certain concept, the critic will not be able to evaluate it correctly—"you cannot critique what you do not understand."
Iterative refinement incurs significant computational overhead: Each round requires generating a full criticism + correction sequence, requiring \(8 \times 8 = 64\) full generations in the worst-case scenario.
Thorough experimentation is limited to mathematical reasoning tasks: Other tasks requiring precise step-by-step verification, such as code reasoning and logical reasoning, warrant deeper exploration.
Stage 1 relies on GPT-4 as a teacher: How to cold-start criticism capabilities without a powerful closed-source model remains an open question.

Difference from CriticGPT (McAleese et al., 2024): The latter trains critic models to assist human annotators (as a tool in the RLHF pipeline), whereas this work directly enhances the LLM's own reasoning ability.
Comparison with Math-Shepherd (Wang et al., 2023): The latter uses a Process Reward Model (PRM) to provide scalar scores, whereas Critic-CoT uses natural language criticism + correction, which is more flexible and interpretable.
Comparison with Self-Refine (Madaan et al., 2023): Self-Refine relies on instance-level prompt feedback, whereas Critic-CoT employs step-level trained CoT criticism, achieving higher granularity and accuracy.
Insights: (1) The criticism-correction loop can be combined with RLHF—using Criticism Utility (CU) as a reward signal; (2) step-level criticism formats can be generalized to code debugging (line-by-line inspection); (3) the weakly supervised construction paradigm can be transferred to any task with a deterministic verifier.

Rating¶

Novelty: ⭐⭐⭐⭐ (Step-by-step CoT criticism format + weakly supervised automatic construction, novel System-2 perspective)
Theoretical Depth: ⭐⭐⭐ (mainly empirical-driven, lacking deep theoretical analysis, but provides solid empirical proof for the mutual reinforcement of criticism and solving)
Experimental Thoroughness: ⭐⭐⭐⭐ (covers in-domain + OOD generalization, comprehensive ablation studies, compares both reasoning strategies)
Utility Value: ⭐⭐⭐⭐⭐ (Critic-as-Filter is plug-and-play, weakly supervised data construction is highly scalable)
Overall Recommendation: ⭐⭐⭐⭐ (Solid progress in the self-criticism direction, with inspiring findings on the mutual reinforcement of criticism and task solving)