Reinforcement Learning Teachers of Test Time Scaling¶

Conference: NeurIPS 2025 arXiv: 2506.08388 Code: GitHub Area: Reinforcement Learning Keywords: Reasoning Language Models, Knowledge Distillation, Reinforcement Learning, Test-Time Inference, Teacher-Student Framework

TL;DR¶

This paper proposes the Reinforcement Learning Teacher (RLT) framework, which provides both the problem and the answer to a teacher model and trains it to generate effective explanatory reasoning chains rather than solving problems from scratch. This enables a 7B-parameter teacher to produce distillation data superior to that generated by models orders of magnitude larger.

Background & Motivation¶

The current training paradigm for reasoning language models (reasoning LMs) faces two fundamental challenges:

1. The Exploration Bottleneck of RL: RL training relies on a one-hot correctness reward, which only provides a learning signal when the model can already solve a problem with some probability. This means RL essentially reinforces existing capabilities rather than enabling genuine acquisition of new skills. Small models, due to their limited initial capacity, struggle to benefit effectively from RL.

2. Misalignment Between Training Objectives and Actual Use: RL-trained reasoning models are often not deployed directly; instead, they serve as teachers to generate reasoning traces for student model distillation or cold-start initialization of subsequent RL iterations. However, reasoning chains trained to "solve problems correctly" are not necessarily well-suited for student learning. Existing pipelines rely heavily on heuristic post-processing (e.g., GPT-based formatting cleanup, incorrect-answer filtering) to improve distillation data quality.

Core Insight: In practice, an effective teacher's strength lies not in independently discovering complex theorems, but in leveraging known answers to construct clear, pedagogically effective explanations for students. Accordingly, this paper redefines the teacher model's task — not to solve problems from scratch, but to "connect the dots" given a known answer, generating explanations maximally useful to students.

Method¶

Overall Architecture¶

The RLT framework inverts the conventional RL reasoning training paradigm: - Conventional approach: The model receives only the problem, reasons, and produces a solution (sparse reward: correct/incorrect). - RLT approach: The model receives both the problem and the answer, and generates a step-by-step explanation (dense reward: based on student comprehension).

Key Designs¶

Task Redefinition: The RLT model's system prompt includes both the problem and the ground-truth answer; the task is to generate a pedagogical explanation connecting the two. At inference time, the teacher's think tokens are extracted directly and, after label substitution, used as distillation data for the student — with no filtering or post-processing required.
Dense Reward Function: The quality of the teacher's explanation is evaluated via student feedback, comprising two complementary terms:
- Student Comprehension Reward \(r^{SS}\): Measures the degree to which the student understands the ground-truth answer after observing the teacher's explanation, quantified via the log-probability of answer tokens under the student model:
\(r^{SS}(o_i, s_i, q_i) = \text{avg}\{\log \pi_s^{s_i}\} + \alpha \min\{\log \pi_s^{s_i}\}\)

where \(\pi_s^{s_i} = \pi_s(s_i | t_{o_i}.q_i)\) is the student's probability over answer \(s_i\) given explanation \(t_{o_i}\) and problem \(q_i\). The avg+min combination ensures no individual answer token is neglected.

- **Logical Interpretability Reward** $r^{KL}$: Ensures that each step of the teacher's explanation constitutes a logically coherent progression from the student's perspective, measured by the KL divergence between teacher and student distributions over think tokens:

$r^{KL}(o_i, s_i, q_i) = \text{avg}\{\mathbb{D}_{KL}(\pi_\theta^{t_{o_i}} \| \pi_s^{t_{o_i}})\} + \alpha \max\{\mathbb{D}_{KL}(\pi_\theta^{t_{o_i}} \| \pi_s^{t_{o_i}})\}$

Crucially, the teacher's distribution is conditioned on both the problem and the answer, while the student's distribution is conditioned on the problem alone. If a reasoning step is only coherent when the answer is known, the KL divergence will be large, thereby penalizing such answer-leaking explanations.

- **Final Reward**: $r_i^{RLT} = r^{SS}(o_i, s_i, q_i) - \lambda r^{KL}(o_i, s_i, q_i)$

Training Objective: Based on the GRPO algorithm, replacing the conventional correctness reward with the RLT reward:

\[J^{RLT}(\theta) = \mathbb{E}_{q,s \sim D, \{o\}_1^G \sim \pi_\theta(\cdot|s,q)} \left[\frac{1}{G}\sum_{i=1}^G \left(A_i^{RLT} - \beta \mathbb{D}_{KL}(\pi_\theta \| \pi_{ref})\right)\right]\]

Loss & Training¶

Qwen2.5-7B-Instruct serves as the backbone with 7B parameters.
A brief SFT phase precedes RL to adapt the model to the new system prompt format.
Training consists of only 125 steps (less than one epoch), with batch size 1024 and learning rate \(1 \times 10^{-6}\).
A separate 7B model serves as the student for reward computation during RL training.

Key Experimental Results¶

Main Results: Distillation Performance Comparison¶

Model	Data Size	AIME 2024	MATH 500	GPQA Diamond	Overall
Qwen2.5-7B-Instruct	N.A.	10.00	74.20	33.30	39.17
Bespoke-7B (R1 distill + post-proc.)	17K	20.00	82.00	37.80	46.60
RLT-7B (no post-processing)	17K	23.30	82.80	42.40	49.50
s1-32B	1K	50.00	92.60	56.60	66.40
Bespoke-32B	17K	63.30	93.00	58.10	71.47
RLT-32B	17K	66.70	93.40	59.60	73.23

Cold-Start RL Performance Comparison¶

Method	AIME 2024	MATH 500	GPQA Diamond	Overall
RL w/o cold start	13.30	74.20	34.80	40.77
Conventional RL teacher (raw) + RL	10.00	71.00	34.80	38.60
Conventional RL teacher (GPT post-proc.) + RL	16.70	78.20	36.90	43.93
Bespoke-7B + RL	16.70	82.80	45.40	48.30
RLT-7B + RL	26.70	84.00	40.90	50.53

Key Findings¶

Small Models Outperforming Large Ones: Raw explanations generated by the 7B RLT model yield better distillation performance than reasoning chains produced by models orders of magnitude larger (e.g., DeepSeek-R1 at 671B) with careful filtering and GPT-based post-processing.
Cross-Scale Effectiveness: Explanations generated by the 7B RLT model still outperform all baselines when used to distill a 32B student, demonstrating that small teachers can effectively instruct larger students.
Zero-Shot Transfer: RLT generates distillation data zero-shot on the countdown task — which was never seen during training — and surpasses the performance of RL trained directly on that task (55.7% vs. 50.8%).
High Correlation Between Reward and Distillation Quality: Pearson correlation exceeds 0.89, validating the effectiveness of the RLT reward function design.
Qualitative Deficiencies in R1 Reasoning Chains: Low-RLT-reward R1 chains frequently attempt to invoke external tools (e.g., calculators) or employ idiosyncratic linguistic patterns from training data (e.g., humorous commentary), whereas RLT explanations are more substantive and automatically incorporate verification steps.

Highlights & Insights¶

By simplifying the task from "problem solving" to "explanation generation," the framework elegantly circumvents the exploration bottleneck of RL, enabling effective RL training even for small models.
The design of the dense reward function reflects deep pedagogical intuition: an effective explanation should not only lead the student to the correct answer, but each reasoning step must also be coherent within the student's cognitive framework.
The framework entirely eliminates the dependence on verifier-based filtering and post-processing in the distillation pipeline, substantially simplifying the reasoning model training workflow.

Limitations & Future Work¶

RLT training requires a student model to compute rewards online, increasing computational overhead.
Validation has been conducted only on mathematics and programming tasks; effectiveness in other reasoning domains (e.g., multi-step logical reasoning, commonsense reasoning) remains to be explored.
The optimal teacher-student pairing has not been thoroughly investigated.
Future work may explore joint teacher-student co-training and self-distillation schemes in which a single model alternates between teacher and student roles.

This work stands in sharp contrast to conventional RL reasoning approaches such as DeepSeek-R1: rather than pursuing the model's standalone problem-solving capability, it focuses on optimizing distillation effectiveness.
The framework has significant implications for democratizing the cost of reasoning model training by shifting the expensive RL burden to small, specialized teacher models, while large models require only inexpensive SFT.
The work motivates a new perspective: training objectives should be aligned with the model's actual deployment purpose, rather than pursuing superficial capability improvements.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Redefining the teacher's task from "solving" to "explaining" is a highly creative contribution.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Multi-dimensional evaluation spanning distillation, cold-start RL, cross-domain transfer, reward analysis, and qualitative analysis.
Writing Quality: ⭐⭐⭐⭐⭐ — Motivation is articulated with clarity and depth; experimental design features rigorous controlled comparisons.
Value: ⭐⭐⭐⭐⭐ — Introduces a novel and practically effective paradigm for reasoning model training.