Skip to content

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

Conference: ICML 2026
arXiv: 2509.14234
Code: None (no public repository provided in the paper)
Area: LLM / NLP; RLHF alternatives; Reference-free RL
Keywords: GRPO, self-synthesized reference, self-proposed rubric, unverifiable reward, HealthBench

TL;DR

This paper proposes Compute as Teacher (CaT): leveraging the \(G\) rollouts already sampled by GRPO to "synthesize" a pseudo-reference answer via a frozen anchor model, then, in unverifiable domains, using the model itself to derive binary rubrics from the pseudo-reference to score each rollout as RL reward. This approach directly transforms inference compute into supervision signals without any human annotation. On HealthBench, CaT achieves up to 30% improvement over baselines and matches or surpasses inference-time aggregation with 9× lower test-time compute.

Background & Motivation

Background: Current large model post-training mainly relies on two approaches—SFT with human-annotated references (Ouyang et al. 2022), or RLVR with programmatic verifiers (e.g., math/code GRPO, Shao et al. 2024). Both require the existence and accessibility of "reference answers".

Limitations of Prior Work: In tasks such as medical consultation, life advice, open dialogue, and creative writing, answers are inherently open-ended, multi-solution, and subject to expert disagreement, making it impossible to write ground truth or programmatic checkers. Common fallbacks either involve costly annotation pipelines or simply let another LLM assign a 1–10 score (LLM-as-judge), the latter repeatedly shown to suffer from inconsistency, verbosity, style bias, and reward hacking (Zheng et al. 2023).

Key Challenge: RL training requires a "reference signal" to compute advantage; yet in unverifiable domains, this signal can neither come from humans nor be generated by programs. This leads to a scarcity of post-training in the most valuable open domains.

Goal: (i) Provide a stable and usable reward signal for RL in unverifiable tasks without any human references; (ii) Ensure this reward mechanism is plug-and-play with existing RLVR pipelines (GRPO) without significant additional compute.

Key Insight: GRPO already samples \(G\) parallel rollouts per prompt to estimate advantage; these rollouts naturally "disagree where the model is uncertain"—one may get an intermediate step right, another the final answer, a third performs correct verification. The information content of the entire set of rollouts is fundamentally greater than any single one, yet this is currently only used for variance normalization and is largely wasted.

Core Idea: Use "synthesis" to reconcile multiple rollouts into a pseudo-reference answer \(s\), then have the model itself extract several binary rubric criteria from \(s\) as rewards—turning "compute into supervision" via a plug-and-play two-stage pipeline: reference estimation and reward derivation.

Method

Overall Architecture

Input: prompt \(q\), current policy \(\pi_t\), frozen anchor \(\pi_0\) (typically the initial policy), judge \(\pi_J\) (e.g., GPT-4o). Procedure:

  1. Sample \(G\) rollouts \(o_{1:G}\) using \(\pi_t\) (shared with GRPO);
  2. Reference estimation: Use \(\pi_0\) to synthesize a pseudo-reference \(s \sim \pi_0(\cdot \mid p_{\text{syn}}, o_{1:G})\) from the rollouts;
  3. Reward derivation: In verifiable domains, directly match answer strings; in unverifiable domains, have \(\pi_0\) generate \(n\ge 5\) binary rubrics \(\mathcal{R}=\{r_1,\dots,r_n\}\) from \(s\), then have \(\pi_J\) judge each rollout on each rubric as yes/no, with reward as the pass rate \(R_{\text{rub}}(o;\mathcal{R}) = \frac{1}{n}\sum_j \mathbf{1}[\pi_J(o,r_j)=\text{yes}]\);
  4. GRPO updates \(\pi_t\) using \(\hat A_i = (R_i - \bar R_G)/\sigma_G\).

A key highlight is that this mechanism is fully aligned with native GRPO sampling: it only adds 1 synthesis, 1 rubric generation, and \(n\times G\) very short yes/no judgments, all parallelizable, with total overhead much less than the \(G\) rollouts themselves.

Key Designs

  1. Synthesis as Reference Estimator:

    • Function: Reconcile \(G\) divergent rollouts into a pseudo-reference \(s\), rather than "selecting" one.
    • Mechanism: The frozen initial policy \(\pi_0\) (not the current \(\pi_t\)) reads all rollouts under a fixed prompt \(p_{\text{syn}}\) and generates a new answer. Crucially, the original prompt \(q\) is deliberately omitted from the input (see ablation in Appx 6.4), forcing the model to rely solely on the rollouts for reconciliation rather than simply re-answering; using the frozen anchor instead of the current policy decouples "exploration" from "estimation"—\(\pi_t\) improves via RL, while \(\pi_0\) provides a stable reference baseline, preventing target drift as the policy evolves. Empirically, synthesis disagrees with majority vote on 5–15% of questions, yet when it disagrees, its accuracy remains 70–86% (Table 1), and in ~1% of cases, it "gets it right when all rollouts are wrong"—something no selection method (majority vote, Self-BoN, min-PPL) can achieve in principle.
    • Design Motivation: Selection can at best recover "the best rollout"; synthesis can splice correct segments across rollouts to generate out-of-distribution, superior answers, fully exploiting the potential of inference compute.
  2. Self-proposed Rubrics (Core Contribution):

    • Function: In unverifiable domains without any human reference, decompose the coarse judgment of "is this answer good" into several binary, fine-grained criteria like "does the answer satisfy condition \(r_j\)".
    • Mechanism: \(\mathcal{R} \sim \pi_0(\cdot \mid p_{\text{rub}}, s)\), where the anchor model extracts \(\ge 5\) binary, auditable, repeatable criteria from the pseudo-reference \(s\) (e.g., "advises consulting a doctor", "mentions lifestyle modification", "avoids giving a diagnosis"), then the judge model \(\pi_J\) independently judges each rollout as yes/no for each criterion, with reward as the proportion satisfied. The entire pipeline from inference compute → pseudo-reference → rubrics → reward involves no human reference at any stage.
    • Design Motivation: Decomposition brings three benefits—(i) Reliability: each binary question is much more stable for LLMs than "scoring 1–10"; (ii) Auditability: specific failed criteria can be traced for debugging; (iii) Reduced style bias: rubric rewards "content coverage" rather than style/length, mitigating verbosity bias and reward hacking.
  3. Drop-in Compatibility with Verifiable Domains + Compute Amortization:

    • Function: The same framework reduces to "answer matching with pseudo-reference" in math/code verifiable domains, with no code changes.
    • Mechanism: In verifiable domains, reward simplifies to \(R_{\text{ver}}(o;s)=\mathbf{1}[\texttt{answer}(o)=\texttt{answer}(s)]\), still using synthesis for \(s\); this is equivalent to TTRL's majority-vote pseudo-labeling, but synthesis can go beyond the rollout set. For compute amortization: after training, a single forward pass can produce answers matching or surpassing inference-time aggregation, effectively burning the "G× compute per deployment" cost into the model weights once.
    • Design Motivation: Unverifiable domains are the real challenge, but the authors show that simply swapping the reward derivation line allows the framework to "plug into" any domain, demonstrating that CaT is not a healthcare-specific trick but a truly unified paradigm.

Loss & Training

  • Base: GRPO's clipped surrogate + KL regularization to \(\pi_0\);
  • Group size \(G=8\);
  • Anchor \(\pi_0\) is the same as the initial policy, judge \(\pi_J=\) GPT-4o;
  • Compute overhead: synthesis is roughly equivalent to one extra rollout; rubric scoring requires \(n\times G\) very short yes/no judgments, fully parallelizable.

Key Experimental Results

Main Results

Model Dataset Initial CaT Inference-time Synthesis Relative Gain / Compute Ratio
Gemma 3 4B HealthBench base +up to 30% < CaT CaT uses 1× test compute vs synth 9×
Qwen 3 4B HealthBench base significantly exceeds base ≈ CaT 9× test compute reduced to 1×
Llama 3.1 8B HealthBench base 0.38 (vs SFT 0.28) < CaT same as above
All three models MATH-500 base up to +33% ≈ CaT drop-in matches verifiable baseline

Ablation Study

Configuration Key HealthBench Phenomenon Notes
CaT (self-proposed rubric) Matches physician rubric "Self-written standards ≈ doctor-written standards" on two models
Model-as-judge (1–10 scoring) All models significantly below CaT Coarse judgment is unstable, reward noise is high
CaT-SFT (SFT with pseudo-reference) Llama 0.28 vs CaT 0.38 RL generalizes better than SFT on small data
Synthesis vs Majority/Self-BoN/Min-PPL CaT wins on HealthBench, matches on MATH-500 Synthesis advantage greatest in unverifiable domains
Synthesis input 8 vs 1 rollout 0.85 vs 0.80 (Qwen MATH) Shows synthesis performs cross-rollout reasoning, not just "one more sample"

Key Findings

  • Self-proposed rubrics rival expert annotation: On HealthBench, self-proposed rubrics match those designed by human doctors on two models, demonstrating that models capable of generating decent answers can also extract effective scoring dimensions.
  • Synthesis truly reconciles: In ~1% of cases, "all rollouts are wrong but synthesis is right", and when disagreeing with majority vote, synthesis accuracy is as high as 82–86%, indicating synthesis can generate superior out-of-distribution answers.
  • Compute is burned into weights: After CaT training, a single forward pass matches or exceeds 9× G-rollout inference-time synthesis, fully amortizing the "9× compute per deployment" cost into "train once".
  • Llama benefits most from RL, less from synthesis: Weaker models are less adept at meta-cognitive reconciliation, but RL compensates; this suggests CaT is especially beneficial for mid- and low-capacity models.
  • Entropy collapse saturates training: Once rollouts converge, synthesis loses reconciliation space and further training yields diminishing returns, consistent with entropy collapse in RL fine-tuning.

Highlights & Insights

  • Paradigm shift: "compute as supervision": Previous reference-free RL (TTRL, Absolute Zero) only used selective aggregators like majority vote, and only in verifiable domains; this work is the first to combine "generative aggregator + binary rubric" into a unified pipeline usable in unverifiable domains, essentially translating inference best-of-N gains into training supervision signals.
  • Decoupling anchor and policy is a key engineering detail: Using frozen \(\pi_0\) instead of \(\pi_t\) for synthesis avoids positive feedback drift ("fooling oneself"), anchoring the reward signal to a stable reference distribution—akin to KL-to-reference in RLHF, but here the anchor estimates the target rather than constraining update magnitude.
  • Rubric as a "white-box interface" for reward: Rubrics make rewards readable, auditable, and debuggable, which is crucial for industrial deployment when asking "why was this penalized"—effectively upgrading reward engineering from black-box LLM judges to structured checklists, enabling manual spot-checking/curation.
  • Transferable design: The synthesis-as-aggregator + rubric-as-reward paradigm can be directly applied to reasoning trace evaluation, multi-turn dialog, agentic trajectories, etc.; any scenario "requiring scoring but lacking clear standards" can have the model generate its own criteria.

Limitations & Future Work

  • Dependence on base model capability: Weak models produce low-information rollouts and have poor reconciliation ability, so CaT's gains diminish accordingly; fundamentally, CaT "amplifies base capability with compute" and is ineffective for models lacking domain mastery.
  • Training saturates after entropy collapse: Once rollouts converge, synthesis loses reconciliation space and training plateaus; the authors suggest introducing exploration rewards or more diverse sampling strategies in future work.
  • Judge model dependence: Using GPT-4o as \(\pi_J\) poses cost and reproducibility issues for small teams; the effect of replacing with open-source judges is not systematically studied in the main text (briefly mentioned in Appx 6.3).
  • Rubric granularity remains coarse: Current rubrics are binary yes/no, lacking partial credit, hierarchical standards, or confidence weighting; future work could introduce finer-grained rubrics to improve reward signal resolution.
  • Self-verification: does the model-generated rubric protect itself? The paper does not systematically discuss whether rubrics exhibit "self-collusion" (i.e., bias toward the anchor's answer style), a potential reward hacking avenue worth further investigation.
  • vs TTRL / Absolute Zero: These also pursue reference-free RL, but only use majority vote/self-play in verifiable domains; CaT uses synthesis to go beyond rollout support and, via rubrics, brings unverifiable domains into the RL framework.
  • vs Rubrics as Rewards (RaR, Gunjal et al. 2026): RaR also uses rubric-based scoring, but rubrics are constructed from human references; CaT makes rubrics self-generated, fully removing human annotation dependence.
  • vs Test-time scaling (majority vote, best-of-N): That approach incurs "G× compute per deployment"; CaT amortizes the same compute into training, reducing deployment to 1×.
  • vs LLM-as-judge (Zheng et al. 2023): 1–10 scoring is coarse and style-biased; CaT uses binary rubrics and auditable criteria, reducing noise and bias.
  • vs Constitutional AI / Self-Instruct: Those methods target specific abilities (harmlessness, instruction following); CaT provides a more general, domain-agnostic reference-free RL framework.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Integrates "compute as supervision + self-generated rubric" into a plug-and-play unified pipeline, a key step for reference-free RL in unverifiable domains.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Covers three model families and two domains, includes comparison with human expert rubrics and full ablation against selection baselines; but only tested at 4–8B scale and in a single (healthcare) unverifiable domain.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear exposition, concise "why it works" intuition, tightly integrated algorithm blocks, figures, and conclusions.
  • Value: ⭐⭐⭐⭐⭐ Provides a paradigm immediately embeddable in industrial RLHF pipelines (GRPO-compatible, no verifier, no human annotation), highly valuable for open-domain post-training.