Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision¶
Conference: ICML 2026
arXiv: 2509.14234
Code: None (No public repository provided in the paper)
Area: LLM / NLP; RLHF alternatives; Reference-free supervised RL
Keywords: GRPO, self-synthesized reference, self-proposed rubric, non-verifiable rewards, HealthBench
TL;DR¶
This paper proposes Compute as Teacher (CaT): it "synthesizes" a pseudo-reference answer from \(G\) rollouts already sampled during GRPO using a frozen anchor model. In non-verifiable domains, the model self-derives binary rubrics from this pseudo-reference to score each rollout as an RL reward. This transforms inference compute directly into supervision signals without any human annotation. CaT achieves up to a 30% improvement over baselines on HealthBench and matches or exceeds inference-time aggregation with \(9\times\) lower test-time compute.
Background & Motivation¶
Background: Current LLM post-training primarily follows two paths: SFT based on human-labeled reference answers (Ouyang et al. 2022), or RLVR with programmatic verifiers (such as GRPO for math/code, Shao et al. 2024). Both paths require the "existence and accessibility of reference answers."
Limitations of Prior Work: In tasks like medical consultation, life advice, open dialogue, and creative writing, answers are naturally open-ended, multi-solution, or subject to expert disagreement. Ground truths are difficult to write, and programmatic checkers are impossible to construct. Common fallbacks include expensive annotation pipelines or using another LLM to score on a 1–10 scale (LLM-as-judge), the latter of which is prone to inconsistency, verbosity bias, style bias, and reward hacking (Zheng et al. 2023).
Key Challenge: RL training requires a "reference signal" to calculate advantage; however, in non-verifiable domains, this signal is neither available from humans nor generated by programs. Consequently, post-training remains most scarce in the most valuable open-ended domains.
Goal: (i) Provide a stable, usable reward signal for RL in non-verifiable tasks without any human reference; (ii) Ensure this reward mechanism is plug-and-play with existing RLVR pipelines (GRPO) without introducing significant extra compute.
Key Insight: GRPO already samples \(G\) parallel rollouts for each prompt to estimate advantage. These rollouts happen to "diverge where the model is uncertain"—one might get an intermediate step right, another might get the final answer right, and a third might perform a correct verification. The information content of the entire group of rollouts is inherently greater than any single one, yet this information is currently wasted, used only for variance normalization.
Core Idea: Use "synthesis" to reconcile multiple rollouts into a pseudo-reference answer \(s\), then let the model extract several binary rubric criteria from \(s\) as rewards. This "compute for supervision" approach is implemented as a plug-and-play two-stage pipeline: reference estimation and reward derivation.
Method¶
Overall Architecture¶
Input: prompt \(q\), current policy \(\pi_t\), frozen anchor \(\pi_0\) (usually the initial policy), judge \(\pi_J\) (e.g., GPT-4o). Process:
- Sample \(G\) rollouts \(o_{1:G}\) using \(\pi_t\) (reused from GRPO);
- Reference estimation: Perform synthesis on rollouts using \(\pi_0\) to obtain a pseudo-reference \(s \sim \pi_0(\cdot \mid p_{\text{syn}}, o_{1:G})\);
- Reward derivation: For verifiable domains, perform direct string matching; for non-verifiable domains, \(\pi_0\) generates \(n \ge 5\) binary rubrics \(\mathcal{R}=\{r_1,\dots,r_n\}\) from \(s\). Then \(\pi_J\) judges each rollout against each criterion as yes/no. The reward is the pass ratio \(R_{\text{rub}}(o;\mathcal{R}) = \frac{1}{n}\sum_j \mathbf{1}[\pi_J(o,r_j)=\text{yes}]\);
- Update \(\pi_t\) using GRPO with \(\hat A_i = (R_i - \bar R_G)/\sigma_G\).
The highlight is that the entire mechanism aligns directly with native GRPO sampling: it only adds 1 synthesis + 1 rubric generation + \(n \times G\) extremely short yes/no judgments. This is parallelizable and the total overhead is much smaller than the \(G\) rollouts themselves.
Key Designs¶
-
Synthesis as reference estimator:
- Function: Reconciles \(G\) divergent rollouts into a single pseudo-reference \(s\), rather than "selecting" one.
- Mechanism: The frozen initial policy \(\pi_0\) (not the current \(\pi_t\)) reads all rollouts under a fixed prompt \(p_{\text{syn}}\) to generate a new answer. The design intentionally excludes the original prompt \(q\) from the input (ablation in Appx 6.4) to force the model to rely entirely on reconciling internal information from rollouts rather than re-answering. Using a frozen anchor instead of the current policy decouples "exploration" from "estimation"—while \(\pi_t\) improves via RL, \(\pi_0\) provides a stable baseline for reference estimation, preventing targets from moving as the strategy drifts. Empirically, synthesis disagrees with the majority vote in 5–15% of cases, yet remains 70–86% accurate when it disagrees (Table 1), even performing correctly in ~1% of cases where "the whole team is wrong"—something impossible for selection methods (majority vote, Self-BoN, min-PPL).
- Design Motivation: Selection can at most recover the "best rollout"; synthesis can stitch correct fragments across rollouts to generate superior answers outside the original distribution, maximizing the potential of inference compute.
-
Self-proposed Rubrics (Novelty):
- Function: In non-verifiable domains without human references, it decomposes coarse judgments like "is this good" into several binary fine-grained judgments like "does it satisfy condition \(r_j\)".
- Mechanism: \(\mathcal{R} \sim \pi_0(\cdot \mid p_{\text{rub}}, s)\). The anchor model distills \(n \ge 5\) binary, auditable, and repeatable criteria from the pseudo-reference \(s\) (e.g., "suggested consulting a doctor," "mentioned lifestyle modification," "avoided giving a diagnosis"). The judge model \(\pi_J\) then independently judges yes/no for each rollout. Reward = satisfaction ratio. The entire pipeline goes from inference compute → pseudo-reference → rubrics → reward without human reference intervention.
- Design Motivation: Decomposition brings three benefits: (i) Reliability: binary questions are much more stable for LLMs than 1–10 scoring; (ii) Auditability: one can check which specific criterion failed, facilitating debugging; (iii) Mitigation of style bias: rubrics reward content coverage rather than writing style or length, alleviating verbosity bias and reward hacking.
-
Drop-in compatibility with verifiable domains + compute amortization:
- Function: The same framework degrades to "answer matching against pseudo-reference" in verifiable domains like math/code without code changes.
- Mechanism: The verifiable reward simplifies to \(R_{\text{ver}}(o;s)=\mathbf{1}[\texttt{answer}(o)=\texttt{answer}(s)]\), with \(s\) still provided by synthesis. This is equivalent to majority-vote pseudo-labeling in TTRL, but synthesis can go beyond the support of the rollout set. Regarding compute amortization: after training, a single forward pass produces answers equivalent to or better than inference-time aggregation, effectively burning the "G-fold deployment cost" once into the model weights.
- Design Motivation: Non-verifiable tasks are the true challenge, but proving that changing one line of reward derivation allows the framework to "plug" into any domain validates CaT as a universal paradigm rather than a healthcare-specific trick.
Loss & Training¶
- Base: GRPO's clipped surrogate + KL regularization to \(\pi_0\);
- Group size \(G=8\);
- Anchor \(\pi_0\) is the same as the initial policy; Judge \(\pi_J=\) GPT-4o;
- Compute overhead: Synthesis is roughly equal to 1 extra rollout; rubric scoring requires \(n \times G\) short yes/no judgments, fully parallelizable.
Key Experimental Results¶
Main Results¶
| Model | Dataset | Initial | CaT | Inference-time Synthesis | Gain / Compute Ratio |
|---|---|---|---|---|---|
| Gemma 3 4B | HealthBench | base | +up to 30% | < CaT | CaT uses 1× test compute vs synth 9× |
| Qwen 3 4B | HealthBench | base | Significantly > base | ≈ CaT | 9× test compute reduced to 1× |
| Llama 3.1 8B | HealthBench | base | 0.38 (vs SFT 0.28) | < CaT | Same as above |
| Three Models | MATH-500 | base | Up to +33% | ≈ CaT | Drop-in parity with verifiable baselines |
Ablation Study¶
| Configuration | Key Observation (HealthBench) | Description |
|---|---|---|
| CaT (self-proposed rubric) | On par with physician rubric | "Self-written criteria ≈ doctor-written criteria" across models |
| Model-as-judge (1–10 scoring) | Significantly lower than CaT | Coarse judgments are unstable; reward noise is high |
| CaT-SFT (SFT using pseudo-reference) | Llama 0.28 vs CaT 0.38 | RL generalizes better than SFT on small data |
| Synthesis vs Majority/Self-BoN/Min-PPL | HealthBench: Synthesis wins; MATH-500: Parity | Synthesis shows greatest advantage in non-verifiable domains |
| Synthesis input 8 vs 1 | 0.85 vs 0.80 (Qwen MATH) | Proves synthesis performs cross-rollout reasoning, not just "sampling one more" |
Key Findings¶
- Self-proposed rubrics rival expert annotations: On HealthBench, self-proposed rubrics performed nearly identically to those designed by human physicians, proving that models capable of "writing decent answers" also possess the ability to "distill valid scoring dimensions."
- Synthesis performs true reconciliation: In ~1% of tasks, "synthesis is correct while all rollouts are wrong." It maintains 82–86% accuracy even when disagreeing with the majority vote, indicating it can generate superior answers outside the rollout distribution.
- Compute is "burned" into weights: After training, a single CaT forward pass matches or exceeds \(9\times\) \(G\)-rollout inference-time synthesis, amortizing the "9-fold deployment cost" into a "one-time training cost."
- Llama gains less from synthesis but most from RL: Weaker models struggle with meta-cognitive reconciliation, but RL compensates; this suggests CaT is friendlier to small/medium models.
- Saturation after entropy collapse: As rollouts converge, the reconciliation space for synthesis vanishes, leading to diminishing marginal returns. This aligns with common entropy collapse in RL fine-tuning.
Highlights & Insights¶
- Paradigm shift of "Compute as Supervision": Previous reference-free RL (TTRL, Absolute Zero) only utilized selective aggregators like majority vote and were limited to verifiable domains. Ours is the first to combine "generative aggregator + binary rubrics" into a unified pipeline for non-verifiable domains, essentially translating inference-time best-of-N gains into training-time supervision signals.
- Decoupling anchor and policy is a critical engineering detail: Using frozen \(\pi_0\) instead of \(\pi_t\) for synthesis prevents positive feedback drift ("deceiving oneself") and anchors the reward signal to a stable reference distribution—consistent with the KL-to-reference philosophy in RLHF, but used for target estimation rather than update constraint.
- Rubrics serve as a "white-box interface" for rewards: Rubrics make rewards readable, auditable, and debuggable. This is crucial for industrial deployment to answer "why was this penalized"—upgrading reward engineering from black-box LLM judges to structured checklists for manual spot-checks/curation.
- Transferable design: The synthesis-as-aggregator + rubric-as-reward paradigm can be directly migrated to reasoning trace scoring, multi-turn dialog, and agentic trajectory scenarios. In any scenario where "scoring is needed but standards are unclear," the model can generate its own standards first.
Limitations & Future Work¶
- Reliance on base model capability: If the base model generates low-information rollouts or has poor reconciliation ability, CaT gains shrink. Essentially, CaT "amplifies base capabilities" and is ineffective for models in domains they haven't mastered at all.
- Saturation after entropy collapse: Once rollouts converge, synthesis loses reconciliation room, hitting a bottleneck. Future work could introduce exploration rewards or diverse sampling strategies.
- Dependence on judge model: Using GPT-4o as \(\pi_J\) poses cost and reproducibility issues for small teams. The effect of replacing it with open-source judges was not systematically studied in the main text (briefly mentioned in Appx 6.3).
- Coarse rubric granularity: Current rubrics are binary yes/no. Future work could introduce partial credit, hierarchical standards, or confidence weighting to increase reward signal resolution.
- Self-verification: Do self-generated rubrics protect the model?: The paper does not systematically discuss "self-collusion" (rubrics biasing toward the anchor's own biased patterns), which is a potential reward hacking path worth investigating.
Related Work & Insights¶
- vs TTRL / Absolute Zero: They also perform reference-free RL but use majority vote/self-play only in verifiable domains; CaT goes beyond rollout support via synthesis and incorporates the non-verifiable domain via rubrics.
- vs Rubrics as Rewards (RaR, Gunjal et al. 2026): RaR also uses rubrics, but they are constructed from human references; CaT makes rubrics self-generated, removing the human dependency.
- vs Test-time scaling (majority vote, best-of-N): That path incurs "\(G\) times compute at every deployment"; CaT amortizes that compute once during training, returning to \(1\times\) at deployment.
- vs LLM-as-judge (Zheng et al. 2023): 1–10 scoring is coarse and style-biased; CaT uses binary rubrics + auditable criteria for noise and bias reduction.
- vs Constitutional AI / Self-Instruct: Those methods target specific capabilities (harmlessness, instruction-following); CaT provides a more general, domain-agnostic reference-free RL framework.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Connecting "compute as supervision + self-generated rubrics" into a plug-and-play unified pipeline is a key step for reference-free RL in non-verifiable domains.
- Experimental Thoroughness: ⭐⭐⭐⭐ Coverage of three model families + two domains, including comparisons with human expert rubrics and detailed selection baseline ablations; however, non-verifiable verification was limited to 4–8B scales in a single domain (healthcare).
- Writing Quality: ⭐⭐⭐⭐⭐ Clear progression, concise "why it works" intuition sections, and strong alignment between algorithms, charts, and conclusions.
- Value: ⭐⭐⭐⭐⭐ Provides a paradigm that can be immediately embedded into industrial RLHF pipelines (GRPO compatible, no verifier needed, no human labels needed), offering extremely high value for open-domain post-training.