Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning¶
Conference: NeurIPS 2025 arXiv: 2505.16270 Code: GitHub Area: Recommender Systems Keywords: Mistake Log, Pilot-Copilot, logits rectification, fine-tuning, error-aware
TL;DR¶
This paper proposes the Transformer Copilot framework, which systematically records a "Mistake Log" during LLM fine-tuning, trains an auxiliary Copilot model to learn the Pilot's error patterns, and rectifies logits at inference time to improve generation quality, achieving up to 34.5% improvement across 12 benchmarks.
Background & Motivation¶
Background: Supervised fine-tuning (SFT) is the standard approach for adapting LLMs to specific domains. However, fine-tuned models still suffer from train-test misalignment at inference time — they cannot fully capture task-specific nuances or may overfit certain patterns in the training data.
Limitations of Prior Work: - Standard SFT updates parameters via loss gradients, discarding each error immediately after consumption; the model retains no explicit memory of where, how, or why it erred. - Data-side interventions (self-refinement) and external feedback (RLHF, Reflexion) require additional data or human annotation. - The final fine-tuned parameters \(\theta_T^P\) contain no error information from the training trajectory — these valuable learning signals are wasted.
Key Challenge: SFT optimizes parameters to minimize loss, but parameter updates are "use-and-discard" — the model may repeatedly make similar errors on the same class of problems because it lacks an explicit error-reflection mechanism.
Goal: Rather than modifying the Pilot model's training process, leverage intermediate signals recorded during training (the Mistake Log) to assist correction at inference time.
Key Insight: Analogous to the "mistake notebook" reflection mechanism in human learning — recording errors, analyzing their causes, and using them as reminders during evaluation.
Core Idea: Systematically record the model's error patterns (inputs, internal states, token-level errors) during fine-tuning, train an auxiliary Copilot to learn these patterns, and rectify the Pilot's logits at inference time.
Method¶
Overall Architecture¶
Three components: (1) Mistake Log definition and collection; (2) Copilot model design and joint training; (3) logits fusion at inference time. The Pilot is fine-tuned normally while the Copilot learns the Pilot's error patterns in parallel; at inference time, both collaborate for generation.
Key Designs¶
-
Mistake Log:
- Function: Systematically records three types of information throughout the entire fine-tuning process.
- Three components:
- Question \(\tilde{X}_t\): Input representation (encoder output or embedding layer output).
- Rationale \(h_t\): Hidden states of each token across all decoder layers \(\{h_{t,i,l}\}_{l=1}^{L^P}\), reflecting the model's internal "reasoning process."
- Mistake \(\ell_t\): Token-level prediction error \(\ell_t(p_{t,i}, \hat{p}_{t,i}) = p_{t,i} - \hat{p}_{t,i}\), precisely quantifying the direction and magnitude of each token's error.
- Complete Mistake Log: \(M_T = \{(\tilde{X}_t, h_t, \ell_t)\}_{t=1}^T\)
-
Copilot Model Design:
- Function: Initialized from the Pilot's decoder, trained to predict the Pilot's token-level errors.
- Encoder-Decoder Copilot: Input is the token-level error sequence \(\ell_{t,<i}\) (projected to hidden dim). Uses modified cross-attention where Queries come from the Copilot's own hidden states and Keys/Values come from the concatenation of the Pilot's input representations and pooled hidden states.
- Decoder-only Copilot: Odd-numbered layers use standard self-attention; even-numbered layers use modified cross-attention attending to Pilot information.
- Loss function: \(\mathcal{L}_t^C = \sqrt{\sum_i \|f_{t,i}^C - \ell_t(p_{t,i}, \hat{p}_{t,i})\|^2}\) (RMSE to avoid excessive gradient smoothing).
-
Joint Training Paradigm:
- Each iteration: (a) Pilot forward pass and parameter update; (b) collect Mistake Log entry; (c) sample from Mistake Log to train Copilot.
- The Copilot continuously tracks the Pilot's evolution, learning the most recent error patterns.
-
Inference-time Logits Rectification:
- Core formula: \(\tilde{p}_{t,i} = \hat{p}_{t,i} + \lambda f_{t,i}^C\)
- The Copilot autoregressively generates error predictions, which are added back to the Pilot's logits.
- \(\lambda\) is a rectification strength hyperparameter (default 1); theoretical guarantees ensure the existence of \(\lambda_0 > 0\) such that the rectified output is closer to the true distribution.
Theoretical Guarantee (Theorem 4.1)¶
Under the mild condition that Copilot error \(\epsilon_C < \sqrt{\epsilon_P^2 + \sigma_P^2}\), the rectified \(\tilde{p}_{t,i}\) is strictly closer to the true distribution \(p_{t,i}\) than \(\hat{p}_{t,i}\). Notably, the Copilot may have larger bias than the Pilot (\(\epsilon_C > \epsilon_P\)) and still be effective — explaining why a smaller Copilot is sufficient.
Key Experimental Results¶
Main Results¶
12 benchmarks spanning commonsense reasoning, arithmetic reasoning, and recommendation tasks.
| Pilot Model | Task Type | w/o Copilot | + Copilot | Gain |
|---|---|---|---|---|
| T5 series | Commonsense reasoning | baseline | +2–15% | Significant |
| LLaMA-3.2-3B | Commonsense reasoning | baseline | +5–34.5% | Up to 34.5% |
| Qwen2.5-7B + 1B Copilot | General | Below Qwen2.5-14B | Surpasses Qwen2.5-14B | 4B fewer params |
| Various Pilots | Arithmetic reasoning | baseline | +2–20% | Consistent gains |
Ablation Study¶
| Analysis Dimension | Finding |
|---|---|
| Copilot size | A 1B Copilot is sufficient to effectively assist a 3B–7B Pilot |
| Computational overhead | Marginal increment — Copilot inference cost is far lower than scaling up the Pilot |
| Transferability | A trained Copilot can be directly transferred to a new Pilot without retraining |
| Scalability | Remains consistently effective as Pilot scale increases |
| Logits rectification visualization | Copilot correction directions align with correct answers |
Key Findings¶
- Copilot as "error corrector," not "independent reasoner": It learns the Pilot's error patterns rather than independent task knowledge.
- Small Copilot suffices: A 1B Copilot assisting a 7B Pilot can outperform a standalone 14B model — more parameter-efficient than simply scaling up.
- Cross-model transferability: A Copilot trained on one Pilot transfers to other Pilots in the same family, suggesting error patterns are shared across models.
- Precise token-level rectification: Visualizations show that the Copilot applies corrections at exactly the positions where the Pilot makes formatting or factual errors.
Highlights & Insights¶
- The "mistake notebook" metaphor is intuitive and effective: The reflection mechanism from human learning is formalized as a Mistake Log — a natural and practical concept.
- Exploiting discarded training signals: In standard SFT, intermediate hidden states and token-level errors are discarded after parameter updates; Copilot transforms this "waste" into valuable supervision signals.
- Lenient theoretical conditions: The Copilot need not be more accurate than the Pilot — it only needs to satisfy mild conditions to guarantee improvement, making a small Copilot viable.
- High parameter efficiency: No part of the Pilot is modified; significant gains are achieved solely by attaching a small auxiliary model.
Limitations & Future Work¶
- Training-time memory overhead: Storing the Mistake Log (hidden states and errors across all training steps) may require strategies such as retaining only the most recent \(N\) steps for large-scale training.
- Inference latency: Although the Copilot is small, the additional forward pass still increases latency.
- Copilot error accumulation: During autoregressive inference, the Copilot conditions on its own generated error predictions rather than ground-truth errors, potentially accumulating bias.
- Dependence on the Pilot's training trajectory: The Copilot learns error patterns from a specific training run; changes to training data or hyperparameters may necessitate retraining.
- No comparison against alignment methods: Methods such as RLHF and DPO also address the train-inference misalignment problem but are not evaluated against.
Related Work & Insights¶
- vs. Self-Refinement / Reflexion: These methods prompt the model to self-reflect at inference time, requiring multiple inference passes. Copilot requires only a single forward pass and is thus more efficient.
- vs. Knowledge Distillation: Distillation transfers knowledge from a large model to a small one; Copilot uses a small model to assist a large one — the direction is reversed.
- vs. Speculative Decoding: Both use a small model to assist large-model inference, but Speculative Decoding accelerates decoding while Copilot improves output quality.
- vs. Logits Calibration: Post-processing methods such as temperature scaling apply global adjustments; Copilot performs token-level conditional rectification, offering finer granularity.
Rating¶
- Novelty: ⭐⭐⭐⭐ The Mistake Log concept and Pilot-Copilot framework are novel, though logits rectification itself is not entirely new.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 12 benchmarks × 3 task types × multiple Pilots + theoretical analysis + visualization + transferability/scalability analysis — highly comprehensive.