Fine-Tuned LLMs Know They Don't Know: A Parameter-Efficient Approach to Recovering Honesty¶
Conference: AAAI 2026 arXiv: 2511.12991 Code: None Area: Medical Imaging / LLM Alignment Keywords: LLM honesty, supervised fine-tuning, knowledge boundary, neuron restoration, parameter-efficient
TL;DR¶
This paper reveals that the root cause of SFT-induced dishonesty in LLMs is impaired self-expression (rather than degraded self-knowledge), and proposes the HCNR framework accordingly. By identifying honesty-critical neurons via Fisher information and restoring them to their pre-trained states with Hessian-guided compensation, HCNR recovers 33.25% of honesty using only 256 data samples and 20% of parameters, achieving over 2.23× speedup.
Background & Motivation¶
Importance and Fragility of LLM Honesty¶
LLM honesty encompasses two dimensions:
Self-knowledge: the ability to recognize the boundaries of one's own knowledge
Faithful self-expression: the ability to respond truthfully based on that self-knowledge
Honesty is typically established during the alignment phase (e.g., RLHF), but supervised fine-tuning (SFT) can severely undermine this property. For example: - After fine-tuning on legal QA, LLMs begin to confidently fabricate legal provisions - After fine-tuning on medical diagnosis, LLMs produce plausible-sounding answers to questions beyond their knowledge - Such "hallucinations" can have serious consequences in high-stakes domains
Assumptions and Limitations of Prior Methods¶
Existing honesty recovery methods (e.g., RAIT, DPO, ORPO) share an implicit assumption: SFT deeply destroys the model's knowledge boundary capabilities, necessitating large-scale data and full-parameter adjustment for repair. This results in: - Requirements for thousands of specially crafted IDK samples - Long training times (30–40 minutes) - Risk of catastrophic forgetting on downstream tasks
Key Observation: Dishonesty as an "Illusory Phenomenon"¶
Two experiments reveal a counterintuitive finding:
Observation 1: During RAIT honesty-enhancement training, model honesty recovers rapidly within approximately 60 gradient steps—suggesting that the core knowledge boundary capability has not been destroyed.
Observation 2: Linear probes (logistic regression) trained on the hidden states of fine-tuned LLMs can distinguish answerable from unanswerable questions with high accuracy (high AUROC). Moreover, probes trained on the base model transfer directly to fine-tuned models while maintaining high AUROC—indicating that SFT does not alter the geometric structure of knowledge boundary representations.
Conclusion: SFT-induced dishonesty is a failure of self-expression, not a loss of self-knowledge.
Method¶
Overall Architecture¶
HCNR (Honesty-Critical Neurons Restoration) proceeds in two stages:
Stage 1: Identifying and Restoring Honesty-Critical Neurons - Fisher information is used to assess each neuron's importance to honesty and downstream tasks - Neurons with high honesty importance and low task importance are selected - Among these, layers/neurons most perturbed by SFT are prioritized - Selected neurons are restored to their pre-trained states
Stage 2: Hessian-Guided Compensation - Restored neurons and unrestored task neurons introduce coordination misalignment - The Hessian matrix is used to compute an optimal compensation vector that minimizes activation discrepancy
Key Designs¶
1. Intra-layer Sensitivity Assessment¶
Core Idea: The diagonal elements of the Fisher Information Matrix (FIM) serve as unbiased estimates of neuron importance.
For the \(k\)-th neuron in layer \(j\), its importance on dataset \(D\) is:
\(s_{j,k}^{hon}\) and \(s_{j,k}^{task}\) are computed on honesty data \(D^{hon}\) and task data \(D^{task}\) respectively. A priority score is defined as:
A high \(r_{j,k}\) indicates that the neuron is critical for honesty but has minimal impact on downstream tasks—precisely the neurons to be preserved. The top \(R_{IW}\) fraction of neurons per layer are selected as candidates.
Design Motivation: Restoring all neurons indiscriminately would degrade task performance; precise identification of neurons that are honesty-relevant but task-neutral is therefore necessary. The KL-divergence-based priority score better discriminates the two neuron types compared to a simple difference.
2. Cross-layer Perturbation Analysis¶
SFT perturbs different layers to different degrees due to the hierarchical specialization of LLMs; layers with the greatest perturbation should be prioritized:
The top \(R_{CW}\) fraction of highly perturbed layers are selected. The final honesty-critical neuron set \(A^{hc}\) is obtained by intersecting the candidate layers and candidate neurons.
Design Motivation: Indiscriminately protecting all layers over-constrains downstream performance. In practice, certain layers (e.g., middle layers) exhibit greater perturbation in their honesty neurons and require priority restoration.
3. Hessian-Guided Compensation¶
Simply restoring neurons to their pre-trained states introduces new misalignment—because all parameters are updated in coordination during SFT. Restoring a subset breaks this coordination, causing a rebound in honesty task loss.
The compensation vector is derived within the OBS framework:
The final weight update rule is:
Design Motivation: Restoration without compensation causes honesty rebound (ablation shows F1 dropping from 72.84 to 65.96); Hessian compensation precisely bridges the coordination gap between restored and task neurons.
Loss & Training¶
- HCNR is training-free—no additional training is required; only a small amount of data is needed to compute Fisher/Hessian statistics
- Only \(|D^{hon}| = |D^{task}| = 128\) samples are required
- Hyperparameters: \(R_{IW} = 0.5\) (select 50% of neurons per layer), \(R_{CW} = 0.4\) (select 40% of layers)
- Only 20% of total parameters are modified
- Experiments are repeated 3 times and averaged
- Runs on Nvidia A800-80GB GPU
Key Experimental Results¶
Main Results¶
Results on Llama-3.1-8B-Instruct fine-tuned on HotpotQA and MedMCQA, followed by honesty recovery:
| Method | FalseQA F1 | NEC F1 | RefuNQ F1 | KUQ F1 | SelfAware F1 | Task Acc. |
|---|---|---|---|---|---|---|
| Fine-tuned | 56.51 | 35.46 | 32.43 | 68.50 | 67.01 | 30.65 |
| RAIT | 68.59 | 68.28 | 71.21 | 80.38 | 64.46 | 27.05 |
| DPO | 69.12 | 69.52 | 72.91 | 80.96 | 64.76 | 29.00 |
| ORPO | 65.83 | 70.03 | 71.26 | 79.21 | 65.21 | 29.60 |
| HCNR | 68.30 | 71.90 | 71.70 | 82.90 | 69.40 | 30.30 |
Efficiency comparison (recovery after HotpotQA fine-tuning):
| Method | Data Size | Param % | Time | Avg. F1 | Avg. RF Δ |
|---|---|---|---|---|---|
| RAIT | 5000 | 100% | 8.76 min | 70.58 | +33.40 |
| DPO | 5000 | 100% | 42.78 min | 71.45 | +37.41 |
| ORPO | 9000 | 100% | 30.97 min | 70.31 | +39.94 |
| HCNR | 256 | 20% | 3.93 min | 72.84 | +42.64 |
Ablation Study¶
| Stage 1 Config | Stage 2 Config | Avg. F1 | Avg. RF Δ | Task Acc. |
|---|---|---|---|---|
| Random | Ours | 65.44 | +36.31 | 29.60 |
| w/o Task | Ours | 70.43 | +33.24 | 28.30 |
| Ours | w/o Compensation | 65.96 | +33.09 | 30.37 |
| Random | w/o Compensation | 54.21 | +23.04 | 29.70 |
| Ours | Ours | 72.84 | +42.64 | 30.30 |
Key Findings¶
- HCNR achieves top performance on 3–4 of 5 honesty benchmarks while maintaining the highest task accuracy
- Efficiency advantages are pronounced: only 256 data samples (20× reduction), 20% of parameters, and 3.93 minutes (2.23× speedup) suffice to outperform all baselines
- F1 saturates at 128 samples: further data increases yield negligible gains, confirming the hypothesis that honesty degradation is a localized phenomenon
- Hessian compensation is indispensable: removing compensation reduces F1 from 72.84 to 65.96 and RF Δ from 42.64 to 33.09
- ICL yields the worst recovery: indicating that fine-tuning impairs in-context learning capabilities
- Cross-model generalization: effective across 5 LLM families including Llama-3, Qwen2/3, and Mistral
Highlights & Insights¶
- The core insight is highly valuable: "SFT-induced dishonesty is a failure of expression, not a loss of cognition"—this finding reshapes the understanding of SFT side effects
- The linear probe transfer experiment is elegantly designed: probes trained on the base model transfer directly to fine-tuned models with sustained effectiveness, providing strong evidence for the robustness of knowledge boundary representations
- Training-free design: unlike RAIT/DPO/ORPO, HCNR requires no additional training—only statistical computation followed by direct weight modification
- Asymmetric behavior of \(R_{IW}\) and \(R_{CW}\): \(R_{IW}\) saturates rapidly (intra-layer neuron selection is relatively insensitive), while \(R_{CW}\) has a clear optimum at 0.3–0.4 (indicating that cross-layer selection is more critical)
- Pareto frontier dominance: on the task-honesty tradeoff plot, HCNR's Pareto frontier strictly dominates all baselines
Limitations & Future Work¶
- Assumption of pre-trained state optimality: the framework assumes that the pre-trained state represents optimal honesty, whereas the post-alignment state may in fact be superior
- Approximations in Fisher/Hessian computation: diagonal Fisher approximation and finite-data Hessian estimation introduce accuracy limitations that scale with data availability
- Only LoRA and full fine-tuning evaluated: other PEFT methods (e.g., Prefix Tuning, Adapter) remain untested
- Narrow definition of honesty: only the dimension of "refusing to answer unknown questions" is considered; broader honesty aspects such as factual error correction and uncertainty calibration are not addressed
- Safety concerns: whether honesty restoration may simultaneously revive certain behaviors intentionally suppressed by SFT warrants further analysis
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — The insight that "dishonesty is a failure of expression" is highly innovative; the HCNR framework is elegantly designed
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 5 LLM families, 4 fine-tuning datasets, 5 honesty benchmarks, and detailed ablations
- Writing Quality: ⭐⭐⭐⭐⭐ — Narrative is fluent with a clear logical flow from observations to method to validation
- Value: ⭐⭐⭐⭐⭐ — Directly practical for safe LLM deployment; the method is efficient and readily applicable