Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching¶

Conference: ACL 2025
arXiv: 2406.06326
Code: https://github.com/zhangxy-2019/Effective-Knowledge-Injection
Area: LLM NLP / Knowledge Injection
Keywords: Knowledge Injection, Self-Teaching, Feynman Technique, continual learning, Knowledge Acquisition

TL;DR¶

Inspired by the Feynman technique, a Self-Tuning framework is proposed. Through a three-layer self-teaching strategy of memorization, comprehension, and self-reflection, it significantly enhances the ability of LLMs to effectively acquire and recall knowledge from new documents.

Background & Motivation¶

Background: The knowledge of LLMs becomes outdated due to one-time training and a constantly changing world, necessitating the continuous injection of new knowledge.

Limitations of Prior Work: Standard continual pre-training struggles to extract stored knowledge; even after instruction fine-tuning, knowledge extraction remains limited.

Key Challenge: Existing methods overemphasize "memorization" while neglecting "comprehension"—even if PPL is reduced, knowledge cannot be effectively extracted in QA tasks.

Goal: To enable LLMs to efficiently absorb, comprehend, and recall new knowledge from raw documents.

Key Insight: Design self-supervised learning tasks by drawing inspiration from the core concept of "comprehension + self-reflection" in the Feynman technique.

Core Idea: First teach the model "how to learn" (Stage 1), and then let it learn new documents autonomously (Stage 2-3).

Method¶

Overall Architecture¶

Three-stage training: Stage 1 learns the ability to absorb knowledge on training documents \(\rightarrow\) Stage 2 applies learning strategies to test documents \(\rightarrow\) Stage 3 continually learns test documents.

Key Designs¶

Memorization Task:
- Function: Performs next-token prediction on the raw text.
- Mechanism: Standard language modeling to embed factual information into the parameters.
- Design Motivation: The first step of the Feynman technique—memorizing basic facts.
Comprehension Task:
- Function: Summarization, key information identification, and natural language inference.
- Mechanism: (i) Uses titles as gold standards for summarization, (ii) uses SpaCy to identify entities, and (iii) generates NLI samples from the documents.
- Design Motivation: The "explaining in one's own words" aspect of the Feynman technique.
Self-Reflection Task:
- Function: "Teaching", "flashcards", cloze tests, multiple choice, and sentence completion.
- Mechanism: All tasks are self-supervisedly generated based on document content, facilitating recall in a closed-book manner.
- Design Motivation: The "finding and filling knowledge gaps" aspect of the Feynman technique.

Loss & Training¶

Stage 1: \(L^{Stage1}_\theta = L_\theta(D^{Doc}_{train}) + L_\theta(D^{Self}_{train}) + L_\theta(D^{QA}_{train})\)
Stage 2: \(L^{Stage2}_\theta = L_\theta(D^{Doc}_{test}) + L_\theta(D^{QA}_{train})\)
Stage 3: \(L^{Stage3}_\theta = L_\theta(D^{Doc}_{test})\)

Key Experimental Results¶

Main Results (Llama2-7B, Wiki-Bio single-domain scenario)¶

Method	PPL↓	EM↑	F1↑	Reasoning Acc↑	NQ F1↑	CSQA Acc↑
Closed-book	8.41	2.87	14.63	7.96	24.67	53.40
Cont. Pre-train	7.28	3.62	15.96	15.09	24.11	53.40
Std. Ins.-tuning	6.83	5.13	19.15	39.09	23.67	51.84
PIT	2.08	11.61	27.15	11.93	26.31	57.58
Self-Tuning	1.11	31.52	50.83	44.31	25.67	66.01

Ablation Study¶

Variant	EM	F1	Reasoning Acc
Self-Tuning (Full)	31.52	50.83	44.31
w/o Review (remove Stage 2 QA)	EM drops	F1 drops	-
via Reading Comp. (replaced with reading comprehension)	Lower than full version	-	-

Key Findings¶

Self-Tuning improves the knowledge extraction EM from 2.87% to 31.52%, approaching the open-book level (31.83%).
The PPL drops almost to 1, demonstrating that new documents are effectively memorized.
Excellent knowledge retention: NQ F1 and CSQA Acc increase rather than decrease.
Significant advantages are also maintained in the cross-domain scenario (Wiki-Film).

Highlights & Insights¶

The analogy to the Feynman technique is highly intuitive, and the three-layer task design is backed by solid learning theory.
All self-teaching tasks are generated in a self-supervised manner, requiring no additional annotations or special templates.
The results of knowledge retention provide confidence—learning new knowledge does not necessarily imply forgetting old knowledge.
The Wiki-Newpages-2023-QA dataset itself is a highly valuable contribution.

Limitations & Future Work¶

The three-stage training increases computational costs.
Validated only on Wikipedia-like knowledge documents; the performance on long technical documents remains unknown.
The training documents for Stage 1 require related QA data, so generalizing to completely new domains requires additional effort.

vs PIT (Jiang et al. 2024): PIT focuses only on memorization rather than comprehension, whereas Self-Tuning demonstrates that comprehension + self-reflection is far superior to pure memorization.
vs ReadComprehension (Cheng et al. 2024): The reading comprehension framework relies on mining patterns, while Self-Tuning's self-supervised generation is more flexible.

Supplementary Details¶

Dataset source: Wikipedia NewPages from September to October 2023.
Three datasets: Wiki-Bio (single-domain), Wiki-Multi (multi-domain), and Wiki-Film (cross-domain).
Evaluation dimensions: Memorization (PPL), Extraction (EM/F1), and Reasoning (NLI Accuracy).
Knowledge retention evaluation: Natural Questions and CommonsenseQA.
Self-teaching tasks are generated through self-supervision using SpaCy and NLTK.
Consistent advantages are also verified on Qwen2-7B and Mistral-7B.
The cross-domain scenario uses Wiki-Bio training data to test generalization capability.
Self-Tuning's knowledge extraction EM approaches the open-book level.
Self-reflection tasks encompass five formats: teaching, flashcards, cloze tests, multiple choice, and sentence completion.

Rating¶

Novelty: ⭐⭐⭐⭐ The idea of introducing the Feynman technique to LLM knowledge injection is novel, and the self-teaching task design is systematic.
Experimental Thoroughness: ⭐⭐⭐⭐ 3 scenarios \(\times\) 3 models \(\times\) multiple metrics + knowledge retention evaluation.
Writing Quality: ⭐⭐⭐⭐ Clear motivation with a smooth transition between method and experiments.
Value: ⭐⭐⭐⭐⭐ Provides a practical training framework for updating LLM knowledge.