Towards a Principled Evaluation of Knowledge Editors¶

Conference: ACL 2025
arXiv: 2507.05937
Code: Yes (Open-source evaluation framework)
Area: NLP / Knowledge Editing
Keywords: Knowledge Editing, Model Editing, Evaluation Methodology, MEMIT, Edit Batch Size

TL;DR¶

This paper systematically reveals that different scoring methods (argmax, multiple-choice, generation match) and different edit batch sizes in knowledge editing evaluation lead to reversals in knowledge editor rankings, and finds that string match-based evaluations are prone to false positives through human evaluation.

Background & Motivation¶

Knowledge Editing has attracted widespread attention in recent years, aiming to update knowledge in pretrained language models through local, targeted modifications without full retraining. Currently, mainstream evaluation datasets include zsRE, CounterFact, MQuAKE, and RippleEdits, but they employ distinctly different scoring methods:

zsRE uses argmax to check token-by-token whether it aligns with the greedy decoding results.
CounterFact uses a multiple-choice format (comparing sequence log-likelihoods).
MQuAKE/RippleEdits use string matching within generated text.

Do these different methods cause inconsistencies in editor rankings? How reliable is string matching? To what extent does the edit batch size degrade the model's overall capabilities? These critical questions have not been fully explored previously.

Method¶

Overall Architecture¶

This paper constructs a unified evaluation framework that integrates four knowledge editing datasets and incorporates the LM Evaluation Harness, enabling simultaneous execution of knowledge editing evaluation and general language understanding tasks on edited models.

Key Designs¶

Comparison of Three Scoring Methods:
- Argmax: Checks token-by-token whether the target string is the highest probability prediction, calculating accuracy.
- MC (Multiple Choice): Compares the sequence log-likelihood of the original target and the edited target.
- Generate (Generation Match): Generates a fixed-length text and checks for the occurrence of the target string.
- Design Motivation: Different methods may implicitly favor specific editors.
Selection of Four Editors:
- MEMIT: Explicitly computes parameter updates via causal tracing, specifically designed for batch editing.
- LoRA: A parameter-efficient fine-tuning method.
- In-Context: Prepends edited facts to the input in natural language format.
- Context-Retriever: Combines with a RAG system to retrieve the 4-NN most relevant edits.
Analysis of Generation Length Impact: Longer generated text leads to a higher false positive rate; human annotation on 200 samples is conducted to verify the reliability of the matching algorithm.
Edit Batch Size Experiments: Gradually scales up the batch size from 1 to 2048 to observe the trend of knowledge editing performance and general capabilities.

Loss & Training¶

LoRA hyperparameters: rank=8, alpha=32, 20 epochs, GPT-2-XL learning rate 5e-3, GPT-J learning rate 1e-3.
MEMIT uses hyperparameters published in the original paper.
2048 items are randomly sampled from each dataset for all experiments.

Key Experimental Results¶

Main Results (Accuracy Comparison under Different Scoring Methods, GPT-J)¶

Dataset	Method	Context-Retriever	In-Context	MEMIT	LoRA	NoEdit
zsRE	argmax	0.735	0.764	0.727	0.756	0.278
zsRE	gen	0.619	0.656	0.629	0.653	0.066
CF	argmax	0.365	0.391	0.312	0.356	0.095
CF	MC	0.800	0.794	0.866	0.688	0.614
CF	gen	0.505	0.511	0.462	0.442	0.200
MQuAKE	gen	0.213	0.198	0.153	0.133	0.050

Ablation Study (Comparison between LLM-as-Judge and Exact Match)¶

Dataset	Mistral-7B	Qwen-32B	Exact Match
zsRE	0.625	0.903	0.882
CF	0.647	0.955	0.917
MQuAKE	0.654	0.897	0.897
RipEd	0.757	0.903	0.896

Key Findings¶

MEMIT's "advantage" on CounterFact is a result of scoring method preference: When using the MC method, MEMIT scores 0.866, far exceeding other editors, but ranks worst when using argmax and generate methods.
String matching suffers from false positive issues: As the generation length increases beyond 30 tokens, the false positive rate rises significantly.
Context-Retriever exhibits a higher false positive rate: This is likely due to more diverse generated text.
When edit batch size increases: The performance of the In-Context editor drops sharply at batch > 64 due to context window limits, while MEMIT remains relatively more robust.
Damage to general capability: LoRA is the most destructive (with perplexity soaring to the millions), MEMIT is the gentlest, and Context-Retriever actually recovers at large batch sizes (due to retrieving harmless edits).

Highlights & Insights¶

This work systematically reveals for the first time the impact of assessment methodology choices on knowledge editor rankings, which is of great significance for fair comparison in this field.
Integrating the LM Evaluation Harness into the knowledge editing evaluation framework addresses the gap in "editor side-effect" research.
The "self-recovery" phenomenon of the Context-Retriever at large batch sizes is intriguing—the more edits there are, the more harmless the retrieved edits become.
LLM-as-Judge demonstrates preliminary feasibility as an alternative evaluation method.

Limitations & Future Work¶

Experiments are only conducted on two small models (GPT-J 6B, GPT2-XL 1.5B) and need to be extended to larger models.
Instruction-tuned models are not covered.
LoRA hyperparameters are optimized only for batch=16, which is unfair to other batch sizes.
The sample size for LLM-as-Judge is too small (200 cases), requiring larger-scale validation.
A wider coverage of editing methods is lacking.

Comparisons with MEMIT (Meng et al., 2023b) reveal the existence of scoring method preferences.
MQuAKE (Zhong et al., 2024) focuses on multi-hop reasoning capabilities; this study finds that in-context editing has a greater advantage on this task.
RippleEdits' (Cohen et al., 2023) forgetfulness query design gives unedited models an inherent advantage on this task.
Insight: Future evaluations of knowledge editing should employ multiple execution scoring methods uniformly and report the impact of edits on general capabilities.

Rating¶

Novelty: ⭐⭐⭐⭐ — Uniquely analyzes the systematic impact of evaluation methods on editor rankings for the first time.
Experimental Thoroughness: ⭐⭐⭐⭐ — Evaluates on four datasets, four editors, multiple batch sizes, and includes human evaluation, but is limited by smaller model scales.
Writing Quality: ⭐⭐⭐⭐ — Well-structured with rich charts and highly rigorous argumentative logic.
Value: ⭐⭐⭐⭐ — Significantly facilitates the normalization of evaluation methodologies in the knowledge editing field.