ScEdit: Script-based Assessment of Knowledge Editing¶

Conference: ACL 2025
arXiv: 2505.23291
Code: Yes (https://github.com/asdfo123/ScEdit)
Area: NLP / Knowledge Editing Evaluation
Keywords: Knowledge Editing, Script Generation, Evaluation Benchmark, Procedural Reasoning, Text-level Evaluation

TL;DR¶

The authors propose ScEdit, a script-based evaluation benchmark for knowledge editing, which extends traditional "What"-style factual recall evaluation to "How"-style procedural reasoning. It introduces a two-tier evaluation system at both token and text levels, revealing significant limitations of existing knowledge editing methods in practical application scenarios.

Background & Motivation¶

While the field of Knowledge Editing (KE) has developed rapidly in recent years, existing evaluation frameworks suffer from fundamental deficiencies:

Oversimplified evaluation: Current metrics (Efficacy, Generalization, Locality) mainly focus on predicting the next few tokens of "What"-style questions, on which many methods have achieved near-perfect scores.

Disconnection from practical applications: LLMs are increasingly deployed as agents where users ask "How"-style procedural questions (e.g., "How to travel from Beijing to Singapore?") rather than simple factual queries.

Lack of text-level evaluation: Existing evaluations only focus on token probabilities, ignoring the quality of long-text generation.

To illustrate with an example from the paper: Singapore has implemented a visa-free policy for Chinese tourists, but an un-updated LLM would still advise users to apply for a visa. This error is not just a simple factual mistake; it leads the entire action plan (script) astray.

The motivation of ScEdit is to elevate knowledge editing evaluation from "correctly answering factual questions" to "correctly guiding multi-step procedural tasks," which is closer to the real-world challenges faced by LLMs during deployment.

Method¶

Overall Architecture¶

ScEdit consists of three core components: 1. Facts: Standard (s, r, o^c) → (s, r, o) knowledge triple editing. 2. Script Questions: "How"-style multi-step questions based on the edited facts. 3. Scripts: Multi-step responses generated by the model to the script questions.

The evaluation framework is divided into two tiers: - Token-level: Evaluates metrics such as S-ES, S-NS, and S-BO using cloze prompts. - Text-level: Evaluates Executability, Coherence, Consistency, and Completeness via automatic (GPT-4) and human evaluation.

Key Designs¶

Script Question Generation
- Function: Generate "How"-style questions that would be affected by the edited facts, starting from each knowledge editing triple.
- Mechanism:
  - Generate candidate script questions using GPT-4 with few-shot prompting.
  - Manually filter to ensure the questions strictly depend on the edited facts.
  - Example: (Panamera, manufacturer, Porsche → Ford) → "How to book a maintenance service for a Panamera?"
- Design Motivation: Simulate real-world scenarios where LLMs acting as agents are required to answer procedural questions.
Token-level Evaluation Metrics
- S-ES (Script-based Efficacy Success): Measures whether the edited model prefers the new answer under the script cloze prompt.
  - Truncated the original script up to the first occurrence of the old answer, and concatenate the script question to formulate the cloze prompt.
  - Check if P(o|Q̃) > P(o^c|Q̃).
- S-NS (Script-based Neighborhood Success): Measures whether editing affects unrelated neighboring facts.
  - Construct neighbor-fact-based script cloze prompts.
  - Check if the model still retains the neighboring facts.
- S-BO (Script-based Bleedover): Measures the bleedover effect of editing on semantically close facts.
  - Measures the drop in accuracy of neighboring facts after editing.
Text-level Evaluation Metrics (7-point Likert Scale)
- Executability: Evaluates whether the script steps are logically executable (disregarding knowledge updates).
- Coherence: Evaluates whether the script is coherent with the updated facts.
- Consistency: Evaluates whether the script is internally consistent (no confusion between old and new facts).
- Completeness: Evaluates whether the script adequately addresses all aspects of the question.
- Design Motivation: Token-level metrics cannot capture the overall generation quality and editing effects of the generated text.
Two Subtasks
- ScEdit-CF (Counterfactual Editing): Evaluates the performance of counterfactual knowledge editing in script scenarios based on the CounterFact dataset.
- ScEdit-T (Temporal Editing): Evaluates temporal update editing based on WikiFactDiff, which closer reflects real-world scenarios.

Dataset Statistics¶

Subtask	No. of Edit Cases	S-Eff. Samples	S-Spec. Samples
ScEdit-CF	1,830	7,342	13,672
ScEdit-T	1,762	7,038	6,597

Key Experimental Results¶

Main Results (Token-level, ScEdit-CF and ScEdit-T)¶

Method	Model	ES↑	S-ES↑	S-NS↑	ES↑(T)	S-ES↑(T)	S-BO↓(T)
Base	GPT2-XL	20.55	21.18	81.52	44.27	41.72	0.00
FT	GPT2-XL	100.00	71.27	65.08	87.17	52.80	1.15
ROME	GPT2-XL	99.95	74.76	80.24	99.15	68.00	0.13
MEMIT	GPT2-XL	93.72	58.11	81.16	81.44	52.13	0.03
PROMPT	GPT2-XL	96.28	69.63	42.88	99.49	84.39	0.54
FT	GPT-J	100.00	83.94	25.81	99.60	97.90	5.47
ROME	GPT-J	99.95	86.50	83.35	99.60	74.29	0.28
MEMIT	GPT-J	99.95	74.59	85.07	99.09	64.66	0.08

Key takeaway: S-ES drops by 27% on average compared to traditional PS metrics, demonstrating that script scenarios are much more challenging.

Text-level Evaluation (ScEdit-CF, LLAMA3-8B)¶

Method	Exec.↑	Coh.↑	Cons.↑	Comp.↑
Base Model	6.74	2.48	6.86	5.40
FT	2.94	2.97	6.17	2.17
ROME	6.41	4.32	6.57	4.67
MEMIT	6.54	3.67	6.70	4.98
PROMPT	6.36	4.35	6.05	5.49

Key Findings¶

S-ES is much harder than ES: S-ES is on average 27% lower than traditional PS, indicating that script scenarios significantly increase editing difficulty.
FT practically fails at the text level: Executability drops to only 2.94/7, nearly destroying the fundamental capabilities of the model.
ROME achieves the best overall performance: It performs robustly at both token and text levels, proving that the locate-then-edit strategy remains effective in script scenarios.
MEMIT preserves locality best: It achieves the best S-NS and S-BO, but at the cost of editing efficacy.
PROMPT shows good efficacy but poor locality: S-NS is only 42.88%, indicating that in-context editing severely interferes with neighboring facts.
Token-level and text-level evaluations capture different dimensions: Coherence is only weakly correlated with S-ES, and Consistency is almost uncorrelated with token-level metrics. This shows that the two evaluation tiers offer unique and irreplaceable values.
The edit-efficacy vs. locality trade-off is more pronounced in script scenarios.

Highlights & Insights¶

Valuable problem formulation: For the first time, knowledge editing evaluation is extended from factual recall to procedural reasoning, aligning closer with real-world deployment scenarios.
Complementary two-tier evaluation system: Token-level and text-level evaluations capture different dimensions, as verified by correlation analysis.
More practical ScEdit-T subtask: It is constructed based on real Wikipedia temporal updates, rather than artificially designed counterfactuals.
Identification of cases with high token-level scores but text-level failures: This exposes the blind spots of existing evaluation frameworks.

Limitations & Future Work¶

Limited model scales: Only models of size <10B were evaluated; larger models might exhibit different behaviors.
Single-edit focus: Large-scale or sequential editing scenarios are not considered.
Data generated by GPT-4: There might be a bias toward causal language models, and some samples might be less natural.
Embodied AI scenarios not covered: Script execution is limited to the human level; robotic execution is an important direction for future work.
Editing methods remain based on the triple paradigm: How to develop editing methods suitable for script scenarios remains an open question.
Potential overestimation/underestimation in automated text-level evaluation: Although manual validation was performed, the coverage remains limited.

Complementary to ROME/MEMIT evaluation frameworks: ScEdit provides more challenging and practical evaluation scenarios.
Theoretical foundation from script generation work: Studies like (Schank & Abelson, 1975; CoScript) provide the theoretical basis for scripts.
Inspirations for "script-aware" knowledge editing methods: Editing should not only modify facts but also ensure that reasoning chains relying on those facts are correctly updated.
Connection to multi-hop reasoning evaluation (Ripple Effect, MeLLo): While sharing similar directions, ScEdit takes a step further by focusing on procedural reasoning.

Rating¶

Novelty: ⭐⭐⭐⭐ — Elevating knowledge editing evaluation to the level of script/procedural reasoning is a valuable contribution, offering a unique "How"-style evaluation perspective.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive coverage with three models, six editing methods, two-tier token+text evaluations, manual+automatic evaluations, and metric correlation analyses.
Writing Quality: ⭐⭐⭐⭐ — The motivation is clearly articulated, the evaluation system is logically designed, and the examples/diagrams are intuitive.
Value: ⭐⭐⭐⭐ — Proposes an important complement to the evaluation paradigm in the knowledge editing field, revealing the performance gap of existing methods in real-world scenarios.