ScEdit: Script-based Assessment of Knowledge Editing¶
Conference: ACL 2025
arXiv: 2505.23291
Code: Yes (https://github.com/asdfo123/ScEdit)
Area: NLP / Knowledge Editing Evaluation
Keywords: Knowledge Editing, Script Generation, Evaluation Benchmark, Procedural Reasoning, Text-level Evaluation
TL;DR¶
The authors propose ScEdit, a script-based evaluation benchmark for knowledge editing, which extends traditional "What"-style factual recall evaluation to "How"-style procedural reasoning. It introduces a two-tier evaluation system at both token and text levels, revealing significant limitations of existing knowledge editing methods in practical application scenarios.
Background & Motivation¶
While the field of Knowledge Editing (KE) has developed rapidly in recent years, existing evaluation frameworks suffer from fundamental deficiencies:
Oversimplified evaluation: Current metrics (Efficacy, Generalization, Locality) mainly focus on predicting the next few tokens of "What"-style questions, on which many methods have achieved near-perfect scores.
Disconnection from practical applications: LLMs are increasingly deployed as agents where users ask "How"-style procedural questions (e.g., "How to travel from Beijing to Singapore?") rather than simple factual queries.
Lack of text-level evaluation: Existing evaluations only focus on token probabilities, ignoring the quality of long-text generation.
To illustrate with an example from the paper: Singapore has implemented a visa-free policy for Chinese tourists, but an un-updated LLM would still advise users to apply for a visa. This error is not just a simple factual mistake; it leads the entire action plan (script) astray.
The motivation of ScEdit is to elevate knowledge editing evaluation from "correctly answering factual questions" to "correctly guiding multi-step procedural tasks," which is closer to the real-world challenges faced by LLMs during deployment.
Method¶
Overall Architecture¶
ScEdit consists of three core components: 1. Facts: Standard (s, r, o^c) → (s, r, o) knowledge triple editing. 2. Script Questions: "How"-style multi-step questions based on the edited facts. 3. Scripts: Multi-step responses generated by the model to the script questions.
The evaluation framework is divided into two tiers: - Token-level: Evaluates metrics such as S-ES, S-NS, and S-BO using cloze prompts. - Text-level: Evaluates Executability, Coherence, Consistency, and Completeness via automatic (GPT-4) and human evaluation.
Key Designs¶
-
Script Question Generation
- Function: Generate "How"-style questions that would be affected by the edited facts, starting from each knowledge editing triple.
- Mechanism:
- Generate candidate script questions using GPT-4 with few-shot prompting.
- Manually filter to ensure the questions strictly depend on the edited facts.
- Example: (Panamera, manufacturer, Porsche → Ford) → "How to book a maintenance service for a Panamera?"
- Design Motivation: Simulate real-world scenarios where LLMs acting as agents are required to answer procedural questions.
-
Token-level Evaluation Metrics
- S-ES (Script-based Efficacy Success): Measures whether the edited model prefers the new answer under the script cloze prompt.
- Truncated the original script up to the first occurrence of the old answer, and concatenate the script question to formulate the cloze prompt.
- Check if P(o|Q̃) > P(o^c|Q̃).
- S-NS (Script-based Neighborhood Success): Measures whether editing affects unrelated neighboring facts.
- Construct neighbor-fact-based script cloze prompts.
- Check if the model still retains the neighboring facts.
- S-BO (Script-based Bleedover): Measures the bleedover effect of editing on semantically close facts.
- Measures the drop in accuracy of neighboring facts after editing.
- S-ES (Script-based Efficacy Success): Measures whether the edited model prefers the new answer under the script cloze prompt.
-
Text-level Evaluation Metrics (7-point Likert Scale)
- Executability: Evaluates whether the script steps are logically executable (disregarding knowledge updates).
- Coherence: Evaluates whether the script is coherent with the updated facts.
- Consistency: Evaluates whether the script is internally consistent (no confusion between old and new facts).
- Completeness: Evaluates whether the script adequately addresses all aspects of the question.
- Design Motivation: Token-level metrics cannot capture the overall generation quality and editing effects of the generated text.
-
Two Subtasks
- ScEdit-CF (Counterfactual Editing): Evaluates the performance of counterfactual knowledge editing in script scenarios based on the CounterFact dataset.
- ScEdit-T (Temporal Editing): Evaluates temporal update editing based on WikiFactDiff, which closer reflects real-world scenarios.
Dataset Statistics¶
| Subtask | No. of Edit Cases | S-Eff. Samples | S-Spec. Samples |
|---|---|---|---|
| ScEdit-CF | 1,830 | 7,342 | 13,672 |
| ScEdit-T | 1,762 | 7,038 | 6,597 |
Key Experimental Results¶
Main Results (Token-level, ScEdit-CF and ScEdit-T)¶
| Method | Model | ES↑ | S-ES↑ | S-NS↑ | ES↑(T) | S-ES↑(T) | S-BO↓(T) |
|---|---|---|---|---|---|---|---|
| Base | GPT2-XL | 20.55 | 21.18 | 81.52 | 44.27 | 41.72 | 0.00 |
| FT | GPT2-XL | 100.00 | 71.27 | 65.08 | 87.17 | 52.80 | 1.15 |
| ROME | GPT2-XL | 99.95 | 74.76 | 80.24 | 99.15 | 68.00 | 0.13 |
| MEMIT | GPT2-XL | 93.72 | 58.11 | 81.16 | 81.44 | 52.13 | 0.03 |
| PROMPT | GPT2-XL | 96.28 | 69.63 | 42.88 | 99.49 | 84.39 | 0.54 |
| FT | GPT-J | 100.00 | 83.94 | 25.81 | 99.60 | 97.90 | 5.47 |
| ROME | GPT-J | 99.95 | 86.50 | 83.35 | 99.60 | 74.29 | 0.28 |
| MEMIT | GPT-J | 99.95 | 74.59 | 85.07 | 99.09 | 64.66 | 0.08 |
Key takeaway: S-ES drops by 27% on average compared to traditional PS metrics, demonstrating that script scenarios are much more challenging.
Text-level Evaluation (ScEdit-CF, LLAMA3-8B)¶
| Method | Exec.↑ | Coh.↑ | Cons.↑ | Comp.↑ |
|---|---|---|---|---|
| Base Model | 6.74 | 2.48 | 6.86 | 5.40 |
| FT | 2.94 | 2.97 | 6.17 | 2.17 |
| ROME | 6.41 | 4.32 | 6.57 | 4.67 |
| MEMIT | 6.54 | 3.67 | 6.70 | 4.98 |
| PROMPT | 6.36 | 4.35 | 6.05 | 5.49 |
Key Findings¶
- S-ES is much harder than ES: S-ES is on average 27% lower than traditional PS, indicating that script scenarios significantly increase editing difficulty.
- FT practically fails at the text level: Executability drops to only 2.94/7, nearly destroying the fundamental capabilities of the model.
- ROME achieves the best overall performance: It performs robustly at both token and text levels, proving that the locate-then-edit strategy remains effective in script scenarios.
- MEMIT preserves locality best: It achieves the best S-NS and S-BO, but at the cost of editing efficacy.
- PROMPT shows good efficacy but poor locality: S-NS is only 42.88%, indicating that in-context editing severely interferes with neighboring facts.
- Token-level and text-level evaluations capture different dimensions: Coherence is only weakly correlated with S-ES, and Consistency is almost uncorrelated with token-level metrics. This shows that the two evaluation tiers offer unique and irreplaceable values.
- The edit-efficacy vs. locality trade-off is more pronounced in script scenarios.
Highlights & Insights¶
- Valuable problem formulation: For the first time, knowledge editing evaluation is extended from factual recall to procedural reasoning, aligning closer with real-world deployment scenarios.
- Complementary two-tier evaluation system: Token-level and text-level evaluations capture different dimensions, as verified by correlation analysis.
- More practical ScEdit-T subtask: It is constructed based on real Wikipedia temporal updates, rather than artificially designed counterfactuals.
- Identification of cases with high token-level scores but text-level failures: This exposes the blind spots of existing evaluation frameworks.
Limitations & Future Work¶
- Limited model scales: Only models of size <10B were evaluated; larger models might exhibit different behaviors.
- Single-edit focus: Large-scale or sequential editing scenarios are not considered.
- Data generated by GPT-4: There might be a bias toward causal language models, and some samples might be less natural.
- Embodied AI scenarios not covered: Script execution is limited to the human level; robotic execution is an important direction for future work.
- Editing methods remain based on the triple paradigm: How to develop editing methods suitable for script scenarios remains an open question.
- Potential overestimation/underestimation in automated text-level evaluation: Although manual validation was performed, the coverage remains limited.
Related Work & Insights¶
- Complementary to ROME/MEMIT evaluation frameworks: ScEdit provides more challenging and practical evaluation scenarios.
- Theoretical foundation from script generation work: Studies like (Schank & Abelson, 1975; CoScript) provide the theoretical basis for scripts.
- Inspirations for "script-aware" knowledge editing methods: Editing should not only modify facts but also ensure that reasoning chains relying on those facts are correctly updated.
- Connection to multi-hop reasoning evaluation (Ripple Effect, MeLLo): While sharing similar directions, ScEdit takes a step further by focusing on procedural reasoning.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Elevating knowledge editing evaluation to the level of script/procedural reasoning is a valuable contribution, offering a unique "How"-style evaluation perspective.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive coverage with three models, six editing methods, two-tier token+text evaluations, manual+automatic evaluations, and metric correlation analyses.
- Writing Quality: ⭐⭐⭐⭐ — The motivation is clearly articulated, the evaluation system is logically designed, and the examples/diagrams are intuitive.
- Value: ⭐⭐⭐⭐ — Proposes an important complement to the evaluation paradigm in the knowledge editing field, revealing the performance gap of existing methods in real-world scenarios.