Skip to content

ScEdit: Script-based Assessment of Knowledge Editing

Conference: ACL 2025
arXiv: 2505.23291
Code: Yes (https://github.com/asdfo123/ScEdit)
Area: NLP / Knowledge Editing Evaluation
Keywords: Knowledge Editing, Script Generation, Evaluation Benchmark, Procedural Reasoning, Text-level Evaluation

TL;DR

The authors propose ScEdit, a script-based evaluation benchmark for knowledge editing, which extends traditional "What"-style factual recall evaluation to "How"-style procedural reasoning. It introduces a two-tier evaluation system at both token and text levels, revealing significant limitations of existing knowledge editing methods in practical application scenarios.

Background & Motivation

While the field of Knowledge Editing (KE) has developed rapidly in recent years, existing evaluation frameworks suffer from fundamental deficiencies:

Oversimplified evaluation: Current metrics (Efficacy, Generalization, Locality) mainly focus on predicting the next few tokens of "What"-style questions, on which many methods have achieved near-perfect scores.

Disconnection from practical applications: LLMs are increasingly deployed as agents where users ask "How"-style procedural questions (e.g., "How to travel from Beijing to Singapore?") rather than simple factual queries.

Lack of text-level evaluation: Existing evaluations only focus on token probabilities, ignoring the quality of long-text generation.

To illustrate with an example from the paper: Singapore has implemented a visa-free policy for Chinese tourists, but an un-updated LLM would still advise users to apply for a visa. This error is not just a simple factual mistake; it leads the entire action plan (script) astray.

The motivation of ScEdit is to elevate knowledge editing evaluation from "correctly answering factual questions" to "correctly guiding multi-step procedural tasks," which is closer to the real-world challenges faced by LLMs during deployment.

Method

Overall Architecture

ScEdit consists of three core components: 1. Facts: Standard (s, r, o^c) → (s, r, o) knowledge triple editing. 2. Script Questions: "How"-style multi-step questions based on the edited facts. 3. Scripts: Multi-step responses generated by the model to the script questions.

The evaluation framework is divided into two tiers: - Token-level: Evaluates metrics such as S-ES, S-NS, and S-BO using cloze prompts. - Text-level: Evaluates Executability, Coherence, Consistency, and Completeness via automatic (GPT-4) and human evaluation.

Key Designs

  1. Script Question Generation

    • Function: Generate "How"-style questions that would be affected by the edited facts, starting from each knowledge editing triple.
    • Mechanism:
      • Generate candidate script questions using GPT-4 with few-shot prompting.
      • Manually filter to ensure the questions strictly depend on the edited facts.
      • Example: (Panamera, manufacturer, Porsche → Ford) → "How to book a maintenance service for a Panamera?"
    • Design Motivation: Simulate real-world scenarios where LLMs acting as agents are required to answer procedural questions.
  2. Token-level Evaluation Metrics

    • S-ES (Script-based Efficacy Success): Measures whether the edited model prefers the new answer under the script cloze prompt.
      • Truncated the original script up to the first occurrence of the old answer, and concatenate the script question to formulate the cloze prompt.
      • Check if P(o|Q̃) > P(o^c|Q̃).
    • S-NS (Script-based Neighborhood Success): Measures whether editing affects unrelated neighboring facts.
      • Construct neighbor-fact-based script cloze prompts.
      • Check if the model still retains the neighboring facts.
    • S-BO (Script-based Bleedover): Measures the bleedover effect of editing on semantically close facts.
      • Measures the drop in accuracy of neighboring facts after editing.
  3. Text-level Evaluation Metrics (7-point Likert Scale)

    • Executability: Evaluates whether the script steps are logically executable (disregarding knowledge updates).
    • Coherence: Evaluates whether the script is coherent with the updated facts.
    • Consistency: Evaluates whether the script is internally consistent (no confusion between old and new facts).
    • Completeness: Evaluates whether the script adequately addresses all aspects of the question.
    • Design Motivation: Token-level metrics cannot capture the overall generation quality and editing effects of the generated text.
  4. Two Subtasks

    • ScEdit-CF (Counterfactual Editing): Evaluates the performance of counterfactual knowledge editing in script scenarios based on the CounterFact dataset.
    • ScEdit-T (Temporal Editing): Evaluates temporal update editing based on WikiFactDiff, which closer reflects real-world scenarios.

Dataset Statistics

Subtask No. of Edit Cases S-Eff. Samples S-Spec. Samples
ScEdit-CF 1,830 7,342 13,672
ScEdit-T 1,762 7,038 6,597

Key Experimental Results

Main Results (Token-level, ScEdit-CF and ScEdit-T)

Method Model ES↑ S-ES↑ S-NS↑ ES↑(T) S-ES↑(T) S-BO↓(T)
Base GPT2-XL 20.55 21.18 81.52 44.27 41.72 0.00
FT GPT2-XL 100.00 71.27 65.08 87.17 52.80 1.15
ROME GPT2-XL 99.95 74.76 80.24 99.15 68.00 0.13
MEMIT GPT2-XL 93.72 58.11 81.16 81.44 52.13 0.03
PROMPT GPT2-XL 96.28 69.63 42.88 99.49 84.39 0.54
FT GPT-J 100.00 83.94 25.81 99.60 97.90 5.47
ROME GPT-J 99.95 86.50 83.35 99.60 74.29 0.28
MEMIT GPT-J 99.95 74.59 85.07 99.09 64.66 0.08

Key takeaway: S-ES drops by 27% on average compared to traditional PS metrics, demonstrating that script scenarios are much more challenging.

Text-level Evaluation (ScEdit-CF, LLAMA3-8B)

Method Exec.↑ Coh.↑ Cons.↑ Comp.↑
Base Model 6.74 2.48 6.86 5.40
FT 2.94 2.97 6.17 2.17
ROME 6.41 4.32 6.57 4.67
MEMIT 6.54 3.67 6.70 4.98
PROMPT 6.36 4.35 6.05 5.49

Key Findings

  1. S-ES is much harder than ES: S-ES is on average 27% lower than traditional PS, indicating that script scenarios significantly increase editing difficulty.
  2. FT practically fails at the text level: Executability drops to only 2.94/7, nearly destroying the fundamental capabilities of the model.
  3. ROME achieves the best overall performance: It performs robustly at both token and text levels, proving that the locate-then-edit strategy remains effective in script scenarios.
  4. MEMIT preserves locality best: It achieves the best S-NS and S-BO, but at the cost of editing efficacy.
  5. PROMPT shows good efficacy but poor locality: S-NS is only 42.88%, indicating that in-context editing severely interferes with neighboring facts.
  6. Token-level and text-level evaluations capture different dimensions: Coherence is only weakly correlated with S-ES, and Consistency is almost uncorrelated with token-level metrics. This shows that the two evaluation tiers offer unique and irreplaceable values.
  7. The edit-efficacy vs. locality trade-off is more pronounced in script scenarios.

Highlights & Insights

  • Valuable problem formulation: For the first time, knowledge editing evaluation is extended from factual recall to procedural reasoning, aligning closer with real-world deployment scenarios.
  • Complementary two-tier evaluation system: Token-level and text-level evaluations capture different dimensions, as verified by correlation analysis.
  • More practical ScEdit-T subtask: It is constructed based on real Wikipedia temporal updates, rather than artificially designed counterfactuals.
  • Identification of cases with high token-level scores but text-level failures: This exposes the blind spots of existing evaluation frameworks.

Limitations & Future Work

  1. Limited model scales: Only models of size <10B were evaluated; larger models might exhibit different behaviors.
  2. Single-edit focus: Large-scale or sequential editing scenarios are not considered.
  3. Data generated by GPT-4: There might be a bias toward causal language models, and some samples might be less natural.
  4. Embodied AI scenarios not covered: Script execution is limited to the human level; robotic execution is an important direction for future work.
  5. Editing methods remain based on the triple paradigm: How to develop editing methods suitable for script scenarios remains an open question.
  6. Potential overestimation/underestimation in automated text-level evaluation: Although manual validation was performed, the coverage remains limited.
  • Complementary to ROME/MEMIT evaluation frameworks: ScEdit provides more challenging and practical evaluation scenarios.
  • Theoretical foundation from script generation work: Studies like (Schank & Abelson, 1975; CoScript) provide the theoretical basis for scripts.
  • Inspirations for "script-aware" knowledge editing methods: Editing should not only modify facts but also ensure that reasoning chains relying on those facts are correctly updated.
  • Connection to multi-hop reasoning evaluation (Ripple Effect, MeLLo): While sharing similar directions, ScEdit takes a step further by focusing on procedural reasoning.

Rating

  • Novelty: ⭐⭐⭐⭐ — Elevating knowledge editing evaluation to the level of script/procedural reasoning is a valuable contribution, offering a unique "How"-style evaluation perspective.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive coverage with three models, six editing methods, two-tier token+text evaluations, manual+automatic evaluations, and metric correlation analyses.
  • Writing Quality: ⭐⭐⭐⭐ — The motivation is clearly articulated, the evaluation system is logically designed, and the examples/diagrams are intuitive.
  • Value: ⭐⭐⭐⭐ — Proposes an important complement to the evaluation paradigm in the knowledge editing field, revealing the performance gap of existing methods in real-world scenarios.