Skip to content

BMIKE-53: Investigating Cross-Lingual Knowledge Editing with In-Context Learning

Conference: ACL 2025
arXiv: 2406.17764
Code: GitHub
Area: Knowledge Editing
Keywords: cross-lingual knowledge editing, in-context learning, multilingual benchmark, script type, language confusion

TL;DR

Proposes BMIKE-53, a cross-lingual benchmark covering 53 languages and integrating three knowledge editing datasets (zsRE, CounterFact, and WikiFactDiff). It systematically evaluates in-context knowledge editing methods from zero-shot to 8-shot settings, revealing that writing systems (Latin vs. non-Latin) are more decisive than language families for cross-lingual editing performance, and that metric-specific exemplar strategies significantly outperform hybrid configurations.

Background & Motivation

Background: LLMs encode a vast amount of knowledge during pre-training, but this knowledge is static and becomes outdated over time. Knowledge Editing (KE) techniques aim to selectively modify specific knowledge in LLMs while maintaining other knowledge unaffected. Gradient-free In-Context Learning (ICL) methods are particularly suitable for closed-source models as they do not require access to model parameters.

Limitations of Prior Work: Existing KE research predominantly focuses on monolingual (English) scenarios. Cross-lingual KE—where knowledge edited in a source language needs to generalize to equivalent queries in other target languages—presents greater challenges, yet systematic studies are scarce. Most existing cross-lingual KE works employ gradient-based methods (e.g., ROME/MEMIT), which incur high computational overhead and are inapplicable to closed-source models. More critically, there is a lack of a unified evaluation benchmark covering a wide range of languages.

Key Challenge: Can knowledge edited in a source language via ICL effectively migrate to semantically equivalent target multilingual queries? What factors determine the success or failure of such cross-lingual transfer?

Key Insight: Building the most comprehensive multilingual KE benchmark to date (53 languages \(\times\) 3 datasets) to systematically analyze the capability boundaries of cross-lingual IKE from multiple dimensions, including model scale, exemplar strategies, query types, and linguistic attributes.

Method

Overall Architecture

Two major components: 1. BMIKE-53 Benchmark Construction: Unified format for three KE datasets \(\rightarrow\) Structured translation via GPT-4o expanded to 52 target languages \(\rightarrow\) Native speaker review + back-translation quality control. 2. Cross-Lingual IKE Evaluation: Define cross-lingual IKE tasks and four query types \(\rightarrow\) Design 4 experimental settings (zero-shot / one-shot / 8-shot mixed / 8-shot metric-specific) \(\rightarrow\) Multi-dimensional analysis.

Key Designs

  1. 三数据集整合与统一格式:

    • Function: Integrates zsRE (standard factual modification), CounterFact (counterfactual knowledge updates), and WikiFactDiff (real-world temporal updates) into a unified benchmark.
    • Mechanism: Each data record uniformly contains the edited knowledge item + four types of evaluation queries (reliability, generality, locality, portability), stored in JSON format.
    • Design Motivation: The three datasets cover a complete spectrum of KE scenarios from standard to counterfactual to temporal real-world contexts. Portability queries are constructed via one-hop knowledge graph reasoning to test the logical reasoning capacity of edited knowledge.
  2. 四种 IKE 实验设置:

    • Function: Systematically controls the impact of exemplar quantity and quality on cross-lingual IKE.
    • Mechanism: zero-shot (no exemplar, depending entirely on pre-training) \(\rightarrow\) one-shot (1 random exemplar to familiarize with format) \(\rightarrow\) 8-shot mixed (8 mixed-type exemplars exposing various query query patterns) \(\rightarrow\) 8-shot metric-specific (8 exemplars of the same type as the evaluation metric to provide targeted guidance).
    • Design Motivation: Experiments demonstrate that exemplar quality (type matching) is far more important than quantity—the performance gains of metric-specific configurations on locality and portability far exceed those from merely increasing the count of mixed exemplars.
  3. 四类跨语言查询设计:

    • Function: Evaluates the completeness of cross-lingual knowledge editing from different perspectives.
    • Mechanism: Reliability (precisely translated queries) evaluates basic editing capability \(\rightarrow\) Generality (semantically equivalent but differently phrased queries) tests generalization \(\rightarrow\) Locality (irrelevant queries) evaluates knowledge retention \(\rightarrow\) Portability (one-hop inference queries) tests reasoning transfer.
    • Design Motivation: The four query categories progress in difficulty—while rel/gen reach similar levels, loc/port perform significantly worse, revealing the true bottlenecks of cross-lingual IKE.

Loss & Training

This work does not involve model training. Evaluation metrics are F1 score and Exact Match (EM). Cross-lingual performance is normalized using English EM as the baseline.

Key Experimental Results

Main Results

Cross-lingual IKE performance of Llama3.1-8B across three datasets (average F1 of 52 languages):

Setting zsRE rel zsRE port CF rel CF loc WFD rel WFD port
zero-shot 65.53 10.05 63.01 18.68 67.84 4.15
one-shot 75.27 20.81 71.92 12.66 70.53 4.15
8-shot mixed 74.29 25.18 75.15 11.40 68.57 8.87
8-shot metric-specific 74.86 32.86 73.88 47.55 71.98 14.58

Llama3.2-3B shows the same trend but overall lower performance: CF loc for 8-shot metric-specific is 31.61 (vs. 47.55 for 8B).

Ablation Study

Analysis Dimension Key Findings
Model Scale (3B vs. 8B) 8B comprehensively outperforms 3B, with larger gaps observed in loc/port queries.
Dataset Differences WFD portability shows the lowest performance (involving second-order temporal knowledge chain reasoning).
Exemplar Strategy 8-shot metric-specific > 8-shot mixed > one-shot > zero-shot
Script Type Latin-script languages >> non-Latin-script languages (independent of linguistic family).
Linguistic Attribute Correlation Syntactic similarity is positively correlated with \(p < 0.05\), phonological similarity is positively correlated, and language family shows no significant correlation.
One-shot on Locality Intrinsically harmful—random exemplars can mislead the model when they do not match the target query type.

Key Findings

  • Exemplar Quality >> Exemplar Quantity: The improvement of 8-shot metric-specific on loc and port far exceeds that of 8-shot mixed, demonstrating that targeted matching is crucial.
  • Script Type is the Decisive Factor for Cross-Lingual KE: Non-Latin-script languages (regardless of whether they belong to the Indo-European family) consistently perform worse than Latin-script languages, while the impact of language families is insignificant.
  • Language Confusion: In non-Latin-script languages, the model frequently responds in English (even when instructions require the target language), which is the direct cause of the poor performance of non-Latin languages.
  • One-shot Can Be Harmful: For locality queries, a single mismatched random exemplar degrades performance—exemplar strategies need alignment with query types.
  • Portability is the Biggest Bottleneck: Port queries perform the poorest across all configurations, indicating that cross-lingual transfer of knowledge reasoning is a major vulnerability of current LLMs.

Highlights & Insights

  • Most comprehensive multilingual knowledge editing benchmark to date: 53 languages \(\times\) 3 KE datasets in a unified format, covering a complete spectrum from standard to counterfactual and temporal scenarios.
  • The "Script Type > Language Family" finding offers new linguistic insights—the weakness of non-Latin scripts is a representation issue rather than a language family issue.
  • Systematic analysis of the language confusion phenomenon (models responding in English to non-Latin queries) provides a crucial reference for multilingual LLM research.
  • The design of the metric-specific exemplar strategy is simple yet highly effective, offering a "quality-first" rule of thumb for ICL research.

Limitations & Future Work

  • Only two models (Llama3.2-3B and Llama3.1-8B) were evaluated, excluding larger \(70\text{B}+\) scales or proprietary models like GPT-4.
  • Multilingual translation relies on GPT-4o, which may introduce systematic translation bias (especially in low-resource languages).
  • No direct comparison is made with gradient-based methods (e.g., ROME/MEMIT), making it difficult to evaluate the relative competitiveness of ICL-based methods.
  • Portability queries in WikiFactDiff are constructed based on automated knowledge graph reasoning, which may introduce noise.
  • Focusing solely on factual knowledge editing, the study does not cover the editing of reasoning rules or commonsense knowledge.
  • vs ROME/MEMIT: Gradient-based methods modify specific parameters, which is computationally expensive and inapplicable to closed-source models. The ICL-based method in this work introduces zero parameter updates but exhibits limited cross-lingual transfer capabilities.
  • vs ReMaKE: ReMaKE leverages retrieval-augmented generation for cross-lingual KE, but only targets batch editing scenarios, whereas this work covers a broader range of editing settings.
  • vs Beniwal et al. (EACL 2024): Prior cross-lingual KE works cover limited languages. This work scales to 53 languages and systematically analyzes the impact of legislative linguistic properties.
  • vs Multilingual ICL Studies: Lai et al. (2023) study the multilingual capabilities of English-centric LLMs, whereas this work focuses on cross-lingual transfer specifically in knowledge-editing scenarios.

Rating

  • Novelty: ⭐⭐⭐⭐ The first multilingual KE benchmark of this scale; the insights on script types are innovative.
  • Experimental Thoroughness: ⭐⭐⭐⭐ 53 languages, 3 datasets, 4 setups, and multi-dimensional analysis of linguistic attributes, though with a limited variety of model architectures.
  • Writing Quality: ⭐⭐⭐⭐ Clearly structured with progressive multi-dimensional analyses and rich diagrams.
  • Value: ⭐⭐⭐⭐ Directly contributes to the cross-lingual NLP and knowledge-editing communities, and the benchmark itself is highly reusable.