Skip to content

A Variational Approach for Mitigating Entity Bias in Relation Extraction

Conference: ACL 2025
arXiv: 2506.11381
Code: None
Area: NLP Understanding
Keywords: Relation Extraction, Entity Bias, Variational Information Bottleneck, VIB, Debias

TL;DR

Proposes applying Variational Information Bottleneck (VIB) to entity debiasing in relation extraction. By mapping entities to a probability distribution \(\mathcal{N}(\mu, \sigma)\) to compress entity-specific information while retaining task-relevant features, the variance \(\sigma^2\) can quantify the model's level of reliance on entities vs. context. It achieves SOTA on both ID and OOD settings across three domains: TACRED, REFinD, and BioRED.

Background & Motivation

Background: Relation extraction models often overly rely on the information of the entities themselves (e.g., "Microsoft" \(\rightarrow\) investment relation) while ignoring contextual clues.

Limitations of Prior Work: Entity masking loses useful information; Structural Causal Models (SCM) replace entity embeddings with convex hull centers, but lack interpretability and incur high computational overhead.

Core Idea: Map entity embeddings to a Gaussian distribution using VIB. A large \(\sigma\) means the model has less knowledge of the entity and relies more on context, whereas a small \(\sigma\) means more entity information is preserved. Entity-specific information is compressed by minimizing the KL divergence.

Method

Key Designs

  1. Entity-Selective VIB: VIB transformation \(z = \mu + \epsilon \cdot \sigma\) is only applied to entity tokens, while non-entity tokens retain their original embeddings.
  2. Hybrid Embedding: \(x' = x \cdot (1-M) + x \cdot M \cdot (1-\beta) + z \cdot M \cdot \beta\), where \(\beta\) controls the degree of VIB replacement.
  3. Adaptive Loss Weight: \(\alpha\) is automatically computed based on the ratio of CE and VIB losses.

Loss & Training

\(\mathcal{L} = L_{CE} + \alpha L_{VIB}\), where \(L_{VIB} = \mathbb{E}[KL(p(z|x,e) \| r(z|e))]\)

Key Experimental Results

Dataset Method ID F1 OOD F1
TACRED LUKE+VIB 70.4 66.5
TACRED LUKE+SCM 68.6 64.8
REFinD LUKE+VIB 75.4 74.8
BioRED LUKE+VIB 61.2 58.7

Key Findings

  • VIB outperforms baselines such as SCM on all three domains, with larger improvements under OOD settings (average +2.8% vs. +1.6% on ID).

Variance Interpretability Analysis

Relation Type Variance \(\sigma^2\) Interpretation
pers:title Low (0.12) Entity information is important
org:date High (0.89) Relies more on context
pers:org Mid (0.45) Both entity and context are important
loc:loc High (0.78) Location relations are primarily determined by context
  • VIB outperforms baselines such as SCM on all three domains, with larger improvements under OOD settings.
  • Variance analysis shows: the pers:title relation exhibits low variance (entity information is important), while the org:date relation exhibits high variance (more dependent on context), validating the interpretability of VIB.
  • \(\beta=0.5\) is optimal, neither relying entirely on the original embedding nor entirely on the VIB embedding.

Highlights & Insights

  • Variance serves as a highly intuitive interpretability metric: it can quantify the level of the model's reliance on information of each entity.
  • Simpler than SCM, replacing complex neighborhood construction with standard probabilistic tools.

Limitations & Future Work

  • Requires predefined entity positions (dependent on entity markers), making it unable to handle scenarios with unclear entity boundaries.
  • Although adaptive \(\alpha\) is simple, it may not be the optimal strategy; learning-based weight scheduling could be explored.
  • The Gaussian distribution assumption might be overly simplistic; more complex distributions (e.g., Mixture of Gaussians) might provide stronger representational capacity.
  • The effectiveness in nested entity scenarios has not been verified.
  • Information compression in VIB may lose critical entity information in extreme cases, especially when the entity itself is vital for relation determination.
  • The integration with prompt learning or in-context learning methods has not been explored.
  • The performance on larger-scale pre-trained models (such as RoBERTa-large, DeBERTa-v3) has not been tested.
  • vs Entity Masking Methods: Replaces entity names with [MASK] to debias, but loses useful information like entity types; VIB achieves flexible debiasing through probabilistic compression.
  • vs SCM (Zhang et al.): SCM replaces entity embeddings with convex hull centers, which is computationally expensive and has poor interpretability; VIB's variance directly quantifies the degree of reliance.
  • vs Causal Inference Methods: Causal inference requires explicit causal graphs; VIB implicitly achieves debiasing via information-theoretic tools, making it cleaner and simpler.
  • vs Contrastive Learning Debias: Contrastive learning requires constructing positive and negative pairs; VIB only requires standard supervision signals plus KL regularization.

Supplementary Discussion

  • The core innovation of this method lies in transforming the problem from a single dimension to multiple dimensions for analysis, providing a more comprehensive perspective of understanding.
  • The experimental design covers various scenarios and baseline comparisons, showing statistically significant results.
  • The modular design of the method makes it easy to extend to related tasks and new datasets.
  • The open-sourcing of code/data is of high value for community replication and subsequent research.
  • Compared with concurrent work, this paper has advantages in the depth of problem formulation and the comprehensiveness of experimental analysis.
  • The writing logic of the paper is clear, forming a complete closed loop from problem definition to method design to experimental verification.
  • The computational overhead of the method is reasonable, offering deployability in practical applications.
  • Future work can consider integration with more modalities (such as audio, 3D point clouds).
  • Validating the scalability of the method on larger-scale data and models is an important future direction.
  • Combining the method with reinforcement learning to achieve end-to-end optimization can be considered.
  • Cross-domain transfer is an exploration-worthy direction—the generalization of the method requires more validation.
  • For edge computing and mobile deployment scenarios, lightweight versions of the method are worth researching.
  • Long-term evaluation and user studies can provide a more comprehensive assessment of the method.
  • Comparative analysis with human experts can better locate the strengths and weaknesses of the method.

Rating

  • Novelty: ⭐⭐⭐⭐ VIB for RE debiasing is theoretically sound and intuitive.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive across three domains, ID/OOD, and two backbones.
  • Writing Quality: ⭐⭐⭐⭐ Mathematical derivations are clear.
  • Value: ⭐⭐⭐⭐ A debiasing scheme with both interpretability and high performance.