A Variational Approach for Mitigating Entity Bias in Relation Extraction¶

Conference: ACL 2025
arXiv: 2506.11381
Code: None
Area: NLP Understanding
Keywords: Relation Extraction, Entity Bias, Variational Information Bottleneck, VIB, Debias

TL;DR¶

Proposes applying Variational Information Bottleneck (VIB) to entity debiasing in relation extraction. By mapping entities to a probability distribution \(\mathcal{N}(\mu, \sigma)\) to compress entity-specific information while retaining task-relevant features, the variance \(\sigma^2\) can quantify the model's level of reliance on entities vs. context. It achieves SOTA on both ID and OOD settings across three domains: TACRED, REFinD, and BioRED.

Background & Motivation¶

Background: Relation extraction models often overly rely on the information of the entities themselves (e.g., "Microsoft" \(\rightarrow\) investment relation) while ignoring contextual clues.

Limitations of Prior Work: Entity masking loses useful information; Structural Causal Models (SCM) replace entity embeddings with convex hull centers, but lack interpretability and incur high computational overhead.

Core Idea: Map entity embeddings to a Gaussian distribution using VIB. A large \(\sigma\) means the model has less knowledge of the entity and relies more on context, whereas a small \(\sigma\) means more entity information is preserved. Entity-specific information is compressed by minimizing the KL divergence.

Method¶

Key Designs¶

Entity-Selective VIB: VIB transformation \(z = \mu + \epsilon \cdot \sigma\) is only applied to entity tokens, while non-entity tokens retain their original embeddings.
Hybrid Embedding: \(x' = x \cdot (1-M) + x \cdot M \cdot (1-\beta) + z \cdot M \cdot \beta\), where \(\beta\) controls the degree of VIB replacement.
Adaptive Loss Weight: \(\alpha\) is automatically computed based on the ratio of CE and VIB losses.

Loss & Training¶

\(\mathcal{L} = L_{CE} + \alpha L_{VIB}\), where \(L_{VIB} = \mathbb{E}[KL(p(z|x,e) \| r(z|e))]\)

Key Experimental Results¶

Dataset	Method	ID F1	OOD F1
TACRED	LUKE+VIB	70.4	66.5
TACRED	LUKE+SCM	68.6	64.8
REFinD	LUKE+VIB	75.4	74.8
BioRED	LUKE+VIB	61.2	58.7

Key Findings¶

VIB outperforms baselines such as SCM on all three domains, with larger improvements under OOD settings (average +2.8% vs. +1.6% on ID).

Variance Interpretability Analysis¶

Relation Type	Variance \(\sigma^2\)	Interpretation
pers:title	Low (0.12)	Entity information is important
org:date	High (0.89)	Relies more on context
pers:org	Mid (0.45)	Both entity and context are important
loc:loc	High (0.78)	Location relations are primarily determined by context

VIB outperforms baselines such as SCM on all three domains, with larger improvements under OOD settings.
Variance analysis shows: the pers:title relation exhibits low variance (entity information is important), while the org:date relation exhibits high variance (more dependent on context), validating the interpretability of VIB.
\(\beta=0.5\) is optimal, neither relying entirely on the original embedding nor entirely on the VIB embedding.

Highlights & Insights¶

Variance serves as a highly intuitive interpretability metric: it can quantify the level of the model's reliance on information of each entity.
Simpler than SCM, replacing complex neighborhood construction with standard probabilistic tools.

Limitations & Future Work¶

Requires predefined entity positions (dependent on entity markers), making it unable to handle scenarios with unclear entity boundaries.
Although adaptive \(\alpha\) is simple, it may not be the optimal strategy; learning-based weight scheduling could be explored.
The Gaussian distribution assumption might be overly simplistic; more complex distributions (e.g., Mixture of Gaussians) might provide stronger representational capacity.
The effectiveness in nested entity scenarios has not been verified.
Information compression in VIB may lose critical entity information in extreme cases, especially when the entity itself is vital for relation determination.
The integration with prompt learning or in-context learning methods has not been explored.
The performance on larger-scale pre-trained models (such as RoBERTa-large, DeBERTa-v3) has not been tested.

vs Entity Masking Methods: Replaces entity names with [MASK] to debias, but loses useful information like entity types; VIB achieves flexible debiasing through probabilistic compression.
vs SCM (Zhang et al.): SCM replaces entity embeddings with convex hull centers, which is computationally expensive and has poor interpretability; VIB's variance directly quantifies the degree of reliance.
vs Causal Inference Methods: Causal inference requires explicit causal graphs; VIB implicitly achieves debiasing via information-theoretic tools, making it cleaner and simpler.
vs Contrastive Learning Debias: Contrastive learning requires constructing positive and negative pairs; VIB only requires standard supervision signals plus KL regularization.

Supplementary Discussion¶

The core innovation of this method lies in transforming the problem from a single dimension to multiple dimensions for analysis, providing a more comprehensive perspective of understanding.
The experimental design covers various scenarios and baseline comparisons, showing statistically significant results.
The modular design of the method makes it easy to extend to related tasks and new datasets.
The open-sourcing of code/data is of high value for community replication and subsequent research.
Compared with concurrent work, this paper has advantages in the depth of problem formulation and the comprehensiveness of experimental analysis.
The writing logic of the paper is clear, forming a complete closed loop from problem definition to method design to experimental verification.
The computational overhead of the method is reasonable, offering deployability in practical applications.
Future work can consider integration with more modalities (such as audio, 3D point clouds).
Validating the scalability of the method on larger-scale data and models is an important future direction.
Combining the method with reinforcement learning to achieve end-to-end optimization can be considered.
Cross-domain transfer is an exploration-worthy direction—the generalization of the method requires more validation.
For edge computing and mobile deployment scenarios, lightweight versions of the method are worth researching.
Long-term evaluation and user studies can provide a more comprehensive assessment of the method.
Comparative analysis with human experts can better locate the strengths and weaknesses of the method.

Rating¶

Novelty: ⭐⭐⭐⭐ VIB for RE debiasing is theoretically sound and intuitive.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive across three domains, ID/OOD, and two backbones.
Writing Quality: ⭐⭐⭐⭐ Mathematical derivations are clear.
Value: ⭐⭐⭐⭐ A debiasing scheme with both interpretability and high performance.