A Mutual Information Perspective on Knowledge Graph Embedding¶
Conference: ACL 2025
Code: None
Area: Graph Learning / Knowledge Graphs
Keywords: Knowledge Graph Embedding, Mutual Information Maximization, Semantic Representation, Complex Relation Patterns, Entity-Relation Modeling
TL;DR¶
This paper proposes a Knowledge Graph Embedding (KGE) framework based on mutual information maximization. It enhances the semantic representation capability of entities and relations by maximizing the mutual information between different components of triples, achieving consistent performance improvements under complex relational patterns (e.g., 1-N, N-1).
Background & Motivation¶
Background: Knowledge Graph Embedding (KGE) techniques are core methods for resolving the issue of missing relations in knowledge graphs. Existing methods primarily represent entities and relations in low-dimensional vector spaces, and score the plausibility of a triple \((h, r, t)\) using a scoring function. Representative methods include translation models like TransE and RotatE, and bilinear models like ComplEx.
Limitations of Prior Work: Existing KGE methods face three core issues: (1) high intra-group similarity—entity embeddings under the same relation can be too close, making them hard to distinguish; (2) semantic information loss—rich semantic details of entities and relations are compressed or lost during embedding; (3) insufficient reasoning capability—especially in complex relational patterns such as 1-N and N-1, models struggle to accurately model scenarios where one head entity corresponds to multiple tail entities (or vice versa).
Key Challenge: Traditional KGE methods only learn embeddings via negative-sampling training of scoring functions, without explicitly preserving the semantic dependency structures among triple components. While learned representations can distinguish positive and negative triples to some extent, they fail to fully capture deep semantic associations between entities and relations.
Goal: Design a general KGE enhancement framework to explicitly maximize the mutual information between triple components from an information-theoretic perspective, allowing embeddings to better preserve semantic dependencies and maintain relational structures.
Key Insight: The authors observe that Mutual Information (MI), which measures the shared information between two random variables, has achieved success in fields like contrastive learning and representation learning. Introducing MI into KGE can directly constrain the degree of information sharing between \((h, r)\) and \(t\), as well as between \((r, t)\) and \(h\).
Core Idea: Introduce mutual information maximization as a regularization objective into KGE training to force learned embeddings to retain more semantic dependencies among components, thereby improving capability in modeling complex relation patterns.
Method¶
Overall Architecture¶
This method is a plug-and-play framework applicable to various existing KGE models. The input is a triple \((h, r, t)\) in a knowledge graph, which yields entity and relation embedding vectors through a backbone KGE model. On top of this, an additional mutual information maximization module is introduced to enhance embedding quality by maximizing the mutual information between different parts of the triple. The final training objective is joint optimization of the raw KGE loss and the mutual information loss.
Key Designs¶
-
Mutual Information Maximization Module:
- Function: Maximize the mutual information between \((h, r)\) and \(t\), as well as between \((r, t)\) and \(h\) in triples.
- Mechanism: Leverage lower-bound estimators of mutual information (e.g., InfoNCE or MINE) to pair combined embeddings (such as concatenation/transformation of \(h\) and \(r\)) with corresponding tail entities \(t\) as positive samples, and randomly sampled entities as negative samples. Maximizing the lower bound of mutual information between positive pairs forces the embeddings to retain more semantic associations in the vector space. Symmetrically, the same process is applied to \((r, t)\) and \(h\).
- Design Motivation: While traditional KGE only implicitly learns correlations among components via scoring functions, mutual information maximization provides an explicit constraint on information preservation, ensuring crucial semantic dependencies are not lost.
-
Combined Embedding Generation:
- Function: Fuse head entity and relation embeddings into a joint representation, representing one side of the mutual information calculation.
- Mechanism: Employ a learnable projection network to combine \((h, r)\) into a single vector. This combination method can be flexibly adapted to the characteristics of the underlying KGE model (e.g., addition for TransE, rotation transformation for RotatE, etc.).
- Design Motivation: Directly concatenating \(h\) and \(r\) may lack flexibility. The projection network helps learn a combined representation more suitable for mutual information estimation.
-
Multi-Perspective Mutual Information Alignment:
- Function: Perform mutual information maximization simultaneously from both head-relation and relation-tail directions.
- Mechanism: Simultaneously maximize \(I((h,r); t)\) and \(I((r,t); h)\) to ensure semantic information is preserved in either direction. This is particularly helpful for handling 1-N relations (one head pointing to multiple tails) and N-1 relations (multiple heads pointing to one tail).
- Design Motivation: 1-N and N-1 relations are classic challenges in KGE. Bidirectional mutual information maximization equips the model with stronger discriminative capabilities from both directions.
Loss & Training¶
The total loss function is the weighted sum of the original loss of the backbone KGE model and the mutual information loss: \(\mathcal{L} = \mathcal{L}_{KGE} + \lambda \cdot \mathcal{L}_{MI}\), where \(\lambda\) controls the strength of mutual information regularization. The mutual information loss \(\mathcal{L}_{MI}\) consists of lower-bound estimations of mutual information in both directions. Contrastive-learning-style negative sampling strategies are used to construct positive and negative pairs for training.
Key Experimental Results¶
Main Results¶
| Dataset | Metric | MI-TransE | TransE | MI-RotatE | RotatE | MI-ComplEx | ComplEx |
|---|---|---|---|---|---|---|---|
| FB15k-237 | MRR | Consistent Gain | Baseline | Consistent Gain | Baseline | Consistent Gain | Baseline |
| WN18RR | Hits@1 | Significant Gain | Baseline | Significant Gain | Baseline | Significant Gain | Baseline |
| FB15k-237 | Hits@10 | Consistent Gain | Baseline | Consistent Gain | Baseline | Consistent Gain | Baseline |
The authors conducted extensive experiments on FB15k-237 and WN18RR, applying the MI framework to typical baseline models including TransE, RotatE, and ComplEx. The framework yielded consistent and significant improvements across various metrics like MRR, Hits@1, Hits@3, and Hits@10.
Ablation Study¶
| Configuration | Key Metric (MRR) | Description |
|---|---|---|
| Full MI Framework | Highest | Complete bidirectional mutual information framework |
| w/o (h,r)→t MI | Obvious decline | Removing mutual information from the head-relation to tail-entity direction |
| w/o (r,t)→h MI | Obvious decline | Removing mutual information from the relation-tail to head-entity direction |
| w/o MI (baseline) | Lowest | Pure baseline model, without mutual information enhancement |
Ablation studies demonstrate that both directions of mutual information are indispensable, though the mutual information from the head-relation to the tail-entity direction has a slightly larger impact on performance.
Key Findings¶
- The proposed MI framework is highly generic: it brings consistent improvements across all evaluated baseline models, demonstrating that mutual information maximization is an enhancement strategy orthogonal to specific KGE architectures.
- The improvement is particularly significant on complex 1-N and N-1 relationship patterns, verifying that bidirectional mutual information effectively alleviates the modeling difficulty of many-to-one/one-to-many relationships.
- Visualization analysis indicates that applying the MI framework improves inner-group discriminability of entity embeddings, rendering different entities more scattered in vector space. This confirms that the mutual information constraint effectively reduces intra-group similarity.
Highlights & Insights¶
- The plug-and-play framework design is the most prominent highlight. Without modifying baseline model architectures, only an additional mutual information loss term needs to be added to boost various KGE models' performance. This level of generality is relatively rare in the KGE field.
- The information-theoretic perspective provides a novel theoretical foundation for KGE. Distinct from prior purely geometric (translation, rotation) or algebraic (bilinear) designs, examining embedding quality from an information preservation standpoint opens up new optimization avenues.
- The bidirectional mutual information formulation can scale to other tasks requiring asymmetric modeling, such as user-item interactions in recommender systems and query-document matching in document retrieval.
Limitations & Future Work¶
- Estimations of mutual information lower bounds inherent bias; how the choice of different estimators affects final performance is not fully investigated.
- The extra mutual information calculation module increases training overhead; its scalability on ultra-large KGs remains to be verified.
- Evaluated only on the link prediction task; performance on other downstream tasks of knowledge graph completion (e.g., triple classification, relation prediction) remains unknown.
- No comparison has been made with recent KGE methods based on pre-trained language models (such as KG-BERT).
Related Work & Insights¶
- vs TransE/RotatE: These methods learn embeddings through translation/rotational geometric constraints. On top of this, this paper introduces an additional mutual information constraint, making them complementary rather than alternative.
- vs Contrastive learning methods (e.g., SimKGC): Contrastive learning also leverages positive and negative pairs, but SimKGC focuses on contrastive training of text encoders. The proposed mutual information framework directly operates on embedding spaces, making it more lightweight.
- vs Information Bottleneck methods: Information bottleneck compresses representations to eliminate redundancy, whereas this paper pursues the opposite direction—maximizing information retention. Both philosophies exhibit advantages in different scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐ Applying the mutual information perspective to KGE is relatively novel, though mutual information maximization has been widely utilized in representation learning.
- Experimental Thoroughness: ⭐⭐⭐⭐ Generalizability is validated across multiple baselines and datasets, and visualization analyses enhance the persuasiveness.
- Writing Quality: ⭐⭐⭐⭐ Clear motivation and standardized descriptions of the frameworks.
- Value: ⭐⭐⭐⭐ The plug-and-play characteristic gives it high practical value, driving meaningful progress in the KGE community.