Skip to content

Uncertain Knowledge Graph Completion via Semi-Supervised Confidence Distribution Learning

Conference: NeurIPS 2025 arXiv: 2510.16601 Code: https://github.com/seucoin/unKR/tree/main/unKR_ssCDL Area: Graph Learning / Knowledge Graphs Keywords: Uncertain Knowledge Graph, Confidence Distribution Learning, Semi-Supervised Learning, Meta Self-Training, Knowledge Graph Completion

TL;DR

ssCDL converts triple confidence scores from scalars into Gaussian confidence distributions to capture supervisory signals from neighboring confidence values, and employs meta self-training to generate high-quality pseudo confidence labels for negatively sampled triples, thereby rebalancing the training data. The method significantly outperforms all baselines on both confidence prediction and link prediction for uncertain knowledge graph completion.

Background & Motivation

Background: Uncertain knowledge graphs (UKGs) associate each triple with a confidence score in \([0, 1]\), providing more precise knowledge representations than deterministic knowledge graphs. Representative UKGs include NELL, ConceptNet, and Probase. Existing UKG completion methods (UKGE, PASSLEAF, BEUrRE, UPGAT, etc.) perform link prediction and confidence prediction via embedding learning.

Limitations of Prior Work: The confidence score distribution in real-world UKGs is severely imbalanced. In NELL, for instance, nearly all stored triples have confidence scores above 0.9, since low-confidence triples are typically regarded as erroneous and excluded from storage. Training embeddings on such skewed data causes models to be strongly biased toward the high-confidence regime, yielding poor predictions on low-confidence samples.

Key Challenge: Embedding learning requires sufficient samples across different confidence levels, yet the intrinsic nature of UKGs results in an extreme scarcity of low-confidence data. Two challenges arise: Challenge 1 — how to extract supervisory signals for unseen confidence values from the existing imbalanced labeled data; Challenge 2 — how to generate reliable confidence labels for unlabeled triples produced by negative sampling.

Goal: To address both challenges simultaneously by augmenting labeled data and expanding unlabeled data, thereby improving UKG embedding quality on two fronts.

Key Insight: Confidence is inherently a fuzzy concept (0.77 and 0.78 are not fundamentally distinct). Inspired by label distribution learning in facial age estimation, the paper extends a single scalar confidence to a Gaussian distribution.

Core Idea: Transform triple confidence into a Gaussian distribution to introduce neighboring confidence signals (addressing Challenge 1), and use meta self-training to generate reliable pseudo confidence distributions for negative samples (addressing Challenge 2).

Method

Overall Architecture

ssCDL consists of two core components: CDL-RL (a Confidence Distribution Learning-based Relation Learner) and PCDG (a Pseudo Confidence Distribution Generator). CDL-RL iteratively learns UKG embeddings on labeled data and pseudo-labeled data generated by PCDG. PCDG is optimized via meta-learning, using CDL-RL's performance on labeled data as the meta-objective to assess pseudo-label quality. The two components are alternately optimized through meta self-training.

Key Designs

  1. Confidence Distribution Learning (CDL):

    • Function: Converts a scalar confidence value into a discrete distribution to introduce supervisory signals from neighboring confidence levels.
    • Mechanism: For a triple with confidence \(s\), a 101-dimensional confidence distribution vector \(\boldsymbol{s}\) is generated from a Gaussian distribution with mean \(s\) and standard deviation \(\sigma\) (granularity \(1/100\)). On the prediction side, the concatenated embeddings of \((h, r, t)\) are passed through a two-layer FCN followed by Softmax to produce the predicted distribution \(\hat{\boldsymbol{s}}\), optimized jointly with KL divergence (distribution similarity) and MSE (deviation between expected and true values). An independent link prediction branch (FCN + Sigmoid + margin-based ranking loss) is also designed, with uncertainty weights dynamically balancing the two tasks.
    • Design Motivation: Confidence values near 0.78, such as 0.76, 0.77, and 0.79, can also provide supervisory signals, effectively alleviating data sparsity. Distributional modeling is theoretically equivalent to smoothing the label space — analogous to label smoothing but more structurally grounded.
  2. Pseudo Confidence Distribution Generator (PCDG):

    • Function: Generates high-quality pseudo confidence labels for unlabeled negatively sampled triples.
    • Mechanism: PCDG shares the same architecture as CDL-RL but maintains independent parameters. Meta-learning loop: PCDG first generates \(\mathcal{D}_{tmp}\) for unlabeled data; CDL-RL takes one gradient step on \(\mathcal{D} \cup \mathcal{D}_{tmp}\) to obtain \(\theta^+\); PCDG parameters \(\eta\) are then updated using \(\mathcal{L}(\mathcal{D}, \theta^+)\) as the meta-objective. Selection strategy: a pseudo distribution is admitted into the training set only if its maximum descriptiveness exceeds a threshold.
    • Design Motivation: Conventional self-training suffers from progressive drift. Meta-learning mitigates this by back-validating whether pseudo labels genuinely benefit CDL-RL, thus enabling quality-aware selection.
  3. Three-Stage Meta Self-Training:

    • Function: Coordinates the training of both components to ensure stability.
    • Mechanism: Stage ① (\(< T_{PCDG}\)): CDL-RL is trained solely on labeled data to stabilize embeddings. Stage ② (\(T_{PCDG} \le \cdot < T_{CDLRL}\)): PCDG training begins, but pseudo labels are not yet fed into CDL-RL. Stage ③ (\(\ge T_{CDLRL}\)): Filtered pseudo labels from PCDG are incorporated into CDL-RL training.
    • Design Motivation: The progressive introduction of components avoids early-stage noise accumulation; threshold-based filtering further ensures pseudo-label quality.

Loss & Training

The total loss of CDL-RL is: \(\mathcal{L} = \frac{1}{2\lambda_{CP}^2}\mathcal{L}_{CP} + \frac{\phi}{2\lambda_{LP}^2}\mathcal{L}_{LP} + \log(\lambda_{CP} \cdot \lambda_{LP})\), where \(\phi=0.1\) limits the disproportionate contribution of \(\mathcal{L}_{LP}\) caused by the large number of negative samples. Fifty negative samples are generated per positive sample for link prediction. Pseudo labels contribute only to \(\mathcal{L}_{CP}\) optimization. Embedding dimensionality is 128 for NL27k and 512 for CN15k.

Key Experimental Results

Main Results

Dataset Task Metric ssCDL Best Baseline Gain
NL27k Confidence Prediction MSE↓ 0.009 0.019 (PASSLEAF-RotatE) −52.6%
NL27k Confidence Prediction MAE↓ 0.042 0.051 (PASSLEAF-DistMult) −17.6%
NL27k Link Prediction WMRR↑ 0.727 0.715 (PASSLEAF-RotatE) +1.7%
NL27k Link Prediction Hits@1↑ 0.636 0.586 (PASSLEAF-ComplEx) +8.5%
CN15k Confidence Prediction MSE↓ 0.034 0.094 (PASSLEAF-RotatE) −63.8%
CN15k Confidence Prediction MAE↓ 0.116 0.248 (PASSLEAF-RotatE) −53.2%
CN15k Link Prediction Hits@1↑ 0.133 0.086 (PASSLEAF-ComplEx) +54.7%

Ablation Study

Configuration NL27k MSE↓ NL27k MAE↓ NL27k WMRR↑ CN15k MSE↓ CN15k MAE↓
ssCDL (full) 0.009 0.042 0.727 0.034 0.116
w/o CDL 0.015 0.057 0.586 0.044 0.141
w/o Meta Self-Training 0.010 0.045 0.718 0.035 0.118

Key Findings

  • CDL contributes substantially more than meta self-training: removing CDL increases NL27k MSE from 0.009 to 0.015 (+67%), whereas removing meta self-training raises it only to 0.010 (+11%), indicating that enhancing supervisory signals from labeled data is more impactful than incorporating unlabeled data.
  • On low-confidence triples (\(<0.5\)), ssCDL significantly outperforms all baselines (lowest low-confidence MAE), validating that distribution learning effectively alleviates class imbalance.
  • All methods perform better on NL27k than CN15k, as ConceptNet confidence scores are defined by source frequency, providing insufficient discriminability.
  • Gains on link prediction are less pronounced than on confidence prediction (indirect benefit vs. direct optimization), yet ssCDL comprehensively surpasses all baselines.

Highlights & Insights

  • Cross-domain transfer of label distribution learning: Migrating Label Distribution Learning from facial age estimation to knowledge graph confidence modeling exploits the inherent continuous fuzziness of confidence scores. This insight is straightforward yet effective — suggesting that a "distribution-based" strategy can be applied to any scenario with imbalanced numerical labels.
  • Meta-learning for pseudo-label quality control: Rather than naively using model predictions as pseudo labels, the meta-objective back-validates whether pseudo labels genuinely benefit CDL-RL, yielding a more principled approach than fixed-threshold filtering. PCDG itself continuously evolves, forming a virtuous cycle.
  • Three-stage progressive strategy: The progression from stabilizing embeddings → training the generator → introducing pseudo labels is a simple yet effective scheme for preventing early-stage noise interference.

Limitations & Future Work

  • Experiments are conducted only on two small-scale datasets — NL27k (175K quadruples) and CN15k (241K quadruples); generalization to larger-scale UKGs remains unexplored.
  • PCDG and CDL-RL share an identical architecture (two-layer FCN); lighter-weight or more specialized generator designs are not investigated.
  • The Gaussian standard deviation \(\sigma\) is fixed at 0.6 without adaptive tuning, though different confidence ranges may warrant different degrees of smoothing.
  • Meta-learning involves second-order gradient computation, incurring considerable training overhead and raising scalability concerns.
  • The paper notes quality issues with ConceptNet's confidence definition, and calls for the development of better UKG evaluation benchmarks.
  • vs. UKGE: Directly predicts scalar confidence using the DistMult scoring function, ignoring confidence imbalance. ssCDL directly addresses this issue.
  • vs. PASSLEAF: Also employs semi-supervised learning but only mitigates the false-negative problem, and is susceptible to progressive drift under conventional self-training. ssCDL's meta self-training framework is more robust.
  • vs. BEUrRE: Models entity uncertainty via box embeddings across dimensions, whereas ssCDL focuses on confidence at the triple level.
  • vs. UPGAT: Leverages subgraph features and GAT without addressing confidence imbalance. ssCDL offers an orthogonal and complementary direction of improvement.

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of confidence distribution learning and meta self-training is original, and the cross-domain transfer is commendable.
  • Experimental Thoroughness: ⭐⭐⭐ Limited to two small datasets, though ablation studies, low-confidence analysis, and hyperparameter sensitivity analysis are relatively complete.
  • Writing Quality: ⭐⭐⭐⭐ The structure mapping two challenges to two corresponding solutions is clear, and the methodological derivation is detailed.
  • Value: ⭐⭐⭐ UKG completion is a specialized sub-field, but the confidence distribution learning paradigm has broader transfer potential.