Skip to content

Credal Ensemble Distillation for Uncertainty Quantification

Conference: AAAI 2026 arXiv: 2511.13766 Code: Unavailable (experimental code provided in supplementary material) Area: Model Compression / Uncertainty Quantification Keywords: Knowledge Distillation, Deep Ensembles, Uncertainty Quantification, Credal Sets, OOD Detection

TL;DR

This paper proposes the Credal Ensemble Distillation (CED) framework, which distills a deep ensemble (DE) teacher into a single-model student called CREDIT. Rather than predicting a single softmax distribution, CREDIT outputs class probability intervals that define a credal set, achieving superior or comparable uncertainty estimation on OOD detection tasks while substantially reducing inference overhead (from 5× to 1×).

Background & Motivation

Uncertainty quantification (UQ) for deep neural networks is critical for model trustworthiness and robustness. Uncertainty is decomposed into two types: - Aleatoric Uncertainty (AU): inherent randomness in the data-generating process - Epistemic Uncertainty (EU): uncertainty due to insufficient model knowledge

Deep Ensembles (DE) have emerged as a strong UQ baseline by combining multiple independently trained networks, effectively distinguishing AU from EU. However, their key bottleneck is high inference cost: \(M\) models incur \(M\)-fold computation and storage.

Limitations of existing distillation approaches:

Ensemble Distillation (ED): distills DE into a single neural network (SNN) outputting one softmax distribution, but loses EU information (a single distribution cannot express uncertainty about uncertainty).

Ensemble Distribution Distillation (EDD): distills into a model outputting a Dirichlet distribution, but suffers from the lack of ground-truth Dirichlet labels; in practice, accuracy drops severely (74.56% vs. SNN's 91.79% on VGG16), and recent theoretical work has criticized its EU interpretation.

Key Challenge: How can a single forward pass simultaneously preserve predictive capability and EU quantification?

Key Insight: This work replaces single distributions or parametric distributions with credal sets (convex sets of probability distributions defined by class probability intervals) as a second-order uncertainty representation. Credal sets are naturally derived from the multiple predictive distributions of a DE and require no distributional assumptions such as those underlying the Dirichlet.

Method

Overall Architecture

CED consists of three steps: (1) a credal wrapper extracts probability intervals and the cross-probability from the \(M\) softmax outputs of the DE teacher; (2) a CREDIT student model is designed to output a \(\mathbb{R}^{2C+1}\) vector encoding the cross-probability, interval lengths, and a weight factor; (3) a novel loss function trains the student to match the teacher's credal information. At inference, the cross-probability is used for classification, while the full output reconstructs the credal set for UQ.

Key Designs

  1. Credal Wrapper (Teacher Side):

    • Function: Extracts class probability intervals from the \(M\) predictive distributions of the DE.
    • Mechanism: For each class \(k\), the upper bound is \(\overline{p}_k = \max_m p_{m,k}\) and the lower bound is \(\underline{p}_k = \min_m p_{m,k}\). These intervals define the credal set \(\mathbb{Q}\). A normalized cross-probability is then computed as \(p^*_k = \underline{p}_k + \beta(\overline{p}_k - \underline{p}_k)\), where \(\beta = (1-\sum_k \underline{p}_k)/\sum_k \Delta p_k\).
    • Design Motivation: The cross-probability is the most representative single-point estimate of a probability interval system. The factor \(\beta\) ensures the cross-probability is properly normalized.
  2. CREDIT Student Architecture:

    • Function: Modifies the final layer of a standard SNN to output \(2C+1\) values.
    • Mechanism: The first \(C\) logits pass through a softmax to yield the cross-probability \(\mathbf{p}_S^*\); the next \(C\) pass through a sigmoid to yield interval lengths \(\Delta\mathbf{p}_S\); the final scalar passes through a sigmoid to yield the weight factor \(\beta_S\). The probability intervals are reconstructed as \(\underline{p}_{S,k} = p^*_{S,k} - \beta_S \Delta p_{S,k}\) and \(\overline{p}_{S,k} = p^*_{S,k} + (1-\beta_S)\Delta p_{S,k}\).
    • Design Motivation: The key constraint is to guarantee that the reconstructed probability intervals are valid (\(\underline{p} \leq p^* \leq \overline{p}\), \(\sum \underline{p} \leq 1 \leq \sum \overline{p}\)). This is ensured by the combination of softmax and sigmoid activations, as verified mathematically.
  3. Distillation Loss:

    • Function: Trains CREDIT to match the DE teacher's credal information.
    • Mechanism: \(\mathcal{L}_{ced} = \text{CE}(\mathbf{p}^*, \mathbf{p}_S^*) + \text{MSE}(\Delta\mathbf{p}, \Delta\mathbf{p}_S) + \text{MSE}(\beta, \beta_S)\). The first term (cross-entropy) preserves predictive performance; the latter two (MSE) capture the imprecision of the credal set.
    • Design Motivation: The credal set distillation objective is decomposed into three independently optimizable targets: accurate classification, interval width, and interval position. Temperature scaling (\(T=2.5\)) is supported.

Uncertainty Quantification

From the reconstructed credal set \(\mathbb{Q}_S\) of CREDIT, the following quantities are computed via constrained optimization: - TU (Total Uncertainty) = maximum Shannon entropy \(\overline{H}(\mathbb{Q}_S)\) (the maximum-entropy distribution within the credal set) - AU (Aleatoric Uncertainty) = minimum Shannon entropy \(\underline{H}(\mathbb{Q}_S)\) - EU (Epistemic Uncertainty) = \(\overline{H} - \underline{H}\) (interval width reflects insufficient model knowledge)

Key Experimental Results

Main Results (VGG16, CIFAR-10 vs. SVHN OOD Detection)

Method AUROC(EU) AUROC(TU) AUPRC(EU) AUPRC(TU) Inference Time
DE (5×) 89.99 91.53 93.78 95.09 5×2.22s
SNN / 89.44 / 93.71 2.22s
ED / 91.07 / 94.51 2.22s
EDD* 90.94 90.96 93.66 93.78 2.22s
MCDO 51.42 89.12 74.72 93.64 2.22s
CED 93.56 92.51 96.09 95.21 2.26s

Ablation Study (ResNet50 + CIFAR-10-C OOD)

Method AUROC(EU)↑ AUROC(TU)↑ Accuracy Note
DE 87.78 94.08 93.40 5-model ensemble, performance upper bound
CED 96.80 95.23 91.77 Single model; EU surpasses DE
ED / 94.09 92.02 No EU estimation capability
EDD* 89.48 91.04 80.38 Severe accuracy degradation

Key Findings

  • CED's EU estimation significantly outperforms all baselines: On VGG16/SVHN, CED achieves EU-AUROC of 93.56%, substantially surpassing DE (89.99%) and EDD* (90.94%), indicating that credal sets capture EU more faithfully than DE's discrete sampling or the Dirichlet distribution.
  • CED does not compromise accuracy: CED (92.23%) is on par with ED (92.18%) and SNN (91.79%), whereas EDD accuracy collapses to 74.56% on VGG16.
  • EU vs. TU: CED performs better at OOD detection using EU than TU, whereas other methods benefit more from TU, indicating a qualitative improvement in CED's EU estimation.
  • Inference efficiency: CED inference time (2.26s) is nearly identical to SNN (2.22s), compared to 5×2.22s = 11.1s for DE.
  • Ensemble size ablation: DE performance continues to improve with ensemble size, whereas CED nearly converges at \(M=5\), demonstrating the effectiveness of distillation.
  • Temperature scaling: \(T=2.5\) yields the best performance; excessively high temperatures (\(T=10\)) degrade results.
  • Medical Imaging Case Study (Camelyon17): CED achieves EU AUARC of 97.12% under OOD settings, outperforming DE (95.92%).

Highlights & Insights

  • Introducing credal sets into knowledge distillation is an elegant innovation: credal sets, as a second-order representation, are more flexible than Dirichlet (requiring no distributional assumptions) and more compact than DE (single model).
  • The CREDIT architecture is minimally invasive, adding only \(C+1\) output nodes with no modifications to the backbone.
  • The paper mathematically proves that CREDIT's output probability intervals are always valid (satisfying credal set conditions), providing an important correctness guarantee from an engineering perspective.
  • The loss function design is conceptually clear: CE preserves classification and MSE preserves imprecision, without requiring complex learning strategies as in EDD.

Limitations & Future Work

  • Scalability to large label spaces: When \(C\) is large (100 or 1,000 classes), softmax produces very small probability values, which may destabilize the regression loss.
  • Calibration: CED's ECE (6.71%) is considerably higher than DE's (1.46%), indicating that calibration requires further improvement.
  • Optimization overhead: Computing \(\overline{H}\) and \(\underline{H}\) requires solving constrained optimization problems, which may incur non-negligible cost when \(C > 10\).
  • Evaluation limited to classification: Generalization to regression, detection, and other tasks remains unexplored.
  • Dependence on teacher quality: CED's performance ceiling is bounded by the DE teacher.
  • Compared to BNNs: CED does not require a posterior distribution over weights, making training considerably simpler.
  • Compared to EDD: CED avoids both the missing ground-truth Dirichlet labels and the accuracy degradation associated with EDD.
  • Credal sets have a rich theoretical foundation in classical machine learning (Levi 1980, imprecise probability theory); their introduction into deep learning distillation represents an effective bridge between theory and practice.
  • Insight: Probability intervals are better suited than point estimates or parametric distributions for expressing "what the model does not know."

Rating

  • Novelty: ⭐⭐⭐⭐⭐ (First proposal of credal sets combined with distillation; theoretically well-motivated)
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Multiple backbones, datasets, ablations, and a medical imaging case study)
  • Writing Quality: ⭐⭐⭐⭐ (Content-dense but clearly structured)
  • Value: ⭐⭐⭐⭐⭐ (Strong candidate for a new standard method in UQ; highly practical)