Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models¶

Conference: CVPR2026
arXiv: 2603.21426
Code: github.com/Jingchensun/beta-kd
Area: Multimodal VLM
Keywords: Knowledge Distillation, Uncertainty Weighting, Bayesian Inference, Gibbs Prior, Multi-task Balancing

TL;DR¶

This paper proposes Beta-KD, an uncertainty-aware knowledge distillation framework grounded in a Bayesian perspective. By modeling teacher supervision as a Gibbs prior and deriving a closed-form solution via Laplace approximation, Beta-KD automatically balances data and teacher signals, consistently improving distillation performance on multimodal VQA benchmarks.

Background & Motivation¶

Knowledge distillation (KD) is a core technique for compressing large models, yet it faces unique challenges in the context of multimodal LLM distillation:

Multi-loss balancing: Distillation losses involve multiple components — cross-entropy (learning from data), KL divergence (learning teacher distributions), feature alignment losses, etc. — each with different scales, gradients, and optimization dynamics.
Capacity gap: The large capacity disparity between teacher and student leads to inconsistent scales and variances in logits and hidden representations.
High cost of weight search: Grid search over loss weights is impractical for large-scale LLMs.

Core Problem: How can one automatically balance data supervision and teacher supervision without manually tuning loss weights?

Method¶

Overall Architecture¶

KD is formulated as a MAP inference problem over student activations, where teacher information serves as a Gibbs prior. The partition function is simplified via Laplace approximation, and the inference parameter \(\beta\) is parameterized by a neural network.

Key Designs¶

Teacher-Informed Gibbs Prior:
- \(p(a^s | a^t, \beta) = \frac{1}{Z_\beta(a^t)} \exp[-\beta \ell(a^s; a^t)]\)
- \(\ell\) can be any alignment energy (FKL, RKL, Cosine, MSE, etc.)
- \(\beta\) controls supervision strength: larger \(\beta\) implies greater trust in the teacher; smaller \(\beta\) implies greater trust in the data.
MAP Inference and Laplace Approximation:
- MAP objective: \(\min_{a^s} -\log p(y|a^s) + \beta\ell(a^s;a^t) + \log Z_\beta(a^t)\)
- After Laplace approximation: \(\log Z_\beta \approx -d/2 \cdot \log\beta + \text{const}\)
- Final objective: \(\min \mathcal{L}_{CE} + \beta \ell + \frac{d}{2}\log\beta\) (with natural regularization)
Two Uncertainty Granularities:
- Task-level (homoscedastic): \(\beta\) is a shared learnable scalar per task.
- Instance-level (heteroscedastic): \(\beta(x) = g_\phi(h(x)) > 0\), predicted by a lightweight network from the input.
- Instance-level uncertainty allows each sample to have its own data–teacher balance.
Energy Function Design Space Exploration:
- Cosine-Probs is found to perform best (scale-invariant, focuses on directional alignment).
- Pre-softmax logit matching (MSE-Logits, Cosine-Logits) performs poorly for generative MLLMs.
- This finding contrasts with conclusions drawn from discriminative tasks.

Loss & Training¶

\[\min_{\theta,\phi} \mathcal{L}_{CE}(\theta) + g_\phi(h(x))\ell(\theta) - \frac{d}{2}\log g_\phi(h(x))\]

The visual encoder and tokenizer are frozen; only the language backbone is fine-tuned.

Key Experimental Results¶

Main Results¶

Method	ScienceQA VQA-Acc	ScienceQA IMG-Acc	Gain
CE+JS	48.5	54.8	Baseline
CE+JS w/ Beta-KD(Task)	50.5(+1.1)	58.1(+1.7)	Task-level
CE+JS w/ Beta-KD(Instance)	53.3(+3.9)	66.9(+10.6)	Instance-level

Ablation Study¶

Configuration	Key Metric	Note
FKL/RKL/JS/TVD and other losses	All show improvement	Method is robust to loss function choice
Task-level vs. Instance-level	Instance-level superior	Fine-grained adaptation is beneficial
2-loss vs. 3-loss	Both effective	Compatible with arbitrary combinations

Key Findings¶

Instance-level uncertainty weighting achieves up to +4.7 absolute points improvement on ScienceQA.
Gains are larger on IMG-Acc (+10.6), suggesting greater benefit for visually grounded questions.
Logit-level matching fails for generative MLLMs, contrary to findings in discriminative settings.
Training dynamics visualization reveals faster convergence, smoother optimization, and closer teacher–student logit alignment.

Highlights & Insights¶

The Bayesian formulation provides an elegant theoretical interpretation of KD: teacher supervision as a Gibbs prior, distillation as MAP inference.
The Laplace approximation naturally yields the \(-\frac{d}{2}\log\beta\) regularization term, preventing \(\beta\) from becoming extreme.
The energy function design space exploration offers practical guidance: Cosine-Probs is the preferred choice.
The method is elegantly designed, with a coherent logical flow from theoretical derivation to implementation.

Limitations & Future Work¶

The instance-level uncertainty network introduces additional parameters and computational overhead.
Experiments are primarily conducted on MobileVLM; validation on larger-scale teachers is limited.
The Laplace approximation assumes local quadratic structure, which may be inaccurate on non-convex losses.
Integration with more recent backbone models (e.g., Qwen2.5-VL) has not been explored.

Related to Kendall & Gal's multi-task uncertainty weighting, but generalized to arbitrary distillation losses.
Multimodal KD methods such as LLaVA-KD and Align-KD can benefit from this framework.
BayesKD targets uncertainty over model parameters, while Beta-KD targets uncertainty over activations — representing complementary perspectives.

Rating¶

Novelty: ⭐⭐⭐⭐ The theoretical framework combining Gibbs prior and Laplace approximation is novel.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple loss combinations, two uncertainty granularities, and six benchmarks.
Writing Quality: ⭐⭐⭐⭐ Theoretical derivations are clear and rigorous.
Value: ⭐⭐⭐⭐ Automatic loss balancing is highly practical for large-model KD.