Skip to content

Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models

Conference: CVPR2026
arXiv: 2603.21426
Code: github.com/Jingchensun/beta-kd
Area: Multimodal VLM
Keywords: Knowledge Distillation, Uncertainty Weighting, Bayesian Inference, Gibbs Prior, Multi-task Balancing

TL;DR

Proposes Beta-KD, an uncertainty-aware knowledge distillation framework from a Bayesian perspective. By modeling teacher supervision as a Gibbs prior and deriving a closed-form solution via Laplace approximation, it automatically adjusts the balance between data and teacher signals, consistently improving distillation performance on multimodal VQA benchmarks.

Background & Motivation

Knowledge Distillation (KD) is a core technology for compressing large models, but it faces specific challenges in Multimodal LLM (MLLM) distillation:

  • Multi-loss Balancing Difficulty: Distillation losses involve multiple channels—cross-entropy (learning from data), KL divergence (learning from teacher distributions), and feature alignment—each with different scales, gradients, and optimization dynamics.
  • Capacity Gap: The significant capacity difference between teacher and student models leads to inconsistent scales and variances in logits and hidden representations.
  • High Cost of Weight Search: Grid searching for weights is impractical for large-scale LLMs.

Core Problem: How to automatically balance data supervision and teacher supervision without manual weight tuning?

Method

Overall Architecture

Distillation for MLLMs requires simultaneous learning from data (cross-entropy) and the teacher (KL/feature alignment). These losses vary in scale and gradient, making manual weight tuning expensive and difficult. Beta-KD takes a different approach: it treats "what the student should look like" as a MAP (Maximum A Posteriori) inference problem. Teacher information is injected as a Gibbs prior, and Laplace approximation is used to simplify the intractable partition function into a closed-form solution. Finally, a lightweight network predicts the balancing coefficient \(\beta\), eliminating grid search and automatically weighting the data and teacher supervision paths.

Key Designs

1. Teacher-Informed Gibbs Prior: Writing "Teacher Trust" as an Adjustable Temperature

To balance data and teacher automatically, teacher supervision is formalized. Beta-KD formulates the teacher's constraint on student activations \(a^s\) as a Gibbs prior \(p(a^s \mid a^t, \beta) = \frac{1}{Z_\beta(a^t)} \exp[-\beta\,\ell(a^s; a^t)]\), where the alignment energy \(\ell\) can take any form such as FKL, RKL, Cosine, or MSE. The coefficient \(\beta\) represents the "degree of trust in the teacher": a large \(\beta\) indicates higher trust in the teacher's distribution, while a small \(\beta\) indicates higher trust in the data itself—reducing the balancing problem to a learnable scalar.

2. MAP Inference + Laplace Approximation: Naturally Deriving a Regularization Term

Viewing distillation as a MAP inference over student activations, the objective is \(\min_{a^s} -\log p(y\mid a^s) + \beta\ell(a^s; a^t) + \log Z_\beta(a^t)\). The difficulty lies in the partition function \(Z_\beta\). Using Laplace approximation, \(\log Z_\beta \approx -\frac{d}{2}\log \beta + \text{const}\) is obtained. Substituting this back yields the final objective: \(\min \mathcal{L}_{CE} + \beta\ell - \frac{d}{2}\log\beta\). The term \(-\frac{d}{2}\log\beta\) is a regularization term naturally derived from the theoretical framework, preventing \(\beta\) from collapsing to 0 or infinity.

3. Task-level and Instance-level Granularity: Sample-Specific Balancing

A fixed global \(\beta\) is too coarse, as different samples rely on the teacher differently. Beta-KD provides two granularities: task-level (homoscedastic), where \(\beta\) is a shared learnable scalar per task; and instance-level (heteroscedastic), where a lightweight network predicts \(\beta(x) = g_\phi(h(x)) > 0\) from the input. Experiments show that instance-level adaptation is significantly stronger, justifying the value of sample-specific balancing.

4. Energy Function Design Space: Generative MLLMs Prefer Cosine-Probs

The choice of \(\ell\) is critical. The paper systematically explores the design space and finds that Cosine-Probs performs best—it is scale-invariant and focuses on directional alignment, avoiding issues with logit scale/variance inconsistencies caused by the capacity gap. Conversely, pre-softmax logit matching (MSE-Logits, Cosine-Logits) performs poorly on generative MLLMs, contradicting common findings in discriminative tasks.

Loss & Training

Final optimization objective (Instance-level): $\(\min_{\theta,\phi} \mathcal{L}_{CE}(\theta) + g_\phi(h(x))\,\ell(\theta) - \frac{d}{2}\log g_\phi(h(x))\)$ During training, the visual encoder and tokenizer are frozen, and only the language backbone is fine-tuned.

Key Experimental Results

Main Results

Method ScienceQA VQA-Acc ScienceQA IMG-Acc Gain
CE+JS 48.5 54.8 Baseline
CE+JS w/ Beta-KD(Task) 50.5(+1.1) 58.1(+1.7) Task-level
CE+JS w/ Beta-KD(Instance) 53.3(+3.9) 66.9(+10.6) Instance-level

Ablation Study

Configuration Key Metric Description
Various losses (FKL/RKL/JS/TVD) Consistent gains Methodology is robust to energy function choice
Task-level vs. Instance-level Instance-level is superior Fine-grained adaptation is valuable
2-loss vs. 3-loss Effective in both Applicable to any loss combination

Key Findings

  • Instance-level uncertainty weighting improves ScienceQA performance by up to +4.7 absolute points.
  • The improvement on IMG-Acc is even larger (+10.6), indicating greater assistance for vision-related questions.
  • Logit-level matching fails in generative MLLMs, unlike in discriminative models.
  • Training dynamics visualization shows faster convergence, smoother optimization, and closer teacher-student logit alignment.

Highlights & Insights

  • Elegant theoretical explanation of KD from a unified Bayesian perspective: teacher supervision as a Gibbs prior and distillation as MAP inference.
  • Laplace approximation yields the \(-\frac{d}{2}\log\beta\) regularization term, naturally preventing extreme values of \(\beta\).
  • Exploration of the energy function design space provides useful practical guidelines: Cosine-Probs is the most effective.
  • Logical consistency from theoretical derivation to implementation.

Limitations & Future Work

  • The instance-level uncertainty network adds parameters and computational overhead.
  • Experiments are primarily based on MobileVLM; validation with larger-scale teachers is limited.
  • Laplace approximation assumes local quadratic approximation, which may be imprecise for non-convex losses.
  • Integration with newer base models (e.g., Qwen2.5-VL) has not been verified.
  • Related to multi-task uncertainty weighting by Kendall & Gal, but generalized to arbitrary distillation losses.
  • Multimodal KD methods like LLaVA-KD and Align-KD could benefit from this approach.
  • Unlike BayesKD, which focuses on model parameter uncertainty, Beta-KD focuses on activation uncertainty.

Rating

  • Novelty: ⭐⭐⭐⭐ Innovative theoretical framework using Gibbs prior and Laplace approximation.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multiple loss combinations, two granularities, and 6 benchmarks.
  • Writing Quality: ⭐⭐⭐⭐ Clear and rigorous theoretical derivation.
  • Value: ⭐⭐⭐⭐ Automatic loss balancing is highly practical for large model KD.