Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding¶
Conference: CVPR 2026 arXiv: 2603.11423 Code: None Area: Knowledge Distillation / Video Understanding / Vision-Language Models Keywords: multi-sample distillation, black-box distillation, video LVLM, adversarial distillation, teacher sampling variance
TL;DR¶
This paper proposes the R-MSD framework, which constructs a teacher pool by sampling \(K\) responses per input, applies task-adaptive quality matching (quality-weighted pairing for closed-ended tasks and uniform pairing for open-ended tasks), and employs an online critic-as-discriminator adversarial distillation strategy to address the unreliability of single-sample supervision in black-box distillation of video LVLMs.
Background & Motivation¶
Background: Large vision-language models (LVLMs) have achieved remarkable progress in video understanding, yet deployment remains constrained by computational cost. Knowledge distillation is the predominant approach for transferring strong teacher capabilities to compact student models. Recent analyses further suggest that distillation can extend the reasoning upper bound of student models, while reinforcement learning (RL) methods are bounded by the base model's distribution.
Limitations of Prior Work: Existing distillation methods assume that only a single teacher response is sampled per input as the supervision signal. This assumption fails severely in video understanding, where two levels of variance exist: (1) cross-question variance — quality scores span \([0.10, 1.0]\) with \(\sigma=0.22\) over 200 samples (MCQ: \(\sigma=0.10\) vs. Visual QA: \(\sigma=0.24\)); (2) intra-question sampling uncertainty — \(\sigma\) ranges from \(0.07\) (MCQ) to \(0.15\) (Visual QA), with an overall format violation rate of 1% rising to 10% on temporal QA.
Key Challenge: The task heterogeneity inherent to video understanding (closed-ended tasks amenable to GT verification vs. open-ended tasks lacking reliable metrics) makes a unified supervision strategy inevitably suboptimal for one category — failing to filter low-quality closed-ended responses introduces noise, while lexical-matching-based ranking for open-ended tasks penalizes semantically correct but differently worded answers.
Goal: How can supervision noise arising from teacher sampling variance be modeled and mitigated, particularly in video understanding scenarios with a mixture of closed-ended and open-ended tasks?
Key Insight: Maintain a pool of \(K\) teacher responses per input and apply task-type-specific quality matching strategies, complemented by an online discriminator for distribution-level supervision.
Core Idea: Replace single-sample supervision with a multi-sample teacher pool and task-adaptive matching (GT-weighted for closed-ended tasks, uniform for open-ended tasks), and leverage an online discriminator to prevent reward hacking associated with static reward models.
Method¶
Overall Architecture¶
R-MSD comprises three components and two stages: (1) multi-sample teacher collection — \(K=4\) teacher responses are sampled per input; (2) task-adaptive quality assessment — closed-ended tasks use GT-based quality scores, and open-ended tasks use a uniform distribution; (3) online discriminator adversarial distillation — scoring via a critic value head and optimization via policy gradient. Stage 1 is an SFT warm-up (using the best teacher response), and Stage 2 is RL-based adversarial distillation.
Key Designs¶
-
Task-Adaptive Quality Assessment and Matching
- Closed-ended tasks (MCQ / temporal grounding / spatial grounding / numerical / OCR): \(q_k = \mathbb{I}(\text{valid}(T_k)) \cdot \text{Metric}(T_k, y^*)\), with quality-weighted sampling \(p_k \propto q_k\)
- Open-ended tasks: uniform pairing \(p_k = 1/K\) to avoid lexical bias
- A quality filtering threshold \(\tau=0.3\) removes low-quality responses (retaining 72%)
- Design Motivation: Closed-ended tasks benefit from available GT signals; imposing a ranking on open-ended tasks penalizes semantically correct but differently phrased responses
-
Two-Stage Training and Composite Reward
- Stage 1: SFT on the best teacher response per input (50K samples, 1 epoch), providing a stable initialization
- Stage 2: The student samples \(N=8\) rollouts, each paired with the teacher pool via quality-weighted matching
- Composite reward: \(R = 0.4 D_\phi + 0.1 R_{outer} + 0.1 R_{task} + 0.4 R_{content}\)
- \(R_{outer}\) evaluates outer format compliance; \(R_{task}\) evaluates task-specific format; \(R_{content}\) is the GT-matching score
- Design Motivation: Decoupling format and content reward components aligns training directly with evaluation metrics
-
Critic-as-Discriminator Online Discrimination
- The critic value head is reused to score the final token; the discriminator is trained with quality-weighted GAD pairing objectives
- The student maximizes the composite reward via policy gradient, with KL penalty to prevent deviation from the reference policy
- Design Motivation: The online discriminator co-evolves with the student, avoiding reward hacking from static reward models
Loss & Training¶
- Stage 1: Cross-entropy loss, 50K samples, 1 epoch
- Stage 2: \(\mathcal{L}_{RL} = -\mathbb{E}[R(S)] + \gamma D_{KL}(\pi_S || \pi_{ref})\), 60K samples, 1 epoch
- Teacher: Qwen3-VL-235B (frozen); Student: Qwen3-VL-4B
- \(K=4\), \(N=8\), batch size 128, AdamW with lr=\(2\times10^{-6}\) (Stage 1) / \(1\times10^{-6}\) (Stage 2)
- Training: 16 frames; Inference: 64 frames
Key Experimental Results¶
Main Results¶
| Benchmark | R-MSD (4B) | Qwen3-VL-4B | SFT+RL (4B) | Gain |
|---|---|---|---|---|
| VideoMME | 65.3% | 63.8% | 64.1% | +1.5 |
| Video-MMMU | 58.6% | 55.4% | 55.8% | +3.2 |
| WorldSense | 49.2% | 46.7% | 47.0% | +2.5 |
| MLVU_MCQ | 72.4% | 71.6% | 71.8% | +0.8 |
| MathVista | 66.3% | 63.7% | 64.0% | +2.6 |
| MathVerse | 39.2% | 35.6% | 36.0% | +3.6 |
| VsTAR Chain2 When | 23.4% tIoU | 18.5% | 19.0% | +4.9 |
Ablation Study¶
| Configuration | VideoMME | Video-MMMU | Notes |
|---|---|---|---|
| A: \(K=1\) (baseline) | 63.8 | 55.4 | Single-sample baseline |
| B: \(K=4\) (multi-sample) | 64.5 | 56.8 | Gain from multi-sample alone |
| C: +quality filtering | 65.0 | 57.6 | Low-quality teacher responses filtered |
| D: +quality weighting | 65.3 | 58.6 | Full R-MSD |
| \(K=8\) | 65.4 | 58.7 | Marginal gain; \(K=4\) most cost-effective |
| \(\tau=0\) (no filtering) | 64.6 | 57.2 | Filtering is necessary |
| \(\tau=0.5\) (too strict) | 64.8 | 57.8 | Over-filtering also suboptimal |
| Closed-ended with uniform | — | 56.2 | GT scoring outperforms uniform (58.6 vs. 56.2) |
| Open-ended with GT scoring | — | 58.4 | Uniform outperforms GT (59.1 vs. 58.4) |
Key Findings¶
- An equal-budget SFT+RL baseline yields only marginal gains (+0.3–0.4%), highlighting the advantage of multi-sample supervision
- \(K=4\) is the most cost-effective operating point; \(K=8\) yields only marginal additional gains
- Task-specialization is validated: GT-weighted scoring outperforms uniform for closed-ended tasks, while uniform outperforms GT-based scoring for open-ended tasks — consistent with the design assumptions
- Pass@k analysis reveals that R-MSD achieves 3.2% higher Pass@1, with upper bounds converging as \(k\) increases, indicating the improvement stems from a more concentrated output distribution
Highlights & Insights¶
- The first systematic quantification of teacher sampling variance in video LVLMs, with concrete empirical evidence (σ=0.22, format violation rates of 1–10%) demonstrating the unreliability of single-sample supervision
- Task-adaptive design is precise: quality filtering for closed-ended tasks and avoidance of lexical bias for open-ended tasks align naturally with the fundamental differences between the two task types
- The composite reward explicitly decouples format and content components, directly aligning training with evaluation metrics
- The online discriminator effectively mitigates the reward hacking problem associated with static reward models
Limitations & Future Work¶
- Multi-sample collection cost scales linearly with \(K\), increasing training computation approximately fourfold at \(K=4\)
- Uniform pairing for open-ended tasks is a conservative choice that does not exploit semantic quality signals (e.g., LLM-as-judge)
- Closed-ended quality assessment relies on GT annotations and is not directly applicable to weakly supervised settings
- Nearly no improvement is observed on LongVideoBench, attributed to the mismatch between training frames (16) and test frames (64)
- Validation is limited to the 4B-scale student; effectiveness at larger or smaller scales remains unexplored
Related Work & Insights¶
- vs. GAD (Ye et al. 2025): R-MSD extends GAD by incorporating task-adaptive multi-sample matching, generalizing from plain text to video multimodal settings
- vs. PromptKD: The latter employs unsupervised logit distillation, whereas R-MSD leverages GT quality signals combined with an adversarial discriminator
- vs. RLVR (Yue et al. 2025): RL methods do not extend the reasoning upper bound, while distillation can transfer novel reasoning patterns — R-MSD provides more reliable distillation supervision
- The proposed methodology for quantifying teacher sampling variance is transferable to arbitrary distillation scenarios
- The task-specialization strategy for closed-ended and open-ended tasks offers general value for mixed-task training pipelines
Rating¶
- Novelty: ⭐⭐⭐⭐ — The combination of multi-sample pooling and task-adaptive matching addresses a genuine practical problem, with well-aligned motivation and design
- Experimental Thoroughness: ⭐⭐⭐⭐ — Six video and two image benchmarks, with detailed ablation and sensitivity analyses
- Writing Quality: ⭐⭐⭐⭐ — Clear problem formulation, intuitive variance analysis figures, and a complete methodological pipeline
- Value: ⭐⭐⭐⭐ — A practical contribution to the video VLM distillation field, with broadly applicable methodology