Skip to content

Unified Reinforcement and Imitation Learning for Vision-Language Models

Conference: NeurIPS 2025 arXiv: 2510.19307 Code: Unavailable (NVIDIA internal) Area: Multimodal VLM Keywords: VLM distillation, reinforcement learning, imitation learning, GRPO, GAIL

TL;DR

This paper proposes RIL (Unified Reinforcement and Imitation Learning), a training framework that combines GRPO-based reinforcement learning with GAIL-style adversarial imitation learning to substantially improve the performance of small VLMs (7B) by learning the text generation style of large VLMs (72B), without incurring additional inference latency or requiring an explicit "thinking" process.

Background & Motivation

VLMs have demonstrated strong multimodal understanding by integrating visual and language modalities, yet deploying large models (72B+) remains infeasible under constrained computational budgets. Existing approaches to improving VLM performance each carry notable limitations:

Existing strategies and their limitations:

Scaling model size: Models such as GPT-4o and Qwen2.5-VL-72B achieve strong results but cannot be deployed on mobile devices or AR headsets.

Think-Answer paradigm: Methods like DeepSeek-R1 leverage long chain-of-thought reasoning to boost performance, but substantially increase inference latency and computational overhead.

Architectural modifications: Adding multiple visual encoders alters the inference pipeline and reduces deployment flexibility.

Conventional knowledge distillation: Feature-distance-based distillation is limited in effectiveness within high-dimensional spaces.

Key Insight: Can a small model learn the "speaking style" (text generation style) of a large model without modifying its architecture or increasing inference cost? RIL combines reinforcement learning (to optimize answer quality) with imitation learning (to learn expressive style) within a unified adversarial framework.

Method

Overall Architecture

RIL comprises three core components: a student VLM (Generator), a Discriminator, and an LLM-as-a-Judge. Training alternates between discriminator pre-training and joint RIL training, where each step updates the student model using both reinforcement rewards and imitation rewards.

Key Designs

  1. Discriminator Architecture and Pre-training

    • The discriminator shares the same architecture and initial parameters as the student VLM, with only the language head (\(\mathbb{R}^{d \times v}\)\(\mathbb{R}^{d \times 1}\)) replaced by a linear discrimination head followed by a sigmoid activation.
    • Pre-training objective: distinguish between responses generated by the student and those generated by the teacher.
    • The architectural symmetry between generator and discriminator prevents the "balance problem," avoiding training collapse caused by one party dominating the other.
    • Continuous discriminator scores are binarized to 0/1, providing a cleaner learning signal.
  2. RIL Reward Design (Dual Rewards)

    • Similarity reward \(r_s\): Derived from the discriminator, measuring the stylistic similarity between student responses and teacher responses (after binarization).
    • Correctness reward \(r_a\): Derived from LLM-as-a-Judge (Qwen2.5-32B), measuring whether the response is semantically consistent with the ground truth (also binarized).
    • The two rewards are complementary: the discriminator focuses on "sounding like the teacher," while the Judge focuses on "being correct."
  3. Teacher-Guided GRPO

    • During the GRPO update, both student and teacher VLM responses are included as candidates.
    • When all sampled student responses for a given question are incorrect, the teacher's correct response provides an escape from the zero-reward trap.
    • This not only stabilizes training but also creates opportunities for the student to surpass the teacher.
  4. Multi-Teacher Strategy

    • Multiple large teacher VLMs (e.g., Qwen2.5-VL-72B + InternVL3-78B) are employed to provide more diverse response styles.
    • Multiple teachers make the discriminator more robust and prevent overfitting to any single response pattern.
    • Exposure to a richer distribution of correct responses improves learning efficiency.

Loss & Training

  • Student models: Qwen2.5-VL-7B/3B, InternVL3-8B/2B/1B
  • Teacher models: Qwen2.5-VL-72B, InternVL3-78B
  • Judge: Qwen2.5-32B
  • Dr.GRPO (an improved variant of GRPO) serves as the RL backbone.
  • The discriminator and Judge are not required at inference time, preserving the original inference speed.

Key Experimental Results

Main Results (Average over 14 Benchmarks)

Model AI2D ChartQA MathVista MMB MMMU BLINK 14-Bench Avg
Qwen2.5-VL-7B (base) 83.9 87.3 67.8 83.5 55.0 56.4 ~70.8
+ RL (Dr.GRPO) 84.5 90.0 69.5 84.3 57.2 60.7 ~73.3
+ RIL (single teacher) 86.7 95.4 74.5 86.8 61.8 68.5 ~78.0
+ RIL (dual teacher) 86.1 95.6 79.7 86.3 65.7 70.0 ~79.7
InternVL3-8B + RIL 87.4 95.5 74.1 88.7 66.8 60.1 ~75.5

Ablation Study (Multi-Teacher vs. Single Teacher)

Configuration MMMU MathVista BLINK Avg. Trend
Single teacher (same family, 72B) 61.8 74.5 68.5 High
Single teacher (cross-family, 78B) 60.9 74.6 68.1 High
Dual teacher (Both) 65.7 79.7 70.0 Best

Key Findings

  • RIL substantially outperforms pure RL: On Qwen2.5-VL-7B, RIL improves over Dr.GRPO by approximately 5–7 points on average, with ChartQA jumping from 90.0 to 95.4.
  • Dual-teacher significantly outperforms single-teacher: MathVista improves from 74.5/74.6 (single teacher) to 79.7 (dual teacher); MMMU improves from 61.8/60.9 to 65.7.
  • Small models also benefit substantially: InternVL3-1B achieves 3–8 point gains across multiple benchmarks under RIL.
  • Synergy with distilled models: VLMs that have already undergone feature-level distillation perform better under RIL, suggesting that intrinsic feature alignment and RIL objectives are complementary.

Highlights & Insights

  • Paradigm innovation: Introducing GAN-style adversarial learning into VLM training, where the discriminator distinguishes between "student-written" and "teacher-written" text styles, is a highly elegant formulation.
  • Stability of binarized rewards: Continuous discriminator scores introduce ambiguity; binarization significantly stabilizes training, which is consistent with the binary reward design philosophy of GRPO.
  • No Think-Answer required: Unlike methods such as Vision-R1, RIL does not require a chain-of-thought at inference time, preserving the original inference speed.
  • Architecture- and tokenizer-agnostic: Since RIL operates purely on textual responses, the student and teacher can employ entirely different image encoders and tokenizers.
  • Generality of LLM-as-a-Judge: Traditional RL rewards rely on answer parsing (applicable only to tasks with canonical answers such as mathematics), whereas a Judge can evaluate open-ended responses.

Limitations & Future Work

  • Access to large teacher VLMs for response generation is required, making the training cost inclusive of large-model inference.
  • Discriminator training stability still requires careful hyperparameter tuning (initialization, pre-training steps, etc.).
  • The Judge model may carry its own biases, leading to inaccurate correctness assessments in certain domains.
  • The practical inference efficiency gains in real deployment scenarios (mobile or edge devices) have not been empirically validated.
  • Preparing data from dual teachers belonging to different VLM families entails considerable engineering effort.
  • vs. DeepSeek-R1/Vision-R1: These methods rely on think-answer processes that increase inference latency, whereas RIL preserves the original inference speed.
  • vs. conventional knowledge distillation: Traditional distillation performs alignment in high-dimensional feature spaces, while RIL performs alignment at the natural language response level (a "verbalization effect"), yielding superior results.
  • vs. GAIL: Classic GAIL is designed for robotic control; RIL is the first to effectively adapt it to VLM training, introducing four key modifications.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Introducing adversarial imitation learning into VLM training is an entirely new angle; the unified framework integrating GRPO is elegantly designed.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 14 benchmarks, multiple model scales (1B–8B), multi-teacher combinations, and comprehensive ablations.
  • Writing Quality: ⭐⭐⭐⭐ The framework is described clearly, though the dense tables and the key innovations require careful repeated reading to fully grasp.
  • Value: ⭐⭐⭐⭐⭐ Offers a practical path to substantially improving small VLM performance without any increase in inference cost.