Skip to content

ORPO-Distill: Mixed-Policy Preference Optimization for Cross-Architecture LLM Distillation

Conference: NeurIPS 2025 arXiv: 2509.25100 Code: Not released Area: LLM Alignment Keywords: Knowledge Distillation, Preference Optimization, ORPO, Mixed Policy, Cross-Architecture Distillation

TL;DR

This paper proposes ORPO-Distill, which reformulates cross-architecture LLM knowledge distillation as a preference optimization problem. The teacher model generates positive reasoning chains while the student model generates negative ones; an ORPO contrastive loss is used for training, augmented by a mixed-policy update strategy for student negative samples. The method consistently outperforms black-box KD baselines across 5 QA benchmarks.

Background & Motivation

Two dominant paradigms in LLM knowledge distillation:

White-box KD: Relies on teacher logits, requiring shared vocabulary/architecture between teacher and student, limiting flexibility.

Black-box KD: Requires only teacher-generated sequences, enabling cross-architecture distillation, but typically limited to CoT distillation.

Three limitations of existing black-box KD:

Single CoT distillation provides limited supervision signal: only one teacher reasoning chain is used.

Absence of contrastive learning: no distinction is made between "good" and "bad" reasoning paths.

Distribution mismatch: training uses teacher sequences, while inference relies on student autoregressive generation, creating a distribution gap.

Three contributions of this paper:

  1. Replacing single CoT with diverse reasoning traces.
  2. Applying ORPO preference optimization for contrastive distillation (teacher as positive, student as negative).
  3. Introducing mixed-policy updates to address the distribution mismatch.

Method

Overall Architecture

Input prompt → Teacher generates \(K\) positive reasoning chains → Student generates \(K\) negative reasoning chains → Construct preference dataset \(\langle\)Prompt, Chosen, Rejected\(\rangle\) → ORPO contrastive training → Mixed-policy negative sample update.

Loss & Training

ORPO (Odds Ratio Preference Optimization) combines SFT and preference alignment into a single objective:

\[L_{SFT} = -\log q_\theta(y_P \mid x)\]
\[L_{OR} = -\log \sigma\left(\log \frac{\text{odds}\;q_\theta(y_P|x)}{\text{odds}\;q_\theta(y_N|x)}\right)\]

where the odds function is defined as:

\[\text{odds}\;q_\theta(y|x) = \frac{q_\theta(y|x)}{1 - q_\theta(y|x)}\]

The final loss is:

\[L_{ORPO} = L_{SFT} + \lambda \cdot L_{OR}\]
  • \(\lambda = 0.1\): mild preference skew (as in human preference alignment).
  • \(\lambda = 1\): strong adaptation (used in this work, to penalize erroneous student generation paths).

Key Designs

Preference Dataset Construction

Key finding: Using student-generated negative samples outperforms teacher-generated negatives.

Setting MedQA ARC-C
(Positive-Teacher, Negative-Teacher) 41.72 45.87
(Positive-Teacher, Negative-Student) 49.33 56.48

Diversity Sampling: - Temperature sampling \(\tau = 0.8\), generating \(K\) reasoning chains. - Both teacher and student use the Reason-then-Answer format. - Rejection sampling: chains with ROUGE-L overlap > 0.80 are discarded. - Experiments with \(K \in \{4, 8, 12\}\) show diminishing returns beyond \(K = 8\).

Mixed-Policy Update

Three policy variants:

Policy Parameter \(\phi\) Negative Sample Source
Off-policy \(\phi = 0\) Fixed; generated by the initial student model
On-policy \(\phi = 1\) Re-sampled from the latest checkpoint every epoch
Mixed-policy \(\phi = 0.5\) Latest checkpoint with probability \(\phi\); otherwise initial model

Why does on-policy underperform off-policy? Frequently updated negative samples are of higher quality but lower diversity, narrowing the contrastive margin. Mixed-policy preserves negative sample diversity by anchoring to the initial student model.

Training Configuration: - Teacher: InternLM 2.5 7B-Chat - Students: InternLM 2.5 1.8B-Chat, TinyLlama 1.1B-Instruct - Full-parameter fine-tuning, 5 epochs - \(K = 8\), \(\lambda = 1\), \(\phi = 0.5\)

Key Experimental Results

Main Results: Accuracy (%) on 5 Datasets across Student Models

Setting MedQA ARC-C StrategyQA OBQA GSM8K Avg.
TinyLlama 1.1B
Zero-shot CoT 29.78 29.95 43.52 26.60 11.97 28.36
Single CoT FT 32.10 32.63 46.25 29.05 31.56 34.32
Diverse CoT FT 34.85 35.40 47.84 33.60 36.22 37.58
Off-Policy ORPO 38.95 41.20 49.77 37.45 39.45 41.36
On-Policy ORPO 35.11 38.01 49.24 35.60 36.88 38.97
Mixed-Policy ORPO 40.25 43.55 51.25 40.10 40.72 43.17
InternLM 1.8B
Zero-shot CoT 35.82 37.12 54.15 27.40 41.02 39.10
Single CoT FT 37.94 40.45 57.50 41.35 44.38 44.32
Diverse CoT FT 40.56 42.15 58.66 54.50 47.50 48.67
Off-Policy ORPO 49.33 56.48 59.39 53.20 51.25 53.93
On-Policy ORPO 43.25 49.80 58.50 52.79 47.94 50.46
Mixed-Policy ORPO 50.43 59.32 61.75 55.22 52.47 55.84

Teacher model (InternLM 2.5 7B-Chat) Zero-shot CoT average: 59.58%.

Ablation Study: Contribution of Each Component

Component TinyLlama Avg. Gain InternLM 1.8B Avg. Gain
Single CoT → Diverse CoT +3.26 +4.35
Diverse CoT → Off-Policy ORPO +3.78 +5.26
Off-Policy → Mixed-Policy +1.81 +1.91
Total Gain (Single CoT → Mixed-Policy) +8.85 +11.52

Key Findings

  1. Every component contributes: diverse reasoning chains (+3–5%), ORPO contrastive training (+4–5%), and mixed-policy (+1–2%) each yield additive improvements.
  2. On-policy underperforms off-policy: validating the hypothesis that negative sample diversity matters more than quality.
  3. Mixed-policy is universally optimal: consistently best across all datasets and student models.
  4. Cross-architecture distillation is effective: both TinyLlama (Llama architecture) and InternLM 1.8B (InternLM architecture) benefit from the InternLM 7B teacher.
  5. Smaller models gain more: the 1.1B model achieves an average absolute gain of +14.81% over Zero-shot CoT; the 1.8B model gains +16.74%.

Highlights & Insights

  1. Reformulating distillation as preference optimization is the central conceptual contribution: black-box KD is fundamentally about teaching the student to "prefer" the teacher's reasoning style, and ORPO provides a natural framework for this.
  2. Student negatives outperforming teacher negatives is an illuminating finding: student-generated negatives directly expose the student's weaknesses, making contrastive training more targeted.
  3. Mixed-policy elegantly balances quality and diversity: the design is analogous to the \(\varepsilon\)-greedy strategy in reinforcement learning.
  4. The method is remarkably simple: it requires no reward model, no PPO training, and no white-box access, resulting in low implementation cost.
  5. The empirical finding of \(K=8\) serves as a practical reference for future work.

Limitations & Future Work

  1. Only multi-choice QA tasks are evaluated: positive/negative sample labels rely on ground-truth answers; open-ended generation tasks would require additional verifiers.
  2. Teacher model is limited to 7B: the performance ceiling with larger teachers (e.g., 70B) remains unexplored.
  3. Coarse-grained \(\phi\) search: only \(\phi \in \{0, 0.5, 1\}\) are tested, without fine-grained or adaptive tuning.
  4. Full-parameter fine-tuning: compatibility with parameter-efficient methods such as LoRA in resource-constrained settings is not validated.
  5. No comparison with other preference optimization methods such as DPO or PPO.
  6. The performance gap between on-policy and off-policy should not be attributed solely to diversity: factors such as the optimization landscape may also play a role.
  • ORPO (Hong et al., 2024): The original work merging SFT and preference alignment; this paper extends its application to the distillation setting.
  • On-policy Distillation (Agarwal et al., 2024): On-policy updates are beneficial in white-box settings, but this paper finds mixed-policy superior in black-box scenarios.
  • MAGDI (Chen et al., 2024): Uses multi-teacher contrastive distillation but does not leverage student-generated outputs.
  • DistillM (Ko et al., 2024): Demonstrates the importance of SGO in white-box settings; this paper generalizes the insight to black-box distillation.

Rating

  • ⭐⭐⭐⭐ (4/5)
  • Novelty ⭐⭐⭐⭐: The reformulation of distillation as preference optimization is elegant; mixed-policy is an effective novel contribution.
  • Experimental Thoroughness ⭐⭐⭐⭐: Five datasets, two student models, and component-wise ablations are provided; however, comparisons with larger teachers and more preference optimization baselines are missing.
  • Writing Quality ⭐⭐⭐⭐: Concise and clear, with well-structured algorithm pseudocode.
  • Value ⭐⭐⭐⭐⭐: The method is simple and effective, with low barriers to adoption in practical small-model deployment.
  • Theoretical Depth ⭐⭐⭐: Intuitive explanations (diversity vs. quality) are offered, but formal theoretical analysis is absent.