ORPO-Distill: Mixed-Policy Preference Optimization for Cross-Architecture LLM Distillation¶
Conference: NeurIPS 2025 arXiv: 2509.25100 Code: Not released Area: LLM Alignment Keywords: Knowledge Distillation, Preference Optimization, ORPO, Mixed Policy, Cross-Architecture Distillation
TL;DR¶
This paper proposes ORPO-Distill, which reformulates cross-architecture LLM knowledge distillation as a preference optimization problem. The teacher model generates positive reasoning chains while the student model generates negative ones; an ORPO contrastive loss is used for training, augmented by a mixed-policy update strategy for student negative samples. The method consistently outperforms black-box KD baselines across 5 QA benchmarks.
Background & Motivation¶
Two dominant paradigms in LLM knowledge distillation:
White-box KD: Relies on teacher logits, requiring shared vocabulary/architecture between teacher and student, limiting flexibility.
Black-box KD: Requires only teacher-generated sequences, enabling cross-architecture distillation, but typically limited to CoT distillation.
Three limitations of existing black-box KD:
Single CoT distillation provides limited supervision signal: only one teacher reasoning chain is used.
Absence of contrastive learning: no distinction is made between "good" and "bad" reasoning paths.
Distribution mismatch: training uses teacher sequences, while inference relies on student autoregressive generation, creating a distribution gap.
Three contributions of this paper:
- Replacing single CoT with diverse reasoning traces.
- Applying ORPO preference optimization for contrastive distillation (teacher as positive, student as negative).
- Introducing mixed-policy updates to address the distribution mismatch.
Method¶
Overall Architecture¶
Input prompt → Teacher generates \(K\) positive reasoning chains → Student generates \(K\) negative reasoning chains → Construct preference dataset \(\langle\)Prompt, Chosen, Rejected\(\rangle\) → ORPO contrastive training → Mixed-policy negative sample update.
Loss & Training¶
ORPO (Odds Ratio Preference Optimization) combines SFT and preference alignment into a single objective:
where the odds function is defined as:
The final loss is:
- \(\lambda = 0.1\): mild preference skew (as in human preference alignment).
- \(\lambda = 1\): strong adaptation (used in this work, to penalize erroneous student generation paths).
Key Designs¶
Preference Dataset Construction
Key finding: Using student-generated negative samples outperforms teacher-generated negatives.
| Setting | MedQA | ARC-C |
|---|---|---|
| (Positive-Teacher, Negative-Teacher) | 41.72 | 45.87 |
| (Positive-Teacher, Negative-Student) | 49.33 | 56.48 |
Diversity Sampling: - Temperature sampling \(\tau = 0.8\), generating \(K\) reasoning chains. - Both teacher and student use the Reason-then-Answer format. - Rejection sampling: chains with ROUGE-L overlap > 0.80 are discarded. - Experiments with \(K \in \{4, 8, 12\}\) show diminishing returns beyond \(K = 8\).
Mixed-Policy Update
Three policy variants:
| Policy | Parameter \(\phi\) | Negative Sample Source |
|---|---|---|
| Off-policy | \(\phi = 0\) | Fixed; generated by the initial student model |
| On-policy | \(\phi = 1\) | Re-sampled from the latest checkpoint every epoch |
| Mixed-policy | \(\phi = 0.5\) | Latest checkpoint with probability \(\phi\); otherwise initial model |
Why does on-policy underperform off-policy? Frequently updated negative samples are of higher quality but lower diversity, narrowing the contrastive margin. Mixed-policy preserves negative sample diversity by anchoring to the initial student model.
Training Configuration: - Teacher: InternLM 2.5 7B-Chat - Students: InternLM 2.5 1.8B-Chat, TinyLlama 1.1B-Instruct - Full-parameter fine-tuning, 5 epochs - \(K = 8\), \(\lambda = 1\), \(\phi = 0.5\)
Key Experimental Results¶
Main Results: Accuracy (%) on 5 Datasets across Student Models¶
| Setting | MedQA | ARC-C | StrategyQA | OBQA | GSM8K | Avg. |
|---|---|---|---|---|---|---|
| TinyLlama 1.1B | ||||||
| Zero-shot CoT | 29.78 | 29.95 | 43.52 | 26.60 | 11.97 | 28.36 |
| Single CoT FT | 32.10 | 32.63 | 46.25 | 29.05 | 31.56 | 34.32 |
| Diverse CoT FT | 34.85 | 35.40 | 47.84 | 33.60 | 36.22 | 37.58 |
| Off-Policy ORPO | 38.95 | 41.20 | 49.77 | 37.45 | 39.45 | 41.36 |
| On-Policy ORPO | 35.11 | 38.01 | 49.24 | 35.60 | 36.88 | 38.97 |
| Mixed-Policy ORPO | 40.25 | 43.55 | 51.25 | 40.10 | 40.72 | 43.17 |
| InternLM 1.8B | ||||||
| Zero-shot CoT | 35.82 | 37.12 | 54.15 | 27.40 | 41.02 | 39.10 |
| Single CoT FT | 37.94 | 40.45 | 57.50 | 41.35 | 44.38 | 44.32 |
| Diverse CoT FT | 40.56 | 42.15 | 58.66 | 54.50 | 47.50 | 48.67 |
| Off-Policy ORPO | 49.33 | 56.48 | 59.39 | 53.20 | 51.25 | 53.93 |
| On-Policy ORPO | 43.25 | 49.80 | 58.50 | 52.79 | 47.94 | 50.46 |
| Mixed-Policy ORPO | 50.43 | 59.32 | 61.75 | 55.22 | 52.47 | 55.84 |
Teacher model (InternLM 2.5 7B-Chat) Zero-shot CoT average: 59.58%.
Ablation Study: Contribution of Each Component¶
| Component | TinyLlama Avg. Gain | InternLM 1.8B Avg. Gain |
|---|---|---|
| Single CoT → Diverse CoT | +3.26 | +4.35 |
| Diverse CoT → Off-Policy ORPO | +3.78 | +5.26 |
| Off-Policy → Mixed-Policy | +1.81 | +1.91 |
| Total Gain (Single CoT → Mixed-Policy) | +8.85 | +11.52 |
Key Findings¶
- Every component contributes: diverse reasoning chains (+3–5%), ORPO contrastive training (+4–5%), and mixed-policy (+1–2%) each yield additive improvements.
- On-policy underperforms off-policy: validating the hypothesis that negative sample diversity matters more than quality.
- Mixed-policy is universally optimal: consistently best across all datasets and student models.
- Cross-architecture distillation is effective: both TinyLlama (Llama architecture) and InternLM 1.8B (InternLM architecture) benefit from the InternLM 7B teacher.
- Smaller models gain more: the 1.1B model achieves an average absolute gain of +14.81% over Zero-shot CoT; the 1.8B model gains +16.74%.
Highlights & Insights¶
- Reformulating distillation as preference optimization is the central conceptual contribution: black-box KD is fundamentally about teaching the student to "prefer" the teacher's reasoning style, and ORPO provides a natural framework for this.
- Student negatives outperforming teacher negatives is an illuminating finding: student-generated negatives directly expose the student's weaknesses, making contrastive training more targeted.
- Mixed-policy elegantly balances quality and diversity: the design is analogous to the \(\varepsilon\)-greedy strategy in reinforcement learning.
- The method is remarkably simple: it requires no reward model, no PPO training, and no white-box access, resulting in low implementation cost.
- The empirical finding of \(K=8\) serves as a practical reference for future work.
Limitations & Future Work¶
- Only multi-choice QA tasks are evaluated: positive/negative sample labels rely on ground-truth answers; open-ended generation tasks would require additional verifiers.
- Teacher model is limited to 7B: the performance ceiling with larger teachers (e.g., 70B) remains unexplored.
- Coarse-grained \(\phi\) search: only \(\phi \in \{0, 0.5, 1\}\) are tested, without fine-grained or adaptive tuning.
- Full-parameter fine-tuning: compatibility with parameter-efficient methods such as LoRA in resource-constrained settings is not validated.
- No comparison with other preference optimization methods such as DPO or PPO.
- The performance gap between on-policy and off-policy should not be attributed solely to diversity: factors such as the optimization landscape may also play a role.
Related Work & Insights¶
- ORPO (Hong et al., 2024): The original work merging SFT and preference alignment; this paper extends its application to the distillation setting.
- On-policy Distillation (Agarwal et al., 2024): On-policy updates are beneficial in white-box settings, but this paper finds mixed-policy superior in black-box scenarios.
- MAGDI (Chen et al., 2024): Uses multi-teacher contrastive distillation but does not leverage student-generated outputs.
- DistillM (Ko et al., 2024): Demonstrates the importance of SGO in white-box settings; this paper generalizes the insight to black-box distillation.
Rating¶
- ⭐⭐⭐⭐ (4/5)
- Novelty ⭐⭐⭐⭐: The reformulation of distillation as preference optimization is elegant; mixed-policy is an effective novel contribution.
- Experimental Thoroughness ⭐⭐⭐⭐: Five datasets, two student models, and component-wise ablations are provided; however, comparisons with larger teachers and more preference optimization baselines are missing.
- Writing Quality ⭐⭐⭐⭐: Concise and clear, with well-structured algorithm pseudocode.
- Value ⭐⭐⭐⭐⭐: The method is simple and effective, with low barriers to adoption in practical small-model deployment.
- Theoretical Depth ⭐⭐⭐: Intuitive explanations (diversity vs. quality) are offered, but formal theoretical analysis is absent.