ORPO-Distill: Mixed-Policy Preference Optimization for Cross-Architecture LLM Distillation¶

Conference: NeurIPS 2025 arXiv: 2509.25100 Code: Not released Area: LLM Alignment Keywords: Knowledge Distillation, Preference Optimization, ORPO, Mixed Policy, Cross-Architecture Distillation

TL;DR¶

This paper proposes ORPO-Distill, which reformulates cross-architecture LLM knowledge distillation as a preference optimization problem. The teacher model generates positive reasoning chains while the student model generates negative ones; an ORPO contrastive loss is used for training, augmented by a mixed-policy update strategy for student negative samples. The method consistently outperforms black-box KD baselines across 5 QA benchmarks.

Background & Motivation¶

Two dominant paradigms in LLM knowledge distillation:

White-box KD: Relies on teacher logits, requiring shared vocabulary/architecture between teacher and student, limiting flexibility.

Black-box KD: Requires only teacher-generated sequences, enabling cross-architecture distillation, but typically limited to CoT distillation.

Three limitations of existing black-box KD:

Single CoT distillation provides limited supervision signal: only one teacher reasoning chain is used.

Absence of contrastive learning: no distinction is made between "good" and "bad" reasoning paths.

Distribution mismatch: training uses teacher sequences, while inference relies on student autoregressive generation, creating a distribution gap.

Three contributions of this paper:

Replacing single CoT with diverse reasoning traces.
Applying ORPO preference optimization for contrastive distillation (teacher as positive, student as negative).
Introducing mixed-policy updates to address the distribution mismatch.

Method¶

Overall Architecture¶

Input prompt → Teacher generates \(K\) positive reasoning chains → Student generates \(K\) negative reasoning chains → Construct preference dataset \(\langle\)Prompt, Chosen, Rejected\(\rangle\) → ORPO contrastive training → Mixed-policy negative sample update.

Loss & Training¶

ORPO (Odds Ratio Preference Optimization) combines SFT and preference alignment into a single objective:

\[L_{SFT} = -\log q_\theta(y_P \mid x)\]

\[L_{OR} = -\log \sigma\left(\log \frac{\text{odds}\;q_\theta(y_P|x)}{\text{odds}\;q_\theta(y_N|x)}\right)\]

where the odds function is defined as:

\[\text{odds}\;q_\theta(y|x) = \frac{q_\theta(y|x)}{1 - q_\theta(y|x)}\]

The final loss is:

\[L_{ORPO} = L_{SFT} + \lambda \cdot L_{OR}\]

\(\lambda = 0.1\): mild preference skew (as in human preference alignment).
\(\lambda = 1\): strong adaptation (used in this work, to penalize erroneous student generation paths).

Key Designs¶

Preference Dataset Construction

Key finding: Using student-generated negative samples outperforms teacher-generated negatives.

Setting	MedQA	ARC-C
(Positive-Teacher, Negative-Teacher)	41.72	45.87
(Positive-Teacher, Negative-Student)	49.33	56.48

Diversity Sampling: - Temperature sampling \(\tau = 0.8\), generating \(K\) reasoning chains. - Both teacher and student use the Reason-then-Answer format. - Rejection sampling: chains with ROUGE-L overlap > 0.80 are discarded. - Experiments with \(K \in \{4, 8, 12\}\) show diminishing returns beyond \(K = 8\).

Mixed-Policy Update

Three policy variants:

Policy	Parameter \(\phi\)	Negative Sample Source
Off-policy	\(\phi = 0\)	Fixed; generated by the initial student model
On-policy	\(\phi = 1\)	Re-sampled from the latest checkpoint every epoch
Mixed-policy	\(\phi = 0.5\)	Latest checkpoint with probability \(\phi\); otherwise initial model

Why does on-policy underperform off-policy? Frequently updated negative samples are of higher quality but lower diversity, narrowing the contrastive margin. Mixed-policy preserves negative sample diversity by anchoring to the initial student model.

Training Configuration: - Teacher: InternLM 2.5 7B-Chat - Students: InternLM 2.5 1.8B-Chat, TinyLlama 1.1B-Instruct - Full-parameter fine-tuning, 5 epochs - \(K = 8\), \(\lambda = 1\), \(\phi = 0.5\)

Key Experimental Results¶

Main Results: Accuracy (%) on 5 Datasets across Student Models¶

Setting	MedQA	ARC-C	StrategyQA	OBQA	GSM8K	Avg.
TinyLlama 1.1B
Zero-shot CoT	29.78	29.95	43.52	26.60	11.97	28.36
Single CoT FT	32.10	32.63	46.25	29.05	31.56	34.32
Diverse CoT FT	34.85	35.40	47.84	33.60	36.22	37.58
Off-Policy ORPO	38.95	41.20	49.77	37.45	39.45	41.36
On-Policy ORPO	35.11	38.01	49.24	35.60	36.88	38.97
Mixed-Policy ORPO	40.25	43.55	51.25	40.10	40.72	43.17
InternLM 1.8B
Zero-shot CoT	35.82	37.12	54.15	27.40	41.02	39.10
Single CoT FT	37.94	40.45	57.50	41.35	44.38	44.32
Diverse CoT FT	40.56	42.15	58.66	54.50	47.50	48.67
Off-Policy ORPO	49.33	56.48	59.39	53.20	51.25	53.93
On-Policy ORPO	43.25	49.80	58.50	52.79	47.94	50.46
Mixed-Policy ORPO	50.43	59.32	61.75	55.22	52.47	55.84

Teacher model (InternLM 2.5 7B-Chat) Zero-shot CoT average: 59.58%.

Ablation Study: Contribution of Each Component¶

Component	TinyLlama Avg. Gain	InternLM 1.8B Avg. Gain
Single CoT → Diverse CoT	+3.26	+4.35
Diverse CoT → Off-Policy ORPO	+3.78	+5.26
Off-Policy → Mixed-Policy	+1.81	+1.91
Total Gain (Single CoT → Mixed-Policy)	+8.85	+11.52

Key Findings¶

Every component contributes: diverse reasoning chains (+3–5%), ORPO contrastive training (+4–5%), and mixed-policy (+1–2%) each yield additive improvements.
On-policy underperforms off-policy: validating the hypothesis that negative sample diversity matters more than quality.
Mixed-policy is universally optimal: consistently best across all datasets and student models.
Cross-architecture distillation is effective: both TinyLlama (Llama architecture) and InternLM 1.8B (InternLM architecture) benefit from the InternLM 7B teacher.
Smaller models gain more: the 1.1B model achieves an average absolute gain of +14.81% over Zero-shot CoT; the 1.8B model gains +16.74%.

Highlights & Insights¶

Reformulating distillation as preference optimization is the central conceptual contribution: black-box KD is fundamentally about teaching the student to "prefer" the teacher's reasoning style, and ORPO provides a natural framework for this.
Student negatives outperforming teacher negatives is an illuminating finding: student-generated negatives directly expose the student's weaknesses, making contrastive training more targeted.
Mixed-policy elegantly balances quality and diversity: the design is analogous to the \(\varepsilon\)-greedy strategy in reinforcement learning.
The method is remarkably simple: it requires no reward model, no PPO training, and no white-box access, resulting in low implementation cost.
The empirical finding of \(K=8\) serves as a practical reference for future work.

Limitations & Future Work¶

Only multi-choice QA tasks are evaluated: positive/negative sample labels rely on ground-truth answers; open-ended generation tasks would require additional verifiers.
Teacher model is limited to 7B: the performance ceiling with larger teachers (e.g., 70B) remains unexplored.
Coarse-grained \(\phi\) search: only \(\phi \in \{0, 0.5, 1\}\) are tested, without fine-grained or adaptive tuning.
Full-parameter fine-tuning: compatibility with parameter-efficient methods such as LoRA in resource-constrained settings is not validated.
No comparison with other preference optimization methods such as DPO or PPO.
The performance gap between on-policy and off-policy should not be attributed solely to diversity: factors such as the optimization landscape may also play a role.

ORPO (Hong et al., 2024): The original work merging SFT and preference alignment; this paper extends its application to the distillation setting.
On-policy Distillation (Agarwal et al., 2024): On-policy updates are beneficial in white-box settings, but this paper finds mixed-policy superior in black-box scenarios.
MAGDI (Chen et al., 2024): Uses multi-teacher contrastive distillation but does not leverage student-generated outputs.
DistillM (Ko et al., 2024): Demonstrates the importance of SGO in white-box settings; this paper generalizes the insight to black-box distillation.

Rating¶

⭐⭐⭐⭐ (4/5)
Novelty ⭐⭐⭐⭐: The reformulation of distillation as preference optimization is elegant; mixed-policy is an effective novel contribution.
Experimental Thoroughness ⭐⭐⭐⭐: Five datasets, two student models, and component-wise ablations are provided; however, comparisons with larger teachers and more preference optimization baselines are missing.
Writing Quality ⭐⭐⭐⭐: Concise and clear, with well-structured algorithm pseudocode.
Value ⭐⭐⭐⭐⭐: The method is simple and effective, with low barriers to adoption in practical small-model deployment.
Theoretical Depth ⭐⭐⭐: Intuitive explanations (diversity vs. quality) are offered, but formal theoretical analysis is absent.