CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning¶
Conference: ACL 2026
arXiv: 2601.05858
Code: https://github.com/alexandra-dragomir/CLewR
Area: Optimization & Theory
Keywords: Curriculum Learning, Preference Optimization, Machine Translation, Catastrophic Forgetting, DPO
TL;DR¶
The paper proposes CLewR (Curriculum Learning with Restarts), a strategy that sorts data from easy to hard and restarts the curriculum at each epoch during preference optimization training, effectively mitigating catastrophic forgetting and consistently improving machine translation performance across multiple model families (Gemma2, Qwen2.5, Llama3.1) and preference optimization algorithms (DPO, CPO, ARPO).
Background & Motivation¶
Background: Large language models demonstrate strong performance in zero-shot multilingual machine translation. Subsequent work further improves translation quality through preference optimization (e.g., DPO, CPO, ARPO), contrasting high-quality translations with low-quality ones.
Limitations of Prior Work: Preference optimization methods ignore the presentation order of data samples during training—a factor that significantly impacts training effectiveness. Existing curriculum learning work (e.g., CurriDPO) simply sorts data by difficulty but does not address catastrophic forgetting during training: easy samples learned early are forgotten in later stages.
Key Challenge: Traditional curriculum learning arranges data from easy to hard and traverses it only once. The model forgets knowledge from easy samples when it concentrates on hard samples later in training, but not following a curriculum means losing curriculum learning benefits.
Goal: Propose a data-level curriculum strategy that simultaneously enjoys curriculum learning benefits and mitigates catastrophic forgetting, applicable to preference optimization training for machine translation.
Key Insight: Sort data from easy to hard within each epoch, but restart the sorting after each epoch—meaning every epoch traverses all samples completely, from easy to hard.
Core Idea: By restarting the easy-to-hard curriculum ordering at each epoch, CLewR natively solves catastrophic forgetting because every epoch traverses all samples from the beginning.
Method¶
Overall Architecture¶
CLewR consists of two phases: (1) Sorting phase—ranks all training triplets based on the similarity difference between chosen and rejected translations; (2) Training phase—traverses all data in the fixed easy-to-hard order within each epoch without random shuffling. Sorting is completed once before training begins, and each epoch repeats the same ordering.
Key Designs¶
-
Multi-Metric Difficulty Scoring:
- Function: Computes a difficulty score for each preference triplet \((x, y_w, y_l)\)
- Mechanism: Computes BLEU, COMET-22, and METEOR between chosen translation \(y_w\) and rejected translation \(y_l\), normalizes them, and averages to obtain a similarity score \(s\). High-similarity samples (small difference between chosen and rejected) are hard samples; low-similarity ones are easy
- Design Motivation: A single metric may be biased; combining three complementary translation evaluation metrics provides a more robust difficulty estimate
-
Epoch-Level Restart Mechanism:
- Function: Restarts the easy-to-hard curriculum ordering at each epoch
- Mechanism: Data sorting is completed once before training. During training, each epoch traverses all data in the same easy-to-hard order (no shuffling), ensuring the model can review from easy samples at the start of every epoch
- Design Motivation: Traditional curriculum learning traversing data only once easily leads to catastrophic forgetting. By restarting the curriculum each epoch, the model reinforces easy sample knowledge every round
-
CLewR-z Variant (ARPO Distance-Based Sorting):
- Function: Uses ARPO's adaptive distance function \(z_\theta\) instead of external metrics for sorting
- Mechanism: ARPO's \(z_\theta(y_w, y_l)\) encodes the log-likelihood difference between chosen and rejected responses, using \(s = -z_\theta\) as the curriculum score. An enhanced ARPO version is also proposed that combines \(z_\theta\) with weighted BLEU and COMET distances
- Design Motivation: Using the model's internal distance signal (rather than external metrics) for sorting aligns the curriculum ordering more closely with the optimization objective
Loss & Training¶
CLewR is compatible with three preference optimization algorithms: DPOP (enhanced DPO), CPO (contrastive preference optimization with behavior cloning), and ARPO (adaptive rejection preference optimization). Training uses each algorithm's native loss function; CLewR only changes the data presentation order.
Key Experimental Results¶
Main Results¶
Gemma2-9B BLEU scores on 6 Romance languages (en→xx direction)
| Method | DPOP | +CurriDPO | +CLewR | CPO | +CLewR | ARPO | +CLewR |
|---|---|---|---|---|---|---|---|
| BLEU | 23.26 | 21.81 | 22.35 | 33.53 | 36.24 | 35.37 | 36.63 |
Qwen2.5-7B BLEU scores on 6 Romance languages (en→xx direction)
| Method | DPOP | +CLewR | CPO | +CLewR | ARPO | +CLewR |
|---|---|---|---|---|---|---|
| BLEU | 24.43 | 23.59 | 27.68 | 30.05 | 30.41 | 31.56 |
Ablation Study¶
| Config | Note | Effect |
|---|---|---|
| CLewR (multi-metric sorting) | BLEU+COMET+METEOR combined sorting | Best |
| CLewR-z (model distance sorting) | Using ARPO internal distance sorting | Near-best |
| ARPO-z'-V1/V2 (enhanced distance) | Enhanced distance function | Further improves baseline ARPO |
| CurriDPO | Competing method | Worse than CLewR |
Key Findings¶
- CLewR consistently improves performance on CPO and ARPO; effects on DPOP vary by model
- On Gemma2, the best CLewR + ARPO-z'-V2 configuration reaches BLEU 37.45 (en→xx), a +2.08 improvement over baseline ARPO
- CLewR outperforms CurriDPO across all model families and most preference optimization algorithms
- Enhanced ARPO (combining external metrics with distance function) further improves baseline ARPO performance
Highlights & Insights¶
- The method is extremely simple—only changing data presentation order without modifying model architecture or loss function yields consistent performance improvements
- The "restart" mechanism is an elegant solution to the contradiction between curriculum learning and catastrophic forgetting
- CLewR has strong universality and can be seamlessly integrated into DPO, CPO, ARPO, and other preference optimization algorithms
Limitations & Future Work¶
- Effects on DPOP are less stable than on CPO/ARPO, possibly related to DPOP's reference model dependency
- Difficulty sorting is based on static pre-training evaluation, unable to dynamically adjust the curriculum during training
- Only validated on machine translation tasks; effectiveness on other preference optimization scenarios (e.g., dialogue, summarization) remains to be verified
Related Work & Insights¶
- vs CurriDPO: CurriDPO uses iterative curriculum but does not address catastrophic forgetting; CLewR is more effective through epoch-level restarts
- vs X-ALMA: X-ALMA is a strong ARPO-based baseline; CLewR further improves upon ARPO
Rating¶
- Novelty: ⭐⭐⭐⭐ The restart mechanism, though simple, effectively resolves the curriculum learning + catastrophic forgetting contradiction
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three model families, three preference optimization algorithms, multilingual evaluation, comprehensive coverage
- Writing Quality: ⭐⭐⭐⭐ Clear motivation, concise method, detailed experiments
- Recommendation: ⭐⭐⭐⭐ Provides a simple and effective curriculum learning strategy for preference optimization training