mR3: Multilingual Rubric-Agnostic Reward Reasoning Models¶

Conference: ICLR 2026 arXiv: 2510.01146 Code: github.com/rubricreward/mr3 Area: LLM Reasoning / Alignment & RLHF Keywords: multilingual reward models, reasoning-based evaluation, curriculum learning, rubric-based assessment, knowledge distillation

TL;DR¶

This paper introduces mR3, a family of multilingual rubric-agnostic reward reasoning models covering 72 languages. Through systematic data construction (GPT-OSS-120B distillation with difficulty filtering) and curriculum learning, the 14B model surpasses the 120B teacher model and all comparable baselines on multilingual evaluation benchmarks, while supporting point-wise, pair-wise, and binary evaluation paradigms.

Background & Motivation¶

Background: The LLM-as-judge evaluation paradigm has been widely adopted in English settings, but support for non-English languages remains extremely limited. Existing reward models (e.g., ArmoRM, RM-R1) focus almost exclusively on English, while multilingual evaluation models (e.g., m-Prometheus) cover only 6 languages and lack systematic study of training strategies.

Limitations of Prior Work: - Existing reward models exhibit significant accuracy degradation in non-English settings - LLMs lack coherent reasoning capability in low-resource languages (LRLs) - Multilingual evaluation lacks a standardized framework; existing work supports only pair-wise comparison, without point-wise or binary evaluation - There is no systematic investigation into how to construct high-quality training data for multilingual reward models, including what languages to use for instructions, rubrics, and reasoning chains

Key Challenge: Multilingual evaluation requires both strong reasoning ability and cross-lingual knowledge transfer, yet existing models' reasoning capabilities degrade substantially in non-English languages. How to simultaneously improve both under limited multilingual data conditions remains an open challenge.

Goal: - Design a multilingual reward reasoning model covering 72 languages - Systematically study optimal combinations of instruction language, reasoning language, and target language - Explore data selection and curriculum learning strategies - Support point-wise, pair-wise, and binary evaluation paradigms

Key Insight: Rather than training conventional scalar reward models, this work trains generative reward models that produce reasoning traces alongside scores, improving evaluation interpretability and cross-lingual robustness through explicit reasoning.

Core Idea: Construct a 72-language alignment dataset (100K samples) via GPT-OSS-120B distillation combined with difficulty filtering and curriculum learning, training generative reward reasoning models that outperform the teacher model despite having far fewer parameters.

Method¶

Overall Architecture¶

Input: task instruction \(t\) + input instance \(i\) + candidate response \(a\) + evaluation rubric \(r\) Output: reasoning trace + brief explanation \(e\) + score \(s\) Formally, \(f(x) = y\), where \(x = (t, i, a, r)\) and \(y = (\text{trace}, e, s)\)

Three evaluation modes: point-wise (scoring a single response), pair-wise (comparing two responses), and binary (correct/incorrect judgment).

Key Designs¶

Multilingual Data Construction Pipeline
- Function: Filter and construct a 100K high-quality multilingual training set from over 3 million samples
- Mechanism:
  - The initial data pool is drawn from 6 public datasets (Human Arena Preference, HelpSteer3, MMMLU, HumanEval-XL, MATH-500 Multilingual, PolyGuardMix), covering 125 languages
  - For data lacking rubrics, English rubrics are automatically generated using GPT-4.1
  - GPT-OSS-120B is used to distill outputs under three language strategies: eng-eng (English instruction + English reasoning), tgt-eng (target-language instruction + English reasoning), and tgt-tgt (target-language instruction + target-language reasoning)
  - Quality filtering: Only samples correctly answered under all three strategies are retained
  - Difficulty filtering: Samples that gpt-oss-20b answers correctly in 5 consecutive attempts are discarded as "easy"
  - The dataset is downsampled to 100K, prioritizing harder samples
Curriculum Learning Strategy
- Function: Optimize the ordering of training data
- Mechanism: Random shuffling, English-first, difficulty ordering, and mixed schemes are compared; ordering from easy to hard yields the best results (difficulty is measured by prediction consistency and token length)
- Design Motivation: Easy samples first establish foundational capabilities; hard samples fine-tune later, avoiding disruption by noisy samples in early training
Multilingual Reasoning Strategy Study
- Function: Systematically compare the effectiveness of eng-eng, tgt-eng, and tgt-tgt reasoning paths
- Key Findings:
  - eng-eng achieves the highest overall performance (most mature English reasoning capability)
  - tgt-eng follows closely; larger models are more robust to non-English prompts
  - tgt-tgt is weakest before fine-tuning but shows the largest gains after fine-tuning, even surpassing the base model's eng-eng performance
- Design Motivation: Target-language reasoning is critical for interpretability and accessibility in low-resource language settings
Training Objective: SFT over RL
- Function: Standard cross-entropy training to maximize the log-likelihood of target tokens
- Core Formula: \(\mathcal{L}_{\text{SFT}}(\theta) = -\frac{1}{N}\sum_{i=1}^{N}\sum_{t=1}^{T_i}\log \pi_\theta(y_t^{(i)} | y_{<t}^{(i)}, x^{(i)})\)
- Design Motivation: Experiments show that RL-based methods (e.g., RLVR) are less effective than SFT in this setting

Loss & Training¶

SFT cross-entropy loss, based on the Qwen3 model family (4B/8B/14B)
Curriculum learning: training data ordered from easy to hard by difficulty
Multilingual alignment: each sample is aligned across all three language strategies

Key Experimental Results¶

Main Results (Pairwise Evaluation Benchmarks, eng-eng Setting)¶

Model	m-RewardBench (23 lang)	RewardBench (1 lang)	MM-Eval (18 lang)	IndoPref (1 lang)
GPT-OSS-120B	89.05	90.30	85.01	72.15
Nemotron-Multi-49B	89.03	89.62	76.27	68.40
R3-Qwen3-14B-LoRA	88.07	91.00	84.04	72.65
mR3-Qwen3-14B	89.18	90.79	86.05	74.14
mR3-Qwen3-8B	88.44	90.50	84.84	72.86
mR3-Qwen3-4B	87.61	89.74	82.62	72.22

mR3-Qwen3-14B surpasses the 120B teacher model with only 14B parameters (+0.13 on m-RB, +1.04 on MM-Eval, +1.99 on IndoPref), while being 3.5× faster than the 49B Nemotron model.

Ablation Study¶

Configuration	Key Finding
Curriculum: easy→hard vs. random	Easy→hard achieves the best results on HelpSteer3 validation set
Data scale: 50K vs. 100K vs. 200K	100K is the sweet spot; 200K yields no significant improvement
Language strategy: eng-eng vs. tgt-tgt	eng-eng achieves higher absolute scores, but tgt-tgt shows the largest gains after fine-tuning
Difficulty filtering: with vs. without	Removing easy samples significantly improves model performance
Training method: SFT vs. RLVR	SFT consistently outperforms RL-based methods on this task

Key Findings¶

Small model, large impact: The 14B model systematically outperforms the 120B teacher and the 49B competitor, demonstrating that high-quality data and correct training strategies matter more than scale
Step-change improvement of tgt-tgt: The base model's target-language reasoning is the weakest, yet it shows the largest gains after fine-tuning, even surpassing the base model's eng-eng performance — indicating that multilingual training effectively "activates" cross-lingual reasoning capabilities
Downstream DPO validation: Using mR3-Qwen3-14B as the reward model for DPO on Qwen3-30B-A3B improves the English win rate on m-ArenaHard-v2.0 from 49.1% to 57.3%
Human evaluation: 20 native speakers evaluated across 12 languages; mR3's reasoning traces substantially outperform the Qwen3 baseline on factuality (2.78 vs. 2.06) and logical coherence (2.67 vs. 2.05)

Highlights & Insights¶

The unified 72-language training framework represents a major advance in multilingual reward modeling, far exceeding the prior best of 6 languages in m-Prometheus. The three-strategy alignment data design (eng-eng/tgt-eng/tgt-tgt) is particularly elegant, enabling controlled research while covering realistic usage scenarios
Easy-to-hard curriculum learning is effective for reward model training: This finding is directly transferable to the training of other generative evaluation models
Data quality > data scale: A 14B model trained on 100K curated samples outperforms models trained on 3M+ samples, underscoring the importance of multi-stage filtering (three-strategy consistency + difficulty filtering)
Interpretability value of target-language reasoning: Although English reasoning achieves higher accuracy, target-language reasoning is critical for accessibility and user trust in low-resource language settings, and fine-tuning effectively closes the gap

Limitations & Future Work¶

The distillation outputs from GPT-OSS-120B carry inherent language bias (strongest in English), which propagates to mR3
Coverage of low-resource languages among the 72 languages may be uneven, as the source datasets are biased toward high- and medium-resource languages
Training relies solely on SFT; the potential of RL-based post-training (e.g., GRPO) is not fully explored
Human evaluation covers only 12 languages (already more than comparable work), falling short of all 72 training languages
Future directions: Specialized data augmentation for low-resource languages (e.g., high-resource → low-resource translation with back-translation), and exploring whether online RL fine-tuning can yield further improvements

vs. R3 (Anugraha et al., 2025): R3 is the English-only predecessor of mR3, trained exclusively on English data. mR3 inherits its rubric-agnostic framework and extends it to 72 languages, substantially outperforming R3 on multilingual benchmarks (m-RewardBench: 89.18 vs. 88.07), while R3 retains a slight edge on the English-only RewardBench (91.00 vs. 90.79)
vs. m-Prometheus (Pombal et al., 2025): Covers only 6 languages with 480K training samples; m-RewardBench score of 79.51 vs. mR3's 89.18, a substantial margin
vs. Nemotron-Multilingual-49B (Wang et al., 2025): With 49B parameters, it supports pair-wise evaluation in only 13 languages; mR3-14B comprehensively surpasses it with 1/3.5 the parameters and 7.2× the language coverage

Rating¶

Novelty: ⭐⭐⭐⭐ The unified 72-language framework and three-strategy alignment data construction are novel, though the model architecture and training method (SFT) are relatively standard
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 7 benchmarks, multiple ablations, curriculum learning comparisons, DPO downstream validation, and a 20-annotator 12-language human evaluation — exceptionally comprehensive
Writing Quality: ⭐⭐⭐⭐ Clear structure with rich tables and figures, though the paper is lengthy (extensive appendix) and core contributions require extraction from a large volume of experiments
Value: ⭐⭐⭐⭐⭐ Addresses a critical gap in multilingual reward modeling with direct practical impact on non-English LLM alignment

Key Experimental Results¶

Model	mR3-RewardBench	Size
GPT-OSS-120B	~88%	120B
mR3-Qwen-14B	88.46%	14B (9× smaller)

Human evaluators (20 annotators, 12 languages) prefer the reasoning quality of mR3.

Rating¶

Novelty: ⭐⭐⭐⭐ First large-scale multilingual reward reasoning model
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 72 languages + human evaluation
Value: ⭐⭐⭐⭐⭐ Foundational infrastructure for multilingual LLM alignment

mR3: Multilingual Rubric-Agnostic Reward Reasoning Models¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Key Experimental Results¶

Main Results (Pairwise Evaluation Benchmarks, eng-eng Setting)¶

Ablation Study¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Key Experimental Results¶

Rating¶

Related Papers¶