Skip to content

Learning to Route Languages for Multilingual Policy Optimization

Conference: ICML 2026
arXiv: 2605.25360
Code: https://github.com/Guochry/LRPO (Available)
Area: Alignment RLHF / Multilingual LLM / Online Policy Optimization
Keywords: Multilingual RL, GRPO, Language Router, Multi-Armed Bandit, Cross-lingual Reward Calibration

TL;DR

This paper proposes LRPO (Language-Routed Policy Optimization), which treats "which language to use for rollout generation" as a learnable variable. Using a contextual bandit-style language router, the method selects the most informative language combination for each training sample under a fixed rollout budget. By combining cross-lingual similarity rewards (calibrated via offline estimation and online adjustment), the method aligns multilingual rollouts into a unified scale for GRPO. It consistently outperforms GRPO and various dominant-language baselines across Qwen/Llama/Gemma backbones on five multilingual benchmarks.

Background & Motivation

Background: Existing RL approaches to extend LLMs to multilingual scenarios primarily follow two paths. One directly applies GRPO (shao2024deepseekmath): sampling a set of rollouts in the original language for each training question, scoring them with a reward model, and performing policy updates after intra-group normalization. The other explicitly constructs cross-lingual preference pairs (MAPO/LIDR/MPO), treating English (or other "dominant language") responses as naturally higher-quality anchors to which other languages are aligned.

Limitations of Prior Work: The GRPO approach fixes each question to a single language, leaving the decision of "which language can more accurately answer this question" to the model's internal implicit mechanisms, thus wasting complementary knowledge encoded in different languages. The dominant-language approach assumes English is always a superior supervision source, but this assumption often fails for questions with strong regional knowledge or cultural context—for example, on a question about "Greek etiquette," an Arabic rollout might be closer to the correct answer than English/Chinese rollouts.

Key Challenge: Given a limited rollout budget (\(K\) samples per question), "which languages to sample" is itself an online exploration-exploitation decision problem. However, existing methods either make no decision (monolingual) or use a fixed and often incorrect prior (English-first).

Goal: Within a fixed budget of \(K\) rollouts, allow the model to learn "which languages should be sampled more for which themes/regions," and combine these multilingual rollouts into a unified GRPO framework for policy updates.

Key Insight: Explicitly model "language selection" as a contextual multi-armed bandit, where each question's theme \(t(x)\) and optional region \(g(x)\) serve as the context, and each language is an arm. The arm's return is the average GRPO reward generated by that language under that context. Simultaneously, cross-lingual similarity needs to be calibrated as a reward signal; otherwise, inconsistent physical scales of raw similarity between different language pairs will disrupt intra-group preferences.

Core Idea: Use a lightweight "Theme × Language + Region × Language" dual-matrix router for online language selection learning. Use offline statistics + online calibration to bring multilingual similarity rewards to the same scale, then feed them back into GRPO for joint optimization.

Method

LRPO extends the traditional "Sample → Score → Update" three-stage process of GRPO into a four-stage process: the router first decides which languages to use for the current round, the policy generates rollouts in the specified languages, rewards are cross-lingually calibrated, and finally, GRPO updates the policy while the router is updated via EMA.

Overall Architecture

  • Input: Training question \(x\) (original language \(\ell_x\), theme \(t(x)\), optional region \(g(x)\)), policy \(\pi_\theta\), routing parameters \((\mathbf{A},\mathbf{B})\), rollout budget \(K\), on-policy quota \(K_{\text{on}}\).
  • Routing Phase: Synthesize logits from a theme matrix \(\mathbf{A}_{t(x)}\) and (if present) a region matrix \(\mathbf{B}_{g(x)}\). Obtain a language distribution \(p(\ell\mid x)\) via softmax with temperature \(\tau\). Reserve \(K_{\text{on}}\) samples for \(\ell_x\) (to preserve on-policy nature); sample the remaining \(K-K_{\text{on}}\) samples according to \(p\) with \(\epsilon\)-greedy to ensure minimum exploration.
  • Rollout Phase: Use language tags or target-language system prompts to guide \(\pi_\theta\) to generate \(\{y_k\}\) according to the sampled \(\{\ell_k\}\).
  • Reward Phase: Compute cross-lingual semantic similarity between each rollout and the reference answer using mmBERT, followed by mean-based or quantile-based calibration at the language-pair level. This is multiplied by an indicator for "whether the target language was actually generated" to form the final reward.
  • Update Phase: Perform GRPO gradient steps after intra-group normalization. Every \(M\) steps, update \(\mathbf{A}, \mathbf{B}\) using EMA with average rewards aggregated by \((t,g,\ell)\) buckets.

Key Designs

  1. Dual-Matrix Language Router (Contextual Bandit):

    • Function: Selects a set of the most informative languages for each question under a fixed rollout budget.
    • Mechanism: Utilizes two low-rank logit matrices—\(\mathbf{A}\in\mathbb{R}^{T\times L}\) (Theme × Language) and \(\mathbf{B}\in\mathbb{R}^{G\times L}\) (Region × Language). The distribution is \(p(\ell\mid x)\propto\exp\!\big((A_{t(x),\ell}+\mathbb{I}[g(x)\neq\varnothing]B_{g(x),\ell})/\tau\big)\). It retains \(K_{\text{on}}\) original language rollouts for on-policy consistency and samples the rest via \(p\) + \(\epsilon\)-greedy. Every \(M\) steps, the average reward \(\bar r_{t,g,\ell}\) accumulated in \((t,g,\ell)\) buckets is written back to \(\mathbf{A},\mathbf{B}\) via EMA (\(A_{t,\ell}\leftarrow(1-\alpha)A_{t,\ell}+\alpha\bar r_{t,g,\ell}\)). Simulated annealing is applied to \(\epsilon\) and \(\tau\) to emphasize exploration early and exploitation later.
    • Design Motivation: Upgrades the determination of "which language provides the most informative rollout" from an implicit mechanism to a learnable contextual bandit. This avoids bias from fixed dominant-language assumptions and naturally handles structures like "regional knowledge requires local languages" (e.g., Greek questions are better suited for Greek, with regional logits providing an additional bias).
  2. Offline Estimation + Online Calibration of Cross-lingual Similarity Rewards:

    • Function: Calibrates raw mmBERT similarity—which has inconsistent scales across different language pairs—into a quality reward that is comparable within the group.
    • Mechanism: In the offline phase, three types of response pairs—semantically equivalent (upper bound alignment), naturally mismatched, and hard negatives—are collected for each language pair \(\langle\ell_i,\ell_j\rangle\) to form an empirical distribution \(\mathcal{S}_{\ell_i,\ell_j}\). In the online RL phase, for each rollout \(y^{(\ell_j)}\) and reference \(z^{(\ell_i)}\), \(s=\mathrm{sim}(y,z)\) is calculated followed by one of two calibrations: Mean Calibration \(r^{\text{qual}}=s-\lambda(\mu_{\ell_i,\ell_j}-\mu_{\text{ref}})\), aligning the mean of equivalent pairs to a global reference mean; or Quantile Calibration \(r^{\text{qual}}=\mathcal{Q}_{\ell_i,\ell_j}(s)\), mapping raw scores directly to cross-lingually comparable empirical quantiles.
    • Design Motivation: Raw similarity contains systematic biases across language pairs (e.g., Chinese-English equivalent pairs mean \(\approx 0.85\), while Chinese-Arabic \(\approx 0.65\)). Without calibration, rollouts in low-resource languages are always suppressed during intra-group normalization, and the "language utility" learned by the router becomes contaminated by this measurement bias, eventually degenerating into monolingual GRPO.
  3. Language Consistency Gating + GRPO Joint Update:

    • Function: Ensures the policy actually outputs the language specified by the router, making the "language channel" externally observable and internally capable of receiving learning signals.
    • Mechanism: Uses a language identifier to compute \(r^{\text{lang}}(y_k)=\mathbb{I}[\mathrm{Lang}(y_k)=\ell_k]\), which is multiplied by the quality reward to get the final reward \(r_k=r^{\text{qual}}_k\cdot r^{\text{lang}}_k\). If the language is incorrect, the reward is zeroed. Policy gradient updates are performed using the GRPO objective with intra-group normalization. Router updates are delayed by \(M\) steps using EMA of the recent window's reward bucket means to prevent single-step noise from misguiding the router.
    • Design Motivation: Without \(r^{\text{lang}}\), the policy might learn to "always answer in English regardless of the requested language" to bypass the router. Multiplicative gating turns "language compliance" into a hard constraint, ensuring that the \(\bar r_{t,g,\ell}\) seen by the router accurately reflects the utility of generating in \(\ell\) for that theme.

Loss & Training

The policy follows the GRPO objective, performing reward normalization within each multilingual rollout group. The router does not use gradients but updates the logit matrices via EMA every \(M\) policy steps. Training data includes 4,885 samples from HelpSteer3 + CARE, covering 14 languages. Themes are automatically categorized into 6 types (Regional Knowledge, General Knowledge, Chat, Reasoning, Safety, Translation) using gpt-oss-120b, with a 98% consistency rate compared to human annotation.

Key Experimental Results

Main Results

Testing across five multilingual benchmarks (CARE / CARE-pro / mGSM-v2 / Global-MMLU-Lite / Include-Lite) with three backbones. Representative results on Qwen2.5-1.5b-it (mGSM-v2 average and Overall average) are shown below:

Method mGSM-v2 Avg. Overall Avg.
Vanilla 24.87 28.64
DPO 27.02 29.33
MAPO 25.64 28.40
MPO 25.05 28.38
GRPO 32.33 30.42
LRPO (Ours) 38.25 32.15

On Qwen2.5-1.5b, LRPO improves mGSM-v2 from 24.87 to 38.25 (+13.38), and the Overall score improves by +1.73 over GRPO. The average seen-language improvement across benchmarks shows LRPO at +5.08 over the instruction-tuned baseline and +2.85 over GRPO. Even on the stronger Gemma3-4b-it, it maintains a lead (46.89 vs GRPO 46.67), indicating improvements are not limited to small models benefiting from multilingual signals.

Ablation Study

Router Variant mGSM-v2 Avg. Overall Avg. Description
Monolingual 32.33 30.42 Degenerates to GRPO
Input-dominant 36.25 31.78 Fixed routing, biased towards on-policy
EN-dominant 37.89 Simulates MAPO-style dominant language
LRPO (Learned + Calibrated) 38.25 32.15 Full Model

Fixed routing (whether biased towards the input language or English) is inferior to learnable routing. The EN-dominant variant is close to LRPO on mGSM-v2 but lags significantly on the regional knowledge-focused CARE series, confirming the failure of the "dominant language hypothesis" on region-grounded tasks.

Key Findings

  • Router Contribution: By expanding each question from "monolingual" to "router-assigned languages," GRPO's intra-group comparison can exploit cross-lingual complementary knowledge, serving as the main source of the +5.92 gain on mGSM-v2 (vs. GRPO).
  • Necessity of Cross-Lingual Calibration: Using raw mmBERT similarity directly leads to intra-group normalization being contaminated by language-pair biases. The router eventually collapses to the language same as the reference, degenerating into the Monolingual variant.
  • Regional Matrix Benefits: The gain from the regional matrix \(\mathbf{B}\) for regional tasks like CARE or Include-Lite is significantly larger than for pure reasoning tasks like mGSM-v2, aligning with the prior that regional knowledge is best carried by local languages.

Highlights & Insights

  • Formulating "language selection" as a contextual bandit is a clean approach. While traditional multilingual RL papers fix the language or default to English, this paper uses \(\mathbf{A}+\mathbb{I}\cdot\mathbf{B}\) low-rank parameterization to learn "Theme × Language" and "Region × Language" priors online, yielding significant gains with near-zero extra computational overhead.
  • The two-stage cross-lingual similarity calibration (offline + online) is noteworthy: any multimodal/multilingual RLHF using embedding similarity as a reward (e.g., image-text, video-text, cross-domain code) will face the issue of inconsistent scale. Quantile calibration \(\mathcal{Q}_{\ell_i,\ell_j}(s)\) provides a plug-and-play solution that does not rely on a parametric calibration model.
  • The multiplicative gate \(r^{\text{qual}}\cdot r^{\text{lang}}\) simply handles the degenerate solution where the model ignores the requested language. This effectively upgrades the "language condition" from a soft constraint to a hard constraint, which is insightful for future RLHF tasks requiring specific styles, formats, or tool usage.

Limitations & Future Work

  • The router is tabular; the number of themes \(T\) and regions \(G\) rely on coarse classification (6 themes). With finer-grained themes or regions (thousands), embedding parameterization would be necessary to avoid instability in EMA estimation caused by data sparsity.
  • Cross-lingual calibration depends on mmBERT. The quality of "semantically equivalent pairs" in the offline \(\mathcal{S}_{\ell_i,\ell_j}\) determines the calibration upper bound. For truly low-resource language pairs where parallel corpora are scarce, calibration remains an open problem.
  • Experiments cover the 1B–4B range for Qwen/Llama/Gemma but haven't been validated at the 30B+ scale. As larger models already achieve significant cross-lingual transfer via GRPO, the relative gains of LRPO might narrow.
  • Training data is limited to human preference sets (HelpSteer3 + CARE). The "language utility" learned by the router is biased by the data distribution—for instance, the language coverage of regional questions in CARE determines the learning range for \(\mathbf{B}\). Moving to new regions requires a cold-start mechanism.
  • vs. MAPO / LIDR / MPO: These methods assume English anchors are more reliable and align other languages to English via translation or log-odds. LRPO does the opposite—it does not presuppose a dominant language, allowing the data to reveal which language is most useful for which task, while avoiding the mixing of language identification errors and content quality errors through calibration and gating.
  • vs. GRPO: This method is a multilingual extension of GRPO. It expands rollout groups from monolingual to multilingual, adds cross-lingual calibration to the reward, and introduces a learnable language router. It is fully compatible with existing GRPO infrastructure and serves as a low-cost upgrade for multilingual SFT/RL pipelines.
  • vs. Inference-time cross-lingual methods (CCL/CoT): Those methods perform cross-lingual Chain-of-Thought concatenation during inference. This paper pushes cross-lingual signals down to the training reward. The directions are orthogonal and theoretically can be combined.