Skip to content

Language-Coupled Reinforcement Learning for Multilingual Retrieval-Augmented Generation

Conference: ACL 2026
arXiv: 2601.14896
Code: GitHub
Area: Reinforcement Learning
Keywords: Multilingual RAG, Reinforcement Learning, GRPO, Knowledge Bias, Knowledge Conflict

TL;DR

This paper proposes the LcRL framework, which addresses knowledge bias and knowledge conflict in multilingual RAG through language-coupled GRPO policy optimization and anti-consistency penalty rewards, achieving significant improvements in multilingual question-answering tasks.

Background & Motivation

Background: Retrieval-Augmented Generation (RAG) has become an effective paradigm for mitigating LLM hallucinations and knowledge deficiencies. In multilingual scenarios, knowledge differences across languages are significant due to extremely unbalanced training data distributions. Multilingual RAG (MRAG) requires models to effectively acquire and integrate external knowledge from multilingual corpora.

Limitations of Prior Work: Existing MRAG methods primarily adopt a "one-size-fits-all" strategy—processing equivalent queries in different languages through single-round retrieval and unified optimization. This leads to two core issues: (1) Knowledge Bias—LLMs produce significantly different answers for semantically equivalent queries in different languages due to varying knowledge reserves; (2) Knowledge Conflict—when the retrieval set contains multiple languages, linguistic differences lead to retrieved documents being semantically relevant but factually inconsistent, interfering with the generation of correct answers.

Key Challenge: Policies in existing RL-based RAG methods (e.g., Search-R1) are optimized independently within a single language, failing to reconcile contradictory facts across languages or exploit cross-lingual complementary effects.

Goal: Design a language-coupled reinforcement learning framework that allows LLMs to adaptively decide whether to retrieve, which language's resources to retrieve, and effectively reconcile conflicting knowledge across languages. Key Insight: Couple multilingual decisions and empirical rewards into the GRPO framework. Core Idea: Sample and evaluate semantically equivalent multilingual queries within the same group to facilitate cross-lingual knowledge transfer.

Method

Overall Architecture

LcRL is based on the GRPO framework and integrates two core modules: (1) a language-coupled rollout module—performing intra-group sampling for semantically equivalent multilingual queries combined with a hierarchical multi-round retrieval strategy; (2) a language-coupled reward module—replacing exact matching with n-gram recall and introducing an anti-consistency penalty to stabilize training. The LLM interacts with a search engine in an interleaved multi-round manner, controlling retrieval and generation via <search> and <answer> tags.

Key Designs

  1. Hierarchical Multi-round Retrieval Strategy:

    • Function: Dynamically selects the set of retrieval languages based on the retrieval round.
    • Mechanism: Round 1 retrieves the native language \(\mathcal{R}_L(q)\) to capture culturally relevant evidence; Round 2 expands globally to all other languages \(\bigcup_{l \in \mathcal{L} \setminus \{L\}} \mathcal{R}_l(q)\) to compensate for knowledge gaps; Round 3 and beyond anchor to high-resource languages (e.g., English) \(\mathcal{R}_{en}(q)\) as factual anchors.
    • Design Motivation: Prioritize native language to avoid conflict, then gradually expand to utilize cross-lingual complementarity, and finally use high-resource languages as a fallback.
  2. Language-Coupled GRPO:

    • Function: Couples multilingual queries during GRPO group sampling, allowing answers in different languages to share the same set of baselines for advantage estimation.
    • Mechanism: For a set of semantically equivalent queries \(\mathcal{Q} = \{q_1, q_2, \dots, q_n\}\), where each \(o_i \sim \pi_\theta(\cdot | q_i; \mathcal{R})\) comes from queries in different languages, the advantage \(\hat{A}_{i,t}^{\text{coupled}}\) is normalized across the entire multilingual group. This encourages the policy to bind embeddings of different languages to a unified, high-quality reasoning path.
    • Design Motivation: Through cross-lingual intra-group normalization, low-resource languages can implicitly learn from the behaviors of high-resource languages, mitigating knowledge bias.
  3. Anti-consistency Penalty Reward:

    • Function: Detects and punishes high similarity between incorrect answers to prevent error pattern collapse.
    • Mechanism: Defines a "bad sample" set \(B_q = \{i \in G_q | r_{\text{ans}}(i) < \tau_{\text{bad}}\}\), calculates the maximum similarity \(m_i\) for each bad sample against other bad samples, and applies a penalty \(r_{\text{anti\_align}}(i) = -p_i \cdot w_q\) to clustered similar incorrect answers.
    • Design Motivation: GRPO in tool-integrated RL is prone to training collapse due to Lazy Likelihood Displacement; the anti-consistency penalty breaks the positive feedback loop of error patterns.

Loss & Training

Reward function: Character 3-gram Recall \(r_{\text{ans}}(i) = \text{c3Recall}(\hat{a}_i, a_{\text{gold}})\) is used instead of binary exact matching to provide dense reward signals. The final reward is \(r_{\text{total}}(i) = \max(0, r_{\text{ans}}(i) + \lambda \cdot \tilde{r}_{\text{anti\_align}}(i))\), where the anti-consistency penalty is clipped within the range \([-0.5, 0]\). The objective function follows the standard PPO-clip format with KL regularization.

Key Experimental Results

Main Results

Dataset Metric LcRL (Qwen2.5-3B) mSearch-R1 Search-R1 D-RAG
MKQA fEM 41.2 37.9 22.6 37.4
MKQA c3Recall 57.0 53.2 34.8 43.3
MKQA CLR 99.1 95.6 83.6 90.2
XOR-TyDi fEM 31.7 21.2 18.4 31.5
XOR-TyDi c3Recall 43.9 35.8 32.0 38.9

Ablation Study

Configuration fEM c3Recall Description
Full LcRL 41.2 57.0 Full model
w/o Lc Reward 30.8 42.2 Remove language-coupled reward
w/o c3Recall Reward 18.0 20.2 Replace with exact matching
w/o Lc Rollout 30.4 45.7 Remove language-coupled sampling
w/o multi-language Rollout 27.9 38.5 Remove multilingual retrieval
Replace by PPO 15.5 21.7 Replace GRPO with PPO

Key Findings

  • LcRL achieves significant improvements across all baselines (t-test \(p < 0.01\)), with fEM reaching 47.6 on Qwen3-8B.
  • As the number of languages in the retrieval set increases, only LcRL's performance continues to improve, while other methods decline sharply after exceeding 2 languages.
  • LcRL performs robustly under limited training data conditions and successfully transfers to languages unseen during training.
  • GRPO significantly outperforms PPO, benefiting from the intra-group learning mechanism that promotes cross-lingual generalization.

Highlights & Insights

  • The design of language-coupled GRPO cleverly exploits the complementarity of multilingual equivalent queries, representing a meaningful extension of standard GRPO.
  • The anti-consistency penalty effectively addresses the reward collapse problem in RL training and is transferable to other tool-integrated RL scenarios.
  • The hierarchical retrieval strategy (Native \(\rightarrow\) Global \(\rightarrow\) High-resource) achieves a good balance between simplicity and effectiveness.

Limitations & Future Work

  • Evaluation was limited to three LLMs and did not cover a broader range of open-source multilingual models.
  • The retriever was fixed to multilingual-e5-base without exploring the possibility of joint optimization.
  • There is a lack of specialized retrieval relevance annotation datasets for multilingual RAG.
  • Future work could explore effects on larger scale models and broader language coverage.
  • vs Search-R1: Search-R1 is a monolingual RL-RAG; LcRL resolves its optimization instability in multilingual settings through the language-coupling mechanism.
  • vs D-RAG: D-RAG mitigates conflict through dialectical reasoning within a fixed pipeline; LcRL performs end-to-end joint optimization of retrieval and generation.
  • vs SFT Methods: RL methods can achieve competitiveness under low-resource conditions, whereas SFT relies on large-scale data.

Rating

  • Novelty: ⭐⭐⭐⭐ Language-coupled GRPO and anti-consistency penalty are significant innovations for multilingual RL-RAG.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multiple models \(\times\) two datasets \(\times\) detailed ablation \(\times\) data scale/language coverage analysis.
  • Writing Quality: ⭐⭐⭐⭐ Problem definitions are clear, the method description is organized, and visualizations are rich.
  • Value: ⭐⭐⭐⭐ Opens a new route for post-training optimization of multilingual RAG; the anti-consistency penalty idea is widely reusable.