Language-Coupled Reinforcement Learning for Multilingual Retrieval-Augmented Generation¶

Conference: ACL 2026 arXiv: 2601.14896 Code: GitHub Area: Reinforcement Learning Keywords: Multilingual RAG, Reinforcement Learning, GRPO, Knowledge Bias, Knowledge Conflict

TL;DR¶

This paper proposes the LcRL framework, which addresses knowledge bias and knowledge conflict in multilingual RAG through language-coupled GRPO policy optimization and anti-alignment penalty rewards, achieving significant improvements on multilingual question answering tasks.

Background & Motivation¶

Background: Retrieval-Augmented Generation (RAG) has emerged as an effective paradigm for mitigating hallucinations and knowledge insufficiency in LLMs. In multilingual settings, the extreme imbalance of training data distributions leads to substantial knowledge disparities across languages, requiring multilingual RAG (MRAG) models to effectively acquire and integrate external knowledge from multilingual corpora.

Limitations of Prior Work: Existing MRAG methods predominantly adopt a one-size-fits-all strategy—processing semantically equivalent queries across different languages through single-round retrieval and unified optimization. This gives rise to two core problems: (1) Knowledge Bias—LLMs produce markedly different responses to semantically equivalent queries in different languages due to uneven knowledge coverage per language; (2) Knowledge Conflict—when retrieved collections span multiple languages, linguistic variation causes retrieved documents to be semantically related yet factually inconsistent, disrupting correct answer generation.

Key Challenge: RL-based RAG methods such as Search-R1 optimize policies independently within a single language, making them unable to reconcile conflicting facts across languages or exploit cross-lingual complementarity.

Goal: Design a language-coupled reinforcement learning framework that enables LLMs to adaptively decide whether to retrieve, which language resources to retrieve from, and how to effectively reconcile conflicting knowledge across languages. Key Insight: Couple multilingual decision-making and experience rewards within the GRPO framework. Core Idea: Allow semantically equivalent multilingual queries to be sampled and evaluated within the same group, thereby promoting cross-lingual knowledge transfer.

Method¶

Overall Architecture¶

LcRL builds upon the GRPO framework and integrates two core modules: (1) a language-coupled rollout module that performs intra-group sampling over semantically equivalent multilingual queries, combined with a hierarchical multi-turn retrieval strategy; and (2) a language-coupled reward module that replaces exact match with n-gram recall and introduces an anti-alignment penalty to stabilize training. The LLM interacts with a search engine in an interleaved multi-turn manner, with retrieval and generation controlled via <search> and <answer> tags.

Key Designs¶

Hierarchical Multi-Turn Retrieval Strategy:
- Function: Dynamically selects the retrieval language set based on the current retrieval round.
- Mechanism: Round 1 retrieves in the native language \(\mathcal{R}_L(q)\) to capture culturally relevant evidence; Round 2 expands globally to all other languages \(\bigcup_{l \in \mathcal{L} \setminus \{L\}} \mathcal{R}_l(q)\) to fill knowledge gaps; Round 3 and beyond anchor retrieval to a high-resource language (e.g., English) \(\mathcal{R}_{en}(q)\) as a factual anchor.
- Design Motivation: Prioritizing native-language retrieval avoids conflicts; progressively expanding exploits cross-lingual complementarity; the high-resource language serves as a reliable fallback.
Language-Coupled GRPO:
- Function: Couples multilingual queries within the GRPO group sampling so that responses from different languages share a common group baseline for advantage estimation.
- Mechanism: For a set of semantically equivalent queries \(\mathcal{Q} = \{q_1, q_2, \dots, q_n\}\), each output \(o_i \sim \pi_\theta(\cdot | q_i; \mathcal{R})\) originates from a different language query, and the advantage \(\hat{A}_{i,t}^{\text{coupled}}\) is normalized across the entire multilingual group, encouraging the policy to bind embeddings of different languages to a unified high-quality reasoning path.
- Design Motivation: Cross-lingual intra-group normalization allows lower-resource languages to implicitly learn from the behavior of higher-resource languages, thereby mitigating knowledge bias.
Anti-Alignment Penalty Reward:
- Function: Detects and penalizes high similarity among incorrect responses, preventing erroneous patterns from collapsing into a single mode.
- Mechanism: A set of "bad samples" \(B_q = \{i \in G_q | r_{\text{ans}}(i) < \tau_{\text{bad}}\}\) is defined; for each bad sample, the maximum similarity \(m_i\) to other bad samples is computed; a penalty \(r_{\text{anti\_align}}(i) = -p_i \cdot w_q\) is applied to clustered similar incorrect responses.
- Design Motivation: GRPO in tool-integrated RL is prone to training collapse due to Lazy Likelihood Displacement; the anti-alignment penalty breaks the positive feedback loop of erroneous patterns.

Loss & Training¶

The reward function replaces binary exact match with character-level 3-gram Recall \(r_{\text{ans}}(i) = \text{c3Recall}(\hat{a}_i, a_{\text{gold}})\) to provide a dense reward signal. The final reward is \(r_{\text{total}}(i) = \max(0, r_{\text{ans}}(i) + \lambda \cdot \tilde{r}_{\text{anti\_align}}(i))\), where the anti-alignment penalty is clipped to the range \([-0.5, 0]\). The objective follows the standard PPO-clip form with KL regularization.

Key Experimental Results¶

Main Results¶

Dataset	Metric	LcRL (Qwen2.5-3B)	mSearch-R1	Search-R1	D-RAG
MKQA	fEM	41.2	37.9	22.6	37.4
MKQA	c3Recall	57.0	53.2	34.8	43.3
MKQA	CLR	99.1	95.6	83.6	90.2
XOR-TyDi	fEM	31.7	21.2	18.4	31.5
XOR-TyDi	c3Recall	43.9	35.8	32.0	38.9

Ablation Study¶

Configuration	fEM	c3Recall	Note
Full LcRL	41.2	57.0	Complete model
w/o Lc Reward	30.8	42.2	Remove language-coupled reward
w/o c3Recall Reward	18.0	20.2	Replace with exact match
w/o Lc Rollout	30.4	45.7	Remove language-coupled sampling
w/o multi-language Rollout	27.9	38.5	Remove multilingual retrieval
Replace by PPO	15.5	21.7	Replace GRPO with PPO

Key Findings¶

LcRL achieves significant improvements over all baselines (t-test p < 0.01), with fEM reaching 47.6 on Qwen3-8B.
As the number of retrieval languages increases, only LcRL exhibits continuous performance gains, while other methods degrade sharply beyond two languages.
LcRL demonstrates strong robustness under limited training data and successfully transfers to languages unseen during training.
GRPO substantially outperforms PPO, benefiting from the intra-group learning mechanism that promotes cross-lingual generalization.

Highlights & Insights¶

The language-coupled GRPO design elegantly exploits the complementarity of multilingual equivalent queries, representing a meaningful extension of standard GRPO.
The anti-alignment penalty effectively addresses reward collapse in RL training and is transferable to other tool-integrated RL settings.
The hierarchical retrieval strategy (native → global → high-resource) strikes a favorable balance between simplicity and effectiveness.

Limitations & Future Work¶

Evaluation is conducted on only three LLMs, without coverage of a broader range of open-source multilingual models.
The retriever is fixed as multilingual-e5-base; joint optimization of the retriever is not explored.
No dedicated retrieval relevance annotation dataset tailored for multilingual RAG is available.
Future work may explore broader language coverage and the effects of larger-scale models.

vs Search-R1: Search-R1 is a monolingual RL-RAG method; LcRL resolves its optimization instability in multilingual settings through the language-coupling mechanism.
vs D-RAG: D-RAG mitigates conflicts via dialectical reasoning but within a fixed pipeline, whereas LcRL jointly optimizes retrieval and generation end-to-end.
vs SFT Methods: RL-based methods achieve competitive performance under low-resource conditions, while SFT approaches rely on large amounts of annotated data.

Rating¶

Novelty: ⭐⭐⭐⭐ Language-coupled GRPO and the anti-alignment penalty constitute important innovations for multilingual RL-RAG.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multiple models × two datasets × comprehensive ablations × data scale and language coverage analyses.
Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear, methodological exposition is well-organized, and visualizations are informative.
Value: ⭐⭐⭐⭐ Opens a new direction for post-training optimization in multilingual RAG; the anti-alignment penalty idea is broadly reusable.