Cooperative Retrieval-Augmented Generation for Question Answering: Mutual Information Exchange and Ranking by Contrasting Layers¶
Conference: NeurIPS 2025 arXiv: 2512.10422 Code: GitHub Area: Retrieval-Augmented Generation & Question Answering Keywords: RAG, query expansion, layer-contrastive ranking, multi-hop QA, cooperative mechanism
TL;DR¶
CoopRAG is a framework that achieves bidirectional cooperation between the retriever and the LLM through query expansion, retriever layer-contrastive reranking, and reasoning chain completion. It surpasses HippoRAG2 by 5.3% on multi-hop QA and by 35.2% on single-hop QA.
Background & Motivation¶
Insufficient query information: Original questions lack sufficient information to adequately guide retrieval and reasoning.
Shallow retrieval: Existing retrievers tend to rely on surface-level pattern matching rather than deep semantic understanding.
Hallucination: LLMs tend to generate incorrect information when faced with uncertain knowledge.
Bidirectional enhancement: A mechanism enabling mutual feedback and correction between the retriever and the LLM is needed.
Method¶
Overall Architecture¶
CoopRAG consists of five stages:
- Query Expansion: The LLM decomposes the question into sub-questions and an uncertain reasoning chain.
- Expansion-Augmented Retrieval: The expanded query is used to retrieve top-\(n\) documents.
- Ranking by Contrasting Layers (RaLa): Documents are reranked to top-\(k\).
- Reasoning Chain Completion: The LLM fills in uncertain parts of the reasoning chain.
- Final Inference: The answer is generated.
Key Designs¶
Query Expansion (Section 3.2):
The LLM generates: - A sub-question set \(S = \{s_1, s_2, \ldots, s_{|S|}\}\) - An uncertain reasoning chain \(R = \{(e_1, r_1, e_1'), \ldots, (e_{|R|}, r_t, \texttt{<FILL>})\}\)
Core innovation: <UNCERTAIN> masks are used in place of uncertain entities to avoid hallucination:
Ranking by Contrasting Layers (RaLa, Section 3.4):
Motivation: Lower Transformer layers capture syntactic information, while higher layers capture semantics. Approach: The representational difference between intermediate and top layers is compared.
where \(g(q_i, d_j) = \max_{l \in C} \left(\langle q_i, d_j^{(L)}\rangle - \langle q_i, d_j^{(l)}\rangle\right)\)
Practical implementation (for cost reduction):
where \(\omega_{U,D} = g(q_0, d_0)\) is the gap weight.
Reasoning Chain Completion (Section 3.5):
The LLM leverages top-\(k\) documents to fill in <UNCERTAIN> and <FILL> placeholders and reconstruct the reasoning chain.
Difficulty-Aware Training (Section 3.6):
Weighted loss: \(\alpha_{U_i} = \log(1 + |S_{U_i}|)\)
Harder questions (with more sub-questions) receive higher loss weights.
Key Experimental Results¶
Retrieval Performance (Table 2 — Multi-hop QA)¶
| Method | HotpotQA R@2 | MuSiQue R@2 | 2Wiki R@2 | Key Improvement |
|---|---|---|---|---|
| HippoRAG2 (L3.3) | 83.5% | 56.1% | 76.2% | baseline |
| HippoRAG2 (GPT) | 80.5% | 53.5% | 74.6% | reference |
| SiReRAG (GPT) | 80.0% | 52.5% | 60.6% | inferior |
| CoopRAG (G2-9B) | 87.9% | 59.4% | 80.1% | +5.9% |
| CoopRAG (G2-27B) | 88.3% | 59.4% | 80.8% | +6.3% |
| CoopRAG (L3.3) | 86.9% | 58.2% | 80.6% | +3.4% |
| CoopRAG (GPT) | 88.8% | 59.6% | 80.4% | +8.3% |
Single-hop QA Performance (Table 2 right, NaturalQuestions)¶
| Method | R@2 | R@5 | Improvement |
|---|---|---|---|
| HippoRAG2 (L3.3) | 45.6% | 78.0% | baseline |
| HippoRAG2 (GPT) | 44.4% | 76.4% | reference |
| CoopRAG (G2-9B) | 71.6% | 88.9% | +27.2% |
| CoopRAG (G2-27B) | 72.8% | 89.5% | +28.4% |
| CoopRAG (L3.3) | 77.2% | 90.8% | +31.6% |
| CoopRAG (GPT) | 80.8% | 92.1% | +35.2% |
QA Performance (Table 3 — EM/F1)¶
| Method | HotpotQA | MuSiQue | 2Wiki | NQ |
|---|---|---|---|---|
| HippoRAG2 (L3.3) | 62.7/75.5 | 37.2/48.6 | 65.0/71.0 | 48.6/63.3 |
| HippoRAG2 (GPT) | 56.3/71.1 | 35.0/49.3 | 60.5/69.7 | 43.4/60.0 |
| CoopRAG (G2-9B) | 64.4/78.1 | 52.2/65.2 | 70.0/78.1 | 63.8/72.7 |
| CoopRAG (L3.3) | 64.7/79.0 | 52.6/66.6 | 71.2/78.8 | 70.9/80.3 |
| CoopRAG (GPT) | 65.5/79.2 | 52.8/66.7 | 70.8/78.6 | 71.3/80.5 |
Performance on Recent Benchmarks (Table 4 — MMLU-Pro)¶
| Task | Baseline | CoopRAG-Algo | CoopRAG-Math | Best Gain |
|---|---|---|---|---|
| Math | 54.63% | 53.89% | 60.25% | +5.62% |
| CS | 37.80% | 40.73% | 42.20% | +4.40% |
| Physics | 38.49% | 39.26% | 44.19% | +5.70% |
| Average | — | +1.35% | +3.02% | best |
Highlights & Insights¶
- Expansion–completion loop: Query expansion combined with reasoning chain completion forms a complete bidirectional cooperative cycle, in which the LLM guides retrieval while retrieval supplements the LLM's knowledge.
- Layer-contrastive ingenuity: Leveraging the multi-layer properties of Transformer internal representations breaks the limitations of single-vector representation.
- Uncertainty masking: Explicitly masking uncertain entities to prevent hallucination is a seemingly simple yet practically effective design choice.
- Scale efficiency: Gemma2-9B outperforms HippoRAG2 based on Llama3.3-70B, demonstrating the superiority of the proposed method.
Limitations & Future Work¶
- Question type constraints: The method primarily targets factual QA; its applicability to open-ended questions remains unexplored.
- LLM dependency: The approach relies on the LLM's expansion capability; weaker models may lead to failure cases.
- Computational cost: Multiple LLM calls (expansion → completion → inference) result in relatively high inference latency.
- Heuristic layer selection: The dynamic layer selection via \(\omega\) weights remains heuristic, with limited theoretical grounding.
Related Work & Insights¶
- RAG methods: HippoRAG, SiReRAG, HopRAG, GraphRAG, LightRAG
- Query augmentation: HyDE, GAR, Step-Back-Prompting
- Dense retrieval: ColBERT, DPR, GTR
- Reasoning chains: CoT, ToT, ReAct
- QA systems: KBQA, multi-hop QA
Rating¶
⭐⭐⭐⭐