ReSCORE: Label-free Iterative Retriever Training for Multi-hop Question Answering with Relevance-Consistency Supervision¶

Conference: ACL 2025 (Long Paper)
arXiv: 2505.21250
Code: Project Page
Area: Information Retrieval / Multi-hop Question Answering
Keywords: multi-hop QA, dense retriever training, label-free supervision, iterative RAG, LLM distillation

TL;DR¶

This paper proposes ReSCORE, which leverages the joint probability of document-query relevance and document-answer consistency generated by an LLM as pseudo-labels to train a dense retriever in an unsupervised manner within an iterative RAG framework, achieving SOTA performance on three multi-hop QA datasets.

Background & Motivation¶

Multi-hop question answering (MHQA) requires reasoning across multiple documents. Current SOTA systems employ the iterative retrieval-augmented generation (Iterative RAG) paradigm. However, two key limitations of prior work exist:

Dense retrievers require labeled data: Although dense retrievers (such as Contriever) outperform BM25 in semantic matching, they require annotated query-document pairs for fine-tuning. In MHQA scenarios, the query in each iteration step (the rewritten question) varies depending on the LLM, making annotation costs extremely high.
Existing iterative RAG methods do not train the retriever: Methods like IRCoT, Adaptive-RAG, and Adaptive-Note perform well in iterative reasoning, but they all rely on pre-trained sparse retrievers (BM25) or domain-unadapted dense retrievers, without fine-tuning the retriever on the target domain.

Core Problem¶

How to train an effective dense retriever for multi-hop question answering scenarios without labeled document relevance tags?

Method¶

Overall Architecture¶

ReSCORE operates within an iterative RAG framework: given a question \(q\), the system iteratively retrieves documents, generates intermediate "thoughts", and rewrites the query until the LLM generates a final answer (rather than "unknown"). During training, the probability distribution generated by the LLM is used as pseudo-labels to supervise the retriever, updating the query encoder through KL divergence loss. The complete system is named IQATR (Iterative Question Answerer with Trained Retriever).

Key Designs¶

Relevance-Consistency Joint Pseudo-Label Generation: The core formula is \(Q_{\text{LM}}^{(i)}(d_j^{(i)} | q) \propto P_{\text{LM}}(a, q | d_j^{(i)}) = P_{\text{LM}}(q | d_j^{(i)}) \cdot P_{\text{LM}}(a | q, d_j^{(i)})\). The first term \(P_{\text{LM}}(q|d)\) measures the relevance of the document to the query, and the second term \(P_{\text{LM}}(a|q,d)\) measures the consistency of the document in answering the question. Using consistency alone produces many false positives (documents with surface lexical overlap but semantic irrelevance obtaining high scores), whereas the relevance term effectively filters out these irrelevant documents.
KL Divergence Training Loss: The LLM probability distribution \(Q_{\text{LM}}\) is used as soft labels to train the retriever by minimizing \(D_{\text{KL}}(Q_{\text{LM}}^{(i)} \| P_R^{(i)})\). Here, \(P_R\) is the softmax distribution based on the dot product of query-document vectors. Only the query encoder is trained, while the document encoder is frozen. When calculating pseudo-labels, only the top-\(M\)=32 documents are selected to control computational overhead.
Iterative Query Reconstruction: Each iteration generates a "thought" (a compression of key information from retrieved documents), which is concatenated with the original query to form a new query (Thought-concat strategy). This approach outperforms direct LLM query rewriting on complex questions, as it retains the original focus of the question.
Iterative Training Mechanism: Training is not completed in a single step but progresses throughout the entire iterative RAG process. The query in each iteration differs (due to reconstruction), corresponding to different retrieved document sets, which allows the retriever to learn to retrieve documents complementary to previous rounds in subsequent iterations.

Key Experimental Results¶

Dataset	Metric	IQATR (ReSCORE)	IRCoT (Prev. SOTA)	Adaptive-RAG	Gain
MuSiQue	EM / F1	23.4 / 32.7	22.0 / 31.8	23.6 / 31.8	+1.4 / +0.9
HotpotQA	EM / F1	47.2 / 59.3	44.4 / 56.2	42.0 / 53.8	+2.8 / +3.1
2WikiMHQA	EM / F1	50.0 / 59.7	49.7 / 54.9	40.6 / 49.8	+0.3 / +4.8

Note: The table above shows Prev. SOTA using Flan-T5-XL + BM25, and IQATR using Llama-3.1-8B + Contriever (ReSCORE).

Contriever Fine-Tuning Performance Comparison (Within the Same Framework):

Dataset	Baseline (Contriever)	+ ReSCORE	Δ EM / Δ F1
MuSiQue	15.2 / 23.8	23.4 / 32.7	+8.2 / +8.9
HotpotQA	39.4 / 52.3	47.2 / 59.3	+7.8 / +7.0
2WikiMHQA	32.8 / 41.6	50.0 / 59.7	+17.2 / +18.1

Ablation Study¶

Comparison of Pseudo-GT Label Types (Table 3, single-step reranking): Using only \(P(q|d)\) (relevance) yields an average recall improvement of 5.37%; using only \(P(a|q,d)\) (consistency) actually decreases recall by 23.8% (severe false positives); combining both as \(P(q,a|d)\) yields an improvement of 14.4%.
Pseudo-GT vs. GT Labels (Table 4): Surprisingly, ReSCORE's pseudo-labels outperform human GT labels. The reason is that during single-step training, GT labels force the query to simultaneously align with multiple documents that are far apart in semantic space (e.g., "Billie Eilish", "Avocado", "Mexico Presidents"), pulling the query encoder towards the centroid of these documents, which is suboptimal for retrieving any single document. In contrast, ReSCORE progressively retrieves complementary documents through an iterative process.
Query Reconstruction Strategies (Table 5): Thought-concat performs better on complex questions (MuSiQue, HotpotQA, averaging 17+ tokens); LLM-rewrite is slightly superior on simple questions (2WikiMHQA, 11.7 tokens). This is because the LLM tends to lose focus when rewriting complex queries.

Highlights & Insights¶

Label-free training method for dense retrievers, cleverly utilizing LLM probability signals as pseudo-labels.
The joint modeling approach of Relevance + Consistency solves the false positive issue caused by using consistency alone.
The discovery that pseudo-labels outperform human GT labels exposes the limitations of GT labels as training signals in multi-hop scenarios.
ReSCORE can act as a plug-in to enhance the performance of various existing iterative RAG frameworks (e.g., Self-RAG, FLARE, Adaptive-Note).
Thorough statistical significance testing (10 seeds, t-test, p < 0.05).

Limitations & Future Work¶

Insufficient Generalization: The model is fine-tuned on specific datasets, which limits its out-of-distribution (OOD) generalization ability across datasets with different reasoning patterns and hop counts.
Computational Overhead: The iterative retrieval process increases latency and computational costs, requiring LLM inference at each iteration step.
Training only the query encoder and freezing the document encoder limits the adaptation capability ceiling of the retriever.
Its reliance on answers as pseudo-label signals reduces its applicability to scenarios where reference answers are not readily available.
The maximum number of iterations is fixed to 6, which may be insufficient for extremely complex questions requiring more reasoning hops.

Method	Retriever	Training Method	Iterative	MHQA Adaptation
ATLAS	Dense	LLM consistency distillation	✗	Single-hop
REPLUG	Dense	LLM consistency	✗	Single-hop
IRCoT	BM25	No training	✓	✓ but no retriever training
Adaptive-RAG	BM25	No training (classifier trained)	✓	✓ but no retriever training
ReSCORE	Dense	LLM relevance + consistency	✓	✓ Iterative training

Key difference from ATLAS/REPLUG: ReSCORE simultaneously models relevance and consistency (rather than consistency alone) and is trained within an iterative framework (instead of single-step). Difference from IRCoT/Adaptive-RAG: ReSCORE actually trains the retriever instead of just using pre-trained versions.

The concept of using "LLM probabilities as soft supervision signals" is highly valuable and can be extended to other retrieval scenarios requiring label-free training (e.g., conversational retrieval, multi-turn search).
The discovery that pseudo-GT outperforms GT implies that in multi-step reasoning tasks, hard labels might be less effective than iterative soft labels, which is worth validating in other domains.
The decomposition framework of Relevance + Consistency provides an interpretable document evaluation approach, which could be used to evaluate document quality in RAG systems.

Rating¶

Novelty: ⭐⭐⭐⭐ The core idea (joint relevance+consistency pseudo-labeling + iteratively trained retriever) presents clear innovation points, although individual components (LLM distillation, iterative RAG) are not entirely brand new.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely solid, featuring three datasets, multiple ablation dimensions, cross-comparisons with various methods, and statistical significance testing.
Writing Quality: ⭐⭐⭐⭐ Clear logic, smooth mathematical derivations, and vivid examples (such as the FIFA World Cup false positive example), with a minor issue being the relatively high number of tables.
Value: ⭐⭐⭐⭐ Highly valuable for retriever training methods in RAG systems, particularly due to the finding that "pseudo-labels outperform GT" and the relevance-consistency decoupling concept.