SetR: Shifting from Ranking to Set Selection for Retrieval Augmented Generation¶
Conference: ACL 2025
arXiv: 2507.06838
Code: LGAI-Research/SetR
Area: Information Retrieval / RAG
Keywords: RAG, Set Selection, Reranking, Chain-of-Thought, Multi-hop QA, Information Need Identification
TL;DR¶
SetR is proposed to shift the document ranking paradigm in RAG to a set selection paradigm. By using CoT reasoning to identify the information needs of queries and select the optimal document set, SetR significantly improves multi-hop QA performance while utilizing fewer documents (an average of 2.91 vs. 5).
Background & Motivation¶
- Problem Definition: The retrieval module of a RAG system must ensure that the retrieved passages are not only individually relevant but also collectively constitute a complete information set to correctly answer complex questions.
- Limitations of Prior Work: Existing reranking methods rank passages by scoring their individual relevance and then selecting the top-k, which poses three core issues: (1) ignoring information complementarity between passages, potentially retrieving redundant content; (2) failing to guarantee the completeness of information coverage required for multi-hop questions; (3) requiring manual tuning of the top-k parameter.
- Design Motivation: The information needs of RAG systems are fundamentally different from those of search engines — search engines rank individual results, whereas RAG requires a set of passages to collectively support generation. The paradigm should shift from "ranking" to "set selection".
- Core Contributions: (1) Proposing a set-based passage selection paradigm based on information need identification; (2) training and open-sourcing the SetR model for efficient selection; (3) comprehensively outperforming commercial and open-source rerankers on multi-hop RAG benchmarks.
Method¶
Overall Architecture¶
SetR workflow: First-stage retriever (BM25/bge) returns top-20 candidate passages \(\rightarrow\) SetR analyzes information needs of queries via CoT reasoning \(\rightarrow\) SetR selects the optimal subset that covers all information needs from candidates (without ranking or a fixed k value).
Key Designs¶
- Information Need Identification (IRI): A structured CoT reasoning strategy with a three-step process: (a) enumerating key information needs required to answer the question; (b) identifying candidate passages containing relevant information for each need; (c) selecting a subset of passages that collectively provide the most comprehensive coverage. Unlike general CoT reasoning, IRI explicitly models the complete coverage of information needs.
- Model Distillation: Using GPT-4o as a teacher to perform zero-shot set selection annotation on 40K MS MARCO queries, which is then distilled into the Llama-3.1-8B-Instruct student model. Training is conducted for 5 epochs with an effective batch size of 512, a learning rate of \(5 \times 10^{-6}\), and the AdamW optimizer.
- Adaptive Number of Passages: The model dynamically decides how many passages to select — with an average of only 2.63-3.41 passages (vs. fixed 5 in baselines), reducing noise interference while improving information precision.
Three Model Variants¶
- SetR-Selection only: Only outputs selection results without the reasoning process.
- SetR-CoT: Conducts CoT reasoning using a general "Let's think step-by-step" prompt.
- SetR-CoT & IRI: The full model, featuring CoT reasoning with structured information need identification.
Experiments¶
Main Results (End-to-End QA)¶
Using bge-large-en-v1.5 as the first-stage retriever and Llama-3.1-8B as the generator:
| Model | No. of Passages | HotpotQA EM/F1 | 2Wiki EM/F1 | MuSiQue EM/F1 | MHRAG Acc |
|---|---|---|---|---|---|
| BM25 only | 5.00 | 30.07/30.97 | 31.17/25.22 | 7.44/10.78 | 41.82 |
| bge-reranker-large | 5.00 | 32.48/33.24 | 31.92/25.47 | 8.06/12.50 | 43.50 |
| RankGPT (gpt-4o) | 5.00 | 33.85/34.45 | 34.36/28.06 | 9.43/13.25 | 45.69 |
| SetR-CoT & IRI | 2.91 | 36.62/38.11 | 35.44/30.35 | 10.79/15.43 | 47.14 |
Ablation Study¶
| Variant | HotpotQA F1 | MHRAG Acc | Description |
|---|---|---|---|
| SetR-Selection only | 37.84 | 46.20 | Outperforms all reranking baselines even without reasoning |
| SetR-CoT | 38.20 | 45.26 | General CoT is inferior to IRI |
| SetR-CoT & IRI | 38.11 | 47.14 | Explicit information need analysis from IRI achieves optimal performance |
Information Coverage and Robustness Analysis (MultiHopRAG)¶
| Metric | Reranking Baseline | SetR |
|---|---|---|
| Hit@k | 48.87% | 69.90% (+21.03%) |
| Information Coverage Rate | 19.33% | 36.49% (+17.16%) |
| Precision@5 | 0.1799 (RankGPT Best) | 0.2268 (+26.1%) |
Key Findings¶
- Less is More: SetR comprehensively outperforms all baselines using 5 passages while utilizing only an average of 2.91 passages.
- Significant Improvement in Information Coverage: Hit@k and information coverage are improved by 21% and 17% respectively, whereas traditional reranking only yields an approximate 10% improvement.
- IRI vs. General CoT: The improvement from information need identification does not stem from simple CoT reasoning capabilities, but rather from task-specific structured analysis.
- Contradiction between Precision and Rank Metrics: SetR significantly leads in terms of Precision but shows slightly lower MRR/NDCG, exposing the limitations of traditional ranking metrics in multi-hop scenarios.
- Increasing the Number of Passages Degrades Baseline Performance: More passages introduce more noise and conflicting information; the selective strategy of SetR successfully avoids this issue.
Highlights¶
- Paradigm Innovation: Shifting RAG retrieval from "ranking" to "set selection" for the first time, offering deep insights with a concise concept.
- High Efficiency and Practicality: Fewer passages + better performance = reduced context window pressure and inference costs.
- Fully Open Source: Model weights, training data construction pipelines, and code are all publicly available.
- Meticulous Ablation Design: Three variants clearly validate the independent contribution of IRI.
Limitations¶
- Distillation relies heavily on the annotation quality of the teacher GPT-4o, which may introduce teacher biases.
- The training data is based on MS MARCO (generic English domain), and the generalizability to non-English or specific domains remains to be verified.
- Performs worse than the best reranking methods on rank-based metrics (MRR/NDCG), suggesting that the two paradigms might be complementary.
- Has not been evaluated on single-hop QA or long-context LLMs.
- Variable-sized outputs may introduce uncertainty into downstream system design.
Related Work¶
- RAG Reranking: LLM-based listwise reranking methods like RankGPT (Sun et al., 2024) and RankZephyr (Pradeep et al., 2023b) focus on individual relevance.
- Iterative Retrieval: Multi-turn retrieval refinements like Self-RAG (Asai et al., 2023) and CoRAG (Wang et al., 2024) are effective but carry high computational overhead.
- Query Decomposition: Methods like IRCoT (Trivedi et al., 2023) and RAPTOR (Sarthi et al., 2024) improve retrieval by decomposing complex queries.
- Context Compression: Approaches like Chirkova et al. (2025) prune retrieved context, which is complementary to set selection.
Rating¶
| Dimension | Score (1-10) |
|---|---|
| Novelty | 9 |
| Technical Depth | 7 |
| Experimental Thoroughness | 8 |
| Writing Quality | 8 |
| Value | 9 |
| Overall Score | 8.2 |