ACL2025 Information Retrieval & RAG AI paper notes paper summaries RAG LLM Reasoning Question Answering Adversarial Robustness Alignment/RLHF

🔍 Information Retrieval & RAG¶

💬 ACL2025 · 84 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (81) · 💬 ACL2026 (73) · 🧪 ICML2026 (26) · 🤖 AAAI2026 (21) · 🧠 NeurIPS2025 (25) · 📹 ICCV2025 (5)

🔥 Top topics: RAG ×50 · LLM ×12 · Reasoning ×6 · Question Answering ×6 · Adversarial Robustness ×4

A Reality Check on Context Utilisation for Retrieval-Augmented Generation: This paper proposes the DRUID real-world fact verification dataset and the ACU evaluation metric, revealing that synthetic datasets (CounterFact, ConflictQA) exaggerate the impact of context features, leading to overly optimistic assessments of LLM context utilization capabilities, and calling for the study of RAG using real-world retrieved data.
A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens: This paper reveals an intriguing phenomenon in LLM text embeddings: when mapping embedding vectors back to the vocabulary space via the decoding layer, the tokens with the highest decoding probability align highly with the keywords of the input text. Furthermore, spectral analysis reveals that this phenomenon is primarily controlled by the first principal component. Based on this, a simple training-free sparse retrieval method is proposed, preserving over 80% of the original dense retrieval performance.
Accelerating Adaptive Retrieval Augmented Generation via Instruction-Driven Representation Reduction of Retrieval Overlaps: IDR² is proposed, a model-agnostic adaptive RAG acceleration framework. By eliminating redundant representations of overlapping documents across multi-round retrieval and utilizing retrieved content to guide parallel decoding, it achieves approximately 2× end-to-end acceleration without compromising generation quality.
AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark: This paper proposes AIR-Bench, the first heterogeneous IR benchmark that leverages LLMs to automatically generate test data. It covers 2 tasks (QA/Long-Doc), 9 domains, and 13 languages across 69 datasets. A three-stage quality control pipeline ensures that the generated data is highly consistent with human annotations, addressing the limitations of narrow domain coverage and high update costs in traditional IR benchmarks.
Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval: This paper formally defines the Visualized Information Retrieval (Vis-IR) paradigm, which uniformly renders multimodal information into screenshots for retrieval. It constructs the VIRA dataset containing 13 million screenshots, the UniSE retrieval model family, and the MVRB benchmark, laying the foundation for unified search engines.
ARise: Towards Knowledge-Augmented Reasoning via Risk-Adaptive Search: Proposes the ARise framework, which integrates Bayesian risk assessment and dynamic RAG into Monte Carlo Tree Search to address the error propagation and verification bottleneck issues in knowledge-augmented reasoning. On multi-hop QA tasks, it outperforms state-of-the-art KAR methods by 23.10% and RAG-equipped reasoning models (DeepSeek-R1) by 25.37% in average accuracy.
Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models: Astute RAG proposes a robust RAG approach against imperfect retrieval. By executing three steps—adaptive generation of internal LLM knowledge as a supplement, source-aware knowledge consolidation, and reliability-based answer generation—it significantly outperforms existing robust RAG methods on Gemini and Claude. Furthermore, it is the only method that does not perform worse than the No-RAG baseline in the worst-case scenario (where all retrieved documents are irrelevant).
Atomic LLM: A Fine-Grained Information Retrieval Evaluation Benchmark for Language Models: This paper proposes the Atomic LLM benchmark, which decomposes information retrieval evaluation into atomic-level fact retrieval tasks. It evaluates the information retrieval capabilities of LLMs across multiple dimensions, including factual precision, source attribution, and granularity coverage, revealing systematic deficiencies of existing LLMs in precise fact extraction.
Automatic Benchmark Generation from Scientific Papers via Retrieval-Augmented LLMs: This paper proposes an automated benchmark generation method based on retrieval-augmented LLMs. It automatically extracts testable knowledge points from scientific papers and generates high-quality evaluation questions. Its effectiveness has been validated across domains such as NLP, machine learning, and bioinformatics, providing a new paradigm for the rapid construction of domain-specific LLM evaluation benchmarks.
Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims: The ClaimSpect framework is proposed to automatically decompose complex claims into hierarchical aspect trees and discover supporting/neutral/opposing viewpoints along with their degree of consensus from a corpus through discriminative retrieval.
CoIR: A Comprehensive Benchmark for Code Information Retrieval Models: This paper proposes CoIR, the first comprehensive benchmark for code information retrieval. Comprising 10 datasets across 4 major categories, 8 subtasks, and 14 programming languages, CoIR reveals that even state-of-the-art (SOTA) retrieval models underperform in code retrieval, and highlights that many models have overfitted to existing leaderboards.
Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence: This paper presents the first systematic study of the individual and combined effects of multiple heuristic biases (brevity, position, literal, and repetition biases) in dense retrievers. It reveals that when multiple biases are compounded, the probability of a retriever selecting the document containing the answer drops below 10%, and these biases can be exploited to manipulate RAG systems, leading to a 34% drop in performance.
ComRAG: Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry: ComRAG is proposed — a retrieval-augmented generation framework for real-time community question answering (CQA) in industry. By utilizing a tri-store architecture (static knowledge vector store + high-/low-quality dynamic QA vector stores) and a centroid-based memory mechanism, it achieves up to a 25.9% improvement in vector similarity across three CQA datasets while reducing latency by 8.7%-23.3%.
Contradiction Detection in RAG-Based Chatbots: This paper addresses the contradiction between retrieved documents and generated responses in RAG dialogue systems by proposing a multi-granularity contradiction detection framework that identifies explicit, implicit, and partial contradictions while providing interpretable contradiction localization.
Cross-Lingual Relevance Transfer for Document Retrieval: This paper proposes a cross-lingual relevance transfer method that transfers relevance judgment capability to low-resource languages using a retrieval model trained on high-resource languages (e.g., English), significantly outperforming existing methods on multiple cross-lingual document retrieval benchmarks.
Divide-Then-Align: Honest Alignment based on the Knowledge Boundary of RAG: DTA proposes dividing RAG queries into four quadrants based on parametric knowledge boundaries and retrieval knowledge boundaries. For queries where "both are unknown," DTA constructs preference data and applies DPO to train the model to answer "I don't know." This addresses the issue of RAFT models generating answers even when retrieval is entirely noisy, achieving an effective balance between accuracy and appropriate abstention.
Don't Reinvent the Wheel: Efficient Instruction-Following Text Embedding based on Guided Space Transformation: The GSTransform framework is proposed, which adapts pre-computed generic embeddings in real-time to the semantic space specified by user instructions via a lightweight space transformation. This avoids re-encoding the entire corpus for each new instruction, achieving an average score of 66.01 across 9 datasets (compared to the SOTA baseline of 55.31) while delivering a 6x to 300x speedup in real-time latency.
Drama: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers: The Drama framework is proposed to systematically explore the combination of multiple LLM-based data augmentation strategies (cropped sentences + synthetic queries + LLM reranking preferences) with pruned LLM backbones. It trains small retriever models (0.1B-1B parameters) via single-stage contrastive learning, matching the 1B-parameter Gecko on BEIR with only 0.3B parameters, while exhibiting strong multilingual and long-context capabilities.
Empaths at SemEval-2025 Task 11: Retrieval-Augmented Approach to Perceived Emotions Prediction: This paper proposes the EmoRAG system, which combines a retrieval-augmented generation (RAG) pipeline with multi-LLM ensemble aggregation. Without any additional training, it achieves competitive results across 28 languages on the SemEval-2025 Task 11 multi-label emotion detection task, with an average F1-micro score of 0.638.
Enhancing Lexicon-Based Text Embeddings with Large Language Models: This paper proposes the LENS framework, which is the first to apply LLMs to general lexicon-based text embeddings. By utilizing token embedding clustering to resolve LLM vocabulary redundancy and introducing bidirectional attention to overcome the limitations of causal LLMs, LENS outperforms dense embeddings trained on the same data on MTEB. When combined with dense embeddings, it achieves state-of-the-art (SOTA) performance on BEIR.
Evaluation of Attribution Bias in Generator-Aware Retrieval-Augmented Large Language Models: Defines and investigates LLMs' attribution sensitivity and bias toward authorship information in RAG. Through counterfactual evaluation, this study reveals that informing LLMs of document authorship significantly alters attribution quality by 3-18%, and LLMs exhibit an attribution bias toward human authorship.
EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation: Proposes EXIT—an extractive context compression framework that evaluates sentence-query relevance in parallel via context-aware sentence-level binary classification, outperforming existing abstractive and extractive compression methods in both QA accuracy and inference latency.
FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation: This study reveals that existing context-faithful RAG methods achieve faithfulness by forcibly suppressing parametric knowledge, which increases the risk of misunderstanding the context (unfaithful errors decrease by 6.65% while mismatch errors increase by 6.42%). We propose FaithfulRAG, which resolves knowledge conflicts through fact-level conflict detection (self-fact mining) and conflict reasoning (self-think module), outperforming the strongest baselines by 8-9 percentage points on FaithEval/SQuAD/MuSiQue/RealtimeQA.
FlashBack: Efficient Retrieval-Augmented Language Modeling for Fast Inference: To address the inference efficiency issues in Retrieval-Augmented Language Models (RALMs) caused by the repeated recomputation of KV cache due to prepending retrieved content, this paper proposes FlashBack. FlashBack appends retrieved content to preserve the input's KV cache, and utilizes Marking Tokens + LoRA fine-tuning to adapt to the new context pattern, achieving up to a 4x inference speedup on Llama 2-7B while maintaining comparable perplexity.
FlexRAG: A Flexible and Comprehensive Framework for Retrieval-Augmented Generation: This paper proposes FlexRAG, an open-source RAG framework oriented towards research and prototyping. It supports text, multimodal, and web retrieval, achieving an order-of-magnitude lower resource overhead than similar frameworks (such as FlashRAG) through memory mapping and asynchronous processing.
From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on RAG Systems: This paper systematically investigates the impact of coreference resolution on the two stages of RAG systems: document retrieval and QA generation. It finds that coreference resolution consistently improves retrieval performance (with mean pooling models benefiting the most). In QA tasks, the performance improvement for small models is significantly greater than that for large models, even allowing small models to reach the baseline performance of large models.
GainRAG: Preference Alignment in Retrieval-Augmented Generation through Gain Signal Synthesis: It is discovered that there is a systematic deviation between the "relevance" optimized by the retriever and the "gain" actually needed by the LLM in RAG—passages containing the gold answer still have a nearly 50% probability of causing incorrect generation, whereas indirectly relevant passages are often more effective. This paper proposes GainRAG, which defines the "gain" signal based on contrastive decoding perplexity and trains a lightweight selector to perform gain-oriented passage filtering between the retriever and the LLM. It comprehensively outperforms Standard RAG and Rerank baselines across six QA datasets.
GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation: GaRAGe is a RAG benchmark consisting of 2,366 questions and over 35K human-annotated grounding passages. Through fine-grained grounding relevance annotations, it systematically evaluates LLMs' abilities to identify relevant information, deflect (refuse to answer), and attribute references in RAG scenarios.
GeAR: Generation Augmented Retrieval: GeAR introduces a fusion encoder and a text decoder into the traditional bi-encoder retrieval framework. It enhances the retrieval model's comprehension of fine-grained internal semantics of documents through generation tasks, without introducing additional computational overhead for global retrieval.
Controllable and Reliable Knowledge-Intensive Task-Oriented Conversational Agents with Declarative Genie Worksheets: Genie proposes a programmable framework for knowledge-intensive task-oriented dialogues. It defines LLM agent policies through declarative Worksheet specifications, limiting the LLM to two roles—semantic parsing and response generation—while an algorithmic runtime system enforces the policies. This achieves a realistic task completion rate improvement from \(21.8\%\) to \(82.8\%\).
Graph of Records: Boosting Retrieval Augmented Generation for Long-context Summarization with Graphs: Proposes Graph of Records (GoR), which constructs a graph structure from LLM historical responses and retrieved text chunks. It utilizes GNNs to learn semantic and logical associations among nodes, combined with a self-supervised BERTScore training objective, improving ROUGE scores by 8-19% over retrieval baselines across four long-context global summarization datasets.
GRAF: Graph Retrieval Augmented by Facts for Romanian Legal Multi-Choice Question Answering: This paper proposes the GRAF algorithm, which integrates a legal knowledge graph (Law-RoG) and Graph Attention Networks for Romanian legal multiple-choice question answering, while open-sourcing JuRO (10,836 questions), the first Romanian legal MCQA dataset, and CROL, a legal corpus.
Gumbel Reranking: Differentiable End-to-End Reranker Optimization: This paper reformulates the reranking process in RAG systems as a document-level Top-k attention masking problem. By leveraging the Gumbel trick and relaxed Top-k sampling, it achieves end-to-end differentiable optimization to directly minimize the final language modeling loss, yielding a 10.4% improvement in Recall@5 on HotpotQA.
HASH-RAG: Bridging Deep Hashing with Retriever for Efficient, Fine Retrieval and Augmented Generation: Hash-RAG systematically integrates deep hashing technology into the RAG framework, enabling highly efficient retrieval with only 10% of the retrieval time of traditional methods while enhancing generation quality without sacrificing efficiency through the Prompt-Guided Chunk-to-Context (PGCC) module.
Health-LLM: Personalized Retrieval-Augmented Disease Prediction System: The Health-LLM framework is proposed, which integrates feature score extraction from health reports via LLM + Llama Index, RAG-augmented medical knowledge retrieval, and CAAFE automated feature engineering with an XGBoost classifier. It achieves an Accuracy of 0.833 and an F1 score of 0.762 for disease prediction on the IMCS-21 Chinese telemedicine dataset, significantly outperforming GPT-4 few-shot+RAG (Acc 0.68) and fine-tuned LLaMA-2-13B (Acc 0.73).
HELIOS: Harmonizing Early Fusion, Late Fusion, and LLM Reasoning for Multi-Granular Table-Text Retrieval: Proposes HELIOS, a three-stage graph retrieval framework (edge-level early fusion \(\rightarrow\) node-level late fusion \(\rightarrow\) star-graph-level LLM refinement). Through multi-granular coordination, it addresses the three major challenges in table-text retrieval: retrieval unit granularity, query dependency relationship discovery, and advanced reasoning, achieving a 42.6% improvement in Answer Recall on OTT-QA.
Hierarchical Document Refinement for Long-context Retrieval-augmented Generation: This paper proposes LongRefiner, a plug-and-play long-document refinement system. Through three stages—dual-level query analysis, hierarchical document structuring, and adaptive refinement—it outperforms full-text input on 7 QA datasets using only 1/10 of the token budget, with latency running at just 1/10 of the strongest baseline.
HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation: This paper proposes HoH, the first large-scale dynamic benchmark specifically designed to evaluate the impact of outdated information on RAG systems, containing 96,124 QA pairs and 219,463 documents, revealing the severe hazards of outdated information on RAG performance and safety.
HybGRAG: Hybrid Retrieval-Augmented Generation on Textual and Relational Knowledge Bases: This paper proposes HybGRAG, a method that leverages both textual and relational information through a Retriever Bank, coupled with a Critic module's self-reflection to iteratively correct question routing errors, achieving an average Hit@1 improvement of 51% on hybrid question answering tasks over semi-structured knowledge bases.
Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion: The authors question whether the performance gains of LLM-based query expansion (HyDE/Query2doc) truly stem from "hypothetical document generation." They find that performance improvements consistently occur only when the LLM-generated documents contain sentences semantically consistent with the gold evidence, revealing potential knowledge leakage issues in benchmarks.
InspireDebate: Multi-Dimensional Evaluation-Guided Reasoning for Debating: A two-component framework is proposed: InspireScore (a debate evaluation system integrating 4 subjective dimensions and 2 objective dimensions) and InspireDebate (a debate framework optimized through a three-stage process of CoT-SFT + multi-dimensional DPO + Web-RAG). This evaluation system improves correlation with expert judgment by 44%, and the debate performance surpasses the baseline by 57%.
Investigating Language Preference of Multilingual RAG Systems: This work systematically investigates the language preference issue in both the retrieval and generation stages of multilingual RAG (mRAG) systems. It proposes the MLRS metric to quantify the degree of retriever preference for specific languages, revealing that retrievers favor high-resource and query languages, while generators prefer the query language and Latin-script languages. Finally, it designs the DKM-RAG framework, which effectively mitigates the preference issue by fusing translated passages with the model's internal knowledge.
Investigating the Robustness of Retrieval-Augmented Generation at the Query Level: Proposes the first modular analysis framework for query-level RAG robustness. Through 1092+ experiments across 5 perturbation types \(\times\) 4 retrievers \(\times\) 3 LLMs \(\times\) 3 datasets, the study reveals the complementary robustness of dense and sparse retrievers against different perturbation types and provides actionable engineering recommendations.
KnowShiftQA: How Robust are RAG Systems when Textbook Knowledge Shifts in K-12 Education?: The authors construct the KnowShiftQA dataset (3,005 questions across 5 subjects) to simulate differences between textbooks and the parametric knowledge of LLMs through hypothetical knowledge updates. They systematically evaluate the robustness of RAG systems under knowledge shifts, finding that the performance of existing RAG systems drops by 22-27% under knowledge shifts.
LDIR: Low-Dimensional Dense and Interpretable Text Embeddings with Relative Representations: This paper proposes LDIR, a method that selects anchor texts using farthest point sampling (FPS) and computes the semantic similarity between target texts and each anchor text. This constructs low-dimensional (\(\le 500\) dimensions), dense, and interpretable text embeddings, yielding performance close to black-box models and significantly outperforming existing interpretable embedding approaches.
Length-Induced Embedding Collapse in PLM-based Models: Identifies and rigorously proves the "length collapse" phenomenon in PLM-based text embedding models—where long-text embeddings tend to cluster together because self-attention acts as a low-pass filter whose filtering effect intensifies as text length increases, over-suppressing high-frequency information. Proposes the TempScale method to mitigate the distributional discrepancy between short and long text embeddings by scaling down the attention temperature, improving MTEB by 0.94% and LongEmbed by 1.10%.
Re-ranking Using Large Language Models for Mitigating Exposure to Harmful Content on Social Media Platforms: Proposes a pairwise preference re-ranking method based on LLMs to demote harmful content in social media recommendation sequences under zero-shot and few-shot settings. The method significantly outperforms industrial-grade classifiers such as Perspective API and OpenAI Moderation API, while introducing two new evaluation metrics: PP-k and EWN.
Logical Consistency is Vital: Neural-Symbolic Information Retrieval for Negative-Constraint Queries: This work proposes NS-IR, which translates natural language queries and documents into first-order logic (FOL) and optimizes dense retrieval embeddings using two techniques: logic alignment and connective constraints, significantly improving retrieval performance in complex logical scenarios such as negative-constraint queries.
MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation: This paper proposes MAIN-RAG, a training-free multi-agent RAG filtering framework that collaborates via three LLM agents (Predictor \(\rightarrow\) Judge \(\rightarrow\) Final-Predictor) to evaluate the relevance of retrieved documents. It designs an adaptive threshold (based on the score mean and standard deviation) to dynamically filter noisy documents, achieving a 2-11% accuracy improvement across 4 QA benchmarks.
MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation: Constructs MEMERAG, the first native multilingual RAG meta-evaluation benchmark covering 5 languages. It achieves high inter-annotator agreement through flowchart-guided annotation, and is designed to evaluate and compare multilingual RAG automatic evaluators.
Micro-Act: Mitigate Knowledge Conflict in QA via Actionable Self-Reasoning: This paper proposes the Micro-Act framework, which introduces a hierarchical action space (navigational, functional, and bridging actions) and adaptive granularity decomposition. It enables LLMs to automatically perceive context complexity and disassemble knowledge contrast layer by layer. Micro-Act outperforms state-of-the-art (SOTA) methods across 5 knowledge-conflict benchmarks while maintaining robustness in conflict-free scenarios.
Mitigating Lost-in-Retrieval Problems in RAG Multi-Hop QA: This paper identifies the "lost-in-retrieval" problem in RAG multi-hop QA—where subsequent sub-questions suffer a drastic drop in retrieval performance due to the lack of key entities after sub-question decomposition. To address this, the ChainRAG framework is proposed, which constructs a sentence graph, performs progressive retrieval, and rewrites sub-questions (to complete missing entities) to form a coherent reasoning chain, consistently outperforming baselines across three datasets: MuSiQue, 2Wiki, and HotpotQA.
MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System: This paper proposes two metrics directly quantifying chunking quality, Boundary Clarity and Chunk Stickiness, along with a granularity-aware Mixture-of-Chunkers (MoC) framework. By employing a regex-guided lightweight chunking strategy, it achieves superior performance in RAG systems compared to traditional methods and direct LLM-based chunking.
MT-RAIG: Novel Benchmark and Evaluation Framework for Retrieval-Augmented Insight Generation over Multiple Tables: This paper introduces MT-RAIG Bench—the first large-scale benchmark for retrieval-augmented insight generation over multiple tables—and MT-RAIG Eval—a decomposition-based, fine-grained automatic evaluation framework. Experiments demonstrate that even frontier LLMs underperform on multi-table reasoning (achieving only around 40% faithfulness and 60% completeness).
Multilingual Retrieval Augmented Generation for Culturally-Sensitive Tasks: A Benchmark for Cross-lingual Robustness: The BordIRLines benchmark dataset was constructed, containing territorial dispute queries in 49 languages with paired retrieved Wikipedia documents. Through a systematic evaluation of cross-lingual robustness in multilingual RAG environments, it was found that retrieving multilingual documents improves response consistency and reduces geopolitical bias better than retrieving only same-language documents.
NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering: NeuSym-RAG proposes a hybrid neural-symbolic retrieval framework that parses PDF documents through multiview chunking and simultaneously stores them into a relational database and a vector database. Under this framework, an LLM Agent iteratively interacts with the backends via executable actions (SQL queries, vector retrieval, viewing images, etc.), improving performance on academic paper QA by 17.3% compared to classic RAG.
On Synthetic Data Strategies for Domain-Specific Generative Retrieval: This paper systematically investigates synthetic data strategies for training generative retrieval models on domain-specific corpora, proposing multi-granular query generation, constraint-based queries, and preference learning based on hard negatives, which significantly improves retrieval performance.
Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval: For the low-resource and morphologically rich language Amharic, this paper proposes dense retrieval models based on pre-trained Amharic BERT/RoBERTa and a ColBERT late-interaction model. These achieve substantial improvements in passage retrieval with parameters far fewer than those of multilingual baselines, establishing the first systematic retrieval benchmark for the language.
Pandora's Box or Aladdin's Lamp: A Comprehensive Analysis Revealing the Role of RAG Noise in Large Language Models: This paper defines 7 noise types in RAG systems from a linguistic perspective and builds the NoiserBench comprehensive evaluation framework. Through large-scale experiments on 8 LLMs, it discovers that noise can be categorized into harmful noise (counterfactual, supportive, orthographic) and beneficial noise (semantic, datatype, illegal sentence). Remarkably, beneficial noise can improve model accuracy by \(1\text{--}3\%\).
Parenting: Optimizing Knowledge Selection of Retrieval-Augmented Language Models with Parameter Decoupling and Tailored Tuning: Inspired by functional areas of the human brain, the Parenting framework is proposed. It decouples and localizes subspaces related to "context adherence" and "noise robustness" in the parameter space of LLMs, and designs customized fine-tuning strategies for different subspaces to achieve a balanced enhancement of both capabilities.
PersonaBench: Evaluating AI Models on Understanding Personal Information through Accessing (Synthetic) Private User Data: Proposes a synthetic data generation pipeline to create the PersonaBench benchmark, which contains diverse user personas and simulated private documents (chat logs, AI interactions, purchase history). It is designed to evaluate AI models' ability to extract personal information from noisy user data. Experimental results show that current RAG formulations are far from sufficient for this task.
PRISM: A Framework for Producing Interpretable Political Bias Embeddings with Political-Aware Cross-Encoder: This work proposes the PRISM framework, which models political bias embedding as an interpretable task for the first time. It automatically extracts controversial topics and left/right bias indicators from a weakly labeled news corpus as embedding dimensions. Then, a political-aware cross-encoder is employed to score articles along each topic dimension, generating sparse and semantically transparent political bias embeddings. The proposed method achieves \(86.1\%\) accuracy on NewsSpectrum (outperforming POLITICS by \(34.8\%\)) while supporting diverse retrieval.
Uncovering Visual-Semantic Psycholinguistic Properties from the Distributional Structure of Text Embedding Space: Proposes the Neighborhood Stability Measure (NSM)—an unsupervised, distribution-free method that estimates word imageability and concreteness by quantifying the stability of neighborhoods in the text embedding space, outperforming existing approaches that rely on multimodal or generative models while using only the text modality.
RAEmoLLM: Retrieval Augmented LLMs for Cross-Domain Misinformation Detection Using In-Context Learning Based on Emotional Information: This paper proposes RAEmoLLM, the first RAG framework based on emotional information retrieval. It utilizes the implicit embeddings of an emotional LLM to construct a retrieval database, providing emotion-related few-shot examples for cross-domain misinformation detection. Without the need for fine-tuning, RAEmoLLM improves performance by up to 15.64%, 31.18%, and 15.73% on three benchmarks respectively compared to other few-shot methods.
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework: RAGEval proposes a schema-based automated evaluation dataset generation framework. It can automatically generate high-quality document-question-answer-reference quadruplets for different vertical domains (finance, law, medicine, etc.) and introduces three new evaluation metrics—Completeness, Hallucination, and Irrelevance—to rigorously assess the factual accuracy of RAG systems.
RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models: Proposes RARE, which introduces two retrieval-augmented actions into the MCTS reasoning framework of rStar (A6: generate search queries based on the original question and retrieve; A7: retrieve and re-answer sub-questions) and replaces the original discriminator with a Retrieval-Augmented Factuality Scorer (RAFS), enabling LLaMA 3.1 to match or exceed GPT-4o performance on medical and commonsense reasoning tasks.
Redundancy, Isotropy and Intrinsic Dimensionality of Prompt-Based Text Embeddings: This paper systematically studies the performance robustness of prompt-based text embedding models (such as gte-Qwen2, E5-mistral, etc.) under post-processing dimensionality reduction. It discovers that classification/clustering tasks can largely preserve performance with only 0.5% of the original dimensions, and quantitatively explains the differences in embedding redundancy across different task prompts using two metrics: Intrinsic Dimension (ID) and isotropy (IsoScore).
Re-identification of De-identified Documents with Autoregressive Infilling: Proposes a RAG-based re-identification method for de-identified documents: first employs sparse + dense retrieval to find relevant background documents, and then uses an autoregressive infilling model to infer masked personally identifiable information (PII), recovering up to 80% of the masked text across three datasets.
Reranking-based Generation for Unbiased Perspective Summarization: Constructs a controlled test suite for the political perspective summarization task to evaluate the reliability of existing evaluation metrics, finding that LLM-based metrics significantly outperform traditional ones. It also demonstrates that reranking-based methods and DPO training on reranked data can substantially improve both the coverage and faithfulness of perspective summaries.
SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model: This work proposes SafeRAG, the first Chinese RAG security evaluation benchmark. It designs four novel attack tasks (silver noise, inter-context conflict, soft ad, and white DoS) capable of bypassing existing retrievers, filters, and generators. By systematically evaluating security vulnerabilities across 14 RAG components, it reveals that even state-of-the-art RAG systems are highly vulnerable to these attacks.
SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation: SeaKR leverages the self-aware uncertainty of LLMs' internal hidden layers (measured by the Gram determinant of hidden representations from multiple EOS token samplings) to adaptively decide when to retrieve, how to re-rank retrieval results, and which reasoning strategies to select. It improves F1 in multi-hop QA by 6% compared to DRAGIN and 9.5% compared to IRCoT.
Semantic Outlier Removal with Embedding Models and LLMs: Proposes SORE (Semantic Outlier Removal), a text cleaning method based on multilingual sentence embeddings and approximate nearest neighbor (ANN) search. It identifies core content via metadata embeddings and flags text segments that match predefined outlier categories or significantly deviate from the core content. SORE achieves extremely low computational costs while approaching LLM-level precision, and has been deployed in production to process millions of documents daily.
SetR: Shifting from Ranking to Set Selection for Retrieval Augmented Generation: SetR is proposed to shift the document ranking paradigm in RAG to a set selection paradigm. By using CoT reasoning to identify the information needs of queries and select the optimal document set, SetR significantly improves multi-hop QA performance while utilizing fewer documents (an average of 2.91 vs. 5).
SGIC: A Self-Guided Iterative Calibration Framework for RAG: SGIC utilizes token-level uncertainty scores (document relevance uncertainty + answer confidence uncertainty) of LLMs as guidance signals for self-calibration. By iteratively injecting the previous response and its uncertainty score into prompts to trigger in-context reasoning, it improves the EM of Llama2-7B from 69.1% to 77.2% (+8.1%) on HotpotQA, and also yields a 2.8% boost for GPT-4o.
Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models: This paper systematically investigates the "sticky token" phenomenon in text embedding models, where repeating certain anomalous tokens in sentences drags their cosine similarity towards a fixed value. It proposes an efficient detection method, STD, and identifies 868 sticky tokens across 40 checkpoints from 14 model families, revealing performance degradation of up to 50% on downstream tasks.
The Distracting Effect: Understanding Irrelevant Passages in RAG: This paper proposes a formal metric for the Distracting Effect (DE) of passages and develops multiple techniques to acquire highly distracting passages (including answer-skewed retrieval and categorized generation). It demonstrates the robustness of this metric across different LLMs, and finally improves QA accuracy by up to 7.5% through fine-tuning LLMs with highly distracting training samples.
Toward Structured Knowledge Reasoning: Contrastive Retrieval-Augmented Generation on Experience: This paper proposes the CoRE framework, which constructs an experience memory repository containing both successful and failed reasoning trajectories via Monte Carlo Tree Search (MCTS), and retrieves positive and negative exemplars during inference via Contrastive In-Context Learning (Contrastive ICL) to enhance the structural data (tables, databases) reasoning capabilities of LLMs, achieving average improvements of 3.44% and 4.24% on Text-to-SQL and TableQA, respectively.
Towards Adaptive Memory-Based Optimization for Enhanced Retrieval-Augmented Generation: This paper proposes the Amber framework, which enhances retrieval efficiency and answer quality in open-domain question answering within an iterative RAG paradigm through the collaboration of three components: an Agent-based Memory Updater, an Adaptive Information Collector, and a Multi-granular Content Filter.
Typed-RAG: Type-Aware Decomposition of Non-Factoid Questions for Retrieval-Augmented Generation: The proposed Typed-RAG framework achieves type-aware decomposition for non-factoid questions (NFQs). By decomposing complex multi-aspect questions into single-aspect sub-queries, it designs differentiated retrieval and generation strategies tailored to distinct question types (debate, experience, comparison, etc.), thereby significantly improving the performance of RAG in NFQA.
Unanswerability Evaluation for Retrieval Augmented Generation: UAEval4RAG proposes a comprehensive evaluation framework for evaluating retrieval-augmented generation (RAG) systems on unanswerable queries. It defines six categories of unanswerability, automatically synthesizes test data based on any given knowledge base, and evaluates system refusal capabilities. Experiments reveal that no single configuration optimizes performance for both answerable and unanswerable queries across all datasets.
VISA: Retrieval Augmented Generation with Visual Source Attribution: VISA proposes a RAG method based on visual source attribution, which leverages large vision-language models (VLMs) to highlight the precise region supporting the generated answer with bounding boxes on retrieved document screenshots, and constructs two datasets, Wiki-VISA and Paper-VISA, to verify its effectiveness.
VoxRAG: A Step Toward Transcription-Free RAG Systems in Spoken Question Answering: Proposes VoxRAG, a modular speech-to-speech retrieval-augmented generation system. It utilizes CLAP audio embeddings to bypass transcription and retrieve semantically relevant audio segments directly from spoken queries. It validates the feasibility of transcription-free spoken retrieval in a podcast question-answering scenario, achieving a Recall@10 of 0.60 on somewhat relevant segments.
When Claims Evolve: Evaluating and Enhancing the Robustness of Embedding Models Against Misinformation Edits: A perturbation framework is proposed to systematically evaluate the robustness of sentence embedding models when handling edited misinformation claims. Standard embedding models exhibit significant performance degradation, which can be mitigated using two approaches: knowledge distillation and claim normalization, bringing up to a 17 percentage point improvement in in-domain robustness and a 10 percentage point improvement in cross-domain generalization.
When Should Dense Retrievers Be Updated in Evolving Corpora? Detecting Out-of-Distribution Corpora Using GradNormIR: This paper proposes GradNormIR, an unsupervised method that utilizes gradient norms to detect whether a corpus is out-of-distribution (OOD) for a dense retriever without requiring queries. This determines when retriever updates are necessary, thereby ensuring retrieval robustness in dynamic corpus scenarios.