Skip to content

🔍 Information Retrieval & RAG

💬 ACL2026 · 44 paper notes

All Languages Matter: Understanding and Mitigating Language Bias in Multilingual RAG

This paper systematically reveals severe language bias (favoring English and the query language) in the reranking stage of multilingual RAG systems, and proposes the LAURA framework, which aligns the reranker via supervision signals driven by downstream generation quality, effectively mitigating bias and improving generation performance.

An Iterative Utility Judgment Framework Inspired by Philosophical Relevance via LLMs

Inspired by Schutz's philosophical theory of relevance, this paper proposes ITEM, an iterative utility judgment framework that enables the three core RAG components—relevance ranking, utility judgment, and answer generation—to mutually and dynamically enhance one another, yielding improvements over baselines across retrieval, utility judgment, and QA tasks.

Bayesian Active Learning with Gaussian Processes Guided by LLM Relevance Scoring

This paper proposes BAGEL, a Bayesian active learning framework based on Gaussian Processes (GP) that propagates sparse LLM relevance signals across the embedding space via an exploration–exploitation strategy under a limited LLM budget, enabling global passage retrieval that substantially outperforms conventional LLM re-ranking methods.

Beyond Black-Box Interventions: Latent Probing for Faithful Retrieval-Augmented Generation

This paper proposes ProbeRAG, which discovers the linear separability of conflicting and aligned knowledge in LLM latent spaces, and designs a three-stage framework (fine-grained knowledge pruning → latent conflict probing → conflict-aware attention) to address RAG faithfulness from the perspective of the model's internal mechanisms.

Beyond Explicit Refusals: Soft-Failure Attacks on Retrieval-Augmented Generation

This paper formally defines the "soft-failure" threat in RAG systems (generating fluent but uninformative responses), proposes DEJA, a black-box evolutionary attack framework that injects adversarial documents to exploit safety alignment mechanisms and induce ambiguous responses, achieving a Soft Attack Success Rate (SASR) exceeding 79% with high stealthiness.

CarO: Chain-of-Analogy Reasoning Optimization for Robust Content Moderation

This paper proposes CarO (Chain-of-Analogy Reasoning Optimization), a two-stage training framework that enables LLMs to autonomously generate analogical reference cases during inference for content moderation. The framework combines RAG-guided analogical chain generation, SFT, and customized DPO. On ambiguous moderation benchmarks, CarO achieves an average F1 improvement of 24.9%, substantially outperforming reasoning models (DeepSeek R1) and dedicated moderation models (LLaMA Guard).

ChAIRO: Contextual Hierarchical Analogical Induction and Reasoning Optimization for LLMs

This paper proposes ChAIRO, a Contextual Hierarchical Analogical Induction and Reasoning Optimization framework that employs a three-stage pipeline (analogical case generation → rule induction → rule-injected fine-tuning) to enable LLMs to autonomously generate analogical cases and induce explicit moderation rules for content moderation. ChAIRO achieves a 4.5% F1 improvement over single-instance rule generation and a 2.3% improvement over static RAG.

ChunQiuTR: Time-Keyed Temporal Retrieval in Classical Chinese Annals

This paper proposes ChunQiuTR, the first temporal retrieval benchmark built upon a non-Gregorian calendar system, constructed from the Spring and Autumn Annals and its exegetical traditions. It further introduces CTD (Calendrical Temporal Dual-encoder), which achieves temporally-aware retrieval via Fourier-based absolute calendrical context and relative temporal offset biases, substantially outperforming pure semantic baselines.

CodePromptZip: Code-specific Prompt Compression for Retrieval-Augmented Generation in Coding Tasks with LMs

This paper proposes CodePromptZip, the first code-specific prompt compression framework, which constructs training data via type-aware priority ranking and trains a small-model compressor with a copy mechanism. It achieves improvements of 23.4%, 28.7%, and 8.7% over the best baseline across three coding tasks.

Conjecture and Inquiry: Quantifying Software Performance Requirements via Interactive Retrieval-Augmented Preference Elicitation

This paper proposes IRAP (Interactive Retrieval-Augmented Preference Elicitation), a method that quantifies natural-language software performance requirements into mathematical functions. Evaluated on 4 real-world datasets against 10 state-of-the-art baselines, IRAP achieves up to 40× performance improvement using only 5 interaction rounds.

Context Attribution with Multi-Armed Bandit Optimization

This paper proposes CAMAB, which frames context attribution in RAG — identifying which context segments contribute to the generated answer — as a Combinatorial Multi-Armed Bandit (CMAB) problem. Using Linear Thompson Sampling to adaptively explore the space of context subsets, CAMAB reduces model queries by up to 30% compared to SHAP and ContextCite while matching or surpassing attribution quality on HotpotQA, CNN/DM, and TyDi QA.

CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering

This paper proposes CounterRefine, a lightweight inference-time repair layer: a standard RAG pipeline first generates a preliminary answer, which then conditions a counterevidence retrieval step to collect supporting and contradicting evidence; a constrained KEEP/REVISE decision gate combined with deterministic validation corrects erroneous answers, improving GPT-5's accuracy on SimpleQA from 67.3% to 73.1%.

CRAFT: Training-Free Cascaded Retrieval for Tabular QA

This paper proposes CRAFT, a three-stage cascaded table retrieval framework requiring no dataset-specific training (SPLADE sparse filtering → semantic mini-table ranking → neural re-ranking). By augmenting table representations with Gemini-generated captions and descriptions, CRAFT achieves SOTA on NQ-Tables (R@1 49.84), demonstrates strong zero-shot generalization on OTT-QA, and exhibits remarkable robustness to query paraphrasing.

CURaTE: Continual Unlearning in Real Time with Ensured Preservation of LLM Knowledge

CURaTE proposes a behavioral unlearning framework based on sentence embedding matching: a general-purpose unlearning embedder is trained prior to deployment (without any forget set); after deployment, embeddings of incoming unlearning requests are stored in a database; at inference time, cosine similarity determines whether to answer or refuse a query. LLM weights are never modified, yielding near-perfect knowledge preservation.

Detecting RAG Extraction Attack via Dual-Path Runtime Integrity Game

This paper proposes CanaryRAG, a RAG runtime defense mechanism inspired by stack canaries in software security. By injecting non-semantic canary tokens into retrieved chunks and designing a dual-path integrity game — the target path should not leak canary tokens, while the Oracle path should be able to elicit them — CanaryRAG detects knowledge base extraction attacks in real time, achieving state-of-the-art protection without compromising task performance or inference latency.

Domain-Specific Data Generation Framework for RAG Adaptation

This paper proposes RAGen, a scalable and modular data generation framework that automatically synthesizes domain-specific QAC (Question-Answer-Context) data through document-level concept extraction, multi-chunk evidence assembly, and Bloom's Taxonomy-guided question generation. The framework supports contrastive fine-tuning of embedding models and supervised fine-tuning of LLMs, achieving substantial improvements over AutoRAG and LlamaIndex baselines across three domain-specific datasets.

DQA: Diagnostic Question Answering for IT Support

This paper proposes the DQA framework, which achieves systematic fault diagnosis in enterprise IT support by maintaining persistent diagnostic states and aggregating retrieved evidence at the root-cause level rather than processing documents individually. The success rate improves from a baseline of 41.3% to 78.7%, while the average number of turns decreases from 8.4 to 3.9.

End-to-End Optimization of LLM-Driven Multi-Agent Search Systems via Heterogeneous-Group-Based Reinforcement Learning

This paper proposes MHGPO (Multi-Agent Heterogeneous Group Policy Optimization), a critic-free multi-agent RL method that achieves end-to-end optimization in a three-agent search system (Rewriter→Reranker→Answerer) through heterogeneous-group relative advantage estimation and backward reward propagation. The method captures implicit cross-agent dependencies and cross-trajectory correlations, significantly outperforming MAPPO and GRPO baselines on multi-hop QA benchmarks such as HotpotQA.

Enhancing LLM-based Search Agents via Contribution Weighted Group Relative Policy Optimization

CW-GRPO reframes process supervision as "advantage redistribution": a LLM judge evaluates the retrieval utility and reasoning correctness of each search turn, computes a contribution score to rescale outcome-based advantages, and achieves turn-level credit assignment without introducing an unstable value function. The approach outperforms standard GRPO by 5.0% on Qwen3-8B.

Enhancing Multilingual RAG Systems with Debiased Language Preference-Guided Query Fusion

This paper demonstrates that the apparent "English preference" in multilingual RAG systems is primarily an artifact of structural priors embedded in evaluation benchmarks (i.e., gold evidence concentrated in English and cultural priors) rather than an intrinsic model bias. The authors propose a debiased language preference metric, DeLP, which reveals that retrievers actually prefer monolingual alignment. Building on this insight, they design the DELTA query augmentation framework, which consistently outperforms English-pivot strategies on multilingual RAG benchmarks.

FAITH: Factuality Alignment through Integrating Trustworthiness and Honestness

This paper proposes FAITH, a framework that maps LLM uncertainty signals (consistency + semantic entropy) to natural-language descriptions of knowledge state quadrants (trustworthiness × honestness), designs uncertainty-aware fine-grained reward functions for PPO training, and applies a RAG module to correct potentially erroneous outputs, systematically improving the factual accuracy of LLMs.

Feedback Adaptation for Retrieval-Augmented Generation

This paper proposes feedback adaptation as a new problem setting for RAG systems—investigating how quickly and effectively corrective feedback propagates to future queries. It defines two evaluation axes, correction latency and post-feedback performance, and introduces PatchRAG as a training-free, inference-time feedback integration approach that achieves immediate correction and strong generalization.

FLARE: Task-Agnostic Embedding Model Evaluation via Normalizing Flows

This paper proposes FLARE, a label-free text embedding model evaluation framework based on normalizing flows. By estimating informational sufficiency directly from log-likelihoods, FLARE avoids the collapse of distance-based density estimation in high-dimensional spaces, achieving a Spearman \(\rho\) of 0.90 against supervised baselines across 11 datasets.

From Relevance to Authority: Authority-aware Generative Retrieval in Web Search Engines

This paper proposes AuthGR, the first framework to systematically integrate document authority into generative retrieval. It combines VLM-based multimodal authority scoring, a three-stage progressive training pipeline (CPT→SFT→GRPO), and a hybrid ensemble deployment pipeline. The approach is validated through large-scale A/B testing on Naver's commercial search engine, demonstrating significant improvements in user engagement.

How Retrieved Context Shapes Internal Representations in RAG

This paper systematically analyzes how retrieved documents influence the internal states of LLMs in RAG from the perspective of hidden representations, identifying five key patterns: random documents induce large representation drift and trigger refusal behavior; relevant documents primarily confirm rather than alter parametric knowledge; a single relevant document can anchor representations in multi-document settings; later layers progressively emphasize parametric knowledge, thereby limiting the influence of retrieved evidence; and LLMs can distinguish random documents in early layers but fail to reliably separate distractor documents from relevant ones even at the final layer.

Hybrid-Vector Retrieval for Visually Rich Documents: Combining Single-Vector Efficiency and Multi-Vector Accuracy

HEAVEN proposes a plug-and-play two-stage hybrid-vector framework that accelerates coarse retrieval via Visual Summary Pages (VS-Pages) with a single-vector model and reduces multi-vector reranking computation via POS-based query token filtering. Across four benchmarks, the framework retains 99.87% of the multi-vector Recall@1 while reducing per-query FLOPs by 99.82%.

Is Agentic RAG Worth It? An Experimental Comparison of RAG Approaches

This paper systematically compares Enhanced RAG and Agentic RAG across four dimensions—user intent handling, query rewriting, document refinement, and underlying LLM selection—on four datasets. The results show that each paradigm has distinct advantages: Agentic RAG is more flexible in intent routing and query rewriting, while Enhanced RAG is more effective in document reranking. Notably, Agentic RAG incurs up to 3.3× higher cost.

KoCo-Bench: Can Large Language Models Leverage Domain Knowledge in Software Development?

KoCo-Bench introduces the first code benchmark with an explicit domain knowledge corpus, covering 11 frameworks and 25 projects across 6 emerging domains (RL, Agent, RAG, etc.). It evaluates LLMs' ability to acquire and apply domain knowledge for code generation and knowledge comprehension, revealing that even the strongest coding agent, Claude Code, achieves only 34.2%.

MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits

This paper proposes MAB-DQA, a framework that decomposes complex queries into multiple aspect sub-queries, dynamically evaluates the importance of each aspect via a multi-armed bandit mechanism (Thompson Sampling), and redistributes retrieval budgets accordingly, achieving significant improvements in retrieval precision and answer accuracy for multimodal document question answering.

MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation

This paper proposes MASS-RAG, a training-free multi-agent synthesis RAG framework that employs three specialized filtering agents—Summarizer, Extractor, and Reasoner—to process retrieved documents from complementary perspectives, followed by a Synthesis Agent that integrates multi-perspective evidence or candidate answers, consistently outperforming strong baselines across four benchmarks.

Multi-Faceted Self-Consistent Preference Alignment for Query Rewriting in Conversational Search

This paper proposes MSPA-CQR, which constructs self-consistent preference data across three dimensions—rewriting, retrieval, and response—and trains a query rewriting model via prefix-guided multi-dimensional DPO, achieving significant improvements over existing methods in both in-distribution and out-of-distribution settings.

RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding

RACER proposes a training-free speculative decoding method that unifies retrieval-based exact pattern matching with logits-based future prediction. It constructs a Logits Tree via a copy-logit strategy and a Retrieval Tree via an LRU-eviction Aho-Corasick automaton, achieving over 2× inference speedup across multiple benchmarks.

ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval

ReasonEmbed introduces three technical innovations—ReMixer, a non-trivial synthetic data pipeline (82K high-quality samples); Redapter, an adaptive reasoning-intensity-weighted training strategy; and multi-backbone implementation—achieving an nDCG@10 of 38.1 on the BRIGHT benchmark, surpassing all existing text embedding models by approximately 10 points.

Region-R1: Reinforcing Query-Side Region Cropping for Multi-Modal Re-Ranking

This paper proposes Region-R1, which formulates query-side region cropping in multimodal re-ranking as a decision-making problem. Via reinforcement learning (r-GRPO), the model learns when and how to crop question-relevant regions from the query image, achieving improvements of 20% and 8% in CondRecall@1 on E-VQA and InfoSeek, respectively.

RepoShapley: Shapley-Enhanced Context Filtering for Repository-Level Code Completion

This paper proposes RepoShapley, a coalition-aware context filtering framework based on Shapley values, which estimates the interactive contribution of retrieved code snippets in combination to determine whether each snippet should be retained or discarded, thereby significantly improving repository-level code completion quality.

Prune-then-Merge: Towards Efficient Multi-Vector Visual Document Retrieval

This paper proposes Prune-then-Merge, a two-stage training-free multi-vector document compression framework. It first removes low-information patches via adaptive attention-based pruning, then applies hierarchical clustering to merge the remaining high-signal patches. Evaluated across 29 VDR datasets, the framework extends the near-lossless compression range from 50–60% to 60–70% and significantly outperforms single-stage methods at high compression ratios of 80%+.

SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

This paper proposes SlideAgent, a hierarchical agentic framework that constructs structured knowledge representations via three dedicated agents operating at the global, page, and element levels, achieving significant improvements in fine-grained understanding of multi-page visual documents, particularly presentation slides.

Stable-RAG: Mitigating Retrieval-Permutation-Induced Hallucinations in Retrieval-Augmented Generation

This paper identifies a critical yet previously overlooked vulnerability in RAG systems—high sensitivity to the ordering of retrieved documents—and proposes Stable-RAG, which applies spectral clustering over hidden states induced by document permutations to identify dominant reasoning patterns, then uses DPO alignment to redirect hallucinated outputs toward correct answers, achieving simultaneous improvements in accuracy and reasoning consistency across three QA datasets.

TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice

This paper introduces TaxPraBen, the first LLM evaluation benchmark targeting Chinese real-world tax practice. It comprises 14 datasets with 7.3K samples spanning three authentic scenarios—tax risk prevention, tax inspection analysis, and tax planning. The paper proposes a scalable evaluation paradigm based on structured parsing, field-aligned extraction, and hybrid numerical–textual matching. Evaluation of 19 LLMs reveals that closed-source models and Chinese-optimized models outperform others, while YaYi2, a tax-domain fine-tuned model, yields only marginal improvements.

To Lie or Not to Lie? Investigating The Biased Spread of Global Lies by LLMs

This paper introduces GlobalLies—a multilingual parallel dataset comprising 440 misinformation generation templates and 6,867 entities across 8 languages and 195 countries—and reveals that LLMs exhibit systematic country-level and language-level biases in misinformation propagation: compliance rates are significantly higher for low-HDI countries (statistical correlation \(\rho=-0.355\), \(p=5\times10^{-7}\)), low-resource languages elicit compliance rates more than 30% higher than English, and existing safety classifiers and RAG-based safeguards provide uneven protection globally.

TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG

This paper proposes TPA, a framework that mathematically decomposes the generation probability of each token in an LLM into contributions from seven sources (Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, and Initial Embedding), and combines part-of-speech (POS) tagging for feature aggregation to achieve state-of-the-art hallucination detection in RAG settings.

Understanding Structured Financial Data with LLMs: A Case Study on Fraud Detection

This paper proposes FinFRE-RAG, a two-stage framework that serializes high-dimensional tabular transaction data into natural language via importance-guided feature reduction, and combines label-aware retrieval-augmented in-context learning to substantially improve F1/MCC of open-source LLMs on financial fraud detection, narrowing the performance gap with specialized tabular classifiers.

VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG

VideoStir proposes a structured and intent-aware RAG framework for long video understanding. By modeling videos as spatio-temporal graphs for multi-hop clip retrieval and training an intent relevance scorer for frame-level filtering, the framework achieves performance comparable to state-of-the-art long video RAG methods without relying on any auxiliary text tools.

Why These Documents? Explainable Generative Retrieval with Hierarchical Category Paths

This paper proposes HyPE, a generative retrieval framework that first generates hierarchical category paths (e.g., "Government >> Government by cities") before decoding document identifiers, providing query-relevant explanations for retrieval results while simultaneously improving retrieval accuracy.