Skip to content

🔍 Information Retrieval & RAG

💬 ACL2026 · 73 paper notes

📌 Same area in other venues: 📷 CVPR2026 (9) · 🔬 ICLR2026 (81) · 🧪 ICML2026 (26) · 🤖 AAAI2026 (21) · 🧠 NeurIPS2025 (24) · 📹 ICCV2025 (5)

🔥 Top topics: RAG ×25 · Question Answering ×8 · Reasoning ×7 · LLM ×7 · Dialogue ×5

A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval

Through a carefully designed financial document diagnostic benchmark (single-digit perturbation + text masking), this study empirically proves that "aggregating VLM patch tokens into a single vector" causes vast semantic differences (e.g., $1.2M vs $7.2M) to collapse into nearly identical vectors with cosine similarity \(> 0.99\). The root cause is "global texture dominance," which various mitigation strategies and retrieval-tuned embeddings fail to resolve.

A Survey of Reasoning-Intensive Retrieval: Progress and Challenges

This paper systematically organizes the emerging direction of "Reasoning-Intensive Retrieval (RIR)." It provides the first comprehensive three-part survey—benchmarks, methods, and challenges—following the pipeline of query/index/retriever/reranker/iteration, and points out that current evaluations rely excessively on traditional IR metrics like nDCG.

Agentic Conversational Search with Contextualized Reasoning via Reinforcement Learning

ConvAgent is proposed to train conversational search agents to alternate between search and reasoning in multi-turn interactions by decomposing RL rewards into three complementary components: outcome reward, information gain reward, and mixed-initiative behavior reward.

All Languages Matter: Understanding and Mitigating Language Bias in Multilingual RAG

The study systematically reveals that multilingual RAG systems exhibit severe language bias (preference for English and query languages) during the reranking stage. It proposes the LAURA framework, which aligns the reranker via supervised signals driven by downstream generation quality, effectively mitigating bias and improving generation performance.

An Iterative Utility Judgment Framework Inspired by Philosophical Relevance via LLMs

Inspired by Schutz's philosophical theory of relevance, this paper proposes ITEM, an iterative utility judgment framework. By enabling dynamic interaction and mutual enhancement among three RAG components (relevance ranking, utility judgment, and answer generation), ITEM outperforms baselines in retrieval, utility judgment, and QA tasks.

AuthorityBench: Benchmarking LLM Authority Perception for Reliable Retrieval-Augmented Generation

AuthorityBench constructs the first LLM "authority perception" benchmark using 10K web domains (PageRank ground truth) + 22K entities (Wikipedia cross-lingual sitelink ground truth) + 120 RAG questions. The study finds that ListJudge / PairJudge + PointScore yields the most accurate outputs, adding web text can degrade performance, and utilizing authority signals for RAG filtering improves answer accuracy by up to 14 percentage points.

Bayesian Active Learning with Gaussian Processes Guided by LLM Relevance Scoring

BAGEL is proposed as a Bayesian active learning framework based on Gaussian Processes (GP). By using an exploration-exploitation balance strategy to propagate sparse LLM relevance signals across the global embedding space under a limited LLM budget, it achieves passage retrieval that significantly outperforms traditional LLM reranking methods.

Benchmarking and Enabling Efficient Chinese Medical Retrieval via Asymmetric Encoders

This paper proposes CMedTEB (Chinese Medical Text Embedding Benchmark) and CARE (Asymmetric Retrieval Framework). The former establishes a high-quality Chinese medical retrieval/reranking/STS benchmark through multi-LLM voting and expert validation. The latter utilizes an asymmetric architecture with a lightweight BERT for query encoding and a large LLM for document encoding, achieving LLM-level retrieval precision with BERT-level online latency through a two-stage progressive alignment strategy.

Beyond Black-Box Interventions: Latent Probing for Faithful Retrieval-Augmented Generation

ProbeRAG is proposed to address RAG faithfulness through the model's internal mechanisms by discovering the linear separability of conflicting/aligned knowledge in the LLM's latent space. It employs a three-stage framework: fine-grained knowledge pruning, latent space conflict probing, and conflict-aware attention.

Beyond Chunks and Graphs: Retrieval-Augmented Generation through Triplet-Driven Thinking

T2RAG replaces the minimum retrieval unit of RAG from "text chunks/KG nodes" with atomic triplets. Off-line, the corpus is extracted into a collection of triplet propositions for indexing. On-line, the LLM decomposes the question into searchable triplets with ? placeholders, iteratively retrieving evidence from the triplet library to fill in the blanks until all placeholders are resolved to generate the final answer. This achieves an average improvement of up to 11% across six datasets while reducing retrieval costs by up to 45%.

BRIEF-Pro: Universal Context Compression with Short-to-Long Synthesis for Fast and Accurate Multi-Hop Reasoning

To address the issues of slow inference and information drowning in RAG under 10k+ word contexts, the authors synthesize multi-hop long-context training data via a "short-context seed data → Wikipedia expansion → head-tail iterative pruning" pipeline. By fine-tuning a 3B Llama-3.2 as an extractive summarizer (BRIEF-Pro), the model outperforms LongLLMLingua's 9× compression with a 32× compression rate across four multi-hop QA datasets. It also enables direct control of summary length through sentence-count instructions.

Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

This paper proposes DGPO: using teacher demonstrations for cold-start KD initialization, followed by applying KL distillation penalties to "error samples" during the PPO stage. This allows 0.5B compact models to acquire Agentic RAG capabilities, increasing the average EM across 7 QA benchmarks from 0.006 to 0.329, with some datasets even surpassing the 3B teacher.

ChatR1: Reinforcement Learning for Conversational Reasoning and Retrieval Augmented Question Answering

The authors extend "Search + Reasoning" RL frameworks (e.g., Search-R1 / R1-Searcher) from single-turn QA to multi-turn conversational QA. They propose ChatR1: a framework that jointly optimizes reasoning, searching, and answering end-to-end via PPO. It introduces an "intent-aware reward" using token-F1 between model-generated search queries and human-authored rewrites as a turn-level dense reward. ChatR1 outperforms ChatGPT/Claude using a 3B backbone across five CQA datasets and demonstrates significantly improved out-of-domain transfer capabilities.

ChunQiuTR: Time-Keyed Temporal Retrieval in Classical Chinese Annals

This paper proposes ChunQiuTR, the first time-keyed retrieval benchmark based on non-Gregorian calendars, constructed from the Spring and Autumn Annals and its commentary tradition. It introduces the Calendar-Temporal Dual-Encoder (CTD), which achieves time-aware retrieval through Fourier absolute calendar contexts and relative offset biases, significantly outperforming pure semantic baselines.

CiteGuard: Faithful Citation Attribution for LLMs via Retrieval-Augmented Validation

CiteGuard proposes a retrieval-augmented agent framework that provides a more faithful foundation for scientific citation attribution via expanded retrieval actions (including full-text search and contextual retrieval), achieving 68.1% accuracy on the CiteME benchmark—a 10 percentage point improvement over baselines and close to human performance (69.2%).

Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers

This paper presents the first systematic evaluation of the impact of "code-switched queries" on modern IR systems. The authors propose the manually annotated CSR-L benchmark and an LLM-generated 11-task CS-MTEB suite. They find that even strong 8B multilingual models suffer a drop of 4–13 points in nDCG@10 under query-side code-switching, with rerankers plummeting from 60 to 25. Lexicon-based vocabulary expansion is shown to mitigate the issue but fails to close the gap to monolingual baselines.

CodePromptZip: Code-specific Prompt Compression for Retrieval-Augmented Generation in Coding Tasks with LMs

CodePromptZip is proposed as the first code-oriented prompt compression framework. It constructs training data through type-aware prioritization and trains a small model compressor with a copy mechanism. It achieves performance improvements of 23.4%, 28.7%, and 8.7% over the best baselines across three coding tasks.

Conjecture and Inquiry: Quantifying Software Performance Requirements via Interactive Retrieval-Augmented Preference Elicitation

The authors propose IRAP, an Interactive Retrieval-Augmented Preference elicitation method that quantifies natural language software performance requirements into mathematical functions. IRAP achieves up to a 40x performance improvement over 10 state-of-the-art (SOTA) methods across four real-world datasets with only 5 rounds of interaction.

Context Attribution with Multi-Armed Bandit Optimization

This paper proposes CAMAB, which models context attribution in RAG (identifying which context snippets contribute to generated answers) as a Combinatorial Multi-Armed Bandit (CMAB) problem. By using Linear Thompson Sampling to adaptively explore the context subset space, CAMAB reduces the number of model queries by up to 30% compared to SHAP and ContextCite on HotpotQA, CNN/DM, and TyDi QA while matching or exceeding attribution quality.

CORAL: Adaptive Retrieval Loop for Culturally-Aligned Multilingual RAG

CORAL reframes multilingual RAG failures as "retrieval condition misalignment"—not just query reformulation, but the need to dynamically switch the retrieval corpus. Through a closed loop consisting of a planner and a critic agent performing "corpus selection → retrieval → scoring/filtering → sufficiency check → corpus/query adjustment," the method achieves a 3.58pp improvement over the strongest baseline for low-resource languages on two cultural benchmarks and a 3.91pp gain on CLIcK Korean cultural QA.

CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering

This paper proposes CounterRefine, a lightweight inference-time repair layer: first, a standard RAG generates a preliminary answer; then, answer-conditioned counterevidence retrieval gathers supporting/refuting evidence; finally, a constrained KEEP/REVISE decision and deterministic verification repair incorrect answers. It improves the accuracy of GPT-5 on SimpleQA from 67.3% to 73.1%.

CRAFT: Training-Free Cascaded Retrieval for Tabular QA

This paper proposes CRAFT, a three-stage cascaded table retrieval framework that requires no dataset-specific training (SPLADE sparse filtering → semantic mini-table ranking → neural re-ranking). By enhancing table representations with Gemini-generated titles and descriptions, it achieves SOTA on NQ-Tables (R@1 49.84), demonstrates strong zero-shot generalization on OTT-QA, and exhibits significant robustness to query rewrites.

Disco-RAG: Discourse-Aware Retrieval-Augmented Generation

The authors propose Disco-RAG, which explicitly injects Rhetorical Structure Theory (RST) into the RAG pipeline. By parsing intra-chunk RST trees (local hierarchy), constructing inter-chunk rhetorical graphs (global coherence), and generating discourse-aware blueprints to guide responses, it achieves training-free SOTA performance on three long-document benchmarks: Loong, ASQA, and SciNews (Loong overall +12.74 LLM Score).

Domain-Specific Data Generation Framework for RAG Adaptation

This paper proposes RAGen, a scalable modular data generation framework that automatically synthesizes domain-specific QAC (Question-Answer-Context) data through document-level concept extraction, multi-chunk evidence assembly, and Bloom's Taxonomy-guided question generation. It supports contrastive finetuning of embedding models and supervised finetuning of LLMs, significantly outperforming AutoRAG and LlamaIndex baselines across three domain datasets.

DQA: Diagnostic Question Answering for IT Support

This paper proposes the DQA framework, which achieves systematic troubleshooting in enterprise IT support by maintaining a persistent diagnostic state and aggregating retrieval evidence at the root-cause level (instead of per-document processing). It improves the success rate from a 41.3% baseline to 78.7% and reduces the average number of turns from 8.4 to 3.9.

End-to-End Optimization of LLM-Driven Multi-Agent Search Systems via Heterogeneous-Group-Based Reinforcement Learning

This paper proposes MHGPO (Multi-Agent Heterogeneous Group Policy Optimization), a critic-free multi-agent RL method. By employing heterogeneous group relative advantage estimation and backward reward propagation, it achieves end-to-end optimization in a three-agent search system (Rewriter→Reranker→Answerer). It captures implicit cross-agent dependencies and cross-trajectory correlations, significantly outperforming MAPPO and GRPO baselines on multi-hop QA benchmarks such as HotpotQA.

Enhancing Factuality through Consensus and Consistency in Summarization Using Minimum Bayes Risk Decoding

This paper proposes ConSUM, which evaluates both the factual consistency of summary candidates with the source document and the consensus among candidates. By combining MBR decoding with factuality metrics such as FENICE/FIZZ for reranking, it enhances the factual reliability of summaries on CNN/DailyMail, XSum, and in human evaluations.

Enhancing LLM-based Search Agents via Contribution Weighted Group Relative Policy Optimization

CW-GRPO redefines process supervision as "advantage redistribution": using an LLM judge to evaluate the retrieval utility and reasoning correctness of each search round, calculating contribution scores to scale outcome-based advantages. This achieves turn-level credit assignment without introducing unstable value functions, outperforming standard GRPO by 5.0% on Qwen3-8B.

Enhancing Multilingual RAG Systems with Debiased Language Preference-Guided Query Fusion

This paper discovers that the "English preference" in multilingual RAG systems is primarily an artifact of structural priors in evaluation benchmarks (gold evidence concentrated in English, cultural priors) rather than an inherent model bias. It proposes DeLP, a debiased language preference metric, to reveal that retrievers actually prefer monolingual alignment. Based on this, the DELTA query enhancement framework is designed, consistently surpassing English-pivoting strategies in multilingual RAG.

eTracer: Towards Traceable Text Generation via Claim-Level Grounding

eTracer decomposes RAG responses into atomic claims and searches for sentence-level evidence (supporting or refuting) within the context. By using a three-step pipeline (decomposition → embedding retrieval → entailment judgment), it outputs a signed score matrix. This allows for precise back-tracing of factual origins and quantitative assessment of response faithfulness in biomedical scenarios.

Feedback Adaptation for Retrieval-Augmented Generation

This paper proposes "Feedback Adaptation" as a new problem setting for RAG systems—investigating how quickly and effectively corrective feedback propagates to future queries. It defines two evaluation axes: correction latency and post-feedback performance, and introduces PatchRAG as a training-free inference-time feedback integration solution to achieve instant correction and strong generalization.

FinRAG-12B: A Production-Validated Recipe for Grounded Question Answering in Banking

The Kasisto team developed FinRAG-12B based on Gemma 3 12B-IT using an efficient 143M-token data recipe (LLM-as-Judge filtering + citation labeling + 22% unanswerable samples + two-stage curriculum). Compressed via W4A16 quantization to 8.4 GB for single-GPU deployment, it outperforms GPT-4.1 in both answer quality (JudgeLM 6.21) and citation quality (73.1). Its refusal rate of 12% balances the unsafe 4.3% of the base model and the 20.2% over-refusal of GPT-4.1. Following deployment across 40+ financial institutions, the query resolution rate increased significantly by +7.1pp (\(p<0.001\)), with latency and costs reduced by 3–5\(\times\) and 20–50\(\times\) respectively compared to GPT-4.1.

FLARE: Task-Agnostic Embedding Model Evaluation via Normalizing Flows

The FLARE framework is proposed, utilizing Normalizing Flows for label-free text embedding model evaluation. By directly estimating information sufficiency from log-likelihood, it avoids the collapse of distance-based density estimation in high-dimensional spaces, achieving a Spearman \(\rho\) of 0.90 with supervised benchmarks across 11 datasets.

From Relevance to Authority: Authority-aware Generative Retrieval in Web Search Engines

This paper proposes AuthGR, the first framework to systematically integrate document authority into generative retrieval. By utilizing VLM multimodal authority scoring, a three-stage progressive training pipeline (CPT→SFT→GRPO), and a hybrid ensemble deployment pipeline, it validated significant user engagement improvements in large-scale A/B tests on the Naver commercial search engine.

GIFT: Guided Fine-Tuning and Transfer for Enhancing Instruction-Tuned Language Models

GIFT transforms the instruction-tuned model from a passive merging target into a teacher that provides confidence scores for training tokens. These scores guide the LoRA fine-tuning of the base model, after which the adapter is merged back into the instruction model. This approach consistently outperforms direct fine-tuning and transfer baselines like Shadow-FT on mathematical, medical, and instruction-following tasks.

GLIER: Generative Legal Inference and Evidence Ranking for Legal Case Retrieval

This paper proposes GLIER: a two-stage framework that reframes Legal Case Retrieval (LCR) from "direct text similarity matching" to "jointly generating latent variables of Charge + Constitutive Elements via seq2seq, then fusing them through multi-view (generation confidence + structural matching + lexical BM25) MLP." It outperforms SAILER and KELLER on LeCaRD/LeCaRDv2 and beats strong full-data baselines using only 10% of the training data.

How Large Language Models Balance Internal Knowledge with User and Document Assertions

This paper moves beyond the binary conflict paradigm of "parametric knowledge vs. a single external source" and proposes a ternary source interaction evaluation framework comprising "parametric (P) / user assertion (U) / document assertion (D)." Based on evaluations of 27 LLMs across two datasets, it finds that most models are more credulous of documents than users, post-training further strengthens this preference, and most models are "impressionable" — failing to distinguish whether external information is helpful or harmful.

How Retrieved Context Shapes Internal Representations in RAG

This paper systematically analyzes how retrieved documents in RAG influence the internal states of LLMs from the perspective of hidden representations. It identifies five key patterns: random documents induce large representation drifts and trigger refusal behaviors; relevant documents primarily confirm rather than alter parametric knowledge; a single relevant document can anchor representations in multi-document scenarios; later layers progressively emphasize parametric knowledge, thereby limiting the impact of retrieved evidence; and LLMs can distinguish random documents in early layers but remain unable to reliably differentiate between distractor and relevant documents even in the final layers.

Hybrid-Vector Retrieval for Visually Rich Documents: Combining Single-Vector Efficiency and Multi-Vector Accuracy

HEAVEN proposes a plug-and-play two-stage hybrid vector framework. It accelerates single-vector coarse retrieval via Visual Summary Pages (VS-Pages) and reduces multi-vector re-ranking computation through POS-based query token filtering. This approach maintains 99.87% of multi-vector Recall@1 while reducing per-query FLOPs by 99.82% across four benchmarks.

HyperMem: Hypergraph Memory for Long-Term Conversations

HyperMem replaces pairwise edges in traditional RAG with "hyperedges" (edges connecting \(\ge 3\) nodes), organizing long-term conversation memory into a "Topic → Episode → Fact" structure. By combining coarse-to-fine retrieval with hypergraph embedding propagation, it solves retrieval fragmentation caused by multi-episode cross-temporal dependencies, achieving a 92.73% LLM-as-judge accuracy on the LoCoMo benchmark (compared to the Prev. SOTA of 86.49%).

IF-GEO: Conflict-Aware Instruction Fusion for Multi-Query Generative Engine Optimization

This paper treats "optimizing a single document for multiple potential queries simultaneously" as a constrained multi-objective optimization problem and proposes IF-GEO. It follows a "diverge-then-converge" strategy: first, LLMs perform reverse-retrieval of representative queries and generate structured edit requests; then, through Priority × Necessity scoring + Deduplication + Conflict Resolution + Global Revision Blueprint, multiple conflicting edit instructions are fused into one executable revision blueprint. Additionally, WCP/DR/WTR risk-aware stability metrics are introduced. On GEO-Bench, it pushes the objective overall from Auto-GEO's 7.59 up to 11.03, while reducing the worst single-query performance drop from -0.0511 to -0.0090.

Is Agentic RAG Worth It? An Experimental Comparison of RAG Approaches

This study systematically compares Enhanced RAG and Agentic RAG across four dimensions—user intent processing, query rewriting, document refinement, and underlying LLM selection—using four datasets. The findings indicate that both paradigms have distinct advantages: Agentic RAG is more flexible in intent routing and query rewriting, while Enhanced RAG is more effective in document reranking. However, the cost of Agentic RAG is up to 3.3 times higher.

Language-Coupled Reinforcement Learning for Multilingual Retrieval-Augmented Generation

This paper proposes the LcRL framework, which addresses knowledge bias and knowledge conflict in multilingual RAG through language-coupled GRPO policy optimization and anti-alignment penalty rewards, achieving significant improvements in multilingual QA tasks.

Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation

This paper proposes EviOmni, which learns to extract rational evidence from retrieved documents via a "reason-then-extract" paradigm. By integrating evidence reasoning and evidence extraction into a unified trajectory, the method utilizes knowledge token masking to avoid information leakage. Optimized via GRPO with verifiable rewards, the model achieves higher accuracy than full-text retrieval while maintaining a significant compression ratio (~38x) across five benchmarks.

MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits

The MAB-DQA framework is proposed to decompose complex queries into multiple aspect sub-queries and utilize a Multi-Armed Bandit mechanism (Thompson Sampling) to dynamically evaluate the importance of each aspect and reallocate the retrieval budget. This significantly improves retrieval precision and response accuracy in multimodal document question answering.

MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation

This paper proposes MASS-RAG, a training-free multi-agent synthesis RAG framework. It processes retrieved documents from complementary perspectives via three specialized filtering agents (Summarizer/Extractor/Reasoner) and integrates multi-view evidence or candidate answers through a Synthesis Agent, consistently outperforming strong baselines across four benchmarks.

More Than Efficiency: Embedding Compression Improves Domain Adaptation in Dense Retrieval

This paper demonstrates that PCA vector compression is not merely for acceleration but also serves as a zero-training domain adaptation method for dense retrievers, where fitting PCA solely with target domain queries improves NDCG@10 across 75.4% of model-dataset combinations.

MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks

MTR-Suite proposes a comprehensive framework spanning benchmark auditing, conversational data synthesis, and retrieval evaluation. It utilizes MTR-Eval to diagnose annotation quality and MTR-Pipeline to generate MTR-Bench—a challenging multi-turn retrieval benchmark—at approximately 1/400 of the cost of manual labor.

Multi-Faceted Self-Consistent Preference Alignment for Query Rewriting in Conversational Search

This paper proposes MSPA-CQR, which constructs self-consistent preference data from three dimensions—rewriting, retrieval, and response—and trains the query rewriting model using prefix-guided multi-dimensional DPO. It significantly outperforms existing methods in both in-distribution and out-of-distribution scenarios.

Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA

This paper introduces MuDABench, shifting multi-document QA from "finding relevant snippets" to "extraction, aggregation, and quantitative analysis over large-scale semi-structured collections." It demonstrates that vanilla RAG struggle even with increased recall, while metadata-aware multi-agent workflows significantly improve results but still trail human experts.

Optimizing User Profiles via Contextual Bandits for Retrieval-Augmented LLM Personalization

The PURPLE framework is proposed, modeling the user profile construction in retrieval-augmented LLM personalization as a contextual bandit problem. It captures dependencies between records via a Plackett-Luce ranking model and directly optimizes retrieval to match generation quality using the LLM's log-likelihood of reference responses as the reward signal.

PL-MTEB: Polish Massive Text Embedding Benchmark

PL-MTEB constructs a 30-task evaluation set for Polish text embeddings covering classification, clustering, pair classification, retrieval, and semantic similarity. It systematically evaluates 30 Polish and multilingual embedding models, showing that while large models generally lead, factors such as task type, training data leakage, and model scale significantly impact the conclusions.

Quantifying and Improving the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data

This paper proposes the SURE framework to systematically evaluate the sensitivity of RAG generation to semantically irrelevant spurious features (style, source, logic, format, metadata) in retrieved documents and significantly improves RALM robustness using synthetic data generated by SURE through SFT/DPO.

RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora

This paper proposes the RARE framework, which tracks cross-document redundancy by decomposing documents into atomic facts. It introduces CRRF (Criterion-separated Reciprocal Rank Fusion) to stabilize multi-criterion LLM judgments. By constructing the RedQA benchmark on high-redundancy enterprise corpora (Finance, Legal, Patents), the study reveals that for 4-hop high-overlap settings, the PerfRecall@10 of mainstream retrievers plummets from 66.4% to a range of 5.0-27.9%.

ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval

ReasonEmbed introduces three technical innovations—the ReMixer non-trivial synthetic data method (82K high-quality samples), Redapter adaptive reasoning-intensity weighted training, and multi-backbone implementations—achieving an nDCG@10 of 38.1 on the BRIGHT benchmark, significantly outperforming all existing text embedding models by approximately 10 points.

Reliable Evaluation Protocol for Low-Precision Retrieval

Reveals that low-precision (e.g., binary/quantized embedding) retrieval systems generate massive "spurious ties" during evaluation due to reduced score granularity, leading to highly unstable results. Proposes two complementary strategies: HPS (High-Precision Scoring) and TRM (Tie-Aware Metrics), making low-precision retrieval evaluation more reliable and consistent.

Rerank Before You Reason: Analyzing Reranking Tradeoffs through Effective Token Cost in Deep Search Agents

This paper systematically investigates the efficiency-effectiveness trade-offs of listwise reranking in deep search agents. By introducing the Effective Token Cost (ETC) metric, the study finds that moderate-depth reranking is generally more cost-effective than increasing search-time reasoning budgets, achieving comparable or higher end-to-end accuracy with lower token overhead.

Retrieval-Augmented Tutoring for Algorithm Tracing and Problem-Solving in AI Education

Ours proposes KITE, a RAG tutoring system for course materials oriented towards algorithm tracing and problem-solving. Through intent-aware Socratic feedback and multi-stage retrieval, it demonstrates superior grounding and pedagogical scaffolding effects across automated metrics, simulated students, and expert reviews.

Retrieve Only Relevant Tables Whether Few or Many: Adaptive Table Retrieval Method

This paper proposes Adaptive Table Retrieval (ATR), which replaces fixed top-k table retrieval with a query-adaptive threshold. By combining relevance calibration, inter-table semantic grouping, and sliding-window reranking, it simultaneously improves retrieval recall, text-to-SQL execution accuracy, and inference efficiency on Spider, BIRD, and Spider 2.0.

REZE: Representation Regularization for Domain-adaptive Text Embedding Pre-finetuning

REZE performs eigenspace decomposition on anchor-positive relation representations during domain embedding pre-finetuning. It uses robust statistics to identify and soft-shrink task-specific shifts, thereby absorbing shared domain knowledge while suppressing representation drift caused by heterogeneous tasks.

RiTeK: A Dataset for Large Language Models Complex Reasoning over Textual Knowledge Graphs in Medicine

RiTeK constructs two large-scale medical Textual Knowledge Graphs (TKG) and corresponding complex reasoning QA datasets, covering 6 topological structures and rich textual descriptions. It evaluates 11 retrieval methods and reveals the severe inadequacies of existing LLM-driven retrieval systems in medical TKG reasoning.

S2G-RAG: Structured Sufficiency and Gap Judging for Iterative Retrieval-Augmented QA

S2G-RAG explicitly models "evidence sufficiency" and "next-step gaps" in iterative RAG as a structured controller named S2G-Judge. Using gap-guided queries and sentence-level evidence extraction to mitigate noise, it improves F1 from 43.3 (SIM-RAG) to 56.5 under the HotpotQA BM25 setting.

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

This paper establishes the first comprehensive MTEB-style text embedding benchmark for Slovak (a low-resource West Slavic language with ~5 million speakers), named SkMTEB (31 datasets, 7 task categories, approximately 4x the depth of existing multilingual coverage). The study evaluates 31 embedding models and utilizes vocabulary trimming + targeted fine-tuning to compress Multilingual E5 into locally deployable Slovak embedding models (45M/365M). These models reduce size by up to 62% while matching the performance of commercial APIs.

Test-Time Training for Zero-Resource Dense Retrieval Reranking

DART is proposed to adaptively adjust the scoring function of dense retrievers using a bilinear matrix at inference time. By utilizing retrieval results as pseudo-labels for zero-shot unlabeled reranking, it achieves an average improvement of 2.1% NDCG@10 on the BEIR benchmark with latency controlled under 10ms.

Low-Resource Language Dilemma in Multilingual Retrieval: Evidence from Amharic

Using Amharic as a diagnostic case, this paper reveals that powerful multilingual retrieval models fail to migrate effectively to morphologically rich low-resource languages in zero-shot settings, with a 23% relative drop in MRR@10 performance. While language-specific fine-tuning provides 32-60% improvements, it still fails to reach the level of monolingual retrievers, indicating that multilingual retrieval is insufficient to guarantee equitable information access for low-resource languages.

UnIte: Uncertainty-based Iterative Document Sampling for Domain Adaptation in Information Retrieval

UnIte shifts the bottleneck of unsupervised domain adaptation for neural retrievers from "generating more pseudo-queries" to "smarter document selection." It first filters low-density noisy documents using aleatoric uncertainty, then iteratively samples high-value documents based on epistemic uncertainty that evolves dynamically with model training. It consistently outperforms DUQGen on large BEIR corpora while using fewer pseudo-queries.

Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning

Verbal-R3 upgrades the traditional reranker from a module that "only provides relevance scores" to a bridging module that "provides scores and generates explanatory Verbal Annotations." This module is then used to train and guide RAG reasoners, simultaneously improving answer accuracy and test-time scaling efficiency in multi-hop question answering.

VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG

VideoStir proposes a structured and intent-aware long video RAG framework. It models videos as spatio-temporal graphs for multi-hop clip retrieval and trains an intent relevance scorer for frame-level filtering. It achieves performance comparable to SOTA long video RAG methods without relying on auxiliary text tools.

VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval

This paper proposes Visualize-then-Retrieve (VisRet), a new paradigm that converts text queries into visual images via T2I generation models before performing retrieval within the image modality. It achieves an average nDCG@30 improvement of 0.125 (CLIP) and 0.121 (E5-V) across four benchmarks, and increases downstream VQA accuracy by 15.7% on Visual-RAG-ME.

When Does Mixing Help? Analyzing Query Embedding Interpolation in Multilingual Dense Retrieval

This paper employs "embedding-level interpolation" as a controllable proxy to investigate the sensitivity of multilingual dense retrieval to mixed-language queries. By systematically varying the mixing ratio of two parallel queries on mMARCO, the study finds that the optimal mixing ratio outperforms the best monolingual query in 88/105 settings. This gain is highly structured: English plays the role of the "strongest mixing partner" and an "asymmetric hegemon" within the vector space.

When Retrieval is Ineffective in Biomedical RAG: A Large-Scale Empirical Study

This large-scale empirical study spanning 5 models, 10 datasets, 4 retrieval methods, and 4 retrieval corpora finds that biomedical RAG only provides marginal and unstable improvements of 1-2 points. The true bottleneck is the model's capacity to effectively utilize retrieved evidence rather than the quality of retrieval itself.

Why Mean Pooling Works: Quantifying Second-Order Collapse in Text Embeddings

This paper argues that mean pooling theoretically loses the second-order structure of token embeddings and proposes the SOCM metric to quantify this second-order collapse; experiments demonstrate that modern contrastively fine-tuned text encoders produce more concentrated token embeddings, making them less prone to collapse than base models, and lower SOCM correlates with higher MTEB performance.

Why These Documents? Explainable Generative Retrieval with Hierarchical Category Paths

The HyPE framework is proposed to provide query-relevant explainable paths by first generating hierarchical category paths (e.g., "Government >> Government by cities") before decoding document identifiers in generative retrieval, while simultaneously improving retrieval accuracy.