Skip to content

🔍 Information Retrieval & RAG

🤖 AAAI2026 · 21 paper notes

📌 Same area in other venues: 📷 CVPR2026 (9) · 🔬 ICLR2026 (81) · 💬 ACL2026 (73) · 🧪 ICML2026 (26) · 🧠 NeurIPS2025 (24) · 📹 ICCV2025 (5)

🔥 Top topics: RAG ×6 · Reasoning ×4 · LLM ×3 · Agents ×3 · Dialogue ×2

"As Eastern Powers, I Will Veto." : An Investigation of Nation-Level Bias of Large Language Models in International Relations

This paper systematically investigates nation-level bias of LLMs in international relations, designing three bias evaluation paradigms (DirectQA, Association Test, Vote Simulation) grounded in real UN Security Council data. It reveals the multi-dimensional nature of such bias—varying across models and evaluation contexts—and proposes a RAG+Reflexion debiasing framework.

Beyond Perplexity: Let the Reader Select Retrieval Summaries via Spectrum Projection Score

This paper proposes Spectrum Projection Score (SPS), a training-free metric that evaluates retrieval summary quality by measuring the alignment between summary token embeddings and the principal subspace of the reader LLM, serving as a replacement for conventional perplexity-based metrics. Combined with the xCompress inference-time controller, SPS achieves substantial improvements over perplexity-based methods across 5 QA datasets (HotpotQA EM +3.6).

Cog-RAG: Cognitive-Inspired Dual-Hypergraph with Theme Alignment Retrieval-Augmented Generation

This paper proposes Cog-RAG, which constructs a dual-hypergraph index comprising a theme hypergraph and an entity hypergraph to simulate the human "top-down" cognitive process via a two-stage retrieval strategy (theme first, then details), achieving global-to-local semantic alignment for generation.

ComLQ: Benchmarking Complex Logical Queries in Information Retrieval

This paper introduces ComLQ, the first IR benchmark targeting complex logical queries spanning 14 query types (conjunction, disjunction, negation, and their combinations). It proposes a subgraph-guided LLM data synthesis pipeline and a negation consistency metric LSNC, revealing that existing retrievers suffer severely in logical reasoning—particularly in negation modeling.

ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning

Inspired by the metacognitive regulation mechanism of the prefrontal cortex, this paper proposes the ComoRAG framework, which achieves stateful multi-step reasoning via a dynamic memory workspace and iterative probe queries, significantly outperforming existing RAG methods on long narrative understanding tasks (200K+ tokens).

ConvMix: A Mixed-Criteria Data Augmentation Framework for Conversational Dense Retrieval

This paper proposes ConvMix, a mixed-criteria data augmentation framework that leverages LLMs to perform scalable relevance annotation augmentation from both query and document directions, combined with clustering-based diversity selection and Fisher information-based in-distribution supervision, to systematically improve conversational dense retrieval performance.

Do Retrieval Augmented Language Models Know When They Don't Know?

This paper systematically analyzes the refusal calibration problem in RAG models, finding that RALMs exhibit an over-refusal rate exceeding 55% when all retrieved documents are irrelevant (even when the model's internal knowledge suffices to answer), and proposes a mechanism combining uncertainty estimation with refusal-aware fine-tuning to balance refusal behavior and answer quality.

Exposing the Cracks: Vulnerabilities of Retrieval-Augmented LLM-Based Machine Translation

This work develops a controlled noise injection framework to systematically evaluate retrieval-augmented machine translation (REAL-MT), introduces two new metrics—Fidelity and CAR—and reveals across 10 language pairs × 4 noise types that models blindly adopt retrieved context even when it is contradictory (CAR remains 65–78%). Large reasoning models (LRMs) are found to be even more vulnerable by "rationalizing" erroneous context, and a fundamental trade-off exists between noise robustness and clean-context utilization.

Magnitude Matters: A Superior Class of Similarity Metrics for Holistic Semantic Understanding

This paper proposes two parameter-free, magnitude-aware vector similarity metrics—Overlap Similarity (OS) and Hyperbolic Tangent Similarity (HTS)—that achieve significantly lower MSE than Cosine Similarity and Dot Product on classification tasks (paraphrase detection, natural language inference) across 4 sentence embedding models and 8 NLP benchmarks, without any additional training overhead.

Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction

This paper proposes H2Memory, a four-layer hierarchical heterogeneous memory structure (Log Graphs / Background Memory / Topic Outlines / Principles), validated on the PAL-Set dataset (100 users × 8.4 months of interaction), improving BLEU-1 on demand paraphrasing and solution recommendation tasks from 13.59 to 26.67.

N2N-GQA: Noise-to-Narrative for Graph-Based Table-Text Question Answering Using LLMs

N2N-GQA is proposed as the first zero-shot framework for open-domain hybrid table-text question answering. Its core mechanism transforms noisy retrieved documents into a dynamic evidence graph (documents as nodes, TF-IDF shared-term weights as edges), and employs graph centrality-based pruning to identify "bridging documents" that connect multi-hop reasoning chains. On OTT-QA, it achieves +39.6 EM over Vanilla RAG (8.0 → 48.8), approaching the fine-tuned system CORE (49.0 EM) in a zero-shot setting.

OPERA: A Reinforcement Learning--Enhanced Orchestrated Planner-Executor Architecture for Reasoning-Oriented Multi-Hop Retrieval

This paper proposes OPERA, a hierarchical framework comprising a Goal Planning Module and a Reason-Execute Module, combined with MAPGRPO—a training algorithm specifically designed for multi-agent settings—to substantially improve performance on reasoning-oriented multi-hop retrieval tasks.

PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation

This paper extends the Prediction-Powered Inference (PPI) framework to sub-instance-level ranking metrics (e.g., Precision@K), achieving unbiased ranking metric estimation using only 30–100 human annotations combined with large-scale LLM judgments. The computational complexity is reduced from \(O(2^{|C|})\) to \(O(2^K)\), and the approach has been successfully deployed to guide an LLM-based query rewriting system in an Indian e-commerce search setting.

PRIME: Planning and Retrieval-Integrated Memory for Enhanced Reasoning

Inspired by dual-process cognitive theory, PRIME is a multi-agent reasoning framework in which a Quick Thinking Agent (System 1) rapidly generates intuitive answers, a Reflection Agent evaluates their confidence, and—when uncertainty is detected—six specialized System 2 agents (Planning / Search / Reading / Hypothesis / Integration / Decision) are triggered for deep knowledge-retrieval reasoning. The framework enables open-source LLaMA 3 to approach GPT-4o performance on medical and multi-hop QA benchmarks.

RAGFort: Dual-Path Defense Against Proprietary Knowledge Base Extraction in Retrieval-Augmented Generation

This paper proposes RAGFort, the first systematic dual-path framework for defending against RAG knowledge base extraction attacks. It combines contrastive reindexing (inter-class) to isolate topic boundaries with constrained cascade generation (intra-class) to suppress sensitive content output. RAGFort reduces the knowledge recovery rate to 0.51× that of an unprotected system while preserving answer quality.

REAP: Enhancing RAG with Recursive Evaluation and Adaptive Planning for Multi-Hop Question Answering

This paper proposes REAP, a dual-module iterative framework that addresses multi-hop question answering through recursive collaboration between a Sub-task Planner (SP), which maintains a global perspective to dynamically guide reasoning trajectories, and a Fact Extractor (FE), which extracts structured facts and latent clues from retrieved content. Using Llama-3.1-8B, REAP substantially outperforms all baselines on 4 benchmarks (HotpotQA F1 68.0 vs. runner-up 63.4).

ReFeed: Retrieval Feedback-Guided Dataset Construction for Style-Aware Query Rewriting

This paper proposes a retrieval feedback-driven dataset construction framework that automatically builds high-quality style-aware query rewriting datasets through a closed-loop pipeline of three steps: identifying retrieval failure cases, LLM-based stylistic rewriting, and re-retrieval verification. The resulting dataset provides a data foundation for training retrieval-aligned rewriting models.

RRRA: Resampling and Reranking through a Retriever Adapter

This paper proposes the RRRA framework, which attaches a lightweight learnable adapter to a Bi-Encoder to model the false-negative probability of each candidate document. The adapter is used simultaneously for negative resampling during training and reranking during inference, consistently outperforming strong baselines such as SimANS and TriSampler on NQ, TQ, and MS MARCO.

SR-KI: Scalable and Real-Time Knowledge Integration into LLMs via Supervised Attention

This paper proposes SR-KI, a framework that injects structured knowledge bases into LLM KV caches via a two-stage training procedure (retrieval layer localization + attention supervision loss). On a single A100 40GB GPU, SR-KI supports injection of up to 40K knowledge base entries, achieves a compression ratio of up to 99.75% via top-100 selection, and maintains an average Recall@10 above 88%.

Towards Inference-Time Scaling for Continuous Space Reasoning

This work presents the first systematic investigation of whether inference-time scaling techniques from discrete text reasoning can transfer to continuous latent-space reasoning models (COCONUT). It finds that dropout sampling can generate diverse reasoning paths (Pass@32 reaching 44.43%), but PRM/ORM yields less than 2.3% improvement, with the root cause being that continuous thought representations lack the geometric inductive bias needed to distinguish correct from incorrect reasoning.

When Small Models Are Right for Wrong Reasons: Process Verification for Trustworthy Agents

By analyzing 10,734 reasoning trajectories, this paper reveals a severe "Right for Wrong Reasons" (RWR) phenomenon in small language models (7–9B): 50–69% of correct answers contain fundamental reasoning flaws. The authors propose the Reasoning Integrity Score (RIS) as a process-level metric, find that RAG effectively improves reasoning quality while metacognitive interventions are harmful, and distill a fast classifier (0.86 F1, 100× speedup) for real-time deployment.