🔍 Information Retrieval & RAG¶

🔬 ICLR2026 · 33 paper notes

AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations: This paper proposes AMemGym — the first long-horizon conversational memory benchmark environment supporting on-policy interactive evaluation. It drives LLM-simulated users via structured data sampling (user profile → state evolution → personalized QA), reveals ranking biases inherent in off-policy evaluation, and systematically diagnoses write/read/utilization failure modes across RAG, long-context, and agent-based memory systems.
Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation: This paper proposes ARC-JSD, a method that computes the Jensen-Shannon Divergence (JSD) between response distributions under full context and sentence-ablated context, enabling efficient and accurate RAG context attribution without fine-tuning, gradient computation, or surrogate models. Combined with Logit Lens for mechanistic analysis, ARC-JSD identifies the attention heads and MLP layers responsible for context attribution, and reduces hallucination rates by approximately 39% via a gating mechanism.
Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation: This paper reformulates positional encoding as prior distributions within a Bayesian attention mechanism, unifying NoPE (uniform prior) and ALiBi (Laplacian prior), and proposes a Generalized Gaussian prior (GGD-BAM) that achieves perfect passkey retrieval at 500× the training length by adding only 384 parameters.
Beyond RAG vs. Long-Context: Learning Distraction-Aware Retrieval for Efficient Knowledge Grounding: This paper proposes LDAR (Learning Distraction-Aware Retrieval), a lightweight adaptive retriever that learns to select passages by sampling a continuous quantile band from the query-passage similarity distribution. LDAR surpasses long-context methods while using approximately half the token budget, balancing information coverage against the influence of distracting passages.
BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs: This paper proposes BTZSC, a benchmark comprising 22 datasets, which for the first time systematically compares four model families — NLI cross-encoders, embedding models, rerankers, and instruction-tuned LLMs (38 models in total) — under a unified zero-shot protocol. Qwen3-Reranker-8B achieves a new SOTA with macro F1 = 0.72, while embedding models demonstrate the best accuracy–latency trade-off.
Digging Deeper: Learning Multi-Level Concept Hierarchies: This paper proposes Multi-Level Concept Splitting (MLCS), which extends concept splitting from a single layer to a recursive multi-level process. Using only top-level concept annotations, MLCS automatically discovers concept hierarchy trees of arbitrary depth. The authors further introduce the Deep-HiCEMs architecture to represent and leverage these deep hierarchies, enabling test-time concept interventions at multiple levels of granularity.
Efficient Discriminative Joint Encoders for Large Scale Vision-Language Re-ranking: This paper proposes EDJE (Efficient Discriminative Joint Encoder), which offlines visual feature extraction and compresses visual tokens via a lightweight attention adapter, achieving a throughput of 50k image-text pairs per second. EDJE matches the retrieval performance of existing joint encoders on Flickr (zero-shot) and COCO (fine-tuned), requiring only 49 kB of storage per image.
Embedding-Based Context-Aware Reranker: This paper proposes EBCAR, a lightweight embedding-space reranking framework that injects structural information via document ID embeddings and passage positional encodings. It employs a hybrid mechanism combining shared full attention and dedicated masked attention to enable cross-passage reasoning. EBCAR achieves state-of-the-art average nDCG@10 on the ConTEB benchmark with only 126M parameters, while delivering inference throughput more than 150× faster than LLM-based rerankers.
Fine-tuning with RAG for Improving LLM Learning of New Skills: This paper proposes transforming RAG from a permanent inference-time dependency into a training-time teacher signal. Hints are extracted from agent failures, used to augment a teacher model that generates higher-quality trajectories, and then removed during distillation into a student model. The student thereby internalizes the retrieval-augmented behavior without requiring runtime RAG, achieving a 91% success rate on ALFWorld (baseline: 79%) and a score of 72 on WebShop (baseline: 61).
Flow of Spans: Generalizing Language Models to Dynamic Span-Vocabulary via GFlowNets: This paper proposes FoSS, the first framework to incorporate GFlowNets into span-level language modeling. By constructing a DAG-structured state space in place of the conventional token-by-token tree structure, FoSS enables more flexible and diverse text generation, achieving up to a 12.5% improvement in MAUVE score.
FutureMind: Equipping Small Language Models with Strategic Thinking-Pattern Priors via Adaptive Knowledge Distillation: This paper proposes FutureMind, a training-free framework that distills structured reasoning and retrieval strategies from LLMs into reusable thinking-pattern priors. Through a four-stage pipeline (question analysis → logical reasoning → strategy planning → retrieval guidance) and three retrieval paradigms, FutureMind enables SLMs to achieve state-of-the-art performance on multi-hop QA benchmarks.
G-reasoner: Foundation Models for Unified Reasoning over Graph-structured Knowledge: This paper proposes G-reasoner, which standardizes heterogeneous knowledge sources via a four-layer unified graph interface called QuadGraph, trains a 34M-parameter GNN-based graph foundation model to jointly reason over graph topology and textual semantics, and achieves state-of-the-art performance over existing GraphRAG methods across 6 benchmarks in conjunction with an LLM.
Hierarchical Concept-based Interpretable Models: HiCEMs introduces a hierarchical concept embedding model that automatically discovers fine-grained sub-concepts within the embedding space of a pretrained CEM via Concept Splitting—without requiring additional annotations—thereby constructing a hierarchical concept structure that supports test-time concept interventions at multiple granularities to improve task performance.
HUME: Measuring the Human-Model Performance Gap in Text Embedding Tasks: This paper proposes HUME, a human evaluation framework that systematically measures human performance on 16 MTEB datasets spanning reranking, classification, clustering, and STS tasks. Humans rank 4th overall (77.6 vs. the best model score of 80.1). The study reveals that cases where models surpass human performance tend to occur on tasks with the lowest human agreement, and evaluates 9 LLMs as potential annotation proxies.
Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning: This paper proposes HybridDeepSearcher, which constructs the HDS-QA dataset to train a large language reasoning model (LRM) to distinguish parallelizable from sequentially dependent search queries. The approach achieves F1 gains of +15.9 on FanOutQA and +11.5 on a BrowseComp subset, while substantially reducing inference latency and demonstrating consistent test-time search scaling.
Judge's Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement: This paper proposes the Judge's Verdict Benchmark—a two-stage evaluation framework based on relevance filtering followed by a Cohen's Kappa human-similarity test—to systematically assess 54 LLM judges. The framework identifies 27 Tier 1 judges (23 human-like and 4 super-consistent). The central finding is that high correlation does not imply high agreement; Kappa combined with z-score is necessary to properly measure LLM judge quality.
Leveraging Data to Say No: Memory Augmented Plug-and-Play Selective Prediction: This paper proposes MA-PaPSP, a training-free plug-and-play selective prediction framework for arbitrary VLMs. It constructs proxy embeddings via k-NN weighted averaging over an external retrieval dataset (reducing representational variance) and applies contrastive normalization scoring (improving calibration). MA-PaPSP consistently outperforms PaPSP and LLM-as-judge baselines on selective prediction across image captioning, image-text matching, and classification tasks.
LightRetriever: A LLM-based Text Retrieval Architecture with Extremely Faster Query Inference: This paper proposes LightRetriever, an extremely asymmetric LLM-based retrieval architecture: the document side retains a full LLM encoder, while the query side eliminates deep modeling entirely — dense retrieval reduces to embedding lookup and averaging, and sparse retrieval reduces to token counting — achieving 1000× query encoding speedup, 10× end-to-end throughput improvement, while retaining 95% of retrieval performance.
Mapping Semantic & Syntactic Relationships with Geometric Rotation: This paper proposes RISE (Rotor-Invariant Shift Estimation), a method that leverages Clifford algebra rotors to represent utterance-level semantic–syntactic transformations (negation, conditionalization, and politeness) as consistent rotation operations on the unit hypersphere. Through systematic experiments across 7 languages × 3 embedding models × 3 transformation types, the paper demonstrates that these rotations transfer across languages and model architectures (77%–95% retention rate), extending the Linear Representation Hypothesis (LRH) from word-level to cross-lingual utterance-level and generalizing it to geodesic structures on curved manifolds.
Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis: This paper proposes PDS (Prototype-Guided Data Synthesis), the first training-free multimodal dataset distillation framework. It leverages CLIP's aligned embedding space to perform modality-specific clustering, applies the Hungarian algorithm for cross-modal prototype matching, and employs an unCLIP decoder to synthesize distilled images from image prototypes. On a distillation set of as few as 100 pairs, PDS surpasses all optimization-based methods at zero training cost while achieving state-of-the-art cross-architecture generalization.
On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation: This paper proposes HOMER, a framework that constructs a three-role LLM collaboration mechanism (conflicting script extractor + hierarchical imaginator + caption generator) grounded in the GTVH theory of verbal humor. By explicitly modeling script opposition, multi-perspective associative chains, and joke database retrieval to build an imagination tree for creative space expansion, HOMER achieves an average improvement of ~7% over baselines on the New Yorker cartoon benchmark using GPT-4o as the backbone, and significantly outperforms all baselines in human evaluation.
Q-RAG: Long Context Multi-Step Retrieval via Value-Based Embedder Training: Multi-step retrieval is formulated as an MDP and solved via value-based RL (soft Q-learning) to fine-tune the embedder rather than the LLM. The Q-function is designed as the inner product of state and action embeddings—proven to be a universal approximator—and combined with RoPE relative positional encoding to enable temporal reasoning. Training requires only a single A100 GPU for 12 hours; models trained on 4K-token contexts generalize to 1M+ token contexts, achieving near-perfect NIAH performance on the RULER benchmark.
Query-Level Uncertainty in Large Language Models: This paper introduces the concept of Query-Level Uncertainty and proposes an Internal Confidence method that estimates, prior to generation (via a single forward pass), whether an LLM is capable of answering a given query. The approach is training-free and enables efficient adaptive inference strategies including RAG triggering, model cascading, and abstention.
RAEE: A Robust Retrieval-Augmented Early Exit Framework for Efficient Inference: This paper proposes RAEE, a retrieval-augmented early exit framework that requires no classifier training. By retrieving exit information from semantically similar samples, RAEE dynamically determines the optimal exit layer, simultaneously accelerating inference and correcting model mispredictions — achieving a dual gain in both efficiency and performance.
RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding: This paper introduces Ravenea, the first benchmark for evaluating multimodal retrieval-augmented cultural understanding. It comprises 1,868 instances and 11,396 human-ranked Wikipedia documents, spanning 11 categories across 8 countries. The benchmark evaluates 7 multimodal retrievers and 17 VLMs, finding that culture-aware RAG yields average improvements of 6% on cVQA and 11% on cIC.
RefTool: Reference-Guided Tool Creation for Knowledge-Intensive Reasoning: This paper proposes RefTool, a framework that automatically creates executable Python tools from external reference materials (e.g., textbooks, knowledge snippets), addressing the failure of existing tool creation methods that rely on LLMs' intrinsic knowledge in specialized domains. RefTool achieves an average improvement of 12.3% over prior methods on causal reasoning, physics, and chemistry tasks.
Retrieval-Augmented Generation for Predicting Cellular Responses to Gene Perturbation: This paper proposes PT-RAG (Perturbation-aware Two-stage Retrieval-Augmented Generation), the first application of differentiable retrieval-augmented generation to single-cell gene perturbation response prediction. The framework combines semantic retrieval of candidate perturbations via GenePT embeddings with Gumbel-Softmax-based conditional discrete sampling for cell-type-aware, end-to-end retrieval optimization. PT-RAG surpasses the STATE baseline on the Replogle-Nadig dataset (Pearson 0.633 vs. 0.624), while demonstrating that naïve RAG severely degrades performance (Pearson 0.396 only), establishing that differentiable, cell-type-aware retrieval is indispensable in this domain.
Revela: Dense Retriever Learning via Language Modeling: This paper proposes Revela, which integrates retriever learning into language modeling via an in-batch attention mechanism. Next-token prediction (NTP) draws not only on within-sequence context but also on other sequences in the batch, weighted by retriever similarity scores, enabling training of a strong dense retriever without labeled query-document pairs.
Summaries as Centroids for Interpretable and Scalable Text Clustering: This paper proposes k-NLPmeans and k-LLMmeans, which periodically replace numeric centroids with textual summaries (summary-as-centroid) during k-means iterations, achieving interpretable cluster prototypes while preserving the standard k-means objective. The number of LLM calls is independent of dataset size.
Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding: This paper proposes Token-Guard, a token-level hallucination control method based on self-checking decoding, which detects and suppresses hallucinations during decoding via token-level/segment-level scoring in the hidden space and an iterative refinement mechanism, achieving an average F1 improvement of 16.3%.
TokMem: One-Token Procedural Memory for Large Language Models: This paper proposes TokMem, which compiles reusable task procedures into single trainable memory tokens that serve simultaneously as procedure indices and generation control signals, enabling efficient invocation of 1,000+ task procedures without long prompts and supporting catastrophic-forgetting-free continual expansion.
Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders: This paper proposes RAGLens, which leverages sparse autoencoders (SAEs) to disentangle RAG-hallucination-specific features from LLM internal activations, and constructs a lightweight, interpretable hallucination detector via mutual information-based feature selection combined with a Generalized Additive Model (GAM). RAGLens surpasses existing methods across multiple benchmarks and supports token-level interpretable feedback and hallucination mitigation.
Your Language Model Secretly Contains Personality Subnetworks: This paper proposes extracting persona-specific subnetworks from pretrained LLMs via activation-guided pruning, enabling efficient persona switching without any training, and introduces a contrastive pruning strategy to enhance parameter separation between opposing personas.