🔍 Information Retrieval & RAG¶
🔬 ICLR2026 · 81 paper notes
📌 Same area in other venues: 📷 CVPR2026 (9) · 💬 ACL2026 (73) · 🧪 ICML2026 (26) · 🤖 AAAI2026 (21) · 🧠 NeurIPS2025 (24) · 📹 ICCV2025 (5)
🔥 Top topics: RAG ×19 · Reasoning ×13 · LLM ×8 · Multimodal/VLM ×5 · Question Answering ×4
- A Dense Subset Index for Collective Query Coverage
-
DISCO models "multiple documents collaboratively covering a complex query" as a monotonic submodular coverage objective. Through vector augmentation and random projection, it rewrites the marginal gain of each greedy iteration into an indexable inner product form. This enables a modified multi-vector IVF index to approximate greedy solutions in sublinear time, achieving over \(100\times\) speedup compared to standard greedy algorithms while providing higher coverage than traditional IR indices.
- AdaCache: Adaptive Caching and Context Augmentation for Efficient LLM Serving
-
AdaCache addresses two types of waste in RAG inference—redundant recomputation of the same text chunks and the uniform provision of top-k contexts regardless of query difficulty. It proposes "Hierarchical Caching + Attention-aware Selective Recomputation" and "Confidence-driven Adaptive Context Augmentation," reducing Time to First Token (TTFT) by 1.4\(\times\) to 5.0\(\times\) compared to state-of-the-art RAG caching systems across six datasets and three models while maintaining generation quality.
- AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations
-
This paper introduces AMemGym, the first on-policy interactive evaluation benchmark for long-horizon conversation memory. By utilizing structured data sampling (User Persona → State Evolution → Personalized QA), it drives LLMs to simulate users in role-play scenarios. The study reveals ranking bias issues in traditional off-policy evaluations and systematically diagnoses the "write/read/utilization" three-stage failure modes in RAG, long-context, and Agent memory systems.
- AssoMem: Scalable Memory QA with Multi-Signal Associative Retrieval
-
AssoMem constructs a "clue-utterance" associative memory graph for large-scale personal memory QA and adaptively fuses three signals—relevance, importance, and temporal alignment—using mutual information for ranking. It significantly outperforms SOTA models based solely on semantic similarity in both retrieval and generation across multiple benchmarks.
- Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation
-
The authors propose ARC-JSD, a method that achieves efficient and precise RAG context attribution without fine-tuning, gradient computation, or surrogate models by calculating the Jensen-Shannon Divergence (JSD) of response distributions between full and ablated contexts. Combined with Logit Lens for mechanistic analysis, it identifies attention heads and MLP layers responsible for attribution, reducing hallucination rates by approximately 39% through gating operations.
- Attribution-Guided Decoding
-
Ours proposes the AGD decoding strategy, which selects the token with the highest attribution score regarding a user-specified "Region of Interest" (ROI) from high-probability candidate tokens during each generation step. This transforms attribution methods from passive analysis tools into active generation steering tools, achieving significant improvements in both instruction following and factuality tasks.
- Automated Formalization via Conceptual Retrieval-Augmented LLMs
-
CRAMF automatically constructs a "concept–definition" knowledge base from Mathlib4, utilizing query augmentation, dual-channel hybrid retrieval, and reranking to provide precise formal definitions for LLM-based autoformalizers. It serves as a plug-and-play plugin that improves translation accuracy by an average of 29.9% relative gain, reaching up to 62.1%.
- Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
-
The paper reformulates positional encoding as a prior distribution within a Bayesian attention mechanism, unifying NoPE (uniform prior) and ALiBi (Laplace prior). It proposes a Generalized Gaussian Prior (GGD-BAM) that achieves perfect passkey retrieval at 500x training length with an addition of only 384 parameters.
- Beyond RAG vs. Long-Context: Learning Distraction-Aware Retrieval for Efficient Knowledge Grounding
-
This paper proposes LDAR (Learning Distraction-Aware Retrieval), a lightweight adaptive retriever that learns to select a continuous band of passages based on query-passage similarity distributions. By balancing information coverage against the impact of distracting passages, it outperforms long-context methods using approximately half the token budget.
- Beyond Sequential Reranking: Reranker-Guided Search Improves Reasoning Intensive Retrieval
-
This paper replaces the rigid "top-k sequential scan" in the "retrieve-and-rerank" pipeline with a greedy search on a document similarity proximity graph (Reranker-Guided-Search, RGS). By prioritizing documents whose neighbors have already received high scores, RGS improves NDCG@10 by 3.5, 2.9, and 5.1 points on BRIGHT, FollowIR, and M-BEIR benchmarks respectively, under a strict budget of 100 reranker calls per query.
- Beyond Text-Only: Towards Multimodal Table Retrieval in Open-World
-
This paper argues that "serializing tables into text before retrieval" sacrifices structural and multimodal information. It redefines open-domain table retrieval as "multimodal retrieval of table screenshots" and constructs TaR-ViR, the first benchmarks for image-based table retrieval. Experiments demonstrate that multimodal retrievers can match or exceed text-based ones in recall while bypassing the error-prone table-to-text conversion process.
- Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding
-
During speculative decoding training, only a single greedy draft path is optimized, while during decoding, an entire draft tree is used for re-ranking and verification. This misalignment limits acceleration. This paper proposes Group Tree Optimization (GTO), which uses "draft tree rewards + group-based draft policy training" to directly align with the tree policy used during decoding. GTO achieves an average increase in acceptance length of 7.4% across multiple LLMs, with a relative speedup of 7.7% over EAGLE-3.
- BrowseNet: Graph-Based Associative Memory for Contextual Information Retrieval
-
BrowseNet organizes the corpus into a "graph-of-chunks" using named entities as edges and text chunks as nodes. By decomposing multi-hop questions into directed acyclic query-subgraphs and performing beam-search-like subgraph traversal along the graph to retrieve evidence, it achieves SOTA Exact Match and recall on HotpotQA, 2WikiMQA, and MuSiQue with only a single LLM call.
- BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs
-
The authors propose the BTZSC benchmark (spanning 22 datasets), which for the first time systematically compares four major model families—NLI Cross-Encoders, Embedding Models, Rerankers, and instruction-tuned LLMs (38 models in total)—under a unified zero-shot protocol. The study finds that Qwen3-Reranker-8B achieves a new SOTA with a macro \(F1=0.72\), while embedding models offer the optimal trade-off between precision and latency.
- CFT-RAG: An Entity Tree Based Retrieval Augmented Generation Algorithm With Cuckoo Filter
-
CFT-RAG integrates a Cuckoo Filter into the entity localization stage of Tree-RAG. By combining fingerprints, block linked lists, and temperature-based sorting, it reduces the complexity of "searching entities in a forest" from \(O(n)\) breadth-first search to approximately \(O(1)\). On the DART dataset, it achieves an 800%+ speedup in retrieval compared to naive Tree-RAG, while simultaneously improving generation accuracy.
- ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks
-
ChronoPlay is the first RAG benchmark generation framework for the gaming domain. It utilizes a "dual-source synthesis engine" (official knowledge to ensure factual accuracy + player community templates to ensure question authenticity) for automated data creation. It further implements a "dual dynamic update mechanism" (refreshing knowledge based on version updates and resampling question distributions by detecting interest drift via JS divergence), allowing the benchmark to evolve with game versions and player focus, thereby exposing RAG system performance fluctuations that static benchmarks fail to detect.
- Conformalized Hierarchical Calibration for Uncertainty-Aware Adaptive Hashing
-
To address the persistent issues of pseudo-label noise and blind domain alignment in Unsupervised Domain Adaptive Hashing (UDAH), COLA introduces a "conformal hierarchical calibration" framework. It quantifies sample reliability at the semantic level using the size of conformal prediction sets and predicts the stability of each hash bit at the representation level. By upgrading uncertainty from heuristic thresholds to continuous weights with statistical guarantees, and utilizing a self-regulating closed loop for dynamic multi-objective loss scheduling, COLA achieves new SOTA mAP results on Office-Home, Office-31, and Digits datasets.
- Counterfactual Reasoning for Retrieval-Augmented Generation
-
CF-RAG embeds counterfactual query generation, dialectical evidence retrieval, and parallel evidence arbitration into the RAG inference process. It distinguishes between evidence that truly determines an answer and merely highly correlated distracting evidence by testing whether the evidence supports only the original query and not similar counterfactual ones, significantly enhancing RAG robustness in multi-hop QA, long-tail entity, and noisy retrieval scenarios.
- Deep Global-sense Hard-negative Discriminative Generation Hashing for Cross-modal Retrieval
-
DGHDGH introduces "Hard Negative Generation" (HNG) into cross-modal hashing for the first time. It utilizes a cross-modal structural graph for bidirectional iterative message propagation to perceive global sample correlation. Based on this, it performs channel-wise, difficulty-adaptive anchor-negative interpolation to synthesize hard negatives that are close to the anchor but do not violate other category boundaries, thereby training a more discriminative Hamming co-space.
- DeepRAG: Thinking to Retrieve Step by Step for Large Language Models
-
DeepRAG formalizes the "reasoning while retrieving" process as a Markov Decision Process (MDP). It enables LLMs to autonomously decide whether to "use internal knowledge or perform external retrieval" for each sub-problem during step-by-step problem decomposition. Through a three-step pipeline—binary tree search for data synthesis, imitation learning, and calibration training—DeepRAG achieves a 25.41% relative improvement in answer accuracy across five QA datasets while significantly reducing retrieval frequency.
- Demystifying Deep Search: A Holistic Evaluation with Hint-free Multi-Hop Questions and Factorised Metrics
-
Addressing the two major issues of "reasoning path leakage in questions" and "reliance on a single pass rate" in current deep search evaluations, this paper constructs WebDetective, a hint-free multi-hop QA benchmark (controlled Wikipedia sandbox + full traceability), and a set of factorised metrics that decouple "Search Sufficiency / Knowledge Utilisation / Refusal Behaviour." After evaluating 25 frontier models, it reveals that current systems are proficient at executing given reasoning paths but generally fail to autonomously discover them, showing poor synthesis despite sufficient evidence and an almost complete failure to provide appropriate refusals when evidence is missing.
- Eigen-Agent: Adaptive Multi-Agent Scientific Reasoning with Monitor-Based RAG
-
Eigen-Agent utilizes a triad of "token-level monitored implicit retrieval + anchor-reference hierarchical refinement + quality-aware iteration" to eliminate the "tool tax" where explicit RAG interrupts reasoning and prevents multi-agent systems from averaging strong solutions into weak ones. It achieves a state-of-the-art accuracy of 48.3% on HLE Bio/Chem Gold while reducing token usage by 53.5% and agent steps by 43.7%.
- ELViS: Efficient Visual Similarity from Local Descriptors that Generalizes Across Domains
-
ELViS performs image pair re-ranking in "similarity space" rather than "appearance space": it first refines the similarity matrix of local descriptors using Optimal Transport (OT) with data-dependent dustbin gains, then sums the strongest correspondence of each descriptor as a "vote" weighted by a learnable function to compute image-level similarity. With 1/20th of the parameters and several times the speed, it significantly outperforms Transformer-based re-ranking methods in cross-domain retrieval.
- Embedding-Based Context-Aware Reranker
-
This paper proposes EBCAR, a lightweight reranking framework operating in the embedding space. By introducing structural information through document ID embeddings and passage position encodings, combined with a hybrid mechanism of shared full attention and specialized masked attention for cross-paragraph reasoning, EBCAR achieves the best average nDCG@10 on the ConTEB benchmark with only 126M parameters. Its inference speed is over 150x faster than LLM-based rerankers.
- Expert Heads: Robust Evidence Identification for Large Language Models
-
By analyzing attention distributions under document permutation perturbations, the authors identify a small subset of "Expert Heads" that consistently focus on gold documents regardless of their position. Using these heads' votes as a zero-shot signal for document retrieval and ranking significantly outperforms dense retrievers on HotpotQA, 2Wiki, and MuSiQue.
- Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs
-
An open-source DeepResearch system built using two 4B small models: Fathom-Search-4B handles multi-turn real-time web search and evidence reasoning (stably exceeding 20 tool calls), while Fathom-Synthesizer-4B synthesizes retrieval trajectories into citation-dense research reports. By utilizing the DUETQA dataset, RAPO optimization algorithm, and controllable step-level rewards, the system pushes open-source DeepSearch to levels approaching closed-source systems.
- Flow of Spans: Generalizing Language Models to Dynamic Span-Vocabulary via GFlowNets
-
Ours proposes FoSS, which introduces GFlowNets to span-level language models for the first time. By constructing a DAG-structured state space to replace the traditional token-by-token tree structure, it achieves more flexible and diverse text generation, with MAUVE scores improving by up to 12.5%.
- FrugalRAG: Less is More in RL Finetuning for Multi-hop Question Answering
-
FrugalRAG proposes a two-stage "Explore-then-Frugal" finetuning framework: the first stage uses supervised finetuning (SFT) to transform a small model into an exploratory policy that maximizes evidence recall through multiple retrieval queries; the second stage applies GRPO reinforcement learning (RL) to teach the model to "decide when to stop based on question difficulty." Consequently, on multi-hop QA tasks like HotPotQA, it reduces the number of retrievals by nearly half using only 1,000 training samples while maintaining or even improving answer accuracy.
- Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks
-
The authors construct COMPACTDS, a high-quality datastore with 380B tokens that enables sub-second retrieval using 456GB of memory on a single machine. They demonstrate that a "frustratingly simple" minimal RAG pipeline consistently delivers significant gains (up to 33% relative improvement) on reasoning-intensive benchmarks such as MMLU, MMLU Pro, GPQA, and MATH, rivaling or exceeding Google Search and complex agentic RAG systems.
- G-reasoner: Foundation Models for Unified Reasoning over Graph-structured Knowledge
-
G-reasoner is proposed to standardize heterogeneous knowledge sources via a four-layer unified graph interface called QuadGraph. A 34M-parameter GNN Graph Foundation Model (GFM) is trained to jointly reason over graph topology and text semantics. Combined with LLMs, it outperforms state-of-the-art (SOTA) GraphRAG methods across six benchmarks.
- Graph-based Nearest Neighbors with Dynamic Updates via Random Walks
-
This paper proposes a novel theoretical framework based on random walks for the HNSW graph index and designs the SPatch deletion algorithm—which "clique-ies then sparsifies" in the neighborhood after node deletion—achieving an optimal trade-off across recall, query speed, deletion latency, and memory footprint.
- GRO-RAG: Gradient-aware Re-rank Optimization for Multi-source Retrieval-Augmented Generation
-
GRO-RAG proposes a completely training-free multi-source RAG framework: it first greedily selects complementary retrieval sources using a "relevance-redundancy" submodular objective, then lets a frozen LLM re-rank documents via the inner product of document representations and generation loss gradients obtained through a single forward-backward pass, directly aligning "what to retrieve" with "what the generation target actually needs."
- Hierarchical Concept-based Interpretable Models
-
HiCEMs introduces hierarchical concept embedding models. Through the Concept Splitting method, it automatically discovers fine-grained sub-concepts in the embedding space of pre-trained CEMs (without additional annotations) to construct a hierarchical concept structure. This allows the model to perform test-time concept interventions at different levels of granularity to improve task performance.
- Hierarchical Encoding Tree with Modality Mixup for Cross-modal Hashing
-
HINT utilizes structural entropy to compress sparse image-text pairings into a hierarchical "encoding tree," excavating multi-granularity semantic communities. By sampling intra-modal and cross-modal proxy samples from the tree, it progressively aligns the two modalities via an MMD-driven curriculum Mixup, achieving more robust unsupervised cross-modal hashing.
- HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation
-
HiPRAG decomposes the reasoning trajectories of agentic RAG into parsable discrete steps, determines online "whether to search" for each decision, and provides a gated hierarchical process reward for RL. This allows the model to improve accuracy while compressing the over-search rate from 27% to 2.3%.
- HUME: Measuring the Human-Model Performance Gap in Text Embedding Tasks
-
Ours proposes the HUME human evaluation framework to systematically measure human performance across 16 datasets in MTEB (Reranking/Classification/Clustering/STS). Findings show humans rank 4th overall (77.6 vs. best model 80.1), revealing that "superhuman" model performance often occurs in tasks with the lowest human agreement. Additionally, the study assesses the feasibility of using 9 LLMs as annotation proxies.
- Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning
-
HybridDeepSearcher is proposed, which trains a Large Reasoning Model (LRM) using the HDS-QA dataset to distinguish between parallelizable and sequentially dependent search queries. It achieves a +15.9 F1 improvement on FanOutQA and +11.5 on the BrowseComp subset, while significantly reducing inference latency and demonstrating consistent test-time search scaling capabilities.
- Improving Semantic Proximity in Information Retrieval through Cross-Lingual Alignment
-
Targeting real-world retrieval scenarios where "documents in two languages coexist," this paper reveals that mainstream multilingual embeddings blindly rank irrelevant English documents ahead of relevant documents in the target language. It proposes a new evaluation scenario and the Max@R metric to quantify this bias, while utilizing JSD distribution-level alignment and InfoNCE retrieval losses to significantly improve cross-lingual alignment and flatten performance gaps across languages using only 2.8k samples, without harming monolingual retrieval.
- Interact-RAG: Reason and Interact with the Corpus, Beyond Black-Box Retrieval
-
Addressing the limitation where existing agentic RAG treats retrieval as a "black-box query" and agents can only repeatedly rephrase queries, this paper proposes Interact-RAG. By introducing a "Corpus Interaction Engine," the retrieval process is decomposed into fine-grained action primitives: multi-faceted retrieval, entity anchoring, and context shaping. This is coupled with a "Plan-Reason-Execute" workflow for trajectory synthesis, followed by SFT+RL to train an end-to-end autonomous agent. It achieves an average improvement of 22.5% over the second-best method across six RAG benchmarks.
- KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model
-
This paper transforms a 0.5B Qwen2 decoder into a fully bidirectional encoder, coupled with a "Pre-training → Fine-tuning → Contrastive Distillation" three-stage pipeline, Focal-style reweighting, online hard negative mixing, and high-quality data engineering covering 100+ categories. This allows KaLM-Embedding-V2.5 to achieve SOTA results in the <1B parameter segment on MTEB Chinese and English benchmarks, even competing with models 3–26 times larger.
- Lean Finder: Semantic Search for Mathlib That Understands User Intents
-
Addressing the pain point that mathlib4 retrieval "only aligns with machine-translated informalization but fails to match real mathematician queries," Lean Finder utilizes "reverse-engineered synthetic user queries + multimodal contrastive learning + DPO preference alignment" to train a user-intent-oriented Lean semantic retriever. It achieves a 30%+ improvement over existing engines and GPT-4o on real-world queries.
- Learning Retrieval Models with Sparse Autoencoders
-
By replacing the vocabulary projection head of SPLADE with a pre-trained Sparse Autoencoder (SAE), queries and documents are encoded into sparse vectors within a "latent vocabulary" space. The resulting SPLARE model systematically outperforms vocabulary-based sparse retrieval in multilingual and cross-domain tasks, matching dense SOTA on MMTEB for the first time.
- Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement
-
GIRCSE allows LLMs to autoregressively generate a sequence of "soft tokens" at inference time to iteratively refine sentence embeddings, supervised by step-wise contrastive losses. This marks the first effective utilization of LLM generative capabilities for embedding tasks, surprisingly unlocking "test-time scaling" where generating more tokens leads to higher vector quality.
- Leveraging Data to Say No: Memory Augmented Plug-and-Play Selective Prediction
-
The MA-PaPSP framework is proposed to construct proxy embeddings (using k-NN weighted averaging to reduce representation variance) and contrastive normalized scores (to improve calibration) via an external retrieval dataset. This provides reliable "refusal" capabilities for any VLM without training, outperforming PaPSP and LLM-as-judge baselines across selective prediction for image captioning, image-text matching, and classification.
- LightRetriever: A LLM-based Text Retrieval Architecture with Extremely Faster Query Inference
-
LightRetriever is proposed as an extremely asymmetric LLM retrieval architecture: while the document side retains the full LLM encoder, the query side completely removes deep modeling—dense retrieval requires only embedding lookup plus averaging, and sparse retrieval requires only token counting. This achieves a \(1000 \times\) speedup in query encoding and a \(10 \times\) improvement in end-to-end throughput while maintaining \(95\%\) of retrieval performance.
- LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora
-
LinearRAG identifies that performance bottlenecks in existing GraphRAG methods stem from unstable and expensive relationship extraction. It proposes a "Tri-Graph" that extracts only entities without relations, coupled with a two-stage retrieval process (semantic bridging for entity activation + global importance aggregation for passage retrieval). This approach reduces indexing time by 77% with zero LLM token consumption while outperforming all SOTA models across four benchmarks.
- Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs
-
LiteCoST utilizes strong LLMs to rewrite "long-document QA" into auditable "extract-then-answer" trajectories. These structure-priority behaviors are distilled into 3B/7B small models via a dual-signal SFT→GRPO approach, allowing small models to reach parity with GPT-4o on financial, legal, and scientific long-document QA while reducing latency by 2–4 times.
- Mapping Semantic & Syntactic Relationships with Geometric Rotation
-
This paper proposes RISE (Rotor-Invariant Shift Estimation), a method that utilizes Clifford algebra rotors to represent clausal semantic-syntactic transformations (negation, conditioning, politeness) as consistent rotation operations on a unit hypersphere. Systematic experiments across 7 languages, 3 embedding models, and 3 transformations demonstrate that these rotations are transferable across languages and models (77%-95% maintenance rate). This work marks the first extension of the Linear Representation Hypothesis (LRH) from word-level to cross-lingual clausal level, generalized to geodesic structures on curved manifolds.
- MergePRAG: Orthogonal Merging of Passage-experts for Multi-hop Parametric RAG
-
MergePRAG utilizes a hypernetwork to translate retrieved passages from each hop into "passage expert" parameters. These are incrementally superimposed into critical layers of the LLM via a continual merging mechanism based on Gram–Schmidt orthogonalization, extending parametric RAG (PRAG) from single-hop to multi-hop reasoning for the first time.
- MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction
-
By appending a small set of learnable Meta Tokens to VLM input sequences and organizing information by granularity into these tokens using Matryoshka-nested multi-vector contrastive training, MetaEmbed allows users to freely choose between 1 to 64 vectors at test-time to trade off retrieval accuracy against indexing and latency overhead. This achieves SOTA performance on MMEB and ViDoRe with compact multi-vector representations and scales stably to 32B parameters.
- MILCO: Learned Sparse Retrieval Across Languages via a Multilingual Connector
-
MILCO employs a "multilingual connector + English MLM head" to project text from 39+ languages into a shared English vocabulary sparse space. Combined with "Sparse Alignment Pre-training" to prevent semantic collapse and a "LexEcho dual-view" to recover rare entities lost during translation, a single 560M sparse model outperforms dense, sparse, and multi-vector baselines such as BGE-M3 and Qwen3-Embed in both multilingual and cross-lingual retrieval tasks.
- MLP Memory: A Retriever-Pretrained Memory for Large Language Models
-
The next-token distribution obtained from kNN retrieval over the entire pre-training corpus is distilled into a lightweight, all-MLP module. This allows LLMs to access "retrieval-style knowledge" via a single forward pass during inference, achieving higher QA accuracy and reduced hallucination at 2.5× the speed of RAG.
- MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval
-
MRMR constructs the first multimodal retrieval benchmark targeting expert-level, multidisciplinary, and reasoning-intensive scenarios. It includes 1,435 queries spanning 23 domains, represents both queries and documents as interleaved image-text sequences, and introduces a novel "Contradiction Retrieval" task. Evaluations reveal that current multimodal retrieval models significantly lag behind a naive "text retriever + image captioning" approach in tasks requiring reasoning.
- On the Theoretical Limitations of Embedding-Based Retrieval
-
This paper derives a lower bound theorem for the "dimension required for single-vector embeddings to represent all top-k document combinations" using sphere packing arguments from high-dimensional geometry. It empirically demonstrates through free embedding optimization and a minimalist real-world dataset, LIMIT, that as long as the number of relevant combinations to be represented is sufficient—even for queries as simple as "who likes apples"—dense retrieval models with fixed dimensions are destined to fail. This is a fundamental bottleneck of the single-vector paradigm rather than an issue of data or scale.
- On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation
-
Ours proposes the HOMER framework, which constructs a three-role LLM collaboration mechanism (Conflicting Script Extractor + Hierarchical Imaginator + Caption Generator) based on the GTVH humor theory. By explicitly modeling script opposition, multi-perspective association chains, and joke database retrieval to construct an imagination tree, the creative space is expanded. HOMER achieves an average improvement of ~7% using GPT-4o as the backbone on the New Yorker cartoon benchmark, and significant human evaluation results outperform all baselines.
- OSCAR: Online Soft Compression for RAG
-
OSCAR utilizes a lightweight compressor to perform online, query-aware compression of each retrieved document into a few embedding tokens. This achieves 2–5× end-to-end inference acceleration on 1B–24B LLMs with negligible performance degradation.
- Q-RAG: Long Context Multi‑Step Retrieval via Value‑Based Embedder Training
-
Q-RAG models multi-step retrieval as an MDP, using value-based reinforcement learning to fine-tune only the embedder (leaving the LLM frozen). This allows the retrieval agent to step-by-step pick supporting facts directly within the latent space of chunk embeddings. It achieves SOTA on long-context benchmarks like BabiLong and RULER (up to 10 million tokens) and can be trained using a single A100.
- Query-Aware Flow Diffusion for Graph-Based RAG with Retrieval Guarantees
-
QAFD-RAG introduces "flow diffusion" into graph-based RAG by dynamically re-weighting edges based on query semantics. This ensures information flows only along paths aligned with the query, enabling the training-free extraction of compact, interpretable reasoning subgraphs. It provides the first statistical guarantee for "recalling relevant subgraphs with high probability" and consistently outperforms baselines like GraphRAG and LightRAG in QA and Text-to-SQL tasks.
- Query-Level Uncertainty in Large Language Models
-
The authors propose the concept of Query-Level Uncertainty and introduce the Internal Confidence (IC) method to estimate whether an LLM can answer a given query before generation (via a single forward pass). This enables efficient, training-free adaptive inference (RAG triggering, model cascading, and abstention).
- RAEE: A Robust Retrieval-Augmented Early Exit Framework for Efficient Inference
-
Ours proposes RAEE, a training-free retrieval-augmented early exit framework. By retrieving exit information from semantically similar samples to dynamically determine the optimal exit layer, it not only accelerates inference but also corrects model mispredictions, achieving a win-win for both speed and performance.
- RefTool: Reference-Guided Tool Creation for Knowledge-Intensive Reasoning
-
The RefTool framework is proposed to automatically create executable Python tools based on external reference materials (textbooks, knowledge snippets), addressing the failure of existing tool creation methods that rely on LLM internal knowledge in specialized domains. It outperforms existing methods by an average of 12.3% on causal reasoning, physics, and chemistry tasks.
- Rethinking Reasoning in Document Ranking: Why Chain-of-Thought Falls Short
-
This paper presents the first systematic and fair controlled experiment proving that explicit Chain-of-Thought (CoT) reasoning does not yield benefits in LLM document reranking tasks. Regardless of pointwise or listwise approaches, SFT or RL training, direct rerankers consistently outperform reasoning-based rerankers while requiring significantly less inference computation.
- Retro*: Optimizing LLMs for Reasoning-Intensive Document Retrieval
-
Retro reformulates the task of "determining relevance between query and document" into a pointwise reasoning task based on an explicit 0–100 rubric*. It utilizes score integration across multiple samples during test-time and a customized SFT + RL training strategy designed for the scoring mechanism. It achieves SOTA on the reasoning-intensive retrieval benchmark BRIGHT while being significantly faster than listwise/setwise methods due to its native pointwise parallelism.
- Reusing Pre-training Data at Test Time is a Compute Multiplier
-
The authors reuse the "exact same corpus used for pre-training" for retrieval augmentation at test time. They find this acts as a compute multiplier that yields performance equivalent to ~5x pre-training compute on MMLU, indicating that current pre-training does not fully extract knowledge from the data. By layering test-time compute techniques like self-consistency, re-ranking, and variance reduction, LLaMA 3.1 8B achieves an additional 10-point gain on MMLU.
- Revela: Dense Retriever Learning via Language Modeling
-
The authors propose Revela, which integrates retriever learning into language modeling via an in-batch attention mechanism. In this framework, Next Token Prediction (NTP) depends not only on the intra-sequence context but also on other sequences within the batch (weighted by retriever similarity), enabling the training of powerful dense retrievers without labeled query-document pairs.
- Robust Test-Time Video-Text Retrieval: Benchmarking and Adapting for Query Shifts
-
To address the sharp performance collapse of Video-Text Retrieval (VTR) models under real-world query perturbations, this paper establishes the MLVP benchmark containing 12 types of spatio-temporal perturbations across 5 intensity levels. It diagnoses that perturbations amplify "hubness" (where a few gallery videos dominate retrieval rankings) as the root cause. Subsequently, the authors propose HAT-VTR, a test-time adaptation framework that suppresses hotspots at the similarity level using Hubness Suppression Memory (HSM) and adapts to video temporal dynamics using multi-granularity losses. It significantly outperforms existing TTA methods in Recall@1 across various query shift scenarios.
- Seeing Through Words: Controlling Visual Retrieval Quality with Language Models
-
Addressing the issues of semantic ambiguity and the inability to control image quality in text-to-image retrieval for short queries (e.g., "a dog"), this paper proposes QCQC: it utilizes a generative language model to complete short queries into detailed descriptions. By conditioning on discretized "relevance + aesthetics" quality levels, users can guide retrieval results toward specific quality tiers (low/medium/high). The method is plug-and-play for any frozen VLM.
- SmartChunk Retrieval: Query-Aware Chunk Compression with Planning for Efficient Document RAG
-
SmartChunk Retrieval utilizes a low-latency planner to select the appropriate chunk granularity range for each query and directly generates high-level chunk embeddings using a lightweight compression encoder. This achieves Q&A performance close to or exceeding tree/graph-based RAG in long-document scenarios at a significantly lower cost.
- Summaries as Centroids for Interpretable and Scalable Text Clustering
-
Proposes k-NLPmeans and k-LLMmeans, which periodically replace numerical centroids with text summaries (summary-as-centroid) during k-means iterations. This achieves interpretable cluster prototypes while maintaining standard k-means objectives, ensuring that LLM call volume remains independent of the dataset size.
- Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking
-
This paper systematically compares two mainstream routes for training LLM rerankers: Contrastive Learning (CL) and Supervised Fine-Tuning (SFT). By decomposing the gradients into "weight × direction" components, it proves that SFT's superiority stems primarily from the weight term (providing larger update steps for hard samples). Based on this, GMR-3B / GMR-7B are trained using pure SFT, achieving SOTA on the self-constructed 40-dataset MRB benchmark for general multimodal reranking.
- SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models
-
The authors construct parallel corpora with identical structures where entities are mapped to real and synthetic names respectively. By comparing task performance across these "parallel worlds," they quantify the LLM's Knowledge Advantage Gap (\(\text{KA}\)) and find that this gap persists even with RAG and CoT enhancements.
- Think Then Embed: Generative Context Improves Multimodal Embedding
-
To address the failure of using Multimodal Large Language Models (MLLMs) directly as encoders under complex instructions, this work proposes the Think-Then-Embed (TTE) framework. It first employs a reasoner to generate an "Embedding-Centric Reasoning" (ECR) trajectory, and then utilizes an embedder to produce vectors conditioned on both the original input and this reasoning trajectory. TTE achieves SOTA on MMEB-V2 (TTE\(_t\)-7B 71.5%), leading other open-source models by an absolute margin of approximately 7%.
- TokMem: One-Token Procedural Memory for Large Language Models
-
TokMem is proposed to compile reusable task procedures into a single trainable memory token. This token serves as both a procedure index and a generation control signal, enabling efficient invocation of 1000+ task procedures without long prompts while supporting continual expansion without forgetting.
- Tools Are Under-Documented: Simple Document Expansion Boosts Tool Retrieval
-
This paper identifies that the primary bottleneck in tool retrieval is the poor quality of existing tool documentation. It proposes a low-cost LLM pipeline to systematically supplement original tool documents into structured profiles containing specific fields (
function_description,when_to_use,limitations,tags). The authors construct the TOOL-REX benchmark along with a large-scale training corpus, and train Tool-Embed (dense retriever) and Tool-Rank (reranker), achieving a new SOTA by pushing \(N@10\) to 52.23 and 56.44 on ToolRet and TOOL-REX respectively. - The Topology of Reasoning: Augmenting Generation with Retrieved Cell Complexes for Text-Graph QA
-
TopoRAG "lifts" text graphs into cell complexes, treating nodes, edges, and cycles as 0/1/2-cells. It employs topology-aware sub-complex retrieval and multi-dimensional message passing to feed high-order dependencies (cycles) into LLMs, consistently outperforming GraphRAG baselines such as G-Retriever and SubgraphRAG across three TGQA datasets.
- Uncertainty-driven Embedding Convolution
-
UEC converts multiple pre-trained text embedding models into Gaussian probabilistic embeddings post-hoc. It then adaptively fuses them using weights estimated from each model's uncertainty for the current query and scores them using a variance-embedded similarity function. It consistently outperforms baselines such as uniform/weighted ensembles and model merging in retrieval, classification, and STS tasks.
- Welfarist Formulations for Diverse Similarity Search
-
This paper models "attribute diversity in retrieval results" as a welfare function maximization problem from mathematical economics—treating each attribute as an agent and replacing the similarity sum of standard nearest neighbor search with Nash Social Welfare (geometric mean). This achieves a query-adaptive trade-off between "relevance" and "diversity" and provides efficient algorithms with provable approximation guarantees that can be applied on top of any ANN.
- When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation
-
Addressing the contradiction that GraphRAG often underperforms relative to basic RAG on real-world tasks, this paper proposes GraphRAG-Bench. This benchmark covers the entire pipeline from graph construction to retrieval and generation, featuring tasks across four difficulty levels. It systematically answers "when and why to use graphs": basic RAG suffices for simple fact retrieval, while graph structures provide substantial gains in complex multi-hop reasoning and context summarization tasks requiring the integration of scattered concepts, though at the cost of significantly higher token consumption.
- Your Language Model Secretly Contains Personality Subnetworks
-
This paper proposes extracting personality-specific subnetworks from pre-trained LLMs through activation-guided pruning, enabling efficient personality switching without any training, and introduces a contrastive pruning strategy to enhance parameter separation between opposing personalities.
- Youtu-GraphRAG: Vertically Unified Agents for Graph Retrieval-Augmented Complex Reasoning
-
Youtu-GraphRAG utilizes a "graph schema" to vertically integrate traditionally isolated graph construction and retrieval. The construction end uses the schema to constrain extraction and perform automatic expansion; the indexing end builds a four-layer knowledge tree via "topology + semantic" dual-perception community detection; the retrieval end uses the same schema to decompose complex questions into atomic sub-queries with iterative reflection. It saves up to 33.60% tokens and improves accuracy by 16.62% across 6 benchmarks compared to SOTA.
- ZeroGR: A Generalizable and Scalable Framework for Zero-Shot Generative Retrieval
-
ZeroGR utilizes natural language task instructions to generalize Generative Retrieval (GR) from supervised single-task settings to zero-shot heterogeneous retrieval. It unifies arbitrary document formats into keyword-based text DocIDs, builds indexes using an instruction-tuned query generator for pseudo-query generation, and employs "reverse-annealed" decoding to balance precision and recall, achieving new SOTA results for GR on BEIR/MAIR and approaching the performance of dense retrieval.