🔍 Information Retrieval & RAG¶

🧠 NeurIPS2025 · 30 paper notes

Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering: This paper introduces the MMDocRAG benchmark (4,055 expert-annotated QA pairs) to systematically evaluate 60 VLMs/LLMs and 14 retrievers on quote selection and interleaved text-image answer generation in multimodal document RAG. Results reveal that the strongest model, GPT-4.1, achieves only 70.2% Quote Selection F1, while fine-tuning yields substantial performance gains.
Chain-of-Retrieval Augmented Generation (CoRAG): This paper proposes CoRAG, a framework that automatically generates intermediate retrieval chains (sub-query → sub-answer) via rejection sampling, fine-tunes an LLM to learn iterative retrieval and reasoning, and supports diverse test-time decoding strategies (greedy / Best-of-N / tree search) for flexible compute scaling. CoRAG achieves 26+ EM improvement on multi-hop QA and attains state-of-the-art on 9/10 tasks of the KILT benchmark.
Cooperative Retrieval-Augmented Generation for Question Answering: Mutual Information Exchange and Ranking by Contrasting Layers: CoopRAG is a framework that achieves bidirectional cooperation between the retriever and the LLM through query expansion, retriever layer-contrastive reranking, and reasoning chain completion. It surpasses HippoRAG2 by 5.3% on multi-hop QA and by 35.2% on single-hop QA.
Deep Research Brings Deeper Harm: This paper reveals critical safety vulnerabilities in Deep Research (DR) agents — even when the underlying LLM correctly refuses harmful queries, deploying it as a DR agent can still produce detailed, professional, and dangerous reports. Two targeted jailbreak methods, Plan Injection and Intent Hijack, are proposed alongside the DeepREJECT evaluation metric. Experiments on 6 LLMs demonstrate that DR agents systematically undermine alignment mechanisms.
DICE: Discrete Interpretable Comparative Evaluation with Probabilistic Scoring for RAG: This paper proposes the DICE framework, which achieves interpretable, robust, and efficient evaluation of RAG systems through a two-stage assessment pipeline (evidence-coupled deep analysis + probabilistic {A, B, Tie} scoring) combined with a Swiss-system tournament. On a Chinese financial QA dataset, DICE attains 85.7% agreement with human experts, substantially outperforming RAGAS (45.7%).
Generalized Contrastive Learning for Universal Multimodal Retrieval: This paper proposes Generalized Contrastive Learning (GCL), which performs contrastive learning over all 6 modality-pair combinations within a mini-batch (image↔text, image↔image+text, text↔image+text). Without constructing new triplet datasets and using only existing image-text pairs, GCL improves VISTA's average retrieval precision on M-BEIR from 21.18 to 34.06 (+60.8%), and on the text→image+text task of MMEB from 10.1% to 31.1%.
Hierarchical Retrieval: The Geometry and a Pretrain-Finetune Recipe: This paper investigates the feasibility of Dual Encoders (DE) for Hierarchical Retrieval (HR), theoretically proving that embedding dimensionality need only grow linearly with hierarchy depth and logarithmically with document count. After identifying the "lost-in-the-long-distance" phenomenon, the paper proposes a pretrain-finetune strategy that improves long-distance recall from 19% to 76% on WordNet.
HiFi-RAG: Hierarchical Content Filtering and Two-Pass Generation for Open-Domain RAG: By decoupling the filtering capability of a lightweight Flash model from the reasoning capability of a Pro model, the paper constructs a multi-stage pipeline (query optimization → hierarchical filtering → two-pass generation → citation verification) that achieves SOTA performance in the MMU-RAGent competition.
How Should We Evaluate Data Deletion in Graph-Based ANN Indexes?: To address the lack of a unified evaluation methodology for data deletion in graph-based ANN indexes, this paper formally defines three baseline approaches—lazy deletion, eager deletion, and reconstruction—proposes a deployment-oriented evaluation framework and metric suite, and introduces the Deletion Control algorithm, which dynamically switches deletion strategies under accuracy constraints based on empirical analysis.
HyperGraphRAG: Retrieval-Augmented Generation via Hypergraph-Structured Knowledge Representation: This paper proposes HyperGraphRAG, the first RAG method based on hypergraph structure, which models n-ary relations (\(n \geq 2\)) via hyperedges. It overcomes the binary-relation bottleneck of existing graph-based RAG methods, achieving comprehensive improvements over StandardRAG and the GraphRAG family on question-answering tasks across medical, agricultural, computer science, and legal domains.
Improving Consistency in Retrieval-Augmented Systems with Group Similarity Rewards: This paper proposes Con-RAG, a framework that trains RAG generators to produce informationally consistent outputs under paraphrased inputs by computing group similarity rewards across multiple generations of semantically equivalent queries via Paraphrased Set GRPO (PS-GRPO), simultaneously improving both consistency and accuracy without requiring explicit ground-truth supervision.
Is PRM Necessary? Problem-Solving RL Implicitly Induces PRM Capability in LLMs: This study systematically demonstrates that pure RL training (without explicit PRM supervision) implicitly induces strong process judgment capability; existing PRMs are even less effective than simple majority voting on strong reasoning models such as DeepSeek-R1 and QwQ-32B. The paper proposes Self-PRM, which allows a model to rerank its outputs using its own internal reward signal, consistently outperforming external PRMs.
Learning Task-Agnostic Representations through Multi-Teacher Distillation: This paper proposes a task-agnostic multi-teacher distillation framework based on mutual information maximization. By estimating the conditional distribution of teacher embeddings via Gaussian kernels, the student model learns high-information-density general-purpose representations without relying on any downstream task labels, achieving state-of-the-art performance among same-scale models across text, vision, and molecular modeling domains.
MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning?: This paper proposes MIR-Bench, the first large-scale and diverse many-shot in-context reasoning benchmark. By automatically generating input-output pairs from programming problems, it evaluates LLMs' pattern recognition capabilities, revealing a performance saturation phenomenon caused by attention diffusion in many-shot settings, and demonstrating that transductive reasoning consistently outperforms inductive reasoning across models.
MITRA: An AI Assistant for Knowledge Retrieval in Physics Collaborations: This paper proposes MITRA, a locally deployed RAG system for large physics experiment collaborations (e.g., CERN CMS), featuring a two-tier vector database architecture (abstract store + full-text store) and a fully on-premise deployment strategy. MITRA substantially outperforms traditional keyword-based search (BM25) on semantic retrieval tasks, improving Precision@1 from 0.13 to 0.75.
MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining: This paper proposes MuRating, a scalable multilingual data selection framework that aggregates multiple English data quality scorers via pairwise comparisons, transfers quality signals to 17 languages through translation, and trains a language-agnostic multilingual quality assessment model, achieving consistent performance gains in LLM pretraining at both 1.2B and 7B scales.
RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering: This paper introduces RAG-IGBench, a benchmark specifically designed to evaluate the quality of interleaved image-text content generated via retrieval-augmented generation. It proposes novel automatic evaluation metrics spanning three dimensions—text quality, image quality, and image-text consistency—and demonstrates strong correlation with human evaluation.
Reliable Decision Making via Calibration Oriented Retrieval Augmented Generation: This paper proposes CalibRAG, a framework that trains a temperature-conditioned forecasting function to ensure confidence calibration in RAG-assisted decision-making, achieving improvements in both calibration quality and accuracy.
Retrieval-Augmented Generation for Reliable Interpretation of Radio Regulations: This paper presents a dedicated RAG pipeline for radio regulations—a legally sensitive, high-stakes domain—and introduces the first ITU radio regulation multiple-choice evaluation benchmark. The proposed system achieves 97% retrieval accuracy and an +11.9% QA accuracy gain over GPT-4o, substantially outperforming naive full-document insertion into the prompt.
Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization: This paper proposes AlignRAG, a framework that reframes RAG as "retrieval-augmented reasoning" and trains a dedicated Critic Language Model (CLM) to iteratively critique and refine the reasoning process at test time, addressing the misalignment between reasoning chains and retrieved evidence. An 8B CLM surpasses a 72B standard CLM on out-of-distribution tasks.
RMIT-ADM+S at the MMU-RAG NeurIPS 2025 Competition: This paper proposes the Routing-to-RAG (R2RAG) system, which employs an LLM-based query classifier to route simple queries to single-turn Vanilla RAG and complex queries to iterative Vanilla Agent. All components are built upon two lightweight models — Qwen3-4B (unquantized) and Qwen3-Reranker-0.6B — running on a single consumer-grade GPU, and the system won the Best Dynamic Evaluation award in the open-source track of the NeurIPS 2025 MMU-RAG competition.
Scaling Language-Centric Omnimodal Representation Learning: This paper proposes the LCO-Emb framework and demonstrates that Multimodal Large Language Models (MLLMs) implicitly establish cross-modal alignment during generative pretraining. Lightweight text-only contrastive fine-tuning suffices to activate full omnimodal representation capabilities. The work further identifies the Generation-Representation Scaling Law (GRSL), which establishes a positive correlation between generative capability and representation performance.
SeCon-RAG: A Two-Stage Semantic Filtering and Conflict-Free Framework for Trustworthy RAG: This paper proposes SeCon-RAG, a two-stage defense framework. The first stage employs clustering combined with semantic graph filtering to remove poisoned documents, while the second stage performs conflict-aware filtering at inference time. SeCon-RAG comprehensively outperforms existing RAG defense methods across 5 LLMs and 3 QA datasets, maintaining high accuracy and near-zero attack success rates even under 100% poisoning rates.
SuperCLIP: CLIP with Simple Classification Supervision: SuperCLIP augments the CLIP contrastive learning framework with an extremely simple classification loss — requiring only a lightweight linear layer that increases total FLOPs by merely 0.077% — to recover fine-grained textual supervision that CLIP underutilizes, achieving consistent improvements on zero-shot classification, image-text retrieval, and vision-only tasks.
SymRTLO: Enhancing RTL Code Optimization with LLMs and Neuron-Inspired Symbolic Reasoning: This paper proposes SymRTLO, the first neurosymbolic framework integrating LLMs with symbolic reasoning for RTL code optimization. By combining retrieval-augmented optimization rules, AST template-guided code generation, and an FSM symbolic system, SymRTLO achieves improvements of up to 43.9%, 62.5%, and 51.1% in power, performance, and area (PPA), respectively.
The Narrow Gate: Localized Image-Text Communication in Native Multimodal Models: Through systematic interpretability analysis, this work discovers that in native multimodal VLMs (Chameleon, Emu3), image-to-text cross-modal information transfer is concentrated at a single end-of-image [EOI] token—forming a "narrow gate" bottleneck. Ablating the [EOI] token's attention causes catastrophic performance collapse, whereas in non-native VLMs (LLaVA, etc.) the information transfer is distributed. This mechanistic difference can be exploited for semantic manipulation and robustness improvement.
The Transparent Earth: A Multimodal Foundation Model for the Earth's Subsurface: This paper proposes Transparent Earth, a Transformer-based multimodal foundation model that fuses 8 heterogeneous geophysical observation modalities via positional encoding and text-derived modality embeddings, enabling zero-shot inference and in-context learning for Earth subsurface property prediction.
Think Straight, Stop Smart: Structured Reasoning for Efficient Multi-Hop RAG: This paper proposes the TSSS (Think Straight, Stop Smart) framework, which achieves state-of-the-art accuracy and competitive efficiency on multi-hop RAG benchmarks through (i) template-based reasoning that caches repeated prefixes and anchors sub-queries to the main question, and (ii) a retriever-based deterministic terminator that halts reasoning upon sub-query repetition.
Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation: This paper proposes a dual-component framework (Windsock + DANCE) to address three core challenges in multimodal RAG: the Windsock module adaptively determines when to retrieve and which modality to retrieve (text/image/none) based on the query; the DANCE instruction fine-tuning strategy improves how to utilize retrieved information by dynamically selecting the model's weakest modality for noise-robust training. The overall framework achieves a 17.07% performance improvement while reducing retrieval calls by 8.95%.
Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals: This paper introduces RAGuard, the first benchmark dataset to systematically evaluate the robustness of RAG systems against misleading retrieved content. By constructing a realistic retrieval corpus from Reddit — containing supporting, misleading, and unrelated documents — it demonstrates that all tested LLM-RAG systems perform worse than a zero-shot baseline when exposed to misleading retrievals, whereas human annotators maintain consistent judgment.