🔍 Information Retrieval & RAG¶

📷 CVPR2026 · 8 paper notes

Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval: This paper introduces MCMR (Multi-Conditional Multimodal Retrieval), a large-scale benchmark that employs a dual-evidence design — where certain attributes are inferable only from images and others only from text — to ensure retrieval tasks cannot be solved unimodally. The benchmark systematically evaluates 5 retrievers and 7 MLLM rerankers, revealing modality asymmetry and fine-grained reasoning gaps.
CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering: CC-VQA is proposed as a training-free method for mitigating knowledge conflicts in KB-VQA. Through a two-stage strategy combining visual-centric contextual conflict reasoning and correlation-guided encoding/decoding, it achieves absolute accuracy improvements of 3.3%–6.4% on three benchmarks: E-VQA, InfoSeek, and OK-VQA.
Explaining CLIP Zero-shot Predictions Through Concepts: This paper proposes EZPC, which learns a linear projection matrix \(A\) to jointly map CLIP image and text embeddings into an interpretable concept space. The method provides faithful, human-understandable explanations for CLIP predictions with negligible accuracy loss (H-mean gap of ~1% on CIFAR-100/CUB/ImageNet-100) and an inference overhead of only ~0.1ms.
M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG: This paper proposes M4-RAG, the first large-scale multilingual, multicultural, multimodal RAG evaluation framework, covering 42 languages and 189 countries with 80K+ cultural VQA instances. It systematically reveals two key findings: RAG is effective for smaller models but does not scale positively with model size, and cross-lingual retrieval suffers from severe performance degradation.
MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model: MuCo proposes a multi-turn dialogue-based contrastive learning framework that leverages the conversational capabilities of MLLMs to process multiple associated query-target pairs within a single forward pass, substantially improving training efficiency and achieving state-of-the-art performance on the MMEB and M-BEIR retrieval benchmarks.
NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval: NanoVDR exploits the modality asymmetry between queries and documents, distilling the query encoding capability of a 2B VLM teacher into a 69M text-only encoder via pointwise cosine alignment. On the ViDoRe benchmark, the student retains 95.1% of teacher performance while reducing query latency by 50× with a total training cost of only 13 GPU hours.
NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval: NanoVDR exploits the inherent asymmetry between queries and documents to distill a 2B-parameter VLM document retriever into a 69M text-only query encoder via pointwise cosine alignment. The student model retains 95.1% of teacher performance on the ViDoRe benchmark, reduces query latency by 50×, and requires only 13 GPU hours to train.
RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations: This paper proposes RobustVisRAG, a causality-guided dual-path framework that decouples semantic–degradation entanglement in VisRAG by capturing degradation signals via a non-causal path and learning clean semantics via a causal path. Under real-world degradation conditions, the framework achieves improvements of 7.35%, 6.35%, and 12.40% in retrieval, generation, and end-to-end performance, respectively, while preserving performance on clean data.