🔍 Information Retrieval & RAG¶
📷 CVPR2026 · 9 paper notes
📌 Same area in other venues: 🔬 ICLR2026 (81) · 💬 ACL2026 (73) · 🧪 ICML2026 (26) · 🤖 AAAI2026 (21) · 🧠 NeurIPS2025 (24) · 📹 ICCV2025 (5)
🔥 Top topics: Multimodal/VLM ×3 · RAG ×2
- CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering
-
Ours proposes CC-VQA, a training-free method for mitigating knowledge conflict. Through a two-stage strategy involving visual-centric context conflict reasoning and correlation-guided encoding/decoding, it achieves an absolute accuracy gain of 3.3%-6.4% across E-VQA, InfoSeek, and OK-VQA benchmarks.
- Explaining CLIP Zero-shot Predictions Through Concepts
-
This paper proposes EZPC, which maps CLIP image-text embeddings into an interpretable concept space by learning a linear projection matrix. While maintaining almost no loss in zero-shot classification accuracy (H-mean gap of only ~1% on CIFAR-100/CUB/ImageNet-100), it provides faithful explanations based on human-understandable concepts for CLIP predictions with a negligible inference overhead increase of about 0.1ms.
- Language-driven Fine-grained Retrieval
-
LaFG replaces the semantically sparse one-hot category name supervision in Fine-Grained Image Retrieval (FGIR) with "attribute-level language prototypes." It leverages an LLM to expand category names into attribute descriptions, uses a frozen VLM to encode and cluster these into a dataset-level attribute vocabulary, and aggregates the Top-K attributes per category into prototypes to supervise the retrieval model. This establishes comparability across inter-class details, achieving SOTA results on CUB / Cars / SOP while significantly improving generalization to unseen classes.
- M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG
-
Ours proposes M4-RAG, the first large-scale multilingual, multicultural, and multimodal RAG evaluation framework. Covering 42 languages and 80K+ cultural VQA instances from 189 countries, it systematically reveals that RAG is effective for small models but fails to scale positively with model size, while showing severe performance degradation in cross-lingual retrieval.
- Mask to Align, Weight to Disambiguate: Reliable Unsupervised Cross-Modal Hashing with Masked-Weight Contrast
-
To address the "partial alignment + semantic ambiguity" issues in unsupervised cross-modal hashing, UWMCH performs token masking before fusion to force the model to learn complementary semantics. It then uses semantic affinity to re-weight contrastive losses to suppress false negatives, supplemented by dual-scale semantic regularization to stabilize the hashing space. It achieves the best mAP in 21 out of 24 settings across three retrieval benchmarks.
- MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model
-
MuCo proposes a multi-turn contrastive learning framework that leverages the conversational capabilities of MLLMs to process multiple associated query-target pairs in a single forward pass. This significantly improves training efficiency and achieves SOTA performance on MMEB and M-BEIR retrieval benchmarks.
- POGA: Paraphrased and Oppositional Graph Alignment for Fine-Grained Cross-Modal Retrieval
-
POGA parses both images and text into structured scene graphs, utilizes LLMs to automatically generate "paraphrased positive samples + counterfactual negative samples" along with their difference information, and trains with a composite loss across four granularities—global, node, relation, and focus. This allows the model to both recognize object attributes and reject "semantically similar but factually incorrect" descriptions in fine-grained long-text retrieval.
- ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology
-
ProM3E utilizes an "align-then-fuse" two-stage framework to train a Masked Variational Autoencoder (MVAE) within the embedding space. By inferring Gaussian distribution representations of missing modalities from a small subset of visible modalities, it supports any-to-any modality generation, modality inversion retrieval, and uncertainty analysis regarding "which modalities to fuse." It comprehensively outperforms TaxaBind on ecological multimodal tasks.
- RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations
-
RobustVisRAG is a causality-guided dual-path framework that decouples semantic-degradation entanglement in VisRAG by capturing signals through a non-causal path while learning pure semantics via a causal path. It achieves performance gains of 7.35%, 6.35%, and 12.40% in retrieval, generation, and end-to-end tasks under real-world degradation, respectively, while maintaining performance on clean data.