Skip to content

šŸ” Information Retrieval & RAG

šŸ“· CVPR2025 Ā· 12 paper notes

šŸ“Œ Same area in other venues: šŸ”¬ ICLR2026 (81) Ā· šŸ’¬ ACL2026 (73) Ā· 🧪 ICML2026 (26) Ā· šŸ¤– AAAI2026 (21) Ā· 🧠 NeurIPS2025 (25) Ā· šŸ“¹ ICCV2025 (5)

šŸ”„ Top topics: RAG Ɨ4 Ā· Few-/Zero-Shot Learning Ɨ3

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

Upgrades CLIP from the traditional one-to-one (image, text) contrastive learning paradigm to a multi-to-multi (multi-image-embeddings, multi-texts) contrastive learning paradigm. By utilizing VLMs to generate multi-perspective, multi-level descriptions and a multi-branch visual encoder to output diverse visual embeddings, it achieves more comprehensive vision-language alignment, substantially outperforming baselines in retrieval, classification, and dense prediction tasks.

ChatHuman: Chatting about 3D Humans with Tools

ChatHuman is proposed, an LLM-based language-driven system that manages new tools by automatically selecting and integrating specialized 3D human analysis tools (3D pose estimation, shape recovery, contact detection, human-object interaction analysis, emotion recognition, etc.), utilizing academic papers as tool manuals along with RAG (Retrieval-Augmented Generation) to create in-context examples. It outperforms existing LLM models in tool selection accuracy and overall performance on 3D human-related tasks.

COBRA: COmBinatorial Retrieval Augmentation for Few-Shot Adaptation

Proposes COBRA, a combinatorial mutual information (CMI)-based retrieval-augmented few-shot adaptation method. By simultaneously accounting for both the similarity of retrieved samples to the target task and the diversity among the samples themselves, COBRA retrieves high-quality auxiliary data from LAION-2B. It consistently outperforms traditional nearest-neighbor retrieval methods across multiple image classification benchmarks with negligible computational overhead.

EZSR: Event-based Zero-Shot Recognition

This paper proposes the EZSR framework for zero-shot object recognition in event camera data. By utilizing a scalar-wise modulation strategy, it addresses the semantic misalignment between event embeddings and CLIP text embeddings. It overcomes training data scarcity through large-scale event data synthesis from static RGB images, achieving a 47.84% zero-shot accuracy on N-ImageNet with a ViT-B/16 backbone.

Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning

This paper extends Retrieval-Augmented Learning (RAL) to Few-Shot Recognition (FSR) for the first time, exposing two main challenges of retrieved data: distribution imbalance and domain gap. It proposes a two-stage method, SWAT (finetuning the vision encoder on mixed data first, then retraining the classifier on few-shot labeled data), outperforming all prior methods by \(>6\%\) across 9 benchmarks.

GOAL: Global-Local Object Alignment Learning

Proposes the GOAL method, which enhances CLIP's understanding of long text descriptions through two modules: Local Image-Sentence Matching (LISM) and Token Similarity-based Learning (TSL). By introducing local semantic alignment on top of global alignment, it significantly improves image-text retrieval performance.

LotusFilter: Fast Diverse Nearest Neighbor Search via a Learned Cutoff Table

This paper proposes LotusFilter, which constructs a cutoff table by precomputes neighbor relationships for each vector offline and performs diversity filtering using greedy set deletion during the online stage. This reduces the complexity of traditional diverse search from \(O(DS^2)\) to \(O(T+S+KL)\). The filtering process requires only 0.02 ms/query, utilizing only 1/40 of the memory compared to traditional methods.

Preserving Clusters in Prompt Learning for Unsupervised Domain Adaptation

The CRPL framework is proposed to improve prompt learning of CLIP in unsupervised domain adaptation (UDA) through source-augmented pseudo-labeling and an optimal transport-based cluster preservation strategy, ensuring that the text embeddings of target prompts better cover the cluster structures of visual embeddings.

RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings

RANGE is proposed, which approximates and injects high-resolution visual information into location embeddings via a retrieval-augmented strategy. This addresses the issues of contrastive learning (e.g., SatCLIP) discarding modality-specific information, achieving up to a 13.1% performance gain on classification tasks and a 0.145 increase in \(R^2\) on regression tasks.

Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis

RAG-Gesture proposes a gesture synthesis framework based on Retrieval-Augmented Generation (RAG). It leverages explicit linguistic knowledge to retrieve semantically relevant exemplar motions from a gesture database, and injects them into the diffusion model's generation process at inference time through DDIM inversion and retrieval guidance, producing semantically rich and natural co-speech gestures without training.

Towards Smart Point-and-Shoot Photography

A smart "point-and-shoot" photography system is proposed: a CLIP-text-embedding-based composition quality assessor (CCQA) first evaluates the current composition quality, then a Mixture of Experts (MoE) camera pose adjustment model (CPAM) predicts yaw/pitch adjustment angles. On the PCARD dataset (320K images generated from 4K panoramas), it achieves a 79.3% AUC for adjustment suggestions and a 0.613 IoU for adjustment accuracy.

VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

Proposes the first RAG framework that directly takes document images (rather than parsed text) as input, utilizing an LVLM as a dual-encoder retriever plus two self-supervised pre-training tasks (contrastive + generative) to achieve document image retrieval, outperforming text RAG by 24 points on ChartQA.