🔍 Information Retrieval & RAG¶

📹 ICCV2025 · 8 paper notes

aligning information capacity between vision and language via dense-to-sparse fe: This paper proposes D2S-VSE, a two-stage training framework (dense-text pretraining + dense-to-sparse feature distillation fine-tuning) that enhances information capacity in visual-semantic embeddings, addressing the core asymmetry in information density between image and text modalities for image-text matching.
Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching: This paper proposes D2S-VSE, a two-stage training framework that addresses the information density asymmetry in image-text matching. In the first stage, the model is pre-trained on LLaVA-generated dense captions to enhance information capacity; in the second stage, dense text embeddings are distilled into sparse text embeddings. The method achieves state-of-the-art performance on MS-COCO and Flickr30K.
External Knowledge Injection for CLIP-Based Class-Incremental Learning: This paper proposes Engine (ExterNal knowledGe INjEction), a framework that employs dual-branch injection tuning (visual branch via data augmentation; text branch via GPT-4-generated discriminative descriptions) and post-tuning knowledge injection at inference (pairwise discriminative feature re-ranking), achieving 3–10% improvements over all CLIP-based class-incremental learning methods across 9 benchmark datasets without storing any historical samples.
LangBridge: Interpreting Image as a Combination of Language Embeddings: LangBridge achieves interpretable vision-language alignment by explicitly decomposing visual features into linear combinations of LLM vocabulary embeddings, and supports pretraining-free adapter transfer across different LLMs.
MonSTeR: a Unified Model for Motion, Scene, Text Retrieval: This paper proposes MonSTeR—the first tri-modal retrieval model for motion, scene, and text—which constructs a unified latent space via higher-order relationship modeling inspired by topological deep learning. By capturing intrinsic dependencies among all three modalities, MonSTeR substantially outperforms baselines that rely solely on unimodal representations across multiple retrieval tasks, and can further serve as an evaluation tool for human-scene interaction models.
OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation: This paper presents OHRBench—the first benchmark for evaluating the cascading impact of OCR on RAG systems. It comprises 8,561 document images across 7 domains and 8,498 QA pairs, and systematically reveals the distinct impact patterns of OCR-induced Semantic Noise and Formatting Noise on both the retrieval and generation stages.
Representation Shift: Unifying Token Compression with FlashAttention: This paper proposes Representation Shift, a training-free and model-agnostic token importance metric that measures the magnitude of representational change before and after a network layer, enabling — for the first time — compatibility between token compression and FlashAttention, achieving up to 5.5× speedup on video understanding and image classification tasks.
ViLU: Learning Vision-Language Uncertainties for Failure Prediction: This paper proposes ViLU, a post-hoc uncertainty quantification framework for VLM zero-shot prediction. By fusing visual embeddings, predicted text embeddings, and image-conditioned text representations via cross-attention, ViLU constructs uncertainty-aware multimodal representations that significantly outperform existing failure prediction methods across 13 classification datasets and large-scale image-text datasets.