Efficient Discriminative Joint Encoders for Large Scale Vision-Language Re-ranking¶

Conference: ICLR 2026 arXiv: 2510.06820 Code: GitHub Area: Information Retrieval Keywords: Vision-Language Retrieval, Joint Encoder, Re-ranking, Token Compression, Efficient Inference

TL;DR¶

This paper proposes EDJE (Efficient Discriminative Joint Encoder), which offlines visual feature extraction and compresses visual tokens via a lightweight attention adapter, achieving a throughput of 50k image-text pairs per second. EDJE matches the retrieval performance of existing joint encoders on Flickr (zero-shot) and COCO (fine-tuned), requiring only 49 kB of storage per image.

Background & Motivation¶

In large-scale multimodal retrieval, embedding-based models (e.g., CLIP) enable efficient search via vector similarity, but independently encoding the two modalities limits fine-grained cross-modal interaction. Joint encoders (e.g., BLIP, BLIP-2) process both modalities jointly and achieve stronger retrieval performance. Cross-encoder re-ranking is already a standard paradigm in text retrieval.

Key Challenge: Visual feature extraction is a severe bottleneck in existing joint encoders—processing a batch of 64 images with BLIP using ViT-B takes ~400 ms and ~1,400 ms with ViT-L, accounting for 83%–93% of total inference time. By contrast, the widely used MiniLM re-ranker in text retrieval has only 22 M parameters and processes an equivalent batch in ~60 ms. This explains why multimodal re-rankers are nearly absent from practical retrieval systems.

Core Idea: Offline visual feature extraction—images are encoded once and cached to disk; at inference time, only a compact joint encoder processes a small number of visual tokens together with text tokens. A token compression adapter further reduces storage requirements substantially.

Method¶

Overall Architecture¶

EDJE operates in two stages—offline and online: - Offline stage: A ViT encodes images; an adapter compresses the resulting features into a compact token set, which is stored to disk. - Online stage: A compact language model (MiniLM) jointly processes the compressed visual tokens and text tokens to produce re-ranking scores.

Key Designs¶

Paradigm shift via visual pre-computation: Because the visual encoder operates solely on images, its outputs can be cached and reused. A ViT-B projects each 16×16 patch into an embedding of dimension $d=384$; stored in FP16, the footprint is comparable to that of the original 8-bit RGB image. Scaling up the visual encoder improves representation quality without increasing online cost. The key challenge is that raw storage is infeasible at web-scale (potentially reaching terabytes), necessitating a compression strategy.
Token compression adapter: $m$ learnable universal query tokens $\mathbf{Q} = [\mathbf{q}_1, \ldots, \mathbf{q}_m]$ aggregate information from $n$ visual tokens $\mathbf{X}$ via cross-attention: $$\mathbf{H} = \text{MultiHeadAttention}(\mathbf{Q},\, \mathbf{X}\mathbf{W}_K,\, \mathbf{X}\mathbf{W}_V)$$ The output is passed through a residual block and linearly projected into the language model embedding space: $$\mathbf{Y} = \bigl(\mathbf{H} + \text{MLP}(\text{LayerNorm}(\mathbf{H}))\bigr)\mathbf{W}_{proj}$$ This compresses 576 ViT tokens down to 64 tokens, reducing storage from 442 kB to 49 kB per image.
Compact joint encoder: Following the VLM paradigm, visual tokens are projected into the language embedding space, concatenated with text tokens, and processed jointly via self-attention for cross-modal interaction. The language model is MiniLM (33 M parameters), substantially smaller than BLIP's 139–167 M. This design is modular: any ViT-based visual encoder can be paired with any pre-trained language model.

Loss & Training¶

Three objectives are jointly optimized: - Image-Text Matching (ITM): Binary classification of positive pairs versus in-batch hard negatives. - Masked Language Modeling (MLM): 50% of text tokens are masked and predicted, encouraging cross-modal dependency. - Text Embedding Recovery: Minimization of the cosine distance between the projected CLS token and the text encoder embedding. - Local-to-Compressed Distillation: The uncompressed Local model serves as a teacher; logit-level distillation is applied to the compressed model.

Pre-training uses 14 M image-text pairs from CC12M, CC3M, SBU, VG, and COCO; fine-tuning uses COCO only.

Key Experimental Results¶

Main Results¶

Flickr30k Zero-Shot Retrieval (SigLIP2 ViT-L/16, 384²):

Method	T2I R@1	I2T R@1	Storage/Image	Params	Inference Time
BLIP ViT-L/16	86.7	96.7	2,359 kB	139 M	101.61 ms
BLIP-2 ViT-L/16	88.6	96.9	2,359 kB	167 M	98.64 ms
EDJE Local	87.8	96.5	442 kB	33 M	4.14 ms
EDJE Compressed-64	86.9	96.4	49 kB	33 M	1.91 ms

EDJE Gains over Various Embedding Models (Flickr30k Zero-Shot T2I R@1):

Backbone	Base	+EDJE	Gain
CLIP ViT-B/16	62.1	76.8	+14.7
CLIP ViT-L/14	65.2	80.6	+15.4
SigLIP2 ViT-B/16	82.1	84.3	+2.2
SigLIP2 ViT-L/16	82.3	87.8	+5.5

Ablation Study¶

Token count: Experiments with {32, 64, 128, 256} target tokens show that 64 tokens achieve the best efficiency–performance trade-off; 32 tokens exhibit a notable performance drop, while 256 tokens approach the performance of the Local variant with 576 tokens.
Re-ranking pool size: Retrieval performance on R@1/5/10 remains stable as $k$ varies from 5 to 50, demonstrating EDJE's robustness to noisy candidates.
Training objectives: Incrementally adding MLM and text embedding recovery on top of ITM each yield positive contributions; the full combination is strongest.
Local-to-Compressed distillation: Provides additional discriminative gains for the compressed variant.
Semantic analysis: The 64 compressed tokens map to semantically meaningful object/scene descriptors (e.g., boulders, caves), whereas a large proportion of the 576 uncompressed tokens map to meaningless special tokens (e.g., unused80), confirming substantial redundancy in raw ViT tokens.
Quantization: Quantizing the compressed tokens incurs negligible performance loss, offering a further storage–performance trade-off.

Key Findings¶

EDJE serves as a plug-and-play re-ranker that improves retrieval across all tested embedding models (CLIP, DFN, MetaCLIP, SigLIP2).
Inference is 53× faster than BLIP-2, with 48× storage reduction (49 kB vs. 2,359 kB per image).
Quantizing compressed tokens causes minimal performance degradation, enabling further storage optimization.

Highlights & Insights¶

The paper precisely identifies visual feature extraction as the joint-encoder bottleneck (83%–93% of inference time); the proposed offline-plus-compression solution is elegant and principled.
The semantic analysis of the token compression adapter is highly informative: most ViT tokens are indeed redundant, and 64 compressed tokens suffice to capture critical semantics.
The modular architecture makes EDJE a highly practical drop-in re-ranker.
The paper's narrative is exceptionally clear, progressing logically from bottleneck analysis → paradigm shift → concrete design → experimental validation.
This work is the first to systematically transfer the mature cross-encoder re-ranking paradigm from text retrieval to multimodal retrieval, filling an important gap.

Limitations & Future Work¶

Coverage is limited to image-text retrieval; multilingual multimodal retrieval, audio, and video modalities are not addressed.
The discriminative capacity of the joint encoder leaves room for improvement; replacing MiniLM with a larger or more capable language model is worth exploring.
Performance degrades noticeably at 32 tokens; more aggressive compression methods deserve investigation.
Gains over DFN and MetaCLIP are smaller than those over CLIP and SigLIP2 (DFN was fine-tuned as a filtering network on Flickr).
Downstream applications where joint encoders excel—such as zero-shot classification and large-scale data filtering—are not explored.

Relation to BLIP family: EDJE can be viewed as an efficient replacement for BLIP's re-ranking capability; the core contribution is shifting visual feature extraction from online to offline.
Connection to ColBERT: The approach shares conceptual similarity with ColBERT's token-level offline storage in text retrieval, but adds a compression dimension.
Relation to Q-Former: The token compression layer resembles BLIP-2's Q-Former but is lighter and focuses on compression rather than generation.
Comparison with LightningDOT: LightningDOT performs re-ranking with region features but compresses each region to a single vector, making it closer in nature to an embedding model than a true joint encoder.

Rating¶

Novelty: ⭐⭐⭐⭐ The core idea (offline visual features + compressed tokens) is intuitively clear, though each component has precedents.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Validation spans multiple backbones, with detailed ablations, semantic visualization, and comprehensive efficiency analysis.
Writing Quality: ⭐⭐⭐⭐⭐ Problem formulation is precise, motivation is exceptionally clear, and bottleneck analysis is data-driven.
Value: ⭐⭐⭐⭐⭐ High practical value; fills the gap left by the absence of joint-encoder re-rankers in multimodal retrieval.