Skip to content

NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

Conference: CVPR 2026 arXiv: 2603.12824 Code: None (Aalto University; data and model not released) Area: Information Retrieval Keywords: Knowledge Distillation, Visual Document Retrieval, Asymmetric Encoding, VLM Compression, Cross-modal Transfer

TL;DR

NanoVDR exploits the inherent asymmetry between queries and documents to distill a 2B-parameter VLM document retriever into a 69M text-only query encoder via pointwise cosine alignment. The student model retains 95.1% of teacher performance on the ViDoRe benchmark, reduces query latency by 50×, and requires only 13 GPU hours to train.

Background & Motivation

Visual Document Retrieval (VDR) treats document pages as images and employs VLMs to encode both queries and document pages into a shared embedding space. State-of-the-art systems (e.g., ColPali, DSE-Qwen2, Tomoro-8B) use billion-parameter VLMs to encode both sides symmetrically. However, this design imposes unnecessary symmetry: while documents contain complex visual content requiring strong visual understanding, queries are merely short text strings with no visual information whatsoever. Encoding a plain-text query with a 2B-parameter VLM entirely wastes the model's visual processing capacity, while incurring query latencies of several seconds and mandating GPU inference.

Core Problem

VDR systems use the same heavy VLM for both query and document encoding, resulting in prohibitively high online query inference costs (latency >2 s, GPU required). Can the asymmetry—queries are pure text—be exploited to distill the VLM's query encoding capability into a lightweight text-only model that supports real-time CPU inference?

Method

Overall Architecture

NanoVDR decouples the retrieval pipeline into two paths: (1) Offline document indexing: a frozen 2B VLM teacher (Qwen3-VL-Embedding-2B) encodes each document page into a 2048-dimensional single-vector embedding; (2) Online query encoding: a distilled lightweight text-only student model (DistilBERT/BERT/ModernBERT + MLP projector) maps query text into the teacher's embedding space. Retrieval scores are computed via cosine similarity.

Key Designs

  1. Query-Centric Distillation: Training proceeds in two steps—the frozen VLM teacher first encodes all training queries in text-only mode, caching target embeddings \(\mathbf{v}_t^Q\); the student text encoder is then trained to align its output \(\mathbf{v}_s^Q\) with the teacher. The loss is remarkably simple: \(\mathcal{L}_\text{align} = 1 - \cos(\mathbf{v}_s^Q, \mathbf{v}_t^Q)\). The key insight is that the entire training procedure requires no document images—because the teacher maps both queries and documents into the same space, a student that learns to align query embeddings automatically acquires the ability to retrieve against document embeddings.

  2. Pointwise Alignment Outperforms Ranking-Based Losses: The paper systematically compares six distillation objectives (pure alignment, pure ranking KL divergence, their combinations, and InfoNCE). The conclusion is counterintuitive: NDCG@5 increases monotonically as the alignment weight increases. Pure alignment outperforms pure ranking by +1.1/+4.0/+2.5 on v1/v2/v3, respectively. The authors attribute this to the richer geometric structure encoded in a high-quality teacher embedding space compared to relative ranking alone. Practically, alignment requires only caching teacher query embeddings (1 GPU hour), whereas ranking additionally requires caching document embeddings (24 GPU hours), making alignment superior in both cost and accuracy.

  3. Multilingual Query Augmentation: Analysis reveals that the primary distillation bottleneck is cross-lingual transfer (DistilBERT is predominantly English-trained) rather than cross-modal transfer—English query retention reaches 94.3%, while Portuguese retention falls to only 75.6%. The remedy is lightweight: approximately 490K English queries are translated into five target languages using Helsinki-NLP Opus-MT, and the frozen teacher encodes the translated queries to produce new target embeddings. The entire augmentation involves only text data, with no images required. After augmentation, the cross-lingual performance gap narrows from 18.6 pp to 2.7 pp.

  4. Student Architecture: A pretrained text backbone with mean pooling and a two-layer MLP projector (768→768→2048). Three scales are explored: NanoVDR-S (DistilBERT, 69M), NanoVDR-M (BERT-base, 112M), and NanoVDR-L (ModernBERT-base, 151M). Larger backbones yield only marginal gains, indicating that query encoding does not require large model capacity.

Loss & Training

  • Training loss: pure pointwise cosine alignment \(\mathcal{L}_\text{align} = 1 - \cos(\mathbf{v}_s^Q, \mathbf{v}_t^Q)\)
  • OneCycleLR schedule (peak lr=2e-4, 3% warmup), batch size 256, gradient accumulation over 4 steps (effective batch size 1024)
  • 20 epochs, single GPU, 10–12 hours
  • Multilingual augmentation variant: dataset size doubled, epochs halved (10), lr slightly increased to 3e-4

Key Experimental Results

Model Parameters ViDoRe v1 ViDoRe v2 ViDoRe v3 Query Latency (CPU)
Tomoro-8B 8.0B 90.6 65.0 59.0 8,225 ms
Teacher (Qwen3-VL-2B) 2.2B 84.3 65.3 50.0
DSE-Qwen2 2.2B 85.1 55.7 41.3 2,539 ms
ColPali 3.0B 84.2 54.7 42.0 7,284 ms
NanoVDR-S 69M 82.2 60.5 43.5 51 ms
NanoVDR-S-Multi 69M 82.2 61.9 46.5 51 ms
NanoVDR-L 151M 82.4 61.5 44.2 109 ms

Key figures: NanoVDR-S-Multi retains 95.1% of teacher performance, achieves 50× lower CPU latency, and uses 32× fewer parameters.

Ablation Study

  • Loss function: Pure alignment uniformly outperforms pure ranking (+1.1/+4.0/+2.5); InfoNCE (hard labels) collapses in performance (−10.7/−21.6/−14.1)
  • Data efficiency: 25% of training data suffices to achieve 93% retention on v1; even 10% data yields 79% retention
  • Language bottleneck: English retention is 94.3%; Portuguese (entirely absent from the training set) is only 75.6%. After augmentation, all languages exceed 92%
  • Teacher quality vs. cosine similarity: Teacher quality is the strongest predictor of distillation success (r=+0.607), while student–teacher cosine similarity is nearly uncorrelated with downstream performance (r=+0.094), suggesting that the geometric structure of the embedding space matters more than pointwise alignment accuracy
  • Model scale: Differences among 69M, 112M, and 151M are negligible, confirming that query encoding does not benefit from larger models

Highlights & Insights

  • Extreme simplicity: The entire method can be summarized in one sentence—run the frozen teacher forward once to obtain query embeddings, then train a small model to perform cosine alignment. No complex distillation strategies, no negative samples, no image processing
  • Deep insight into asymmetry: The fundamental difference between queries and documents (text vs. vision) is translated into an asymmetric system design. While this observation seems obvious in hindsight, many complex systems overlook it
  • Alignment > Ranking: In a sufficiently high-quality teacher embedding space, directly aligning embedding coordinates proves more effective than matching ranking distributions—a finding with broad implications for the retrieval distillation community
  • 13 GPU-hour training cost: Compared to VLM training that routinely requires hundreds of GPU hours, the practical accessibility of this approach is remarkable

Limitations & Future Work

  • Performance is upper-bounded by the teacher; the student can never surpass the teacher
  • Offline document indexing still requires the full 2B VLM; indexing cost is not reduced
  • Only pure-text query scenarios are validated; multimodal queries (e.g., image-accompanied queries) remain unexplored
  • Multilingual augmentation relies on machine translation quality and may introduce errors in terminology-heavy domains
  • No thorough comparison with contemporary lightweight VLMs such as ModernVBERT
  • ColPali/Tomoro: Use multi-vector representations with MaxSim, yielding high quality but extremely high latency (7–8 s) and enormous index storage (256–819 GB/M). NanoVDR uses single-vector cosine similarity, achieving 51 ms latency and 8.2 GB/M index size
  • DSE-Qwen2: Also uses single-vector representations, but encodes queries with a 2B VLM (2.5 s latency). NanoVDR outperforms DSE-Qwen2 on v2/v3 (benefiting from a better teacher) while using 32× fewer parameters
  • ModernVBERT: A 250M vision-language encoder requiring the full model for both query and document encoding. NanoVDR's query side requires no visual module whatsoever
  • SERVAL: Generates document descriptions via a VLM and then indexes them with a text encoder, incurring the massive inference cost of a 72B VLM + 7B encoder. NanoVDR directly distills the embedding space, which is more straightforward and efficient

Broader Implications: - The "asymmetric distillation" paradigm generalizes to many settings, such as recommendation systems where item embeddings are computed offline with a large model and user embeddings are computed online with a lightweight model - If the finding that pointwise alignment outperforms ranking-based distillation holds in other retrieval tasks (text retrieval, code retrieval), it could reshape best practices in retrieval distillation - Cross-lingual transfer—not cross-modal transfer—being the primary bottleneck offers broadly applicable guidance for multimodal model compression research

Rating

  • Novelty: ⭐⭐⭐⭐ The asymmetric distillation paradigm is novel and the alignment > ranking finding is valuable, though the overall method is relatively straightforward
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 22 datasets × 3 benchmark versions × 6 loss functions × 3 backbones; ablations are exhaustive and the cross-lingual analysis is convincing
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear structure, every claim supported by data, thorough appendix, strong reproducibility
  • Value: ⭐⭐⭐⭐⭐ Directly addresses the core deployment bottleneck of VDR systems (latency and cost); a 69M model with CPU inference offers exceptional practical utility