Revela: Dense Retriever Learning via Language Modeling¶
Conference: ICLR2026
arXiv: 2506.16552
Code: To be confirmed
Area: Information Retrieval
Keywords: dense retrieval, self-supervised learning, language modeling, in-batch attention, retriever
TL;DR¶
This paper proposes Revela, which integrates retriever learning into language modeling via an in-batch attention mechanism. Next-token prediction (NTP) draws not only on within-sequence context but also on other sequences in the batch, weighted by retriever similarity scores, enabling training of a strong dense retriever without labeled query-document pairs.
Background & Motivation¶
Background: Dense retrievers typically require labeled query-document pairs for training, making annotation expensive in specialized domains and complex scenarios.
Limitations of Prior Work: Self-supervised retrieval methods (e.g., Contriever) tend to overfit structural biases in data; auto-encoding approaches lack pairwise supervision.
Key Challenge: LMs learn inter-token dependencies via NTP (a successful self-supervised paradigm). The key question is how to extend a similar idea to learning inter-chunk dependencies.
Key Insight: Retrieval is framed as "sequence-level NTP"—NTP identifies the most relevant preceding tokens, while retrieval identifies the most relevant documents.
Core Idea: In-batch attention is introduced into Transformer blocks, allowing NTP to leverage both within-sequence context and other sequences in the batch, with cross-sequence weights provided by the retriever.
Method¶
Overall Architecture¶
Documents are split into chunks and placed in the same batch. The retriever computes inter-chunk similarity, which modulates in-batch attention within the LM's Transformer blocks. The NTP loss jointly optimizes both the LM and the retriever.
Key Designs¶
- In-batch Attention: The embedding of sequence \(i\) can attend to the embeddings of other sequences \(j\) in the batch, with weights modulated by the retriever-computed \(\text{Sim}(D_i, D_j)\).
- Joint Optimization: The retriever and LM are trained end-to-end under a shared NTP objective.
- Same-document Negatives: Different chunks from the same document are placed in the same batch, functioning similarly to hard negatives.
Loss & Training¶
- Training is conducted on Wikipedia (general retrieval) or code corpora (StackOverflow + documentation) for code retrieval.
- Both the retriever and LM are fine-tuned with LoRA (rank=256); temperature \(\tau=10^{-4}\), learning rate \(10^{-4}\).
- Training runs for only 1 epoch: ~10K steps on Wikipedia (~44 hours) and ~11K steps on code (~48 hours), using 4×A100 GPUs.
- At inference, queries and documents support up to 2048 tokens; the
<eos>token embedding serves as the document representation. - "Query:" and "Passage:" prefixes are prepended to distinguish queries from passages.
Key Experimental Results¶
Main Results¶
| Method | CoIR (nDCG@10) | BRIGHT | BEIR |
|---|---|---|---|
| E5-Mistral-7B (supervised) | baseline | baseline | baseline |
| Revela-3B (unsupervised) | +2.8 | surpasses commercial APIs | unsupervised SOTA |
Key Findings¶
- Revela without labeled data outperforms a supervised model with 7B parameters.
- It achieves unsupervised SOTA on BEIR using ~1000× less data and ~10× less compute.
- Cross-domain generalization is stronger than contrastive learning methods.
- Performance scales consistently with batch size and model size.
Ablation Study¶
| Ablation / Analysis | Finding |
|---|---|
| Batch size scaling | Performance improves monotonically with batch size; larger batches provide more negatives. |
| Retriever size scaling | Consistent improvement from 135M to 3B, following a scaling law. |
| LM size | A larger LM (1B→3B) provides a stronger learning signal for the retriever. |
| Mixed-domain training | Joint training on Wikipedia + Code does not hurt single-domain performance while improving cross-domain generalization. |
| vs. REPLUG | Revela outperforms REPLUG at all scales; joint optimization is superior to a frozen LM. |
| Cross-domain generalization | Revela trained on Wikipedia surpasses commercial APIs on unseen reasoning-intensive tasks in BRIGHT. |
Detailed CoIR Results (nDCG@10)¶
| Method | Scale | Avg. nDCG@10 |
|---|---|---|
| UniXCoder (supervised) | 0.1B | baseline |
| Revela | 0.1B | +11.1 |
| E5-PT (weakly supervised, 270M pairs) | 0.3B | baseline |
| Revela | 0.5B | +9.7 |
| BGE-M3 (supervised) | 0.6B | baseline |
| Revela | 0.5B | surpasses |
| E5-Mistral-7B (supervised) | 7B | baseline |
| Revela | 3B | +2.8 |
Highlights & Insights¶
- The NTP→retrieval analogy is both natural and effective: inter-token dependencies ↔ inter-document dependencies.
- Power of joint optimization: REPLUG relies on perplexity from a frozen LM, which is often poorly calibrated; Revela's joint updates address this fundamental issue.
- Remarkable data efficiency: ~1000× less data and ~10× less compute suffice to achieve unsupervised SOTA on BEIR—demonstrating that method design matters more than data scaling.
- Cross-domain generalization: Stronger than traditional contrastive learning methods, as the NTP objective captures more general "semantic dependencies" rather than superficial co-occurrence.
Limitations & Future Work¶
- Performance is sensitive to batch size; a sufficiently large batch (16+) is required, which may become a bottleneck under resource constraints.
- In-batch attention increases training-time computational overhead, as each sequence must attend to all other sequences in the batch.
- Only text and code retrieval have been validated; multimodal retrieval (image, audio, etc.) remains unexplored.
- The effect of chunk segmentation strategies (sentence boundaries, fixed length, etc.) on performance has not been thoroughly analyzed.
- For very long documents (>2048 tokens), the current chunking approach may lose long-range dependencies.
Related Work & Insights¶
- vs. Contriever (Izacard et al.): Contriever uses contrastive learning (same document = positive, cross-document = negative); Revela models cross-document conditional probabilities via NTP, capturing "why two documents are related" more precisely.
- vs. Atlas (Izacard et al.): Atlas trains the retriever using cross-attention signals from an encoder-decoder architecture and requires periodic re-indexing; Revela uses a decoder-only model with in-batch attention, which is more efficient and requires no re-indexing.
- vs. REPLUG (Shi et al.): REPLUG distills a frozen LM via perplexity; Revela trains jointly—experiments confirm joint training is significantly superior at all scales.
- vs. E5-PT (Wang et al.): E5-PT trains on 270M weakly supervised pairs, whereas Revela learns from raw text alone—demonstrating that a well-designed objective can substitute for large volumes of labeled data.
- Broader Inspiration: The in-batch attention paradigm can generalize to any setting requiring modeling of "intra-set relationships," such as multi-document summarization and cross-modal retrieval.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The paradigm of learning retrieval via NTP is highly original.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three benchmarks, multi-scale evaluation, and scaling analysis.
- Writing Quality: ⭐⭐⭐⭐ Motivation is clear and formulations are rigorous.
- Value: ⭐⭐⭐⭐⭐ Provides a powerful new paradigm for self-supervised retrieval.