VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval¶
Conference: ACL 2026
arXiv: 2505.20291
Code: GitHub
Area: Image Generation
Keywords: Text-to-Image Retrieval, Visualized Query, Cross-modal Alignment, Retrieval-Augmented Generation, Modality Projection
TL;DR¶
This paper proposes Visualize-then-Retrieve (VisRet), a new paradigm that visualizes text queries into images via T2I generation models before performing retrieval within the image modality. It achieves an average nDCG@30 improvement of 0.125 (CLIP) and 0.121 (E5-V) across four benchmarks, with a 15.7% increase in downstream VQA accuracy on Visual-RAG-ME.
Background & Motivation¶
Background: Text-to-image (T2I) retrieval is a critical component for knowledge-intensive applications. Common methods rank candidate images by their similarity to text queries in a shared embedding space. While cross-modal embedding models (e.g., CLIP, E5-V) continue to improve, cross-modal similarity alignment still faces fundamental challenges.
Limitations of Prior Work: Cross-modal embeddings often behave as "bags of concepts," failing to capture structured visual relationships such as pose, perspective, and spatial layout. For instance, for a query like "a bar-headed goose with wings spread," embedding models may match the species but fail to recognize subtle visual features like wing posture or low-angle perspective. Current improvements (query rewriting, multi-stage re-ranking) remain constrained by the inherent difficulty of cross-modal similarity alignment.
Key Challenge: Text is inherently limited in its ability to exhaustively describe complex visual-spatial relationships, and cross-modal retrievers have an intrinsic weakness in identifying subtle visual-spatial features. Encoding all visual requirements into a text query may even harm retrieval performance due to embedding quality constraints.
Goal: To propose a retrieval paradigm that bypasses the weaknesses of cross-modal similarity matching by projecting text queries into the image modality, leveraging the superior performance of retrievers in single-modality retrieval.
Key Insight: Visualization provides a more intuitive and expressive medium than text for expressing compositional concepts (entity + pose + spatial relations). Performing retrieval within the image modality avoids the pitfalls of cross-modal retrievers and exploits their stronger capabilities in single-modality tasks.
Core Idea: Decomposing T2I retrieval into two stages: "text-to-image modality projection" and "image-to-image in-modality retrieval." The text query is visualized via a T2I generation model, and image-to-image retrieval is performed directly using the generated images.
Method¶
Overall Architecture¶
VisRet consists of two stages: (1) Modality Projection: An LLM transforms the original text query into T2I instructions, and a T2I generation model generates \(m\) visualization images \(\{v_1,\ldots,v_m\} \equiv \mathcal{T}(q)\); (2) In-modality Retrieval: Each generated image is used independently for retrieval, and results are aggregated via Reciprocal Rank Fusion (RRF) to produce the final ranked list.
Key Designs¶
-
Modality Projection:
- Function: Converts text queries into image queries to make key visual-spatial requirements explicit.
- Mechanism: Given an original query \(q\), an LLM drafts T2I instructions \(q'\) in the text space, describing images that satisfy the latent features of \(q\). A T2I generation method (e.g., Stable Diffusion) then projects \(q'\) into \(m\) images \(\{v_1,\ldots,v_m\}\). Diversity is introduced through multiple samplings.
- Design Motivation: Visualized queries can simultaneously depict required entities, poses, and perspectives, which are limited by cross-modal matching quality when encoded solely in text.
-
In-modality Retrieval and RRF Aggregation:
- Function: Completes retrieval within the image modality and aggregates results from multiple visualizations.
- Mechanism: Each generated image \(v_i\) independently retrieves a ranked list \(\mathcal{R}(v_i, \mathcal{I})\). These \(m\) lists are fused using RRF: \(\text{score}_{\text{RRF}}(r) = \sum_{i=1}^{m} \frac{1}{\lambda + \text{rank}_i(r)}\), where \(\lambda\) controls the impact of low-ranking items. The top-\(k\) results with the highest scores are selected.
- Design Motivation: Operating entirely within the image modality avoids cross-modal retriever weaknesses. Multi-image aggregation increases query diversity.
-
Visual-RAG-ME Benchmark Construction:
- Function: Provides a retrieval evaluation benchmark for comparing visual features of multiple entities.
- Mechanism: Extends Visual-RAG by constructing questions that compare visual features of two biologically similar entities (e.g., which one has a lighter color or smoother surface). Candidates are identified via BM25, comparison questions are manually constructed, and labels are retrieved from iNaturalist annotations. It contains 50 high-quality queries.
- Design Motivation: Existing benchmarks primarily evaluate single-entity retrieval and lack scenarios requiring reasoning across visual features of multiple entities, which is a significant challenge for T2I retrieval.
Loss & Training¶
VisRet is a training-free, plug-and-play method that does not require modifications to the retriever or pre-computed image embedding indices. It only requires a one-time use of an LLM for T2I instructions and a T2I model for image generation.
Key Experimental Results¶
Main Results¶
nDCG@30 across four benchmarks (CLIP Retriever)
| Method | Visual-RAG | Visual-RAG-ME | INQUIRE-Rerank-Hard | COCO-Hard |
|---|---|---|---|---|
| Original Query | 0.385 | 0.435 | 0.412 | 0.042 |
| LLM Rewriting | 0.395 | 0.572 | 0.407 | 0.093 |
| Corpus Captioning (BLIP) | 0.271 | 0.371 | 0.401 | 0.153 |
| VISA Reranking | 0.388 | 0.457 | 0.000 | 0.000 |
| VisRet | 0.438 | 0.605 | 0.455 | 0.108 |
nDCG@30 across four benchmarks (E5-V Retriever)
| Method | Visual-RAG | Visual-RAG-ME | INQUIRE-Rerank-Hard | COCO-Hard |
|---|---|---|---|---|
| Original Query | 0.407 | 0.486 | 0.407 | 0.178 |
| LLM Rewriting | 0.391 | 0.566 | 0.412 | 0.182 |
| VisRet | 0.461 | 0.622 | 0.425 | 0.205 |
Ablation Study¶
Impact of T2I Generation Models on Visual-RAG-ME Performance (CLIP Retriever)
| T2I Model | N@1 | N@10 | N@30 |
|---|---|---|---|
| Stable Diffusion 3.5 | 0.270 | 0.467 | 0.484 |
| FLUX.1-dev | 0.320 | 0.501 | 0.494 |
| DALL-E 3 | 0.346 | 0.554 | 0.553 |
| gpt-image-1 (high quality) | 0.460 | 0.632 | 0.605 |
Multi-image Aggregation vs. Single-image (CLIP Retriever)
| Benchmark | 3 Images N@30 | 1 Image N@30 |
|---|---|---|
| Visual-RAG | 0.438 | 0.425 |
| Visual-RAG-ME | 0.605 | 0.602 |
Key Findings¶
- VisRet achieves an average nDCG@10 improvement of 0.109 (38%↑) for the CLIP retriever and 0.078 (23%↑) for E5-V.
- Downstream VQA accuracy: Top-1 retrieval gains 3.8% on Visual-RAG and 15.7% on Visual-RAG-ME.
- T2I generation quality is a key bottleneck: gpt-image-1 significantly outperforms Stable Diffusion 3.5. Failure modes include lack of focus, factual errors, and poor instruction following.
- Single-image visualization only slightly reduces performance; the benefit of multi-image aggregation comes from increased query diversity.
- While visualized queries improve retrieval, they cannot replace real images as independent knowledge sources.
Highlights & Insights¶
- Novel and Practical: Bypasses fundamental cross-modal alignment difficulties through "visualize-then-retrieve," offering a simple yet powerful perspective.
- Training-free and Plug-and-play: No retraining of retrievers or modification of existing infrastructure is required, allowing direct use of existing image embedding indices.
- Visual-RAG-ME benchmark fills the gap in evaluating retrieval for multi-entity visual feature comparison.
- VisRet's actual latency is lower than VISA re-ranking (approx. 5× faster) because VISA requires an LVLM to process top-k candidates.
Limitations & Future Work¶
- Performance highly depends on T2I generation quality; gains from weaker models (e.g., Stable Diffusion) are limited.
- Generated images may contain factual errors (e.g., inaccurate species appearance), affecting retrieval quality.
- Evaluation is primarily focused on the natural species domain; effectiveness in other knowledge-intensive fields (e.g., medicine, architecture) remains to be verified.
- The computational cost of T2I generation is higher than simple query rewriting.
Related Work & Insights¶
- vs LLM Query Rewriting: Query rewriting still matches in the text-image cross-modal space, whereas VisRet shifts completely into the image modality to avoid cross-modal weaknesses.
- vs VISA Reranking: VISA relies on LVLM to process top-k candidates, with costs scaling linearly with \(k\) and performance limited by initial retrieval quality; VisRet fundamentally changes the query modality.
- vs Corpus Captioning: Converting images to text results in information loss, which is particularly detrimental in knowledge-intensive scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Reframing T2I retrieval as "visualization + image-to-image retrieval" is a unique and elegant perspective.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across four benchmarks, two retrievers, multiple ablations, and downstream VQA.
- Writing Quality: ⭐⭐⭐⭐ Clear motivation, concise methodology, and intuitive charts.
- Value: ⭐⭐⭐⭐ Provides a new paradigm for knowledge-intensive T2I retrieval, with practical utility enhanced by its training-free nature.