VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval¶

Conference: ACL 2026 arXiv: 2505.20291 Code: GitHub Area: Image Generation Keywords: Text-to-image retrieval, query visualization, cross-modal alignment, retrieval-augmented generation, modality projection

TL;DR¶

This paper proposes Visualize-then-Retrieve (VisRet), a novel retrieval paradigm that first visualizes a text query into images via a T2I generative model and then performs retrieval within the image modality. VisRet achieves an average nDCG@30 improvement of 0.125 (CLIP) and 0.121 (E5-V) across four benchmarks, and improves downstream VQA accuracy by 15.7% on Visual-RAG-ME.

Background & Motivation¶

Background: Text-to-image (T2I) retrieval is a critical component in knowledge-intensive applications. Common approaches embed text queries and candidate images into a shared representation space and rank by similarity. Despite continuous advances in cross-modal embedding models (e.g., CLIP, E5-V), cross-modal similarity alignment remains a fundamental challenge.

Limitations of Prior Work: Cross-modal embeddings often behave as "bags of concepts," failing to capture structured visual relationships such as pose, viewpoint, and spatial layout. For instance, when querying "a bar-headed goose with wings spread," embedding models can match the species but fail to recognize subtle visual features such as wing posture and the upward-looking angle. Existing improvements (query rewriting, multi-stage reranking) remain constrained by the inherent difficulty of cross-modal similarity alignment.

Key Challenge: Text is inherently insufficient for exhaustively describing complex visual-spatial relationships, and cross-modal retrievers exhibit an intrinsic weakness in recognizing fine-grained visual-spatial features. Encoding all visual requirements into a text query may actually degrade retrieval performance due to embedding quality limitations.

Goal: To propose a retrieval paradigm that bypasses the weaknesses of cross-modal similarity matching by projecting text queries into the image modality, thereby leveraging the stronger intra-modal retrieval capability of existing retrievers.

Key Insight: Visualization provides a more intuitive and expressive medium than text for conveying compositional concepts (entity + pose + spatial relationship). Performing retrieval within the image modality avoids the weaknesses of cross-modal retrievers and exploits their stronger intra-modal capabilities.

Core Idea: Decompose T2I retrieval into two stages—"text → image modality projection" and "image → image intra-modal retrieval"—by visualizing text queries via a T2I generative model and subsequently performing image-to-image retrieval using the generated images.

Method¶

Overall Architecture¶

VisRet consists of two stages: (1) Modality Projection—an LLM transforms the original text query into a T2I instruction, which a T2I generative model uses to produce \(m\) visualization images \(\{v_1,\ldots,v_m\} \equiv \mathcal{T}(q)\); (2) Intra-modal Retrieval—each generated image is used independently for retrieval, and the resulting ranked lists are aggregated via Reciprocal Rank Fusion (RRF) to produce the final result.

Key Designs¶

Modality Projection:
- Function: Transforms a text query into an image query, making key visual-spatial requirements explicit.
- Mechanism: Given the original query \(q\), an LLM first drafts a T2I instruction \(q'\) in text space, describing an image that likely satisfies the implicit visual feature requirements of \(q\). A T2I generation method (e.g., Stable Diffusion) then projects \(q'\) into \(m\) images \(\{v_1,\ldots,v_m\}\). Diversity is introduced through multiple sampling.
- Design Motivation: A visualized query can simultaneously depict the required entity, pose, and viewpoint—information that, when encoded solely through text, is limited by the quality of cross-modal matching.
Intra-modal Retrieval and RRF Aggregation:
- Function: Performs retrieval within the image modality and aggregates results from multiple visualizations.
- Mechanism: Each generated image \(v_i\) is used to independently retrieve a ranked list \(\mathcal{R}(v_i, \mathcal{I})\), and the \(m\) lists are fused via RRF: \(\text{score}_{\text{RRF}}(r) = \sum_{i=1}^{m} \frac{1}{\lambda + \text{rank}_i(r)}\), where \(\lambda\) controls the influence of low-ranked items. The top-\(k\) results with the highest scores are returned.
- Design Motivation: Operating entirely within the image modality avoids the weaknesses of cross-modal retrievers and leverages their stronger intra-modal capabilities. Multi-image aggregation increases query diversity.
Visual-RAG-ME Benchmark Construction:
- Function: Provides a retrieval evaluation benchmark for comparing visual features across multiple entities.
- Mechanism: Visual-RAG is extended with questions comparing visual features of two biologically similar entities (e.g., which has a lighter color or smoother surface). Candidate entities are identified via BM25, comparison questions are manually constructed, and retrieval labels are annotated from iNaturalist, yielding 50 high-quality queries.
- Design Motivation: Existing benchmarks primarily evaluate single-entity retrieval and lack scenarios requiring cross-entity visual feature reasoning, which represents an important challenge for T2I retrieval.

Loss & Training¶

VisRet is a training-free, plug-and-play method that requires no modification to retrievers or pre-computed image embedding indices. It requires only a one-time use of an LLM to generate T2I instructions and a T2I model to produce visualization images.

Key Experimental Results¶

Main Results¶

nDCG@30 across four benchmarks (CLIP retriever)

Method	Visual-RAG	Visual-RAG-ME	INQUIRE-Rerank-Hard	COCO-Hard
Original Query	0.385	0.435	0.412	0.042
LLM Rewriting	0.395	0.572	0.407	0.093
Corpus Captioning (BLIP)	0.271	0.371	0.401	0.153
VISA Reranking	0.388	0.457	0.000	0.000
VisRet	0.438	0.605	0.455	0.108

nDCG@30 across four benchmarks (E5-V retriever)

Method	Visual-RAG	Visual-RAG-ME	INQUIRE-Rerank-Hard	COCO-Hard
Original Query	0.407	0.486	0.407	0.178
LLM Rewriting	0.391	0.566	0.412	0.182
VisRet	0.461	0.622	0.425	0.205

Ablation Study¶

Effect of T2I generation model on Visual-RAG-ME (CLIP retriever)

T2I Model	N@1	N@10	N@30
Stable Diffusion 3.5	0.270	0.467	0.484
FLUX.1-dev	0.320	0.501	0.494
DALL-E 3	0.346	0.554	0.553
gpt-image-1 (high quality)	0.460	0.632	0.605

Multi-image aggregation vs. single image (CLIP retriever)

Benchmark	3-image N@30	1-image N@30
Visual-RAG	0.438	0.425
Visual-RAG-ME	0.605	0.602

Key Findings¶

VisRet achieves an average nDCG@10 improvement of 0.109 (38%↑) with the CLIP retriever and 0.078 (23%↑) with E5-V.
Downstream VQA accuracy improves by 3.8% on Visual-RAG (top-1 retrieval) and 15.7% on Visual-RAG-ME.
T2I generation model quality is the critical performance bottleneck: gpt-image-1 substantially outperforms Stable Diffusion 3.5; three failure modes are identified—lack of focus, factual inaccuracies, and poor instruction following.
Using a single visualization incurs only a marginal performance drop; the gains from multi-image aggregation stem from increased query diversity.
Visualized queries improve retrieval but cannot substitute real images as an independent knowledge source.

Highlights & Insights¶

The perspective is novel and practical: by adopting a "visualize-then-retrieve" paradigm, the paper elegantly circumvents the fundamental difficulty of cross-modal alignment.
The training-free, plug-and-play design requires no retraining of retrievers or modification of existing infrastructure, enabling direct use of precomputed image embedding indices.
The Visual-RAG-ME benchmark addresses the gap in retrieval evaluation for multi-entity visual feature comparison.
VisRet incurs lower practical latency than VISA reranking (approximately 5× faster), since VISA requires an LVLM to process top-\(k\) candidates.

Limitations & Future Work¶

Performance is strongly dependent on T2I generation model quality; weaker models (e.g., Stable Diffusion) yield limited gains.
Generated images may contain factual errors (e.g., inaccurate species appearances), degrading retrieval quality.
Evaluation is currently focused on natural species; effectiveness in other knowledge-intensive domains (e.g., medicine, architecture) remains to be validated.
The computational cost of T2I generation is higher than that of simple query rewriting.

vs. LLM Query Rewriting: Query rewriting still operates in the text–image cross-modal space; VisRet fully transitions into the image modality, avoiding cross-modal weaknesses.
vs. VISA Reranking: VISA relies on an LVLM to process top-\(k\) candidates, with cost scaling linearly in \(k\) and quality bounded by the initial retrieval; VisRet fundamentally changes the query modality.
vs. Corpus Captioning: Converting images to text loses information, particularly in knowledge-intensive scenarios.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Reframes T2I retrieval as "visualization + image-to-image retrieval," offering a distinctive and elegant perspective.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers four benchmarks, two retrievers, extensive ablation analyses, and downstream VQA evaluation.
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated, the method is concise, and figures are intuitive.
Value: ⭐⭐⭐⭐ Introduces a new paradigm for knowledge-intensive T2I retrieval; the training-free plug-and-play nature enhances practical applicability.