Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval¶

Conference: CVPR 2025
arXiv: 2412.11077
Code: https://github.com/Pter61/osrcir2024/
Area: LLM Reasoning
Keywords: Composed Image Retrieval, Zero-Shot, Chain-of-Thought, Multimodal Large Language Model, Training-Free

TL;DR¶

This paper proposes OSrCIR, a training-free one-stage zero-shot composed image retrieval method. It utilizes multimodal large language models to directly process the reference image and modification text, and accurately understands the user's implicit intent through reflective Chain-of-Thought reasoning, outperforming existing training-free methods by 1.80% to 6.44% across multiple benchmarks.

Background & Motivation¶

Background: Composed Image Retrieval (CIR) retrieves a target image using a reference image paired with a modification text. Zero-shot CIR (ZS-CIR) leverages pre-trained models such as CLIP, bypassing the need for large amounts of labeled triplet data.
Limitations of Prior Work: Current training-free ZS-CIR methods adopt a two-stage pipeline: first generating a textual caption of the reference image using an image captioner, and then reasoning the target description using an LLM. This leads to two issues: (1) the captioning stage is unaware of the modification intent, thereby missing crucial visual details; and (2) simple prompting limits the reasoning capabilities of the LLM.
Key Challenge: The separation of captioning and reasoning in two-stage methods causes information loss—key visual details are lost during the captioning stage, which subsequent LLM reasoning cannot salvage.
Goal: Design a one-stage reasoning approach that preserves complete visual information and fully exploits the reasoning capability of MLLMs.
Key Insight: Directing the MLLM to handle the reference image and the modification text simultaneously, thereby avoiding information loss in the intermediate captioning stage.
Core Idea: One-stage MLLM reasoning combined with Reflective Chain-of-Thought (Reflective CoT) guidance to precisely understand modification intent.

Method¶

Overall Architecture¶

Reference image \(I_r\) + modification text \(T_m\) → MLLM (with Reflective CoT prompt) → target image description \(T_t\) → CLIP text encoding → cosine similarity matching with image database → retrieval results. All computations are conducted on a single NVIDIA A100 GPU using PyTorch.

Key Designs¶

One-Stage Inference Process:
- Function: Eliminates the information loss inherent in two-stage methods, directly reasoning the target description from the image and modification text.
- Mechanism: \(T_t = \Psi_M(p_c \circ I_r \circ T_m)\), where the CoT prompt, reference image, and modification text are concatenated as input to the MLLM to generate the target description in a single pass. Since the MLLM simultaneously "sees" both the image and the modification intent, it can preserve key visual details relevant to the modification.
- Design Motivation: In two-stage methods, the captioner is unaware of the modification intent and cannot decide which details to retain. The one-stage approach allows complete visual information to participate in the reasoning process.
Reflective Chain-of-Thought (Reflective CoT):
- Function: Guides the MLLM to reason about the user's modification intent step-by-step, avoiding misunderstandings caused by simple prompting.
- Mechanism: A four-step progressive reasoning process: (1) Reference image description: focusing on visual details related to the modification text; (2) Thoughts: analyzing the modification intent and the affected visual elements; (3) Reflections: filtering out incorrect intents and identifying the most relevant modified elements to mitigate hallucination; (4) Target image description: generating the final description based on the filtered elements. All steps are completed within a single prompt to ensure efficiency.
- Design Motivation: User modification intents are often implicit (e.g., "without human" actually means retaining the puppy and the blurred human background), requiring multi-step reasoning and reflection to be accurately comprehended.
Vision-by-Language Contextual Learning (Vision-by-Language ICL):
- Function: Enables the MLLM to understand the expected output format of each CoT step while maintaining the zero-shot setting.
- Mechanism: Providing example outputs in pure text format (without reference images) to guide the MLLM to generate properly formatted reasoning outputs at each step.
- Design Motivation: Merely providing CoT guidelines is insufficient; the MLLM requires concrete examples to understand the expected behavior at each step.

Loss & Training¶

This method is entirely training-free. Retrieval utilizes a frozen CLIP model, ranking candidate images based on the cosine similarity between their image encodings and the text encoding of the target description \(T_t\). Evaluation is performed on hidden test sets via submission servers for CIRCO and CIRR, and directly evaluated for Fashion-IQ and GeneCIS.

Key Experimental Results¶

Main Results¶

Method	CIRCO mAP@5	CIRCO mAP@25	CIRR R@1	CIRR R@5
CIReVL (ViT-L/14)	18.57	20.89	24.55	52.31
CIReVL* (GPT-4o)	18.92	21.15	24.83	52.68
OSrCIR	23.87	27.84	29.45	57.68
LinCIR	12.59	15.00	25.04	53.25

Ablation Study¶

Configuration	CIRCO mAP@5	Description
Full OSrCIR	23.87	Full Reflective CoT
w/o Reflections step	21.42	Contribution of the reflections step is +2.45
Simple prompt (w/o CoT)	19.85	Overall contribution of the CoT framework is +4.02
Two-stage + GPT-4o	18.92	One-stage outperforms two-stage

Key Findings¶

OSrCIR significantly outperforms existing training-free methods across all CLIP architectures (ViT-B/32, ViT-L/14, ViT-G/14).
The reflection step effectively filters out hallucinations from the Thoughts stage, improving reasoning accuracy.
The improvement is particularly prominent on CIRCO (+5.30 mAP@5), which has a more accurate evaluation protocol, while the gain on CIRR is smaller but still significant.
The inference efficiency of the one-stage method is comparable to that of the two-stage method, incurring no additional overhead.
On Fashion-IQ, OSrCIR (ViT-L/14) achieves an average R@10 of 33.26%, which is +4.21 higher than CIReVL* and +5.46 higher than the best training-based method Context-I2W, with Shirt/Dress/Toptee reaching 33.17/29.70/36.92, respectively.
Solely upgrading to a stronger MLLM (CIReVL \(\rightarrow\) CIReVL*) yields marginal improvements (+0.50 mAP@5), indicating that the two-stage paradigm itself is the bottleneck.
By default, GPT-4o (temperature=0) is used, and the results are averaged over three runs; it also supports GPT-4o-mini, GPT-4V, and open-source models like LLaVA and MiniGPT4.

Highlights & Insights¶

Reason-before-Retrieve Paradigm: OSrCIR reformulates CIR as an MLLM reasoning problem rather than a simple feature combination problem, aligning better with the human cognitive process of retrieval.
Critical Role of the Reflection Step: The Reflections step resembles human "wait, let me think again" self-correction, effectively reducing reasoning hallucinations. Contributing +2.45 mAP@5, the reflection step is core to the performance gain.
Transferability to Other Multimodal Retrieval Tasks: The design concept of Reflective CoT can be applied to any multimodal task that requires understanding implicit intents.
Importance of Information Integrity: Experiments demonstrate that the one-stage method outperforms the two-stage approach with GPT-4o (+4.95 mAP@5), indicating that fully preserving visual information is more important than employing a stronger LLM.
Ingenious Design of Vision-by-Language ICL: Using only pure text examples (sans images) is sufficient to guide the MLLM to understand the output format of each CoT step, maintaining a truly zero-shot setting.
Consistency Across CLIP Architectures: From ViT-B/32 to ViT-G/14, OSrCIR consistently performs with a large margin, demonstrating that the method does not rely on a specific CLIP version. Qualitative analysis shows that OSrCIR can accurately capture key details missed by CIReVL (such as poster types, dog breeds, colors, etc.).

Limitations & Future Work¶

It relies heavily on the reasoning capacity of the MLLM, and the effectiveness may vary significantly across different MLLMs.
The improvement on CIRR is relatively limited (+4.9 R@1), likely due to noisy annotations in this benchmark.
The combination with training-based methods has not been explored; theoretically, integrating training could lead to further improvements.
The retrieval still relies on the text-image alignment quality of CLIP, with CLIP's performance bottlenecking the final retrieval accuracy.
The four-step reasoning of CoT increases the MLLM inference overhead, which demands efficiency considerations in large-scale retrieval scenarios.
When the modification text is extremely simple (e.g., changing only the color), the reflection step might be redundant.
It also significantly outperforms all adaptation methods on the GeneCIS benchmark (a more generalized composition retrieval), showcasing robust cross-benchmark generalization.

vs CIReVL: CIReVL is a two-stage method where the separation of captioning and reasoning causes information loss; OSrCIR is a one-stage method that preserves complete visual information.
vs LinCIR/Pic2Word: These methods require training a text inversion network, whereas OSrCIR is completely training-free and delivers superior performance (CIRCO mAP@5: OSrCIR 23.87% vs LinCIR 12.59% vs Context-I2W 13.04%).
vs LDRE: LDRE uses diffusion model ensembles, which incur high computational overhead; OSrCIR is much more efficient.

Rating¶

Implementation Details¶

By default, the MLLM is GPT-4o with the API temperature set to 0 and all parameters left at their defaults. The retrieval module is built on PyTorch running on a single NVIDIA A100. CLIP variants utilize official weights (ViT-B/32, ViT-L/14), and ViT-G/14 uses OpenCLIP weights. - Novelty: ⭐⭐⭐⭐ First application of one-stage + reflective CoT in ZS-CIR - Experimental Thoroughness: ⭐⭐⭐⭐ Verified across multiple benchmarks and architectures (CIRCO/CIRR/Fashion-IQ/GeneCIS) with detailed ablations - Writing Quality: ⭐⭐⭐⭐ Clearly explained motivation and intuitive illustrations - Value: ⭐⭐⭐⭐ Training-free method achieving new SOTA with high practical applicability