Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval¶
Conference: CVPR 2025
arXiv: 2412.11077
Code: https://github.com/Pter61/osrcir2024/
Area: LLM Reasoning
Keywords: Composed Image Retrieval, Zero-Shot, Chain-of-Thought, Multimodal Large Language Model, Training-Free
TL;DR¶
This paper proposes OSrCIR, a training-free one-stage zero-shot composed image retrieval method. It utilizes multimodal large language models to directly process the reference image and modification text, and accurately understands the user's implicit intent through reflective Chain-of-Thought reasoning, outperforming existing training-free methods by 1.80% to 6.44% across multiple benchmarks.
Background & Motivation¶
- Background: Composed Image Retrieval (CIR) retrieves a target image using a reference image paired with a modification text. Zero-shot CIR (ZS-CIR) leverages pre-trained models such as CLIP, bypassing the need for large amounts of labeled triplet data.
- Limitations of Prior Work: Current training-free ZS-CIR methods adopt a two-stage pipeline: first generating a textual caption of the reference image using an image captioner, and then reasoning the target description using an LLM. This leads to two issues: (1) the captioning stage is unaware of the modification intent, thereby missing crucial visual details; and (2) simple prompting limits the reasoning capabilities of the LLM.
- Key Challenge: The separation of captioning and reasoning in two-stage methods causes information loss—key visual details are lost during the captioning stage, which subsequent LLM reasoning cannot salvage.
- Goal: Design a one-stage reasoning approach that preserves complete visual information and fully exploits the reasoning capability of MLLMs.
- Key Insight: Directing the MLLM to handle the reference image and the modification text simultaneously, thereby avoiding information loss in the intermediate captioning stage.
- Core Idea: One-stage MLLM reasoning combined with Reflective Chain-of-Thought (Reflective CoT) guidance to precisely understand modification intent.
Method¶
Overall Architecture¶
Reference image \(I_r\) + modification text \(T_m\) → MLLM (with Reflective CoT prompt) → target image description \(T_t\) → CLIP text encoding → cosine similarity matching with image database → retrieval results. All computations are conducted on a single NVIDIA A100 GPU using PyTorch.
Key Designs¶
-
One-Stage Inference Process:
- Function: Eliminates the information loss inherent in two-stage methods, directly reasoning the target description from the image and modification text.
- Mechanism: \(T_t = \Psi_M(p_c \circ I_r \circ T_m)\), where the CoT prompt, reference image, and modification text are concatenated as input to the MLLM to generate the target description in a single pass. Since the MLLM simultaneously "sees" both the image and the modification intent, it can preserve key visual details relevant to the modification.
- Design Motivation: In two-stage methods, the captioner is unaware of the modification intent and cannot decide which details to retain. The one-stage approach allows complete visual information to participate in the reasoning process.
-
Reflective Chain-of-Thought (Reflective CoT):
- Function: Guides the MLLM to reason about the user's modification intent step-by-step, avoiding misunderstandings caused by simple prompting.
- Mechanism: A four-step progressive reasoning process: (1) Reference image description: focusing on visual details related to the modification text; (2) Thoughts: analyzing the modification intent and the affected visual elements; (3) Reflections: filtering out incorrect intents and identifying the most relevant modified elements to mitigate hallucination; (4) Target image description: generating the final description based on the filtered elements. All steps are completed within a single prompt to ensure efficiency.
- Design Motivation: User modification intents are often implicit (e.g., "without human" actually means retaining the puppy and the blurred human background), requiring multi-step reasoning and reflection to be accurately comprehended.
-
Vision-by-Language Contextual Learning (Vision-by-Language ICL):
- Function: Enables the MLLM to understand the expected output format of each CoT step while maintaining the zero-shot setting.
- Mechanism: Providing example outputs in pure text format (without reference images) to guide the MLLM to generate properly formatted reasoning outputs at each step.
- Design Motivation: Merely providing CoT guidelines is insufficient; the MLLM requires concrete examples to understand the expected behavior at each step.
Loss & Training¶
This method is entirely training-free. Retrieval utilizes a frozen CLIP model, ranking candidate images based on the cosine similarity between their image encodings and the text encoding of the target description \(T_t\). Evaluation is performed on hidden test sets via submission servers for CIRCO and CIRR, and directly evaluated for Fashion-IQ and GeneCIS.
Key Experimental Results¶
Main Results¶
| Method | CIRCO mAP@5 | CIRCO mAP@25 | CIRR R@1 | CIRR R@5 |
|---|---|---|---|---|
| CIReVL (ViT-L/14) | 18.57 | 20.89 | 24.55 | 52.31 |
| CIReVL* (GPT-4o) | 18.92 | 21.15 | 24.83 | 52.68 |
| OSrCIR | 23.87 | 27.84 | 29.45 | 57.68 |
| LinCIR | 12.59 | 15.00 | 25.04 | 53.25 |
Ablation Study¶
| Configuration | CIRCO mAP@5 | Description |
|---|---|---|
| Full OSrCIR | 23.87 | Full Reflective CoT |
| w/o Reflections step | 21.42 | Contribution of the reflections step is +2.45 |
| Simple prompt (w/o CoT) | 19.85 | Overall contribution of the CoT framework is +4.02 |
| Two-stage + GPT-4o | 18.92 | One-stage outperforms two-stage |
Key Findings¶
- OSrCIR significantly outperforms existing training-free methods across all CLIP architectures (ViT-B/32, ViT-L/14, ViT-G/14).
- The reflection step effectively filters out hallucinations from the Thoughts stage, improving reasoning accuracy.
- The improvement is particularly prominent on CIRCO (+5.30 mAP@5), which has a more accurate evaluation protocol, while the gain on CIRR is smaller but still significant.
- The inference efficiency of the one-stage method is comparable to that of the two-stage method, incurring no additional overhead.
- On Fashion-IQ, OSrCIR (ViT-L/14) achieves an average R@10 of 33.26%, which is +4.21 higher than CIReVL* and +5.46 higher than the best training-based method Context-I2W, with Shirt/Dress/Toptee reaching 33.17/29.70/36.92, respectively.
- Solely upgrading to a stronger MLLM (CIReVL \(\rightarrow\) CIReVL*) yields marginal improvements (+0.50 mAP@5), indicating that the two-stage paradigm itself is the bottleneck.
- By default, GPT-4o (temperature=0) is used, and the results are averaged over three runs; it also supports GPT-4o-mini, GPT-4V, and open-source models like LLaVA and MiniGPT4.
Highlights & Insights¶
- Reason-before-Retrieve Paradigm: OSrCIR reformulates CIR as an MLLM reasoning problem rather than a simple feature combination problem, aligning better with the human cognitive process of retrieval.
- Critical Role of the Reflection Step: The Reflections step resembles human "wait, let me think again" self-correction, effectively reducing reasoning hallucinations. Contributing +2.45 mAP@5, the reflection step is core to the performance gain.
- Transferability to Other Multimodal Retrieval Tasks: The design concept of Reflective CoT can be applied to any multimodal task that requires understanding implicit intents.
- Importance of Information Integrity: Experiments demonstrate that the one-stage method outperforms the two-stage approach with GPT-4o (+4.95 mAP@5), indicating that fully preserving visual information is more important than employing a stronger LLM.
- Ingenious Design of Vision-by-Language ICL: Using only pure text examples (sans images) is sufficient to guide the MLLM to understand the output format of each CoT step, maintaining a truly zero-shot setting.
- Consistency Across CLIP Architectures: From ViT-B/32 to ViT-G/14, OSrCIR consistently performs with a large margin, demonstrating that the method does not rely on a specific CLIP version. Qualitative analysis shows that OSrCIR can accurately capture key details missed by CIReVL (such as poster types, dog breeds, colors, etc.).
Limitations & Future Work¶
- It relies heavily on the reasoning capacity of the MLLM, and the effectiveness may vary significantly across different MLLMs.
- The improvement on CIRR is relatively limited (+4.9 R@1), likely due to noisy annotations in this benchmark.
- The combination with training-based methods has not been explored; theoretically, integrating training could lead to further improvements.
- The retrieval still relies on the text-image alignment quality of CLIP, with CLIP's performance bottlenecking the final retrieval accuracy.
- The four-step reasoning of CoT increases the MLLM inference overhead, which demands efficiency considerations in large-scale retrieval scenarios.
- When the modification text is extremely simple (e.g., changing only the color), the reflection step might be redundant.
- It also significantly outperforms all adaptation methods on the GeneCIS benchmark (a more generalized composition retrieval), showcasing robust cross-benchmark generalization.
Related Work & Insights¶
- vs CIReVL: CIReVL is a two-stage method where the separation of captioning and reasoning causes information loss; OSrCIR is a one-stage method that preserves complete visual information.
- vs LinCIR/Pic2Word: These methods require training a text inversion network, whereas OSrCIR is completely training-free and delivers superior performance (CIRCO mAP@5: OSrCIR 23.87% vs LinCIR 12.59% vs Context-I2W 13.04%).
- vs LDRE: LDRE uses diffusion model ensembles, which incur high computational overhead; OSrCIR is much more efficient.
Rating¶
Implementation Details¶
By default, the MLLM is GPT-4o with the API temperature set to 0 and all parameters left at their defaults. The retrieval module is built on PyTorch running on a single NVIDIA A100. CLIP variants utilize official weights (ViT-B/32, ViT-L/14), and ViT-G/14 uses OpenCLIP weights. - Novelty: ⭐⭐⭐⭐ First application of one-stage + reflective CoT in ZS-CIR - Experimental Thoroughness: ⭐⭐⭐⭐ Verified across multiple benchmarks and architectures (CIRCO/CIRR/Fashion-IQ/GeneCIS) with detailed ablations - Writing Quality: ⭐⭐⭐⭐ Clearly explained motivation and intuitive illustrations - Value: ⭐⭐⭐⭐ Training-free method achieving new SOTA with high practical applicability