Skip to content

Adapting In-context Generation for Enhanced Composed Image Retrieval

Conference: CVPR 2026
Paper: CVF Open Access
Code: https://github.com/JThuge/DAIG
Area: Multimodal VLM / Image-Text Retrieval
Keywords: Composed Image Retrieval, Few-shot, T2I Fine-tuning, Synthetic Triplets, Domain Adaptation

TL;DR

This paper proposes DAIG: using 32 target domain samples to perform in-context fine-tuning (CIR-LoRA) on a pre-trained T2I model (Flux). This allows the model to synthesize "unbiased, domain-aligned" Composed Image Retrieval (CIR) triplets in batches. A two-stage training framework (feature-perturbed pre-training DRSP + angular margin fine-tuning FRA) is then used to feed these synthetic data into any off-the-shelf CIR model, achieving significant performance gains on CIRR/FashionIQ in a plug-and-play manner with zero additional inference cost.

Background & Motivation

Background: The input for Composed Image Retrieval (CIR) is a dual-modality query consisting of a "reference image \(I_r\) + a relative description \(T_c\)." The goal is to retrieve a target image \(I_t\) from a gallery that matches the user's modification intent. Supervised CIR methods (CLIP4CIR, BLIP4CIR, SPRC, etc.) perform well due to cross-modal alignment in VLMs but rely heavily on human-annotated \((I_r, T_c, I_t)\) triplets.

Limitations of Prior Work: Annotating triplets is extremely expensive, making supervised CIR difficult to scale. Zero-shot CIR (ZS-CIR) attempts to bypass annotation, but three mainstream routes have flaws: inversion networks (mapping images to pseudo-tokens), training-free LLM inference (slow and complex), and triplet synthesis (CompoDiff/VISTA/CoAlign). Triplet synthesis has the most potential, but synthesized data lacks target domain knowledge, leading to a hard-to-eliminate domain gap.

Key Challenge: The work most relevant to this paper, CoAlign, uses a frozen T2I model for zero-shot in-context generation—where an LLM writes text triplets, fills them into a layout template, and the T2I generates "two semantically related sub-images" in one forward pass. However, it has two fundamental issues: (1) high distribution drift between generated images and the real target domain; (2) a lack of task priors, leading to overly similar backgrounds in sub-images, which introduces bias when used as \(I_r/I_t\). In other words, freely generated triplets are biased and noisy, yielding limited effectiveness despite filtering.

Goal: To generate clean, unbiased CIR training triplets that align with the target domain using only a few annotations (few-shot), even as few as 32 samples, and to enhance any existing CIR model with these samples.

Key Insight: The authors found that LoRA inherently possesses the attribute of "capturing and aligning with target domain distributions from minimal samples," while in-context descriptions can inject CIR task objectives into T2I models. Thus, the approach shifts from "generation with a frozen model" to "generation with a few-shot fine-tuned model."

Core Idea: Use 32 target domain samples for parameter-efficient in-context fine-tuning of a T2I model (CIR-LoRA) to inject domain and task priors simultaneously, generating unbiased domain-adaptive triplets. A two-stage framework (robust pre-training on synthetic data + fine-grained fine-tuning on real data) is then used to enhance off-the-shelf CIR models.

Method

Overall Architecture

DAIG consists of three serial components: (i) In-context Generative Fine-tuning(ii) Domain-Adaptive In-context Generation(iii) Two-stage CIR Training Framework.

Input for the first part is 32 target domain triplets, outputting a T2I model \(G'\) that "understands" the domain and the CIR task. Each triplet is combined into a stitched image + in-context description to fine-tune Flux with CIR-LoRA. The second part uses \(G'\) with an LLM for batch data creation: the LLM generates ~20k text triplets based on (object, edit) templates, which are fed to \(G'\) to synthesize synthetic triplet set \(S'\) (~20k). The third part utilizes \(S'\) and real annotations \(S\) in two stages: distribution robust synthetic pre-training (DRSP) on synthetic data, followed by fine-grained real adaptation (FRA) with angular margins on real annotations.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["32 Target Domain Triplets<br/>Stitched Images + In-context Descriptions"] --> B["CIR-LoRA<br/>In-context Generative Fine-tuning<br/>MoE-LoRA injected into Cross-Attention"]
    B --> C["Domain-Adaptive T2I Generation<br/>LLM writes text triplets → G' synthesizes<br/>≈20k unbiased triplets S'"]
    C --> D["DRSP Robust Synthetic Pre-training<br/>Gaussian Perturbation of Visual Features + Contrastive Learning"]
    D --> E["FRA Fine-grained Real Adaptation<br/>Diagonal matching pairs with Angular Margin φ"]
    E --> F["Enhanced CIR Model<br/>Zero additional inference cost"]

Key Designs

1. CIR-LoRA: Simultaneous Injection of Domain and Task Priors

To address the bias of frozen T2I models, the authors perform parameter-efficient fine-tuning with 32 samples. Each target domain triplet's \(I_r\) and \(I_t\) are stitched horizontally. A captioner generates descriptions \(T_r, T_t\) for both, which along with \(T_c\) are put into a template to get an in-context description \(T_{ic}\). During fine-tuning, the backbone is frozen while a learnable CIR-LoRA is integrated into the cross-attention layers:

\[\text{Attention}(Q,K,V)=\text{Softmax}\!\left(\frac{QK^{T}}{\sqrt{d}}\right)V,\quad K=W_k\,\tau_{txt}(T_{ic}),\ V=W_v\,\tau_{txt}(T_{ic})\]

The key is using a MoE (Mixture of Experts) for each projection weight \(W\). A routing function \(r\) assigns weights to experts \(B_i, A_i\) based on description characteristics. This allows the model to handle diverse edit operations (add/delete/replace/viewpoint) by assigning optimal experts.

2. DAIG Generation Pipeline: LLM Scripting + G' Image Synthesis

To scale data, LLMs are driven by templates \(P(\text{object}, \text{edit})\) to output \((T_r, T_c, T_t)\) in JSON, constrained by target domain examples. These are converted to in-context descriptions for \(G'\) to synthesize \(S'=\{I_r^i, T_c^i, I_t^i\}_{i=1}^M\). This ensures diversity and high fidelity without the need for additional filtering like in CoAlign.

3. DRSP: Distribution Robust Synthetic Pre-training

Synthetic T2I images often form a sparse distribution relative to the real domain. The authors model the statistics (mean \(\mu(v)\), std \(\sigma(v)\)) of the visual features \(v\) as a multivariate Gaussian and apply reparameterized perturbations:

\[\tilde v=\tilde\sigma(v)\,\frac{v-\mu(v)}{\sigma(v)}+\tilde\mu(v)\]

Then \(\tilde v\) replaces \(v\) for alignment. This "stretches" the sparse synthetic distribution to improve generalization with zero additional inference cost.

4. FRA: Fine-grained Real Adaptation

FRA fine-tunes on real annotations \(S\) by adding an angular margin \(\varphi\) to the matching pairs (diagonal elements) in the similarity matrix:

\[p_{i,j}=\frac{e^{cos(\theta_{i,j}/\tau)}}{\sum_{k\in B} e^{cos(\theta_{i,k}/\tau)}},\quad \theta_{i,j}=\arccos\big(\text{sim}(f_q^i, f_t^j)\big)+\varphi\cdot\mathbb{I}(i=j)\]

This forces the model to learn more discriminative representations to bridge the final domain gap.

Key Experimental Results

Main Results

On the CIRR test set, DAIG acts as a plug-and-play enhancement for three base models across 32-shot, 1%, and 100% data rates.

Setting Method R@1 R@5 R@10 Avg.
32-shot CLIP4CIR† 22.87 52.12 64.63 52.12
32-shot + DAIG 31.02 63.71 75.81 61.51
32-shot SPRC† 29.88 57.61 69.25 62.46
32-shot + DAIG 42.05 72.41 82.00 71.87
100% SPRC† 52.05 82.22 89.98 81.27
100% + DAIG 53.88 84.10 90.60 82.40

Synthesis Dataset Comparison

Using only 20k synthetic triplets (DAIG-DRSP) outperforms existing datasets with millions of samples.

Dataset Scale CIRR Avg. FashionIQ Avg@10
ST18M 18M 62.47 30.97
DAIG-DRSP (Ours) 20k 71.68 44.74

Ablation Study

Stage Configuration CIRR R@5 FashionIQ Avg@10
DRSP ZSIG (Zero-shot Gen) 66.10 39.66
DRSP + CIR-LoRA 71.01 43.89
DRSP + Feature Perturbation 72.02 45.02
FRA w/ Angular Margin φ 75.52 45.51

Highlights & Insights

  • Shifting from "Frozen" to "Few-shot Fine-tuned" Generation: A small change that addresses both domain bias and task prior deficiency.
  • MoE for Task Diversity: MoE routing handles the wide range of editing operations in relative descriptions better than a single LoRA.
  • Zero-cost Gain via DRSP: Perturbing feature statistics to broaden sparse distributions is a clean, plug-and-play regularization for any "train on synthetic, test on real" scenario.

Limitations & Future Work

  • Computational Dependency: Relies on heavy T2I (Flux) and LLM (Qwen2.5) models for the generation pipeline.
  • Object/Edit Set Construction: These sets currently require manual or heuristic definition, which introduces a new dependency.
  • Benchmark Coverage: Validated primarily on FashionIQ and CIRR; performance on more open-domain or long-tail datasets remains to be explored.
  • vs CoAlign: By fine-tuning instead of freezing the model, DAIG generates unbiased domain-adaptive data that outperforms CoAlign's 534k triplets with only 20k samples.
  • vs PromptCLIP / PTG: DAIG demonstrates that "generating high-quality data" is more effective than "prompt tuning" in few-shot CIR settings.

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐