Adapting In-context Generation for Enhanced Composed Image Retrieval¶
Conference: CVPR 2026
Paper: CVF Open Access
Code: https://github.com/JThuge/DAIG
Area: Multimodal VLM / Image-Text Retrieval
Keywords: Composed Image Retrieval, Few-shot, T2I Fine-tuning, Synthetic Triplets, Domain Adaptation
TL;DR¶
This paper proposes DAIG: using 32 target domain samples to perform in-context fine-tuning (CIR-LoRA) on a pre-trained T2I model (Flux). This allows the model to synthesize "unbiased, domain-aligned" Composed Image Retrieval (CIR) triplets in batches. A two-stage training framework (feature-perturbed pre-training DRSP + angular margin fine-tuning FRA) is then used to feed these synthetic data into any off-the-shelf CIR model, achieving significant performance gains on CIRR/FashionIQ in a plug-and-play manner with zero additional inference cost.
Background & Motivation¶
Background: The input for Composed Image Retrieval (CIR) is a dual-modality query consisting of a "reference image \(I_r\) + a relative description \(T_c\)." The goal is to retrieve a target image \(I_t\) from a gallery that matches the user's modification intent. Supervised CIR methods (CLIP4CIR, BLIP4CIR, SPRC, etc.) perform well due to cross-modal alignment in VLMs but rely heavily on human-annotated \((I_r, T_c, I_t)\) triplets.
Limitations of Prior Work: Annotating triplets is extremely expensive, making supervised CIR difficult to scale. Zero-shot CIR (ZS-CIR) attempts to bypass annotation, but three mainstream routes have flaws: inversion networks (mapping images to pseudo-tokens), training-free LLM inference (slow and complex), and triplet synthesis (CompoDiff/VISTA/CoAlign). Triplet synthesis has the most potential, but synthesized data lacks target domain knowledge, leading to a hard-to-eliminate domain gap.
Key Challenge: The work most relevant to this paper, CoAlign, uses a frozen T2I model for zero-shot in-context generation—where an LLM writes text triplets, fills them into a layout template, and the T2I generates "two semantically related sub-images" in one forward pass. However, it has two fundamental issues: (1) high distribution drift between generated images and the real target domain; (2) a lack of task priors, leading to overly similar backgrounds in sub-images, which introduces bias when used as \(I_r/I_t\). In other words, freely generated triplets are biased and noisy, yielding limited effectiveness despite filtering.
Goal: To generate clean, unbiased CIR training triplets that align with the target domain using only a few annotations (few-shot), even as few as 32 samples, and to enhance any existing CIR model with these samples.
Key Insight: The authors found that LoRA inherently possesses the attribute of "capturing and aligning with target domain distributions from minimal samples," while in-context descriptions can inject CIR task objectives into T2I models. Thus, the approach shifts from "generation with a frozen model" to "generation with a few-shot fine-tuned model."
Core Idea: Use 32 target domain samples for parameter-efficient in-context fine-tuning of a T2I model (CIR-LoRA) to inject domain and task priors simultaneously, generating unbiased domain-adaptive triplets. A two-stage framework (robust pre-training on synthetic data + fine-grained fine-tuning on real data) is then used to enhance off-the-shelf CIR models.
Method¶
Overall Architecture¶
DAIG consists of three serial components: (i) In-context Generative Fine-tuning → (ii) Domain-Adaptive In-context Generation → (iii) Two-stage CIR Training Framework.
Input for the first part is 32 target domain triplets, outputting a T2I model \(G'\) that "understands" the domain and the CIR task. Each triplet is combined into a stitched image + in-context description to fine-tune Flux with CIR-LoRA. The second part uses \(G'\) with an LLM for batch data creation: the LLM generates ~20k text triplets based on (object, edit) templates, which are fed to \(G'\) to synthesize synthetic triplet set \(S'\) (~20k). The third part utilizes \(S'\) and real annotations \(S\) in two stages: distribution robust synthetic pre-training (DRSP) on synthetic data, followed by fine-grained real adaptation (FRA) with angular margins on real annotations.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["32 Target Domain Triplets<br/>Stitched Images + In-context Descriptions"] --> B["CIR-LoRA<br/>In-context Generative Fine-tuning<br/>MoE-LoRA injected into Cross-Attention"]
B --> C["Domain-Adaptive T2I Generation<br/>LLM writes text triplets → G' synthesizes<br/>≈20k unbiased triplets S'"]
C --> D["DRSP Robust Synthetic Pre-training<br/>Gaussian Perturbation of Visual Features + Contrastive Learning"]
D --> E["FRA Fine-grained Real Adaptation<br/>Diagonal matching pairs with Angular Margin φ"]
E --> F["Enhanced CIR Model<br/>Zero additional inference cost"]
Key Designs¶
1. CIR-LoRA: Simultaneous Injection of Domain and Task Priors
To address the bias of frozen T2I models, the authors perform parameter-efficient fine-tuning with 32 samples. Each target domain triplet's \(I_r\) and \(I_t\) are stitched horizontally. A captioner generates descriptions \(T_r, T_t\) for both, which along with \(T_c\) are put into a template to get an in-context description \(T_{ic}\). During fine-tuning, the backbone is frozen while a learnable CIR-LoRA is integrated into the cross-attention layers:
The key is using a MoE (Mixture of Experts) for each projection weight \(W\). A routing function \(r\) assigns weights to experts \(B_i, A_i\) based on description characteristics. This allows the model to handle diverse edit operations (add/delete/replace/viewpoint) by assigning optimal experts.
2. DAIG Generation Pipeline: LLM Scripting + G' Image Synthesis
To scale data, LLMs are driven by templates \(P(\text{object}, \text{edit})\) to output \((T_r, T_c, T_t)\) in JSON, constrained by target domain examples. These are converted to in-context descriptions for \(G'\) to synthesize \(S'=\{I_r^i, T_c^i, I_t^i\}_{i=1}^M\). This ensures diversity and high fidelity without the need for additional filtering like in CoAlign.
3. DRSP: Distribution Robust Synthetic Pre-training
Synthetic T2I images often form a sparse distribution relative to the real domain. The authors model the statistics (mean \(\mu(v)\), std \(\sigma(v)\)) of the visual features \(v\) as a multivariate Gaussian and apply reparameterized perturbations:
Then \(\tilde v\) replaces \(v\) for alignment. This "stretches" the sparse synthetic distribution to improve generalization with zero additional inference cost.
4. FRA: Fine-grained Real Adaptation
FRA fine-tunes on real annotations \(S\) by adding an angular margin \(\varphi\) to the matching pairs (diagonal elements) in the similarity matrix:
This forces the model to learn more discriminative representations to bridge the final domain gap.
Key Experimental Results¶
Main Results¶
On the CIRR test set, DAIG acts as a plug-and-play enhancement for three base models across 32-shot, 1%, and 100% data rates.
| Setting | Method | R@1 | R@5 | R@10 | Avg. |
|---|---|---|---|---|---|
| 32-shot | CLIP4CIR† | 22.87 | 52.12 | 64.63 | 52.12 |
| 32-shot | + DAIG | 31.02 | 63.71 | 75.81 | 61.51 |
| 32-shot | SPRC† | 29.88 | 57.61 | 69.25 | 62.46 |
| 32-shot | + DAIG | 42.05 | 72.41 | 82.00 | 71.87 |
| 100% | SPRC† | 52.05 | 82.22 | 89.98 | 81.27 |
| 100% | + DAIG | 53.88 | 84.10 | 90.60 | 82.40 |
Synthesis Dataset Comparison¶
Using only 20k synthetic triplets (DAIG-DRSP) outperforms existing datasets with millions of samples.
| Dataset | Scale | CIRR Avg. | FashionIQ Avg@10 |
|---|---|---|---|
| ST18M | 18M | 62.47 | 30.97 |
| DAIG-DRSP (Ours) | 20k | 71.68 | 44.74 |
Ablation Study¶
| Stage | Configuration | CIRR R@5 | FashionIQ Avg@10 |
|---|---|---|---|
| DRSP | ZSIG (Zero-shot Gen) | 66.10 | 39.66 |
| DRSP | + CIR-LoRA | 71.01 | 43.89 |
| DRSP | + Feature Perturbation | 72.02 | 45.02 |
| FRA | w/ Angular Margin φ | 75.52 | 45.51 |
Highlights & Insights¶
- Shifting from "Frozen" to "Few-shot Fine-tuned" Generation: A small change that addresses both domain bias and task prior deficiency.
- MoE for Task Diversity: MoE routing handles the wide range of editing operations in relative descriptions better than a single LoRA.
- Zero-cost Gain via DRSP: Perturbing feature statistics to broaden sparse distributions is a clean, plug-and-play regularization for any "train on synthetic, test on real" scenario.
Limitations & Future Work¶
- Computational Dependency: Relies on heavy T2I (Flux) and LLM (Qwen2.5) models for the generation pipeline.
- Object/Edit Set Construction: These sets currently require manual or heuristic definition, which introduces a new dependency.
- Benchmark Coverage: Validated primarily on FashionIQ and CIRR; performance on more open-domain or long-tail datasets remains to be explored.
Related Work & Insights¶
- vs CoAlign: By fine-tuning instead of freezing the model, DAIG generates unbiased domain-adaptive data that outperforms CoAlign's 534k triplets with only 20k samples.
- vs PromptCLIP / PTG: DAIG demonstrates that "generating high-quality data" is more effective than "prompt tuning" in few-shot CIR settings.
Rating¶
- Novelty: ⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐