Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval¶
Conference: CVPR 2026 arXiv: 2604.05393 Code: Project Page Area: Object Detection / Image Retrieval Keywords: Composed Image Retrieval, Instance-Level Consistency, Attention Modulation, Fine-Grained Retrieval, Visual Anchoring
TL;DR¶
This paper proposes Object-Anchored Composed Image Retrieval (OACIR), a new task formulation, along with a large-scale benchmark OACIRR (160K+ quadruplets) and the AdaFocal framework. AdaFocal employs a context-aware attention modulator to adaptively enhance focus on anchored instance regions, substantially outperforming existing methods in instance-level retrieval fidelity.
Background & Motivation¶
Background: Composed Image Retrieval (CIR) enables flexible retrieval via multimodal queries combining a reference image and modification text, with broad applications in e-commerce and interactive search.
Limitations of Prior Work: CIR inherently prioritizes semantic matching, treating the reference image as only a coarse-grained visual anchor — making it unable to reliably retrieve user-specified instances in the presence of visually similar distractors.
Practical Need: In scenarios such as digital memory retrieval and long-term identity tracking, ensuring instance-level fidelity is more critical than broad semantic alignment.
Key Challenge: The task requires simultaneously achieving (1) compositional reasoning over three information sources (anchored instance + global scene + text modification) and (2) precise discrimination of the target instance from a gallery dense with visually similar distractors.
Core Idea: By combining explicit bounding-box visual anchoring with an adaptive attention enhancement mechanism, the approach elevates CIR from semantic-level to instance-level retrieval.
Method¶
Overall Architecture¶
Query branch: \((I_r, B_r, T_m)\) → Image encoder → CAAM predicts modulation scalar \(\beta\) → Attention activation mechanism enhances instance region → Multimodal encoder → Query representation \(f_q\) Target branch: \(I_t\) → Image encoder → Multimodal encoder → Target representation \(f_t\) Training: Contrastive learning loss aligns representations from both branches.
Key Designs¶
-
Context-Aware Attention Modulator (CAAM):
-
The reference image and modification text are fed into the multimodal encoder, along with \(K\) learnable context probe tokens.
- Probe tokens learn contextual cues through interaction with the multimodal input.
- A Transformer-based Contextual Reasoning Module (CRM) aggregates and reasons over these cues, producing a modulation scalar \(\beta\) via linear projection.
-
Design Motivation: The degree of instance focus should vary dynamically with query context — when the modification text demands large scene changes, instance attention should be relaxed; when only the background changes, instance attention should be intensified.
-
Attention Activation Mechanism: \(\beta\) is injected as a dynamic bias into the cross-attention of the query branch: $\(\{\hat{q}_m\} = \text{Softmax}\left(\frac{QK^T + \beta \cdot M_{B_r}}{\sqrt{d_k}}\right)V\)$ where \(M_{B_r}\) is a binary mask spatially aligned with the bounding box. \(\beta > 0\) amplifies attention over the instance region, enabling adaptive focus.
-
OACIRR Benchmark Construction (four-stage pipeline):
-
Image Pair Collection: Same-instance, cross-context image pairs are extracted from DeepFashion2, Stanford Cars, Products-10K, and Google Landmarks v2.
- Image Pair Filtering: Overly similar pairs are removed (to prevent shortcut learning), along with category-centroid images.
- Quadruplet Annotation: Modification texts are generated by an MLLM; bounding boxes are annotated by a grounding model.
- Gallery Construction: Hard negatives (same-category but different-instance distractors) are mined in a targeted manner.
Loss & Training¶
- Contrastive Alignment Loss: In-batch contrastive learning maximizes cosine similarity between correct query–target pairs.
- Differentiated learning rates: CAAM at 1e-4; multimodal encoder at 1e-5.
- Temperature parameter \(\tau = 0.07\).
Key Experimental Results¶
Main Results (OACIRR Benchmark, ViT-G Backbone)¶
| Method | Fashion \(R_{ID}@1\) | Car \(R_{ID}@1\) | Product \(R_{ID}@1\) | Landmark \(R_{ID}@1\) | Avg |
|---|---|---|---|---|---|
| GME (7B) | 44.98 | 63.11 | 83.44 | 77.11 | 62.53 |
| SPRC (trained on CIRR) | 28.62 | 25.13 | 54.39 | 40.41 | 37.30 |
| SPRC (trained on OACIRR) | 65.25 | 72.87 | 86.05 | 76.32 | 74.05 |
| AdaFocal | 77.15 | 78.42 | 91.86 | 82.92 | 79.00 |
Ablation Study¶
| Configuration | \(R_{ID}@1\) | R@1 | Avg | Note |
|---|---|---|---|---|
| w/o CAAM (\(\beta=0\)) | 77.74 | 58.39 | 74.91 | Baseline |
| Average pooling + frozen probes | 79.70 | 59.84 | 76.39 | Simple aggregation insufficient |
| Transformer CRM + learnable probes | 82.59 | 62.88 | 79.00 | Reasoning capacity + task adaptation |
Key Findings¶
- Training on OACIRR data boosts SPRC from 37.30% to 74.05%: instance-consistent training data is the key factor.
- AdaFocal yields a further gain of +4.95%: adaptive attention modulation is effective.
- The gap between \(R@1\) and \(R_{ID}@1\) reveals that the primary failure mode of existing methods is instance misidentification.
Highlights & Insights¶
- Advancing CIR from semantic-level to instance-level retrieval represents an important paradigm shift in the retrieval community.
- OACIRR is the first large-scale instance-level composed retrieval benchmark spanning four domains, offering substantial community value.
- The context-aware modulation mechanism in CAAM elegantly balances instance fidelity with compositional reasoning.
Limitations & Future Work¶
- Bounding box annotation increases user interaction cost; future work may explore automatic instance anchoring.
- The current framework supports only single-instance anchoring; multi-instance scenarios remain to be addressed.
- Video-level instance tracking retrieval has not been explored.
Related Work & Insights¶
- Shares the instance consistency objective with Re-ID (person re-identification) but is more general in scope.
- The attention bias injection idea draws from generative models (e.g., Prompt-to-Prompt) and is successfully transferred to retrieval tasks.
- Has direct applicability to product search, digital asset management, and related applications.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ New task definition + new benchmark + new method, a trifecta of contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Cross-paradigm comparisons, comprehensive ablations, and complete qualitative analysis.
- Writing Quality: ⭐⭐⭐⭐ Clear structure with detailed description of the dataset construction pipeline.
- Value: ⭐⭐⭐⭐⭐ The problem formulation and benchmark contributions will advance the retrieval field.