Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval¶

Conference: CVPR 2026 arXiv: 2604.05393 Code: Project Page Area: Object Detection / Image Retrieval Keywords: Composed Image Retrieval, Instance-Level Consistency, Attention Modulation, Fine-Grained Retrieval, Visual Anchoring

TL;DR¶

This paper proposes Object-Anchored Composed Image Retrieval (OACIR), a new task formulation, along with a large-scale benchmark OACIRR (160K+ quadruplets) and the AdaFocal framework. AdaFocal employs a context-aware attention modulator to adaptively enhance focus on anchored instance regions, substantially outperforming existing methods in instance-level retrieval fidelity.

Background & Motivation¶

Background: Composed Image Retrieval (CIR) enables flexible retrieval via multimodal queries combining a reference image and modification text, with broad applications in e-commerce and interactive search.

Limitations of Prior Work: CIR inherently prioritizes semantic matching, treating the reference image as only a coarse-grained visual anchor — making it unable to reliably retrieve user-specified instances in the presence of visually similar distractors.

Practical Need: In scenarios such as digital memory retrieval and long-term identity tracking, ensuring instance-level fidelity is more critical than broad semantic alignment.

Key Challenge: The task requires simultaneously achieving (1) compositional reasoning over three information sources (anchored instance + global scene + text modification) and (2) precise discrimination of the target instance from a gallery dense with visually similar distractors.

Core Idea: By combining explicit bounding-box visual anchoring with an adaptive attention enhancement mechanism, the approach elevates CIR from semantic-level to instance-level retrieval.

Method¶

Overall Architecture¶

Query branch: $(I_r, B_r, T_m)$ → Image encoder → CAAM predicts modulation scalar $\beta$ → Attention activation mechanism enhances instance region → Multimodal encoder → Query representation $f_q$ Target branch: $I_t$ → Image encoder → Multimodal encoder → Target representation $f_t$ Training: Contrastive learning loss aligns representations from both branches.

Key Designs¶

Context-Aware Attention Modulator (CAAM):
The reference image and modification text are fed into the multimodal encoder, along with $K$ learnable context probe tokens.
Probe tokens learn contextual cues through interaction with the multimodal input.
A Transformer-based Contextual Reasoning Module (CRM) aggregates and reasons over these cues, producing a modulation scalar $\beta$ via linear projection.
Design Motivation: The degree of instance focus should vary dynamically with query context — when the modification text demands large scene changes, instance attention should be relaxed; when only the background changes, instance attention should be intensified.
Attention Activation Mechanism: $\beta$ is injected as a dynamic bias into the cross-attention of the query branch: $$\{\hat{q}_m\} = \text{Softmax}\left(\frac{QK^T + \beta \cdot M_{B_r}}{\sqrt{d_k}}\right)V$$ where $M_{B_r}$ is a binary mask spatially aligned with the bounding box. $\beta > 0$ amplifies attention over the instance region, enabling adaptive focus.
OACIRR Benchmark Construction (four-stage pipeline):
Image Pair Collection: Same-instance, cross-context image pairs are extracted from DeepFashion2, Stanford Cars, Products-10K, and Google Landmarks v2.
Image Pair Filtering: Overly similar pairs are removed (to prevent shortcut learning), along with category-centroid images.
Quadruplet Annotation: Modification texts are generated by an MLLM; bounding boxes are annotated by a grounding model.
Gallery Construction: Hard negatives (same-category but different-instance distractors) are mined in a targeted manner.

Loss & Training¶

Contrastive Alignment Loss: In-batch contrastive learning maximizes cosine similarity between correct query–target pairs.
Differentiated learning rates: CAAM at 1e-4; multimodal encoder at 1e-5.
Temperature parameter $\tau = 0.07$.

Key Experimental Results¶

Main Results (OACIRR Benchmark, ViT-G Backbone)¶

Method	Fashion $R_{ID}@1$	Car $R_{ID}@1$	Product $R_{ID}@1$	Landmark $R_{ID}@1$	Avg
GME (7B)	44.98	63.11	83.44	77.11	62.53
SPRC (trained on CIRR)	28.62	25.13	54.39	40.41	37.30
SPRC (trained on OACIRR)	65.25	72.87	86.05	76.32	74.05
AdaFocal	77.15	78.42	91.86	82.92	79.00

Ablation Study¶

Configuration	$R_{ID}@1$	R@1	Avg	Note
w/o CAAM ($\beta=0$)	77.74	58.39	74.91	Baseline
Average pooling + frozen probes	79.70	59.84	76.39	Simple aggregation insufficient
Transformer CRM + learnable probes	82.59	62.88	79.00	Reasoning capacity + task adaptation

Key Findings¶

Training on OACIRR data boosts SPRC from 37.30% to 74.05%: instance-consistent training data is the key factor.
AdaFocal yields a further gain of +4.95%: adaptive attention modulation is effective.
The gap between $R@1$ and $R_{ID}@1$ reveals that the primary failure mode of existing methods is instance misidentification.

Highlights & Insights¶

Advancing CIR from semantic-level to instance-level retrieval represents an important paradigm shift in the retrieval community.
OACIRR is the first large-scale instance-level composed retrieval benchmark spanning four domains, offering substantial community value.
The context-aware modulation mechanism in CAAM elegantly balances instance fidelity with compositional reasoning.

Limitations & Future Work¶

Bounding box annotation increases user interaction cost; future work may explore automatic instance anchoring.
The current framework supports only single-instance anchoring; multi-instance scenarios remain to be addressed.
Video-level instance tracking retrieval has not been explored.

Shares the instance consistency objective with Re-ID (person re-identification) but is more general in scope.
The attention bias injection idea draws from generative models (e.g., Prompt-to-Prompt) and is successfully transferred to retrieval tasks.
Has direct applicability to product search, digital asset management, and related applications.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ New task definition + new benchmark + new method, a trifecta of contributions.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Cross-paradigm comparisons, comprehensive ablations, and complete qualitative analysis.
Writing Quality: ⭐⭐⭐⭐ Clear structure with detailed description of the dataset construction pipeline.
Value: ⭐⭐⭐⭐⭐ The problem formulation and benchmark contributions will advance the retrieval field.

Configuration	\(R_{ID}@1\)	R@1	Avg	Note
w/o CAAM (\(\beta=0\))	77.74	58.39	74.91	Baseline
Average pooling + frozen probes	79.70	59.84	76.39	Simple aggregation insufficient
Transformer CRM + learnable probes	82.59	62.88	79.00	Reasoning capacity + task adaptation