Skip to content

Find your Needle: Small Object Image Retrieval via Multi-Object Attention Optimization

Conference: NeurIPS 2025 arXiv: 2503.07038 Code: Available (GitHub) Area: Computer Vision / Image Retrieval Keywords: Small Object Image Retrieval, Multi-Object Attention Optimization, Image Retrieval, Attention Interpretability, Global Descriptor

TL;DR

MaO proposes a novel approach for Small Object Image Retrieval (SoIR) that integrates multi-object pre-training with attention-based feature refinement, aggregating representations of multiple objects into a single global descriptor, achieving substantial improvements over existing retrieval methods across multiple benchmarks.

Background & Motivation

Retrieving images containing a specific small object from a large-scale image corpus is akin to finding a needle in a haystack. Conventional instance-based image retrieval (IBIR) methods are primarily evaluated on datasets with large, centered objects (e.g., RParis6K, ROxford5K), where objects occupy on average 40% of the image area. In real-world scenarios, however, target objects are often very small (potentially occupying as little as 0.5% of the image area) and appear alongside numerous distracting objects.

Existing methods face three key challenges:

Insufficient small-object representation: Global encoders tend to focus on large objects and background regions, causing small-object features to be submerged.

Multi-object interference: The presence of multiple objects of the same category within a scene leads to representational confusion.

Single-descriptor constraint: Efficient retrieval requires a single compact descriptor per image, which conflicts with the need to accurately represent multiple small objects.

Method

Overall Architecture

MaO operates in two stages (as illustrated in Figure 2):

Stage A — Multi-Object Fine-tuning: 1. An open-vocabulary detector (OVD, specifically OWLv2) decomposes each image into \(k\) object crops. 2. Each crop is encoded independently, yielding \(k\) feature vectors \(\{v_1, ..., v_k\} \in \mathbb{R}^d\). 3. These are fused into a global descriptor \(v_c\) via average pooling. 4. The model is trained with an InfoNCE contrastive loss to align \(v_c\) with the query object feature \(v_q\).

Stage B — Multi-Object Attention Optimization (post-training refinement): 1. LeGrad is employed to generate interpretability heatmaps. 2. A single token \(\hat{v}_c\) is optimized so that the attention maps of all crops align with their corresponding object masks. 3. The result is a refined representation that accounts for all objects simultaneously.

Key Designs

Object decomposition strategy: OWLv2 is applied in "objectness" mode to detect arbitrary objects, with a confidence threshold of 0.2. Each detected object is cropped centered on itself, with a minimum crop size equal to the backbone input size, effectively filtering background noise.

Attention optimization objective:

\[\hat{v}_c = \arg\max_{v_c} \sum_i \text{IoU}(E(v_c \cdot v_i), m_i) + \alpha \cdot v_c \sum_i v_i\]

where \(E(\cdot)\) denotes the interpretability map generated by LeGrad, \(m_i\) is the object mask (obtained via SAM), and \(\alpha=0.03\) is a regularization weight. The regularization term prevents the optimized token from deviating excessively from the initial representation.

Lightweight fine-tuning: LoRA (rank=256) is used to fine-tune the Transformer backbone on the VoxDet training set with a batch size of 128, for a single epoch. The refinement process takes 0.03 seconds per object over 80 iterations.

Loss & Training

  • Stage A: InfoNCE contrastive loss, aligning the averaged multi-object representation with the query object representation.
  • Stage B: Gradient descent optimizing IoU maximization with embedding regularization.
  • Optimizer: AdamW with learning rate \(5 \times 10^{-5}\), exponentially decayed to \(1 \times 10^{-6}\).
  • Refinement stage learning rate: \(1 \times 10^{-1}\); executed offline on gallery images.

Key Experimental Results

Main Results

The authors introduce 4 SoIR benchmarks and evaluate using mAP:

Method VoxDet PerMiR VoxDetW PerMiRW INSTRE-XS INSTRE-XXS
GSS (zero-shot) 52.01 26.73 52.01 26.73 82.34 67.98
GeM (zero-shot) 51.08 25.98 51.08 25.98 74.74 53.27
SuperGlobal 47.33 17.48 47.33 17.48 56.11 33.02
CLIP (zero-shot) 44.52 26.98
MaO-CLIP (zero-shot) ~70+ ~89+
MaO-DINOv2 (fine-tuned) 83.70 68.54

MaO outperforms conventional IR methods by 18–26 mAP on VoxDet, demonstrating significant advantages in multi-object interference scenarios.

Ablation Study

Configuration VoxDet (mAP) VoxDetW (mAP)
DINOv2 Backbone zero-shot 51.23 51.23
+ Fine-tuning 54.33 54.33
+ Full-image optimization 69.54 48.24
+ Multi-object optimization (MaO) 83.70 68.54

The ablation reveals that: - Fine-tuning alone yields modest gains (+3 mAP). - Full-image attention optimization is effective in controlled settings (+15) but degrades in the wild. - Multi-object attention optimization constitutes the primary contribution (+14 and +20).

Key Findings

  1. Smaller objects are harder: When objects occupy only 0.5% of the image area, MaO still achieves ~50% AP, while other methods largely fail.
  2. Resolution sensitivity: MaO effectively leverages high-resolution images to improve retrieval, whereas global methods degrade at higher resolutions.
  3. Controllable clutter effect: As the number of objects increases from 1 to 6, MaO's mAP drops only from 0.96 to 0.82.
  4. Improved attention distribution: Visualizations show that DINOv2 attention concentrates on backgrounds (e.g., shelves), while MaO effectively distributes attention across individual objects.

Highlights & Insights

  1. Clear problem formulation: SoIR is systematically defined and studied for the first time, establishing standardized benchmarks.
  2. Elegant method design: The conflicting objectives of encoding each object individually and fusing into a single descriptor are reconciled through attention optimization.
  3. Strong practicality: The final representation is a single global feature vector (512D or 768D), fully compatible with standard retrieval pipelines without additional storage overhead.
  4. Inventive application of LeGrad: Repurposing an interpretability tool as a feature optimization signal is a notably creative contribution.

Limitations & Future Work

  1. Dependence on OVD quality: Missed detections by the object detector (e.g., only 63% recall on VoxDet at IoU=0.9) prevent certain objects from being encoded into the global representation.
  2. Degradation in high-density scenes: When OVD detects more than 25 objects, specific small targets may be underweighted.
  3. Refinement stage overhead: At 0.03 seconds per object, costs accumulate in multi-object scenarios (though offline execution is feasible).
  4. Cross-domain generalization unexplored: The VoxDet training set consists of synthetic 3D data; generalization to real-world data remains to be validated.
  5. Insufficient handling of occlusion: Feature extraction for overlapping objects remains challenging.
  • MaskInversion: Uses interpretability maps to optimize single-object representations; MaO extends this paradigm to multi-object scenarios.
  • α-CLIP: A CLIP variant incorporating an additional mask channel; performs well in multi-object settings but still falls short of MaO.
  • PDM: Employs diffusion models for personalized retrieval, but incurs high computational costs unsuitable for large-scale global search.
  • Insight: Interpretability tools such as LeGrad can serve not only as analytical instruments but also as guidance signals for feature optimization in a reverse manner.

Rating

  • Novelty: ★★★★☆ — First systematic treatment of SoIR; attention optimization approach is original.
  • Technical Depth: ★★★★☆ — Two-stage framework is clearly designed with a mathematically grounded optimization objective.
  • Experimental Thoroughness: ★★★★★ — Multiple benchmarks, multiple backbones, detailed ablations, and visualization analyses.
  • Practicality: ★★★★★ — Single-vector retrieval, fully compatible with standard retrieval pipelines.
  • Overall Recommendation: ★★★★☆