NeurIPS 2025 Model Compression Small Object Image Retrieval Multi-Object Attention Optimization Image Retrieval Attention Interpretability Global Descriptor

Find your Needle: Small Object Image Retrieval via Multi-Object Attention Optimization¶

Conference: NeurIPS 2025 arXiv: 2503.07038 Code: Available (GitHub) Area: Computer Vision / Image Retrieval Keywords: Small Object Image Retrieval, Multi-Object Attention Optimization, Image Retrieval, Attention Interpretability, Global Descriptor

TL;DR¶

MaO proposes a novel approach for Small Object Image Retrieval (SoIR) that integrates multi-object pre-training with attention-based feature refinement, aggregating representations of multiple objects into a single global descriptor, achieving substantial improvements over existing retrieval methods across multiple benchmarks.

Background & Motivation¶

Retrieving images containing a specific small object from a large-scale image corpus is akin to finding a needle in a haystack. Conventional instance-based image retrieval (IBIR) methods are primarily evaluated on datasets with large, centered objects (e.g., RParis6K, ROxford5K), where objects occupy on average 40% of the image area. In real-world scenarios, however, target objects are often very small (potentially occupying as little as 0.5% of the image area) and appear alongside numerous distracting objects.

Existing methods face three key challenges:

Insufficient small-object representation: Global encoders tend to focus on large objects and background regions, causing small-object features to be submerged.

Multi-object interference: The presence of multiple objects of the same category within a scene leads to representational confusion.

Single-descriptor constraint: Efficient retrieval requires a single compact descriptor per image, which conflicts with the need to accurately represent multiple small objects.

Method¶

Overall Architecture¶

MaO operates in two stages (as illustrated in Figure 2):

Stage A — Multi-Object Fine-tuning: 1. An open-vocabulary detector (OVD, specifically OWLv2) decomposes each image into \(k\) object crops. 2. Each crop is encoded independently, yielding \(k\) feature vectors \(\{v_1, ..., v_k\} \in \mathbb{R}^d\). 3. These are fused into a global descriptor \(v_c\) via average pooling. 4. The model is trained with an InfoNCE contrastive loss to align \(v_c\) with the query object feature \(v_q\).

Stage B — Multi-Object Attention Optimization (post-training refinement): 1. LeGrad is employed to generate interpretability heatmaps. 2. A single token \(\hat{v}_c\) is optimized so that the attention maps of all crops align with their corresponding object masks. 3. The result is a refined representation that accounts for all objects simultaneously.

Key Designs¶

Object decomposition strategy: OWLv2 is applied in "objectness" mode to detect arbitrary objects, with a confidence threshold of 0.2. Each detected object is cropped centered on itself, with a minimum crop size equal to the backbone input size, effectively filtering background noise.

Attention optimization objective:

\[\hat{v}_c = \arg\max_{v_c} \sum_i \text{IoU}(E(v_c \cdot v_i), m_i) + \alpha \cdot v_c \sum_i v_i\]

where \(E(\cdot)\) denotes the interpretability map generated by LeGrad, \(m_i\) is the object mask (obtained via SAM), and \(\alpha=0.03\) is a regularization weight. The regularization term prevents the optimized token from deviating excessively from the initial representation.

Lightweight fine-tuning: LoRA (rank=256) is used to fine-tune the Transformer backbone on the VoxDet training set with a batch size of 128, for a single epoch. The refinement process takes 0.03 seconds per object over 80 iterations.

Loss & Training¶

Stage A: InfoNCE contrastive loss, aligning the averaged multi-object representation with the query object representation.
Stage B: Gradient descent optimizing IoU maximization with embedding regularization.
Optimizer: AdamW with learning rate \(5 \times 10^{-5}\), exponentially decayed to \(1 \times 10^{-6}\).
Refinement stage learning rate: \(1 \times 10^{-1}\); executed offline on gallery images.

Key Experimental Results¶

Main Results¶

The authors introduce 4 SoIR benchmarks and evaluate using mAP:

Method	VoxDet	PerMiR	VoxDetW	PerMiRW	INSTRE-XS	INSTRE-XXS
GSS (zero-shot)	52.01	26.73	52.01	26.73	82.34	67.98
GeM (zero-shot)	51.08	25.98	51.08	25.98	74.74	53.27
SuperGlobal	47.33	17.48	47.33	17.48	56.11	33.02
CLIP (zero-shot)	44.52	26.98	—	—	—	—
MaO-CLIP (zero-shot)	~70+	~89+	—	—	—	—
MaO-DINOv2 (fine-tuned)	83.70	—	68.54	—	—	—

MaO outperforms conventional IR methods by 18–26 mAP on VoxDet, demonstrating significant advantages in multi-object interference scenarios.

Ablation Study¶

Configuration	VoxDet (mAP)	VoxDetW (mAP)
DINOv2 Backbone zero-shot	51.23	51.23
+ Fine-tuning	54.33	54.33
+ Full-image optimization	69.54	48.24
+ Multi-object optimization (MaO)	83.70	68.54

The ablation reveals that: - Fine-tuning alone yields modest gains (+3 mAP). - Full-image attention optimization is effective in controlled settings (+15) but degrades in the wild. - Multi-object attention optimization constitutes the primary contribution (+14 and +20).

Key Findings¶

Smaller objects are harder: When objects occupy only 0.5% of the image area, MaO still achieves ~50% AP, while other methods largely fail.
Resolution sensitivity: MaO effectively leverages high-resolution images to improve retrieval, whereas global methods degrade at higher resolutions.
Controllable clutter effect: As the number of objects increases from 1 to 6, MaO's mAP drops only from 0.96 to 0.82.
Improved attention distribution: Visualizations show that DINOv2 attention concentrates on backgrounds (e.g., shelves), while MaO effectively distributes attention across individual objects.

Highlights & Insights¶

Clear problem formulation: SoIR is systematically defined and studied for the first time, establishing standardized benchmarks.
Elegant method design: The conflicting objectives of encoding each object individually and fusing into a single descriptor are reconciled through attention optimization.
Strong practicality: The final representation is a single global feature vector (512D or 768D), fully compatible with standard retrieval pipelines without additional storage overhead.
Inventive application of LeGrad: Repurposing an interpretability tool as a feature optimization signal is a notably creative contribution.

Limitations & Future Work¶

Dependence on OVD quality: Missed detections by the object detector (e.g., only 63% recall on VoxDet at IoU=0.9) prevent certain objects from being encoded into the global representation.
Degradation in high-density scenes: When OVD detects more than 25 objects, specific small targets may be underweighted.
Refinement stage overhead: At 0.03 seconds per object, costs accumulate in multi-object scenarios (though offline execution is feasible).
Cross-domain generalization unexplored: The VoxDet training set consists of synthetic 3D data; generalization to real-world data remains to be validated.
Insufficient handling of occlusion: Feature extraction for overlapping objects remains challenging.

MaskInversion: Uses interpretability maps to optimize single-object representations; MaO extends this paradigm to multi-object scenarios.
α-CLIP: A CLIP variant incorporating an additional mask channel; performs well in multi-object settings but still falls short of MaO.
PDM: Employs diffusion models for personalized retrieval, but incurs high computational costs unsuitable for large-scale global search.
Insight: Interpretability tools such as LeGrad can serve not only as analytical instruments but also as guidance signals for feature optimization in a reverse manner.

Rating¶

Novelty: ★★★★☆ — First systematic treatment of SoIR; attention optimization approach is original.
Technical Depth: ★★★★☆ — Two-stage framework is clearly designed with a mathematically grounded optimization objective.
Experimental Thoroughness: ★★★★★ — Multiple benchmarks, multiple backbones, detailed ablations, and visualization analyses.
Practicality: ★★★★★ — Single-vector retrieval, fully compatible with standard retrieval pipelines.
Overall Recommendation: ★★★★☆