Referring Expression Comprehension for Small Objects¶

Conference: ICCV 2025 arXiv: 2510.03701 Code: GitHub Area: Autonomous Driving Keywords: Referring Expression Comprehension, Small Object Detection, Parameter-Efficient Fine-Tuning, Progressive Zooming, GroundingDINO

TL;DR¶

This work proposes the SOREC dataset (100K referring expression–bounding box pairs for small objects) and the PIZA adapter module (Progressive-Iterative Zooming Adapter), enabling pretrained models such as GroundingDINO to autoregressively zoom in on extremely small targets, achieving substantial accuracy gains for small-object REC in autonomous driving scenarios.

Background & Motivation¶

Referring Expression Comprehension (REC) aims to localize a specific object in an image given a natural language description. State-of-the-art models have achieved over 90% accuracy on standard benchmarks such as RefCOCO, yet extremely small object localization remains a significant challenge.

Lack of datasets: Existing REC datasets (e.g., RefCOCO) predominantly contain medium-to-large objects and lack annotations for small targets. In SOREC, the average bounding box occupies only 0.05% of the image area—far smaller than conventional objects.

Pretrain–finetune gap: Large-scale pretrained models (e.g., GroundingDINO) perform well on normally sized objects, but low-resolution features are insufficient for fine-grained localization of extremely small targets. Naively rescaling images to the model's input resolution discards critical details.

Urgent application demand: Detecting small objects at a distance—pedestrians, traffic signs, street lights—is safety-critical in autonomous driving, yet existing methods perform poorly in such scenarios.

Core insight: When facing a small object, humans first coarsely localize and then progressively zoom in. PIZA mimics this behavior through autoregressive zooming.

Method¶

Overall Architecture¶

PIZA (Progressive-Iterative Zooming Adapter) formulates small-object localization as a search process. Given a pretrained model \(F\), PIZA extends it to \(F_{\bigcirc *}\), which autoregressively predicts a sequence of bounding boxes:

\[P = (\mathbf{b}_0, \mathbf{b}_1, \cdots, \mathbf{b}_T)\]

where \(\mathbf{b}_0\) covers the entire image and \(\mathbf{b}_T\) is the final localization of the small target. At each step:

\[\hat{\mathbf{b}}_{i+1} = F_{\bigcirc *}(\mathbf{x}_i, \mathbf{t}, \mathbf{b}_{0:i})\]

where \(\mathbf{x}_i\) is the cropped image region at step \(i\).

Key Designs¶

PIZA Module (Zooming-Step Embedding): Inspired by timestep embeddings in diffusion models, PIZA learns a zooming-step embedding to represent the progress of the search process. The pipeline is:
- Extract 6-dimensional low-level features from the bounding box sequence \(\mathbf{b}_{0:i}\) (normalized size, relative size, normalized width/height, center coordinates)
- Derive embedding \(\mathbf{h} \in \mathbb{R}^d\) via learnable Fourier embedding + Transformer encoder + average pooling
- Two prediction heads: an EOS head (binary classification for search termination) and a Progress head predicting search progress \(\hat{z} \in [0,1]\)
- Total parameter count: only 0.27M; feature dimension: 16
Three Parameter-Efficient Fine-Tuning Integration Modes: PIZA can be flexibly integrated into different PEFT methods:
- PIZA-CoOp: inserts the zooming-step embedding \(\mathbf{h}\) into the learnable prompt embedding sequence
- PIZA-LoRA: injects \(\mathbf{h}\) into the LoRA bottleneck layer: \(W\mathbf{x} + BA\mathbf{x} + BC\mathbf{h}\)
- PIZA-Adapter+: adds \(\mathbf{h}\) to the output of the channel-scaling layer in Adapter+
Extended Training Dataset Construction: Ground-truth search trajectories are constructed to supervise the autoregressive zooming process. Key steps:
- Estimate the bounding box area ratio distribution \(p(r)\) from the pretraining dataset
- Determine the zooming factor \(z_j^*\) at each step using an exponential weighting scheme, prioritizing precision at early steps
- Gradually transition the aspect ratio from the full image to that of the target
- Generate [CONT] and [EOS] labels to supervise the stopping decision

Loss & Training¶

Contrastive loss and localization loss based on GroundingDINO
A random step from the search trajectory is sampled per mini-batch for forward computation
AdamW optimizer, learning rate \(2 \times 10^{-4}\), decayed by 0.5 at epoch 3, trained for 5 epochs total
LoRA applied to all self-attention and cross-attention modules with rank=16
Adapter+ inserted after each self-attention and FFN module

Key Experimental Results¶

Main Results¶

Parameter-efficient fine-tuning results on SOREC (Train-L)

Method	Params	Val mAcc	Test-A mAcc	Test-B mAcc	Test-A Acc50	Test-B Acc50
Zero-shot	0	0.2	0.3	0.0	1.0	0.2
Full fine-tuning	173.0M	37.4	43.8	30.5	69.6	55.6
LoRA	1.3M	25.2	30.7	19.7	50.2	37.3
PIZA-LoRA	1.5M	34.5	39.3	29.0	54.0	43.4
Adapter+	3.3M	34.6	40.7	27.6	65.9	51.3
PIZA-Adapter+	3.5M	39.0	45.1	31.7	66.2	52.2

Ablation Study¶

Contribution of each PIZA component (PIZA-Adapter+, Train-S, mAcc/Acc50/Acc75)

Configuration	Val	Test-A	Test-B	Notes
w/o PIZA module	26.0/48.1/24.8	32.0/55.0/33.3	20.3/40.4/17.9	No zooming: baseline
w/o emb. insertion	36.7/53.2/41.7	42.8/59.2/49.9	30.3/45.8/34.0	No embedding injection
Full PIZA-Adapter+	36.8/53.5/41.8	43.1/59.6/50.1	30.4/45.9/34.1	Full model

Effect of adapter bottleneck dimension

Dimension d	Params	Val mAcc	Test-A mAcc
32	1.6M	35.1	40.8
64	1.9M	36.6	42.2
128	2.4M	36.4	41.8
256	3.5M	36.8	43.1

Key Findings¶

The zero-shot baseline is near zero (mAcc 0.2%), confirming that pretrained models completely fail on extremely small objects.
PIZA-Adapter+ surpasses full fine-tuning with only 3.5M parameters (vs. 173M), demonstrating the effectiveness of progressive zooming.
Removing the PIZA module causes mAcc to drop sharply from 36.8 to 26.0, confirming that autoregressive zooming is the core contribution.
Test-A (traffic signs, etc.) consistently outperforms Test-B (other small objects) by more than 10% in accuracy.
Performance improves continuously with larger training sets, indicating room for further dataset scaling.
On average, only 2.11 zooming steps are required to complete localization.

Highlights & Insights¶

Strong dataset contribution: SOREC is the first REC dataset targeting small objects in autonomous driving, with objects occupying only 0.05% of image area and expressions averaging 25.5 words (vs. 3.5 in RefCOCO), filling an important gap in the field.
Mimicking human visual search: The progressive zooming strategy is intuitive and efficient, analogous to the human "scan-then-focus" visual search behavior.
Parameter efficiency: The PIZA module introduces only 0.27M parameters and achieves substantial performance gains by flexibly injecting zooming-step information into various PEFT frameworks.
Reproducible dataset pipeline: The semi-automated construction pipeline (SAM + GPT-4o + crowdsourcing) provides a replicable template for creating similar datasets in other domains.

Limitations & Future Work¶

Autoregressive zooming increases the number of inference steps (averaging 2–3 per sample), which may be insufficient for latency-critical applications.
Validation is currently limited to GroundingDINO; transferability to other foundation models (e.g., GLIPv2, Florence) remains to be explored.
Dataset expressions are generated by GPT-4o, which may result in limited linguistic diversity (18.45% contain minor errors).
The reliability of the automatic stopping decision (EOS prediction) warrants further investigation.
Integration with classical small-object detection techniques such as multi-scale feature extraction could be explored.

This work reframes REC from a "single-step localization" problem to a "multi-step search" problem, providing a new paradigm for handling extreme-scale objects.
The zooming-step embedding in PIZA draws inspiration from timestep embeddings in diffusion models, demonstrating an elegant cross-domain transfer of ideas.
The SOREC construction pipeline (foundation model segmentation → GPT-generated descriptions → crowdsourced verification) offers a reference for automated training data production.
The approach has direct implications for long-range object understanding and safety-critical planning in autonomous driving.

Rating¶

Novelty: ⭐⭐⭐⭐ — The progressive zooming localization concept is intuitive and novel; the dataset fills a critical gap.
Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple PEFT methods are compared with thorough ablations, though comparisons against broader REC baselines are limited.
Writing Quality: ⭐⭐⭐⭐⭐ — The paper is well-structured with clearly articulated motivation and detailed dataset construction descriptions.
Value: ⭐⭐⭐⭐ — Both the dataset and the method make important contributions to small-object understanding.