Skip to content

SRA-Det: Learning Omni-Grained Open-Vocabulary Detection Beyond Category Names

Conference: CVPR 2026
Paper: CVF Open Access
Code: None
Area: Object Detection / Open-Vocabulary Detection
Keywords: Open-vocabulary detection, fine-grained recognition, semantic retrieval, soft-min matching, attribute-augmented data

TL;DR

Addressing the issue where open-vocabulary detectors only match "category names" and remain insensitive to fine-grained attributes like color, material, and pattern, SRA-Det uses learnable retrieval queries to extract multiple semantic "facets" from text tokens. It employs soft-min matching as a "logic AND" to ensure all facets are satisfied. Combined with an attribute-augmented pipeline that uses LLMs for generation and CLIP for dual-verification, SRA-Det achieves 54.9 mAP on FG-OVD and maintains 40.4 AP on LVIS under zero-shot settings.

Background & Motivation

Background: The mainstream approach for Open-Vocabulary Detection (OVD) leverages vision-language pre-training (e.g., CLIP) to align region features with text embeddings, enabling the recognition of arbitrary categories beyond the training vocabulary. Models like GLIP, GroundingDINO, OWL-ViT, YOLO-World, MM-GDINO, and LLMDet have demonstrated strong zero-shot generalization on COCO and LVIS.

Limitations of Prior Work: These methods and their evaluations remain at the "coarse-grained category" level—as long as the box matches the category, it is considered correct, regardless of whether the model truly understands fine-grained attributes in the description. Consequently, a detector that finds "a dog" might fail to distinguish "a dog with gray and white fur and blue eyes"; one that finds "a knife" might miss "a grey metal knife with a black plastic handle." Benchmarks like FG-OVD use "hard negative" descriptions differing by only one or two attributes to challenge detectors, where mainstream models exhibit significant performance drops.

Key Challenge: The root cause is that existing methods compress the entire description into a single final text vector. Even models like NoctOWL and HA-FGOVD, which have begun to emphasize attributes, eventually fuse attribute information linearly into a global vector. Once collapsed into a single vector, strong category semantics dominate attribute clues, allowing "partially matched" objects (violating one or two key attributes) to still receive high scores.

Goal: The authors decompose the problem into two sub-problems: (i) how text should be represented for visual matching, and (ii) how to obtain fine-grained attribute supervision at low cost on large-scale detection data.

Key Insight: A correct detection must simultaneously satisfy all semantic clues mentioned in the description. Therefore, the matching score should be a "logical AND" of these clues, rather than a weighted sum where weak attributes are masked by strong category signals. The prerequisite for an "AND" operation is explicitly decomposing the description into multiple facets for individual verification.

Core Idea: Use a small set of learnable retrieval queries to "retrieve" multiple complementary semantic facets from token-level text features, then use a differentiable soft-min to aggregate facet similarities. This ensures that any mismatched attribute significantly lowers the total score. Simultaneously, an LLM+CLIP pipeline automatically generates attribute supervision to feed this mechanism.

Method

Overall Architecture

SRA-Det consists of two complementary components: the model side replaces single-vector text representation with "multi-semantic facets + soft-min matching" to solve "how to represent and verify text"; the data side uses an automated pipeline to add dense visual attribute labels to large-scale detection data, solving "where fine-grained supervision comes from."

The model itself is a DETR-style dual-stream structure: an image encoder and object decoder produce \(N\) object queries \(v_i\), while a text encoder encodes category names or free-form descriptions into token embeddings. The key modification occurs in the text branch: the Semantic Retrieval Augmentation (SRA) module uses \(K\) queries to extract \(K\) semantic facets \(\hat h_k\) from token embeddings. During inference, the Multi-Semantic Matching module calculates the similarity between each object query and these \(K\) facets, aggregating them into a final score \(s_i\) via soft-min. The training data is enriched by the Attribute-Augmented Data Pipeline before being fed into the detector.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Detection Datasets<br/>(O365 / V3Det / VAW / GoldG)"] --> P["Attribute-Augmented Data Pipeline<br/>LLM Attribute Gen + CLIP Dual-Verification"]
    P --> T["Train SRA-Det"]
    D["Category Names / Free Descriptions"] --> TE["Text Encoder<br/>Token Embeddings H"]
    TE --> SRA["Semantic Retrieval Augmentation (SRA)<br/>K queries extract K facets Ĥ"]
    IMG["Image"] --> OD["Object Decoder<br/>Object Queries vᵢ"]
    SRA --> MSM["Multi-Semantic Matching<br/>Soft-min aggregation of K similarities"]
    OD --> MSM
    MSM --> O["Final Score sᵢ"]
    T -.-> SRA
    T -.-> OD

Key Designs

1. Semantic Retrieval Augmentation (SRA): Decomposing a description into complementary facets

This specifically addresses the issue of category semantics drowning out attributes in single-vector representations. The module takes \(K\) retrieval queries \(Q=\{q_1,\dots,q_K\}\) (set to \(K=3\) in the paper) and text token sequences \(H=\{h_1,\dots,h_L\}\) (from the last layer of the text encoder, excluding [SOS]/[EOS]). Each query uses attention to "read" these tokens and extract a semantic facet: \(r_k=\mathrm{Attention}(q_k,H,H)\). To ensure different queries attend to different parts, the authors equip each query with learnable positional embeddings \(p_k\) and inject the previous retrieval result as context for the next: \(q_k=\mathrm{LN}(r_{k-1}+p_k)\), where the initial query is the global average pool \(r_0=\mathrm{AvgPool}(H)\). This "query-by-query conditioning" ensures facets extract complementary information progressively.

The extracted \(r_k\) is fused with the global text embedding \(h_{\mathrm{eos}}\): \(h'_k=h_{\mathrm{eos}}+r_k\), \(h_k=h'_k+\mathrm{MLP}(\mathrm{LN}(h'_k))\), and finally projected and L2-normalized to obtain the multi-semantic representation \(\hat H=\{\hat h_1,\dots,\hat h_K\}\). To ensure these \(K\) facets are both diverse and comprehensive, two regularizations are applied to the attention maps \(A=\{a_1,\dots,a_K\}\): a diversity loss \(\mathcal L_{\mathrm{div}}=\frac{1}{K(K-1)}\sum_{i\neq j}\langle a_i,a_j\rangle^2\) to minimize overlap between queries, and a coverage loss \(\mathcal L_{\mathrm{cov}}=\frac1L\sum_{j=1}^L\big(\frac1K\sum_{i=1}^K a_{i,j}-\frac1L\big)^2\) to encourage all valid tokens to be attended to.

2. Multi-Semantic Matching (Soft-min): Differentiable logic AND for attribute satisfaction

With \(K\) semantic facets, how can they be aggregated to reflect "all must be satisfied"? For each object query \(v_i\), similarities with each facet are computed: \(s_{i,k}=\langle v_i,\hat h_k\rangle\). While a hard minimum \(\min_k s_{i,k}\) represents a "logical AND" (veto power), it is non-differentiable and only propagates gradients to the minimum item, which can cause training collapse. The authors use soft-min instead:

\[s_i=-\tau\log\sum_{k=1}^{K}\exp\!\Big(-\frac{s_{i,k}}{\tau}\Big)+\tau\log K\]

When facet scores are close, it behaves like an average; when one score is significantly low, it approaches the minimum. Thus, a single mismatched attribute (e.g., mistaking "metal" for "rattan") is sufficient to drag down the overall score. This "short-board effect" is the fundamental difference from standard single-vector dot products, where weak attributes can be averaged out by strong category scores.

3. Attribute-Augmented Data Pipeline: LLM generation and CLIP dual-verification

A detector requires fine-grained supervision. Large-scale OVD datasets often lack attribute labels. To avoid high manual labeling costs and MLLM hallucinations, the pipeline follows three steps: (1) Category-prior attribute generation: LLM (DeepSeek V3.1) generates candidate visual attributes restricted to discriminative pure visual traits (color, shape, part, material, etc.). (2) CLIP feature extraction: RoI crops are passed through the CLIP image encoder with augmentations (padding and flipping) to obtain robust visual features \(\hat f_{roi}\). Textual descriptions are encoded using multiple templates to get \(\hat f^d_{attr}\) and \(f^c_{attr} = \mathrm{norm}(f_{attr} + f_{class})\). (3) Dual-verification labeling: Only attributes where \(s^d_{attr}>\max(s_{class},\gamma)\) and \(s^c_{attr}>\max(s_{class},\gamma)\) are retained, ensuring attribute scores exceed the baseline category score.

Loss & Training

The objective follows DETR-style detectors with the added regularizations:

\[\mathcal L=\mathcal L_{\mathrm{align}}+\mathcal L_{\mathrm{align\_des}}+\mathcal L_{\mathrm{box}}+\mathcal L_{\mathrm{GIoU}}+\mathcal L_{\mathrm{div}}+\mathcal L_{\mathrm{cov}}\]

Where \(\mathcal L_{\mathrm{align}}\) and \(\mathcal L_{\mathrm{align\_des}}\) are alignment losses for category names and augmented descriptions, respectively. The backbone is Swin-T, with Open-CLIP ViT-B/16 as the text encoder. Training is conducted with \(K=3\), batch size 256, over 12 epochs.

Key Experimental Results

Main Results

In zero-shot settings, SRA-Det (Swin-T) achieves 54.9 mAP across eight FG-OVD sub-datasets, outpacing OWL-ViT(L/14) by 13.8 mAP and even exceeding the fine-tuned NoctOWLv2(L/14). It maintains 40.4 AP on LVIS minival.

Benchmark Metric SRA-Det (Swin-T) Comparison Gain
FG-OVD (Zero-shot, Avg 8 sets) mAP 54.9 OWL-ViT(L/14) 41.1 +13.8
LVIS minival (Zero-shot) AP 40.4 YOLO-World-L 35.4 +5.0
LVIS minival (Zero-shot) AP 40.4 GroundingDINO 27.4 +13.0
FG-OVD (Fine-tuned) mAP 67.8 GUIDED 66.4 +1.4

Compared to MLLMs, SRA-Det (0.116B params) achieves a higher F1 (46.29) and Recall (51.71) on FG-OVD than Qwen3-VL-2B (2B) and Rex-Omni (4B), demonstrating superior parameter efficiency.

Ablation Study

Configuration LVIS AP FG-OVD Hard Explanation
V3Det only 27.6 25.3 Baseline
+ Attr-Aug Data 27.9 (+0.3) 30.3 (+5.0) Automated supervision
+ VAW 28.5 (+0.6) 43.1 (+12.8) Real attribute data impact
+ O365 & GoldG 40.4 (+11.9) 45.2 (+2.1) Massive data for category-level
w/o SRA 28.5 40.8 Removing facet-based retrieval
w/ SRA (Full) 28.5 43.1 (+2.3) Fine-grained improvement

Key Findings

  • SRA targets fine-grained needs: Adding SRA kept LVIS AP at 28.5 while boosting FG-OVD Hard from 40.8 to 43.1, confirming that single vectors suffice for categories, but multiple facets are required for fine-grained verification.
  • Real data outperforms synthetic: While automated data provided +5.0 mAP on FG-OVD Hard, real attribute data (VAW) provided +12.8 mAP, highlighting that explicit visual attribute supervision is the bottleneck.
  • Large detection sets primarily assist category levels: O365 & GoldG significantly boosted LVIS (+11.9) but only marginally improved FG-OVD Hard (+2.1).

Highlights & Insights

  • Differentiable Logic AND: The soft-min operator smoothly interpolates between "average" and "minimum," allowing the model to punish partial matches while maintaining stable gradient flow.
  • Progressive Query Conditioning: Initializing subsequent queries with previous results allows \(K\) facets to be complementary rather than redundant, yielding a stable +0.8 mAP gain.
  • Robust Pipeline: The "pure visual attribute + dual CLIP verification + higher than category score" filtering mechanism effectively mitigates LLM hallucinations and CLIP noise.

Limitations & Future Work

  • The pipeline's reliance on LLM generation and CLIP verification is imperfect and can introduce bias or noise. Performance remains limited for rare or ambiguous attributes with subtle visual evidence.
  • The use of a fixed \(K=3\) might not be optimal for long descriptions with many attributes. Future work could investigate adaptive \(K\) or long-tail attribute reweighting.
  • vs HA-FGOVD / NoctOWL: Whereas previous works still fuse attributes into a single vector (allowing category signals to mask attribute violations), SRA-Det uses distinct facets and soft-min to ensure an "all-or-nothing" satisfaction.
  • vs GUIDED: GUIDED uses a three-stage pipeline (subject recognition, coarse detection, attribute discrimination). SRA-Det integrates multi-attribute consistency directly into the matching mechanism in an end-to-end fashion.
  • vs LLMDet / MM-GDINO: These focus on scaling category-level AP via MLLM fine-tuning. SRA-Det focuses on fine-grained precision, outperforming them significantly on attribute-heavy benchmarks.

Rating

  • Novelty: ⭐⭐⭐⭐ Introducing "multi-semantic facets + soft-min AND" to OVD is clear and effective.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Coverage of LVIS/FG-OVD/OmniLabel plus extensive MLLM comparisons.
  • Writing Quality: ⭐⭐⭐⭐ Logical flow and clear coupling between formulas and diagrams.
  • Value: ⭐⭐⭐⭐ Addresses the coarse-to-fine gap in OVD with reusable matching and data paradigms.