Skip to content

AgentDet: A Shared-Blackboard Multi-Agent Framework for Zero-/Few-Shot Object Detection

Conference: CVPR 2026
Paper: CVF Open Access
Code: None
Area: Multi-Agent / Zero-Few-Shot Object Detection
Keywords: Zero/Few-shot Detection, Multi-Agent, Shared Blackboard, Knowledge Base, Pseudo-incremental Learning

TL;DR

AgentDet decomposes zero-/few-shot object detection into four LLM agents: Scout, Pinner, Curator, and Judge. These agents collaborate via a "Shared Blackboard" and a patch-level "Knowledge Base" (KB). The framework fragments visual evidence into the KB, assembles them into holistic textual clues for LLM-based box prediction, and trains only the Judge agent. It achieves competitive results on PASCAL VOC and COCO for both ZSOD and FSOD tasks.

Background & Motivation

Background: Zero-shot and few-shot object detection (ZSOD/FSOD) aims to detect objects with zero or \(K\) labeled images of novel classes. Traditional FSOD relies on transfer learning, meta-learning, metric learning, and data augmentation. Recent trends shift toward VLM/LLM-based methods (e.g., Grounding DINO, FM-FSOD, LLaFS) to leverage zero-shot capabilities of foundation models.

Limitations of Prior Work: Traditional methods suffer from catastrophic forgetting and unstable generalization when novel samples are extremely scarce or the base/novel gap is large; episodic training is often unstable. Foundation model approaches have their own drawbacks: VLMs like Grounding DINO rely on large-scale visual pre-training that conflicts with the "few-shot generalization" protocol (essentially "peeking" at answers). FM-FSOD treats LLMs as static classifiers without linguistic reasoning. LLaFS is tied to CodeLlama and requires polygon-level supervision, failing at zero-shot detection. Crucially, most works treat ZSOD and FSOD as separate tasks rather than a unified framework.

Key Challenge: LLMs possess rich linguistic priors and cross-modal reasoning but act like "blind expert" — they can describe an object in detail but cannot "see" specific instances in an image, making it difficult to output precise coordinates directly.

Goal: ① Provide a single architecture covering both ZSOD and FSOD that transitions smoothly from 0-shot to few-shot without structural changes; ② Continuously accumulate visual knowledge under realistic conditions where no box annotations exist for novel classes.

Key Insight: The authors use a cinematic analogy — when no photo of a target exists (e.g., the collage scene in The Truman Show or composite sketches), one can piece together eyes, a nose, and a mouth from magazines to form a face. Similarly in detection: Accumulate fragment-level visual evidence → Assemble into credible holistic textual clues → Use LLM linguistic priors for box localization.

Core Idea: Decouple detection into four collaborative agents centered around a "Shared Blackboard + Patch-level KB." Novel classes are handled via a pseudo-incremental approach: when no box labels exist, only high-consistency local evidence is written to the KB. When a few labels arrive, the system upgrades to FSOD without retraining the architecture.

Method

Overall Architecture

The input to AgentDet is a query image \(I\) and a closed class set \(C=C_b \cup C_n\) (base + novel). The output is the final set of boxes and confidence scores \((B,\omega)\). The system uses two types of memory: a Knowledge Base (KB) (persistent, stores patch-level visual evidence, supports retrieval) and a Shared Blackboard (Board) (transient, aggregates clues and control flags during a single inference pass). Four agents interact with these memories: Scout writes "holistic textual clues," Pinner pins "fragment-level clues" retrieved from the KB to the board, Curator maintains/updates the KB safely, and Judge reads the board to output final boxes. During training, only the Judge is trained (updating its image encoder and a LoRA-tuned LLM detection head), while other modules (CLIP, DINOv2) remain frozen.

The board state \(M^{(t)}_{board}\) at time \(t\) is denoted as \(M^{(t)}_{board}=\{H^{(t)}, R^{(t)}, B^{(t)}, \omega^{(t)}, r^{(t)}_{scout}, r^{(t)}_{pinner}, pm^{(t)}\}\), where \(H\) is the holistic textual clue, \(R\) represents fragment references from the KB, \((B,\omega)\) are predictions, \(r_{scout}, r_{pinner}\in\{0,1\}\) are readiness flags, and \(pm\in\{0,1\}\) is a gate controlling KB write-access.

graph TD
    Q["Query Image I + Class Set C"] --> S["Agent-Scout<br/>CLIP alignment for Holistic Text Clues H"]
    S --> BB["Shared Blackboard<br/>Aggregate H, R, B, ω and gate pm"]
    KB[("Agent-Curator<br/>Pseudo-incremental KB maintenance")] -->|Retrieve patches| P["Agent-Pinner<br/>DINOv2 for fragment visual clues R"]
    P --> BB
    BB --> J["Agent-Judge<br/>LLM reads R, H for Box Output + Self-Refinement"]
    J -->|Safe write-back update| KB
    J --> O["Final Detection B, ω"]

Key Designs

1. Shared Blackboard + KB: Decoupling "Fragment Evidence" from "Per-image Coordination"

This design addresses the "blind expert" nature of LLMs and agent coordination. The KB (\(\mathcal{K}\)) is a persistent database where each entry stores \(\langle p_k, \theta^{attr}_k, \theta^{rect}_k, s_k\rangle\) (visual patch, attribute embedding, box coordinates, CLIP score). The Board (\(M_{board}\)) is a low-latency state space surviving only for one episode. A consistency gate determines safe writing to the KB:

\[\text{cons}(k,c)=\mathbb{1}\!\left[\max_i M_{i,k}>\tau_{search}\right]\cdot\mathbb{1}\!\left[\max_i S_{i,c}>\theta\right]\]

Where \(M\) is query-KB similarity and \(S\) is crop-class similarity. "Safe write" (\(pm=1\)) occurs only if consistency requirements and attribute scores \(s_{p,a}>\tau_{seg}\) are met, preventing KB contamination.

2. Agent-Scout: CLIP-aligned "Holistic Textual Clues" for Search-Space Narrowing

Scout identifies probable classes in the image. It performs multi-scale, multi-aspect ratio uniform cropping to generate region proposals \(B\). Each patch is encoded by CLIP Visual Encoder, and class names \(c_j\) are encoded via prompt templates. A similarity matrix \(S\) is calculated with temperature scaling: \(S=\tau\cdot\text{softmax}(\theta_{vis}\theta^\top_{txt})\). Classes exceeding a threshold \(\theta\) form \(C_{detected}\), which can optionally be verified by an LLM as \(C_{target}\). This \(H\) (holistic clue) is pinned to the board to guide subsequent retrieval.

3. Agent-Curator: Crop-then-oversegment for Pseudo-incremental KB Maintenance

Curator fragments visual evidence into the KB. It performs multi-scale over-segmentation \(P_n\) to extract CLIP visual embeddings and coordinates. For each class, it generates text attribute lists \(A_c\) and matches them to patches via CLIP similarity \(s_{p,a}\). The "pseudo-incremental" essence lies in the asymmetry between training and inference: Training uses GT boxes to crop and over-segment, ensuring high-purity patches. Inference on novel classes (without boxes) over-segments the whole image, using the consistency gate to decide if an entry should be submitted, delayed, or decayed.

4. Agent-Pinner: DINOv2 Retrieval for "Fragment Visual Clues"

Pinner determines which KB fragments are relevant to the query image. It over-segments the query image \(P_q\) and uses DINOv2 to compute similarity \(M=QK^\top\) with the KB. Subset \(R\) containing above-threshold fragments limits background noise. These spatial and semantic descriptors are projected into a reference sequence \(\theta_{ref}\) for the Judge. Notably, Pinner uses DINOv2 for retrieval while Scout uses CLIP for alignment, providing complementary evidence.

5. Agent-Judge: The Decision-maker and Sole Trained Agent

The Judge consumes \(C_{target}\) from Scout and \(\theta_{ref}\) from Pinner. It encodes the query image via EVA-CLIP + Q-Former into tokens \(V\). The LLM (LoRA-tuned) receives a prompt \(P\) containing task descriptions and retrieved knowledge to output initial boxes \(B_{init}\). A prompt-based self-refinement follows: the LLM re-processes initial results to produce \(B_{final}\). Confidence \(\omega\) is calculated from token sequence probabilities. Training uses a unified detection loss, updating only the image encoder and LLM LoRA.

Loss & Training

Training utilizes a single unified detection loss. Parameters for Agent-Judge (image encoder and LLM LoRA) are updated, while CLIP, DINOv2, and the Q-Former core remain frozen. Backbones like Llama3.1-8B-Instruct and Qwen2.5-7B/8B were tested. This lightweight recipe is key to generalizing to unseen classes.

Key Experimental Results

Main Results

PASCAL VOC few-shot (mAP, Novel Split 1):

Method 1-shot 3-shot 5-shot 10-shot
ICPE (AAAI23) 54.1 62.5 65.3 66.3
FM-FSOD† (CVPR24) 41.6 - 55.8 61.2
LLMdet† (CVPR25) 39.9 51.7 56.1 60.8
AgentDet-Qwen2.5-7b† 55.3 63.2 68.3 69.3
AgentDet-Qwen3-8b† 56.5 64.2 68.5 69.6

COCO (AP, cross-shot):

Method 0 1 5 10 30
FM-FSOD† (CVPR24) - 5.7 21.9 27.7 37.0
LLMdet† (CVPR25) 7.2 8.8 22.3 27.8 37.8
AgentDet-Qwen3-8b† 9.4 10.8 23.9 31.6 37.5

Ablation Study

Module ablation (AgentDet-Qwen2.5/Llama, VOC 10-shot, mAP):

Configuration mAP Note
AgentDet-Qwen (Full) 65.1 Full model
w/o Q-Former tuning 52.2 -8.2%
w/o Knowledge Base (KB) 41.3 -19.1%
w/o LLM fine-tuning 0.0 Fails completely
w/o Agent-Scout 30.2 Significant drop

Key Findings

  • Knowledge Base is critical: Removing the KB results in a 19.1% drop, the largest for any single module.
  • LLM Fine-tuning is mandatory: Zero performance without fine-tuning proves that language priors alone cannot perform localization without LLM task-adaptation.
  • Low-shot Advantage: AgentDet leads significantly at 0-shot and 1-shot, though the relative gain diminishes at 30-shot.
  • Parameter Efficiency: Only a small fraction of parameters (LoRA + encoder) are trained; KB size is capped via "last-place elimination" to maintain efficiency.

Highlights & Insights

  • "Blind Expert" metaphor realized: Converts the LLM's inability to see instances into a process of feeding it "visual fragments" and "textual clues."
  • Shared Blackboard + Consistency Gate: Recycles classic blackboard architecture for multimodal agents, treating current-state coordination and long-term knowledge as separate entities.
  • Asymmetric Training/Inference: The "crop-then-oversegment" strategy ensures high-quality training samples while allowing the system to learn from unlabeled test flows via "safe write" rules.

Limitations & Future Work

  • Plateauing at 30-shot: Performance gains diminish as data increases, suggesting the framework is specialized for data-scarce regimes.
  • Heaviness of Pipeline: Running four agents, two memories, multiple encoders, and LLM self-refinement likely incurs significant latency (inference time not detailed).
  • External Model Dependency: Quality of "fragments" is capped by frozen CLIP/DINOv2 performance.
  • vs FM-FSOD (CVPR24): FM-FSOD uses LLM as a static classifier; AgentDet involves the LLM in the decision loop (self-refinement) and adds an explicit visual KB.
  • vs LLaFS (CVPR24): LLaFS requires polygon-level supervision; AgentDet operates on boxes and supports ZSOD via pseudo-incrementality.
  • Multi-Agent Paradigms: AgentDet tailors general multi-agent concepts (ReAct, AutoGen) specifically for detection through the shared blackboard coordinator.

Rating

  • Novelty: ⭐⭐⭐⭐ Connects blackboard agents + patch KB + pseudo-incrementality under a unified ZSOD/FSOD view.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Strong results on VOC/COCO; clear ablation on components.
  • Writing Quality: ⭐⭐⭐ Vivid analogies, though some formal threshold definitions are scattered.
  • Value: ⭐⭐⭐⭐ A reusable multi-agent recipe for low-shot open-vocabulary detection.