Skip to content

Can OOD Object Detectors Learn from Foundation Models?

Conference: ECCV 2024
arXiv: 2409.05162
Code: GitHub
Area: Object Detection
Keywords: OOD Object Detection, Synthetic Data, Foundation Models, Stable Diffusion, Scene-Level Editing

TL;DR

SyncOOD proposes an automated data curation method that leverages LLMs to imagine semantically novel OOD concepts and performs region-level editing on ID images via Stable Diffusion Inpainting to synthesize scene-level OOD samples. After refining bounding boxes with SAM and filtering via feature similarity, a lightweight MLP classifier is trained, substantially outperforming SOTA on multiple OOD detection benchmarks with a minimal amount of synthetic data.

Background & Motivation

Background: Modern object detectors achieve excellent performance on closed-set data, but often misclassify OOD classes as ID classes in open-world applications, threatening deployment reliability. OOD object detection aims to identify and label unknown objects.

Limitations of Prior Work: - Most methods synthesize OOD data in the detector's latent space (e.g., VOS, SR-VAE, DFDD), which is limited by latent space quality and lacks interpretability. - Adversarial sample methods (e.g., SAFE) lack semantic diversity. - Video pseudo-supervision methods introduce additional data requirements. - All methods are bound to closed-set distributions, potentially biasing towards ID datasets.

Key Insight: Can foundation models (LLM + Stable Diffusion + SAM) trained on massive open data be leveraged to synthesize high-quality OOD samples? Two key observations: (1) "Hard" OOD samples close to ID data are more beneficial for learning precise decision boundaries; (2) context acts as a distracting factor in OOD detection.

Method

Overall Architecture

SyncOOD consists of two core phases: (1) OOD Data Synthesis—utilizing foundation models to automatically generate annotated, scene-level OOD images; (2) OOD Detector Training—optimizing ID/OOD decision boundaries via hard-sample mining and a lightweight classifier. The entire pipeline is fully automated and requires almost no human annotation. The key lies in decoupling OOD synthesis into four steps: "concept discovery \(\rightarrow\) region-level editing \(\rightarrow\) annotation refinement \(\rightarrow\) sample filtering".

Key Designs

1. LLM-Driven Novel Concept Imagination (Step 1)

  • Based on ID category labels, GPT-4 is utilized via in-context learning to brainstorm \(M\) semantically novel, visually similar, and context-compatible OOD concepts for each ID class.
  • LLM ensures semantic separability: the imagined novel concepts are semantically separated from the ID classes.
  • Experiments show that a single in-context example is sufficient to discover high-quality novel concepts.
  • Concepts overlapping with OOD classes in the test set are removed to prevent information leakage.

2. Region-Level Image Editing (Step 2)

  • Use Stable-Diffusion-Inpainting for box-conditioned editing: \(\mathbf{x}^{\text{edit}} = \text{SDI}(\mathbf{x}^{\text{id}}, \mathbf{b}^{\text{id}}, \mathbf{y}^{\text{novel}})\)
  • The bounding box of the ID object is used as the editing mask, and the novel concept acts as the text prompt condition.
  • Key advantage: keeps the original scene context unchanged, replacing only the object inside the target region, which eliminates context bias interference.

3. SAM Bounding Box Refinement (Step 3)

  • Due to the randomness of diffusion models, the position/size of the edited object may shift.
  • SAM is used within the padded region of the edited area to obtain the instance mask with the highest confidence.
  • After converting the mask to a bounding box, the IoU with the original box is calculated to filter out samples with excessive scale changes (\(\text{IoU} > \gamma\)).

4. Feature Similarity-Based Hard Sample Mining (Step 4)

  • The pre-trained detector is used to extract latent space features of ID/OOD object pairs.
  • Visually similar but semantically distinct OOD samples are filtered based on cosine similarity: \(\epsilon_{\text{low}} < \text{sim}(\mathbf{z}^{\text{edit}}, \mathbf{z}^{\text{id}}) < \epsilon_{\text{up}}\)
  • Too high similarity indicates failed editing, while too low indicates image distortion.
  • Filter out hard OOD samples that are "just confusing enough".

Loss & Training

  • Train a lightweight 3-layer MLP as a plug-and-play OOD detector.
  • Use standard binary classification loss to optimize the ID/OOD decision boundary.
  • The detector parameters remain frozen; only the additional MLP is trained, so the ID performance (mAP) is unaffected.
  • Following SAFE, multi-scale features are extracted as training samples.
  • Learning rate: PASCAL-VOC=1e-4, BDD-100K=5e-5, momentum=0.9, dropout=0.5, batch=32.

Key Experimental Results

Main Results

Method VOC→COCO FPR95↓ VOC→OI FPR95↓ BDD→COCO FPR95↓ BDD→OI FPR95↓
MSP 70.99 73.13 80.94 79.04
VOS 47.53 51.33 44.27 35.54
SAFE 47.40 20.06 32.56 16.04
DFDD 41.34 44.52 30.71 22.67
Ours+FRCNN 36.44 13.34 22.67 12.96
Ours+VOS 34.97 11.25 23.09 14.12

Substantially outperforms SOTA on all benchmarks, utilizing only about 25% (VOC) and 20% (BDD) of the auxiliary data compared to SAFE.

Ablation Study

Ablation Item FPR95(COCO/OI)↓ Analysis
Synthetic data size 2k→14k 37.82~36.70 / 13.87~12.96 Highly stable performance, high data efficiency
Number of concepts 3→8 36.96~37.91 / 13.15~13.58 Insensitive to the number of concepts
W/o SAM refinement 39.55 / 13.72 SAM provides more accurate bounding boxes
W/o similarity filtering 39.29 / 13.68 Filter effectively removes noisy samples
Using object-centric images 51.99 / 20.70 Scene-level editing far outperforms pure object images
Scene-level w/o bounding boxes 48.01 / 18.61 Precise OOD bounding box annotations are essential

Key Findings

  • Extremely high data efficiency: Only 2k synthetic samples are enough to approach peak performance, significantly superior to SAFE which requires 16k+ samples.
  • Scene-level editing is crucial: Pure object images and full-image OOD are far inferior to region-level editing.
  • Context consistency is key: Even slight background modifications lead to significant shifts in detector features.
  • Sweet spot in similarity range: >0.9 indicates failed editing, whereas too low indicates image distortion.

Highlights & Insights

  • First to achieve photorealistic scene-level OOD synthesis: Extends OOD synthesis from the latent space to the pixel space.
  • Seamlessly orchestrating LLM, SD, and SAM: Effectively cascades concept imagination \(\rightarrow\) image editing \(\rightarrow\) annotation refinement, leveraging the unique strengths of each.
  • Explicitly decoupling synthesis and selection: Semantic separability is guaranteed by the LLM, whereas visual similarity is ensured by feature filtering.
  • Plug-and-play design: Superimposes OOD detection capability onto any detector without modifying the original model.

Limitations & Future Work

  • Synthesis quality is constrained by the capabilities of Stable Diffusion and SAM; certain concepts might fail to edit.
  • LLM-imagined concepts must exclude classes overlapping with test-set OOD, posing a risk of prior knowledge leakage.
  • Only evaluated on two detectors (Faster R-CNN and VOS); more modern detector architectures have not been explored.
  • Computational overhead (GPT-4 API calls, SD generation time) is not discussed.
  • Can be extended to multi-modal OOD detection such as 3D and video.
  • VOS: A classic method that samples OOD samples in the latent space, acting as the direct baseline for SyncOOD.
  • SAFE: A similar framework but utilizes adversarial noise, with data efficiency far lower than SyncOOD.
  • Dream-OOD: Uses diffusion models to synthesize OOD data for image classification, which is inapplicable for detection tasks.
  • Insight: The open-world knowledge in foundation models can be efficiently injected into downstream detection tasks via automated pipelines.

Rating

Dimension Rating
Novelty ⭐⭐⭐⭐
Technical Depth ⭐⭐⭐⭐
Experimental Thoroughness ⭐⭐⭐⭐⭐
Engineering Practicality ⭐⭐⭐⭐
Writing Quality ⭐⭐⭐⭐