Can OOD Object Detectors Learn from Foundation Models?¶

Conference: ECCV 2024
arXiv: 2409.05162
Code: GitHub
Area: Object Detection
Keywords: OOD Object Detection, Synthetic Data, Foundation Models, Stable Diffusion, Scene-Level Editing

TL;DR¶

SyncOOD proposes an automated data curation method that leverages LLMs to imagine semantically novel OOD concepts and performs region-level editing on ID images via Stable Diffusion Inpainting to synthesize scene-level OOD samples. After refining bounding boxes with SAM and filtering via feature similarity, a lightweight MLP classifier is trained, substantially outperforming SOTA on multiple OOD detection benchmarks with a minimal amount of synthetic data.

Background & Motivation¶

Background: Modern object detectors achieve excellent performance on closed-set data, but often misclassify OOD classes as ID classes in open-world applications, threatening deployment reliability. OOD object detection aims to identify and label unknown objects.

Limitations of Prior Work: - Most methods synthesize OOD data in the detector's latent space (e.g., VOS, SR-VAE, DFDD), which is limited by latent space quality and lacks interpretability. - Adversarial sample methods (e.g., SAFE) lack semantic diversity. - Video pseudo-supervision methods introduce additional data requirements. - All methods are bound to closed-set distributions, potentially biasing towards ID datasets.

Key Insight: Can foundation models (LLM + Stable Diffusion + SAM) trained on massive open data be leveraged to synthesize high-quality OOD samples? Two key observations: (1) "Hard" OOD samples close to ID data are more beneficial for learning precise decision boundaries; (2) context acts as a distracting factor in OOD detection.

Method¶

Overall Architecture¶

SyncOOD consists of two core phases: (1) OOD Data Synthesis—utilizing foundation models to automatically generate annotated, scene-level OOD images; (2) OOD Detector Training—optimizing ID/OOD decision boundaries via hard-sample mining and a lightweight classifier. The entire pipeline is fully automated and requires almost no human annotation. The key lies in decoupling OOD synthesis into four steps: "concept discovery \(\rightarrow\) region-level editing \(\rightarrow\) annotation refinement \(\rightarrow\) sample filtering".

Key Designs¶

1. LLM-Driven Novel Concept Imagination (Step 1)

Based on ID category labels, GPT-4 is utilized via in-context learning to brainstorm \(M\) semantically novel, visually similar, and context-compatible OOD concepts for each ID class.
LLM ensures semantic separability: the imagined novel concepts are semantically separated from the ID classes.
Experiments show that a single in-context example is sufficient to discover high-quality novel concepts.
Concepts overlapping with OOD classes in the test set are removed to prevent information leakage.

2. Region-Level Image Editing (Step 2)

Use Stable-Diffusion-Inpainting for box-conditioned editing: \(\mathbf{x}^{\text{edit}} = \text{SDI}(\mathbf{x}^{\text{id}}, \mathbf{b}^{\text{id}}, \mathbf{y}^{\text{novel}})\)
The bounding box of the ID object is used as the editing mask, and the novel concept acts as the text prompt condition.
Key advantage: keeps the original scene context unchanged, replacing only the object inside the target region, which eliminates context bias interference.

3. SAM Bounding Box Refinement (Step 3)

Due to the randomness of diffusion models, the position/size of the edited object may shift.
SAM is used within the padded region of the edited area to obtain the instance mask with the highest confidence.
After converting the mask to a bounding box, the IoU with the original box is calculated to filter out samples with excessive scale changes (\(\text{IoU} > \gamma\)).

4. Feature Similarity-Based Hard Sample Mining (Step 4)

The pre-trained detector is used to extract latent space features of ID/OOD object pairs.
Visually similar but semantically distinct OOD samples are filtered based on cosine similarity: \(\epsilon_{\text{low}} < \text{sim}(\mathbf{z}^{\text{edit}}, \mathbf{z}^{\text{id}}) < \epsilon_{\text{up}}\)
Too high similarity indicates failed editing, while too low indicates image distortion.
Filter out hard OOD samples that are "just confusing enough".

Loss & Training¶

Train a lightweight 3-layer MLP as a plug-and-play OOD detector.
Use standard binary classification loss to optimize the ID/OOD decision boundary.
The detector parameters remain frozen; only the additional MLP is trained, so the ID performance (mAP) is unaffected.
Following SAFE, multi-scale features are extracted as training samples.
Learning rate: PASCAL-VOC=1e-4, BDD-100K=5e-5, momentum=0.9, dropout=0.5, batch=32.

Key Experimental Results¶

Main Results¶

Method	VOC→COCO FPR95↓	VOC→OI FPR95↓	BDD→COCO FPR95↓	BDD→OI FPR95↓
MSP	70.99	73.13	80.94	79.04
VOS	47.53	51.33	44.27	35.54
SAFE	47.40	20.06	32.56	16.04
DFDD	41.34	44.52	30.71	22.67
Ours+FRCNN	36.44	13.34	22.67	12.96
Ours+VOS	34.97	11.25	23.09	14.12

Substantially outperforms SOTA on all benchmarks, utilizing only about 25% (VOC) and 20% (BDD) of the auxiliary data compared to SAFE.

Ablation Study¶

Ablation Item	FPR95(COCO/OI)↓	Analysis
Synthetic data size 2k→14k	37.82~36.70 / 13.87~12.96	Highly stable performance, high data efficiency
Number of concepts 3→8	36.96~37.91 / 13.15~13.58	Insensitive to the number of concepts
W/o SAM refinement	39.55 / 13.72	SAM provides more accurate bounding boxes
W/o similarity filtering	39.29 / 13.68	Filter effectively removes noisy samples
Using object-centric images	51.99 / 20.70	Scene-level editing far outperforms pure object images
Scene-level w/o bounding boxes	48.01 / 18.61	Precise OOD bounding box annotations are essential

Key Findings¶

Extremely high data efficiency: Only 2k synthetic samples are enough to approach peak performance, significantly superior to SAFE which requires 16k+ samples.
Scene-level editing is crucial: Pure object images and full-image OOD are far inferior to region-level editing.
Context consistency is key: Even slight background modifications lead to significant shifts in detector features.
Sweet spot in similarity range: >0.9 indicates failed editing, whereas too low indicates image distortion.

Highlights & Insights¶

First to achieve photorealistic scene-level OOD synthesis: Extends OOD synthesis from the latent space to the pixel space.
Seamlessly orchestrating LLM, SD, and SAM: Effectively cascades concept imagination \(\rightarrow\) image editing \(\rightarrow\) annotation refinement, leveraging the unique strengths of each.
Explicitly decoupling synthesis and selection: Semantic separability is guaranteed by the LLM, whereas visual similarity is ensured by feature filtering.
Plug-and-play design: Superimposes OOD detection capability onto any detector without modifying the original model.

Limitations & Future Work¶

Synthesis quality is constrained by the capabilities of Stable Diffusion and SAM; certain concepts might fail to edit.
LLM-imagined concepts must exclude classes overlapping with test-set OOD, posing a risk of prior knowledge leakage.
Only evaluated on two detectors (Faster R-CNN and VOS); more modern detector architectures have not been explored.
Computational overhead (GPT-4 API calls, SD generation time) is not discussed.
Can be extended to multi-modal OOD detection such as 3D and video.

VOS: A classic method that samples OOD samples in the latent space, acting as the direct baseline for SyncOOD.
SAFE: A similar framework but utilizes adversarial noise, with data efficiency far lower than SyncOOD.
Dream-OOD: Uses diffusion models to synthesize OOD data for image classification, which is inapplicable for detection tasks.
Insight: The open-world knowledge in foundation models can be efficiently injected into downstream detection tasks via automated pipelines.

Rating¶

Dimension	Rating
Novelty	⭐⭐⭐⭐
Technical Depth	⭐⭐⭐⭐
Experimental Thoroughness	⭐⭐⭐⭐⭐
Engineering Practicality	⭐⭐⭐⭐
Writing Quality	⭐⭐⭐⭐