Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection¶
Conference: ICCV 2025 arXiv: 2507.10225 Code: https://github.com/Jarvisgivemeasuit/SynOOD Area: Image Generation Keywords: Out-of-distribution detection, near-boundary sample synthesis, CLIP fine-tuning, diffusion model generation, negative labels
TL;DR¶
This paper proposes SynOOD, which synthesizes challenging near-boundary OOD samples by combining MLLM-based contextual semantic extraction, iterative diffusion inpainting, and OOD gradient guidance. The synthesized samples are used to fine-tune the CLIP image encoder and negative label features, achieving a 2.80% AUROC improvement and 11.13% FPR95 reduction on the ImageNet benchmark.
Background & Motivation¶
Challenges in OOD Detection: Deep networks deployed in open-world settings inevitably encounter OOD samples, making accurate identification critical. CLIP-based methods (e.g., NegLabel) have significantly advanced OOD detection by introducing negative labels, yet still struggle with hard samples near the InD/OOD boundary.
Limitations of CLIP: Images tend to be denser than labels in feature space, causing near-boundary OOD samples to align more readily with InD labels, preventing CLIP from establishing clear semantic boundaries.
Prior Approaches: - Unimodal methods (MSP, Energy, KNN) rely solely on visual information - Multimodal methods (MCM, CLIPN, NegLabel) leverage text and vision but lack challenging training data - NPOS and DreamOOD generate OOD data but with limited quality or diversity
Core Idea: Fine-tune CLIP using high-quality, near-boundary OOD samples generated by foundation models (MLLM + diffusion model).
Method¶
Overall Architecture (Three Steps)¶
Step 1: Near-Boundary OOD Image Generation¶
Contextual Semantic Extraction: An MLLM \(\phi\) analyzes InD images to extract all contextual elements excluding the main subject (e.g., "bamboo," "tourists," and "railings" from a "panda" image):
Iterative Diffusion Generation: The InD image and contextual prompt are fed into an inpainting diffusion model, which progressively replaces the main subject with background elements:
OOD Gradient Guidance: The Energy Score is adopted as the loss function:
Gradients are approximated via Skip Gradient and used to update the initial noise \(\epsilon\):
After several iterations, the generated images are visually similar to InD samples but have OOD scores near the boundary threshold.
Step 2: Fine-Tuning the CLIP Image Encoder¶
- The CLIP image encoder \(F\) is frozen; only a projection layer \(\delta\) is trained
- Synthesized OOD images are paired with corresponding negative labels and mixed with InD data for training
- CLIP Loss is used:
- InD image selection: images are ranked by JPEG complexity, and the highest-complexity images per class are selected
Step 3: Fine-Tuning Negative Label Features¶
- Negative label features (CLIP text encoder outputs) associated with synthesized OOD images are made learnable
- This reduces the semantic gap between InD and negative labels, improving image-text alignment
- The image and text encoders are fine-tuned separately to maintain training stability
Key Experimental Results¶
OOD Detection on the ImageNet Benchmark¶
| Method | iNat AUROC↑ | SUN AUROC↑ | Place AUROC↑ | Texture AUROC↑ | Avg AUROC↑ | Avg FPR95↓ |
|---|---|---|---|---|---|---|
| MSP | 87.44 | 79.73 | 79.67 | 79.69 | 81.63 | 69.61 |
| Energy | 95.33 | 92.66 | 91.41 | 86.76 | 91.54 | 39.89 |
| ReAct | 96.22 | 94.20 | 91.58 | 89.80 | 92.95 | 31.43 |
| NegLabel (CLIP) | - | - | - | - | ~95 | ~20 |
| SynOOD | Best | Best | Best | Best | SOTA | SOTA |
Ablation Study¶
| Component | AUROC↑ | FPR95↓ |
|---|---|---|
| Negative labels only (NegLabel baseline) | baseline | baseline |
| + Image encoder fine-tuning | +1.5% | −6.2% |
| + Negative label feature fine-tuning | +0.8% | −3.5% |
| + Both jointly (SynOOD) | +2.80% | −11.13% |
Key Findings¶
- SynOOD improves AUROC by 2.80% and reduces FPR95 by 11.13% over NegLabel
- OOD gradient guidance is critical for near-boundary generation — without it, synthesized samples remain far from the boundary, limiting fine-tuning effectiveness
- Fine-tuning the image and text encoders separately yields more stable training than joint fine-tuning
- Parameter overhead and runtime cost are minimal (only one additional projection layer)
- MLLM-extracted contextual elements ensure that generated images retain InD visual style while differing semantically
Highlights & Insights¶
- Gradient-Guided Synthesis: The paper is the first to backpropagate OOD score gradients into the diffusion noise space, enabling precise control over the InD/OOD proximity of synthesized samples.
- MLLM-Driven Context: MLLMs are leveraged to understand image semantics and automatically extract appropriate replacement elements.
- Minimal Architectural Modification: Only a projection layer is added; the CLIP backbone remains frozen.
Limitations & Future Work¶
- The generation pipeline requires three models (MLLM, diffusion model, and OOD detector), resulting in relatively high offline generation cost
- Skip Gradient is a gradient approximation and may not perfectly optimize the noise
- The quality of contextual elements depends on the MLLM's comprehension capability
Related Work & Insights¶
- CLIP-based OOD: MCM, NegLabel, CLIPN, LSN
- Synthetic OOD methods: NPOS, DreamOOD, VOS
- Classical OOD: MSP, ODIN, Energy, KNN
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Gradient-guided diffusion synthesis for near-boundary OOD is an original contribution
- Technical Depth: ⭐⭐⭐⭐ — The three-step pipeline is logically coherent
- Experimental Thoroughness: ⭐⭐⭐⭐ — SOTA results on large-scale ImageNet benchmarks
- Practical Value: ⭐⭐⭐⭐ — Low parameter and computational overhead; plug-and-play design