Skip to content

Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection

Conference: ICCV 2025 arXiv: 2507.10225 Code: https://github.com/Jarvisgivemeasuit/SynOOD Area: Image Generation Keywords: Out-of-distribution detection, near-boundary sample synthesis, CLIP fine-tuning, diffusion model generation, negative labels

TL;DR

This paper proposes SynOOD, which synthesizes challenging near-boundary OOD samples by combining MLLM-based contextual semantic extraction, iterative diffusion inpainting, and OOD gradient guidance. The synthesized samples are used to fine-tune the CLIP image encoder and negative label features, achieving a 2.80% AUROC improvement and 11.13% FPR95 reduction on the ImageNet benchmark.

Background & Motivation

Challenges in OOD Detection: Deep networks deployed in open-world settings inevitably encounter OOD samples, making accurate identification critical. CLIP-based methods (e.g., NegLabel) have significantly advanced OOD detection by introducing negative labels, yet still struggle with hard samples near the InD/OOD boundary.

Limitations of CLIP: Images tend to be denser than labels in feature space, causing near-boundary OOD samples to align more readily with InD labels, preventing CLIP from establishing clear semantic boundaries.

Prior Approaches: - Unimodal methods (MSP, Energy, KNN) rely solely on visual information - Multimodal methods (MCM, CLIPN, NegLabel) leverage text and vision but lack challenging training data - NPOS and DreamOOD generate OOD data but with limited quality or diversity

Core Idea: Fine-tune CLIP using high-quality, near-boundary OOD samples generated by foundation models (MLLM + diffusion model).

Method

Overall Architecture (Three Steps)

Step 1: Near-Boundary OOD Image Generation

Contextual Semantic Extraction: An MLLM \(\phi\) analyzes InD images to extract all contextual elements excluding the main subject (e.g., "bamboo," "tourists," and "railings" from a "panda" image):

\[p^{con} = \phi(x^{in}, p^{in})\]

Iterative Diffusion Generation: The InD image and contextual prompt are fed into an inpainting diffusion model, which progressively replaces the main subject with background elements:

\[z_T = \sqrt{\bar{\alpha}_T}z^{in} + \sqrt{1-\bar{\alpha}_T}\epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]

OOD Gradient Guidance: The Energy Score is adopted as the loss function:

\[\mathcal{L}^O = m_{out} - \tau \cdot \log\sum_{i=1}^{C} e^{g_i(x^{syn})/\tau}\]

Gradients are approximated via Skip Gradient and used to update the initial noise \(\epsilon\):

\[\epsilon := \epsilon - r \cdot \ddot{\nabla}_\epsilon\mathcal{L}^O\]

After several iterations, the generated images are visually similar to InD samples but have OOD scores near the boundary threshold.

Step 2: Fine-Tuning the CLIP Image Encoder

  • The CLIP image encoder \(F\) is frozen; only a projection layer \(\delta\) is trained
  • Synthesized OOD images are paired with corresponding negative labels and mixed with InD data for training
  • CLIP Loss is used:
\[\mathcal{L}^P = -\frac{1}{2m}\sum_{i=1}^{2m}\log\frac{\exp(sim(\hat{I}_i, T_i)/\tau)}{\sum_{j=1}^{M'}\exp(sim(\hat{I}_i, T_j)/\tau)}\]
  • InD image selection: images are ranked by JPEG complexity, and the highest-complexity images per class are selected

Step 3: Fine-Tuning Negative Label Features

  • Negative label features (CLIP text encoder outputs) associated with synthesized OOD images are made learnable
  • This reduces the semantic gap between InD and negative labels, improving image-text alignment
  • The image and text encoders are fine-tuned separately to maintain training stability

Key Experimental Results

OOD Detection on the ImageNet Benchmark

Method iNat AUROC↑ SUN AUROC↑ Place AUROC↑ Texture AUROC↑ Avg AUROC↑ Avg FPR95↓
MSP 87.44 79.73 79.67 79.69 81.63 69.61
Energy 95.33 92.66 91.41 86.76 91.54 39.89
ReAct 96.22 94.20 91.58 89.80 92.95 31.43
NegLabel (CLIP) - - - - ~95 ~20
SynOOD Best Best Best Best SOTA SOTA

Ablation Study

Component AUROC↑ FPR95↓
Negative labels only (NegLabel baseline) baseline baseline
+ Image encoder fine-tuning +1.5% −6.2%
+ Negative label feature fine-tuning +0.8% −3.5%
+ Both jointly (SynOOD) +2.80% −11.13%

Key Findings

  • SynOOD improves AUROC by 2.80% and reduces FPR95 by 11.13% over NegLabel
  • OOD gradient guidance is critical for near-boundary generation — without it, synthesized samples remain far from the boundary, limiting fine-tuning effectiveness
  • Fine-tuning the image and text encoders separately yields more stable training than joint fine-tuning
  • Parameter overhead and runtime cost are minimal (only one additional projection layer)
  • MLLM-extracted contextual elements ensure that generated images retain InD visual style while differing semantically

Highlights & Insights

  1. Gradient-Guided Synthesis: The paper is the first to backpropagate OOD score gradients into the diffusion noise space, enabling precise control over the InD/OOD proximity of synthesized samples.
  2. MLLM-Driven Context: MLLMs are leveraged to understand image semantics and automatically extract appropriate replacement elements.
  3. Minimal Architectural Modification: Only a projection layer is added; the CLIP backbone remains frozen.

Limitations & Future Work

  • The generation pipeline requires three models (MLLM, diffusion model, and OOD detector), resulting in relatively high offline generation cost
  • Skip Gradient is a gradient approximation and may not perfectly optimize the noise
  • The quality of contextual elements depends on the MLLM's comprehension capability
  • CLIP-based OOD: MCM, NegLabel, CLIPN, LSN
  • Synthetic OOD methods: NPOS, DreamOOD, VOS
  • Classical OOD: MSP, ODIN, Energy, KNN

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — Gradient-guided diffusion synthesis for near-boundary OOD is an original contribution
  • Technical Depth: ⭐⭐⭐⭐ — The three-step pipeline is logically coherent
  • Experimental Thoroughness: ⭐⭐⭐⭐ — SOTA results on large-scale ImageNet benchmarks
  • Practical Value: ⭐⭐⭐⭐ — Low parameter and computational overhead; plug-and-play design