Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection¶

Conference: ICCV 2025 arXiv: 2507.10225 Code: https://github.com/Jarvisgivemeasuit/SynOOD Area: Image Generation Keywords: Out-of-distribution detection, near-boundary sample synthesis, CLIP fine-tuning, diffusion model generation, negative labels

TL;DR¶

This paper proposes SynOOD, which synthesizes challenging near-boundary OOD samples by combining MLLM-based contextual semantic extraction, iterative diffusion inpainting, and OOD gradient guidance. The synthesized samples are used to fine-tune the CLIP image encoder and negative label features, achieving a 2.80% AUROC improvement and 11.13% FPR95 reduction on the ImageNet benchmark.

Background & Motivation¶

Challenges in OOD Detection: Deep networks deployed in open-world settings inevitably encounter OOD samples, making accurate identification critical. CLIP-based methods (e.g., NegLabel) have significantly advanced OOD detection by introducing negative labels, yet still struggle with hard samples near the InD/OOD boundary.

Limitations of CLIP: Images tend to be denser than labels in feature space, causing near-boundary OOD samples to align more readily with InD labels, preventing CLIP from establishing clear semantic boundaries.

Prior Approaches: - Unimodal methods (MSP, Energy, KNN) rely solely on visual information - Multimodal methods (MCM, CLIPN, NegLabel) leverage text and vision but lack challenging training data - NPOS and DreamOOD generate OOD data but with limited quality or diversity

Core Idea: Fine-tune CLIP using high-quality, near-boundary OOD samples generated by foundation models (MLLM + diffusion model).

Method¶

Overall Architecture (Three Steps)¶

Step 1: Near-Boundary OOD Image Generation¶

Contextual Semantic Extraction: An MLLM \(\phi\) analyzes InD images to extract all contextual elements excluding the main subject (e.g., "bamboo," "tourists," and "railings" from a "panda" image):

\[p^{con} = \phi(x^{in}, p^{in})\]

Iterative Diffusion Generation: The InD image and contextual prompt are fed into an inpainting diffusion model, which progressively replaces the main subject with background elements:

\[z_T = \sqrt{\bar{\alpha}_T}z^{in} + \sqrt{1-\bar{\alpha}_T}\epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]

OOD Gradient Guidance: The Energy Score is adopted as the loss function:

\[\mathcal{L}^O = m_{out} - \tau \cdot \log\sum_{i=1}^{C} e^{g_i(x^{syn})/\tau}\]

Gradients are approximated via Skip Gradient and used to update the initial noise \(\epsilon\):

\[\epsilon := \epsilon - r \cdot \ddot{\nabla}_\epsilon\mathcal{L}^O\]

After several iterations, the generated images are visually similar to InD samples but have OOD scores near the boundary threshold.

Step 2: Fine-Tuning the CLIP Image Encoder¶

The CLIP image encoder \(F\) is frozen; only a projection layer \(\delta\) is trained
Synthesized OOD images are paired with corresponding negative labels and mixed with InD data for training
CLIP Loss is used:

\[\mathcal{L}^P = -\frac{1}{2m}\sum_{i=1}^{2m}\log\frac{\exp(sim(\hat{I}_i, T_i)/\tau)}{\sum_{j=1}^{M'}\exp(sim(\hat{I}_i, T_j)/\tau)}\]

InD image selection: images are ranked by JPEG complexity, and the highest-complexity images per class are selected

Step 3: Fine-Tuning Negative Label Features¶

Negative label features (CLIP text encoder outputs) associated with synthesized OOD images are made learnable
This reduces the semantic gap between InD and negative labels, improving image-text alignment
The image and text encoders are fine-tuned separately to maintain training stability

Key Experimental Results¶

OOD Detection on the ImageNet Benchmark¶

Method	iNat AUROC↑	SUN AUROC↑	Place AUROC↑	Texture AUROC↑	Avg AUROC↑	Avg FPR95↓
MSP	87.44	79.73	79.67	79.69	81.63	69.61
Energy	95.33	92.66	91.41	86.76	91.54	39.89
ReAct	96.22	94.20	91.58	89.80	92.95	31.43
NegLabel (CLIP)	-	-	-	-	~95	~20
SynOOD	Best	Best	Best	Best	SOTA	SOTA

Ablation Study¶

Component	AUROC↑	FPR95↓
Negative labels only (NegLabel baseline)	baseline	baseline
+ Image encoder fine-tuning	+1.5%	−6.2%
+ Negative label feature fine-tuning	+0.8%	−3.5%
+ Both jointly (SynOOD)	+2.80%	−11.13%

Key Findings¶

SynOOD improves AUROC by 2.80% and reduces FPR95 by 11.13% over NegLabel
OOD gradient guidance is critical for near-boundary generation — without it, synthesized samples remain far from the boundary, limiting fine-tuning effectiveness
Fine-tuning the image and text encoders separately yields more stable training than joint fine-tuning
Parameter overhead and runtime cost are minimal (only one additional projection layer)
MLLM-extracted contextual elements ensure that generated images retain InD visual style while differing semantically

Highlights & Insights¶

Gradient-Guided Synthesis: The paper is the first to backpropagate OOD score gradients into the diffusion noise space, enabling precise control over the InD/OOD proximity of synthesized samples.
MLLM-Driven Context: MLLMs are leveraged to understand image semantics and automatically extract appropriate replacement elements.
Minimal Architectural Modification: Only a projection layer is added; the CLIP backbone remains frozen.

Limitations & Future Work¶

The generation pipeline requires three models (MLLM, diffusion model, and OOD detector), resulting in relatively high offline generation cost
Skip Gradient is a gradient approximation and may not perfectly optimize the noise
The quality of contextual elements depends on the MLLM's comprehension capability

CLIP-based OOD: MCM, NegLabel, CLIPN, LSN
Synthetic OOD methods: NPOS, DreamOOD, VOS
Classical OOD: MSP, ODIN, Energy, KNN

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Gradient-guided diffusion synthesis for near-boundary OOD is an original contribution
Technical Depth: ⭐⭐⭐⭐ — The three-step pipeline is logically coherent
Experimental Thoroughness: ⭐⭐⭐⭐ — SOTA results on large-scale ImageNet benchmarks
Practical Value: ⭐⭐⭐⭐ — Low parameter and computational overhead; plug-and-play design