Revisiting Logit Distributions for Reliable Out-of-Distribution Detection¶

Conference: NeurIPS 2025 arXiv: 2510.20134 Code: GitHub Area: Multimodal VLM / OOD Detection Keywords: OOD detection, logit distribution, CLIP, post-hoc method, scoring function

TL;DR¶

This paper proposes LogitGap, a novel post-hoc OOD detection scoring function that explicitly exploits the "gap" between the maximum logit and the remaining logits to distinguish in-distribution (ID) from out-of-distribution (OOD) samples. A top-N selection strategy is introduced to filter noisy logits. Theoretical analysis and experiments demonstrate that LogitGap outperforms MCM and MaxLogit across multiple scenarios.

Background & Motivation¶

OOD detection is a critical safety requirement for deploying deep learning models in open-world settings. Post-hoc methods have attracted broad interest due to their flexibility — they require no modification of model parameters. The central challenge is designing effective scoring functions that maximize the separability between ID and OOD samples.

Limitations of two representative scoring functions:

MaxLogit: \(S(x) = \max_k z_k\), uses only the maximum logit value and completely ignores information from the remaining logits.

MCM (Maximum Concept Matching): \(S(x) = \max_k \frac{e^{z_k/\tau}}{\sum_j e^{z_j/\tau}}\), implicitly incorporates all logits via softmax, but softmax compresses absolute logit information, and different logit patterns may map to similar probability distributions.

Key observation: ID samples exhibit a peaked logit distribution (one dominant logit far exceeding the rest), while OOD samples exhibit a flatter distribution (the maximum logit is less prominent, non-maximum logits are relatively higher). This results in a noticeably larger "logit gap" for ID samples — a natural discriminative cue.

Method¶

Overall Architecture¶

LogitGap is a purely post-hoc method that takes the logit vector output of a pretrained model (e.g., CLIP) as input, requiring no training or model modification. The core pipeline is: compute logits → sort in descending order → compute the mean gap among top-N logits → use as the OOD score.

Key Designs¶

LogitGap Scoring Function
- The logit vector \(\boldsymbol{z}\) is sorted in descending order as \(\boldsymbol{z}'\); the average difference between the maximum logit and all remaining logits is computed: \(S_{\text{LogitGap}}(x;f) = \frac{1}{K-1}\sum_{j=2}^{K}(z'_1 - z'_j)\)
- Equivalent form: \(S = z'_1 - \bar{z}'_K\), i.e., the maximum logit minus the mean of the remaining logits.
- ID samples receive higher scores (peaked distribution, large gap); OOD samples receive lower scores (flat distribution, small gap).
- Key distinction from MCM: MCM implicitly incorporates non-maximum logits via the softmax denominator (losing absolute value information), whereas LogitGap explicitly quantifies the gap (preserving complete information).
LogitGap-topN Refinement
- Problem: In \(K\)-way classification, a large number of tail classes are semantically irrelevant to the input; their logits contribute minimally to ID/OOD discrimination and instead introduce noise.
- Solution: Compute the gap using only the top-N logits: \(S_{\text{topN}} = \frac{1}{N-1}\sum_{j=2}^{N}(z'_1 - z'_j)\)
- N selection: The optimal \(N\) is determined by maximizing the mean score difference between ID and OOD samples, which reduces to \(\arg\max_{N}(\mathbb{E}_{OOD}[\bar{z}'_N] - \mathbb{E}_{ID}[\bar{z}'_N])\).
- Training-free strategy: Only a small ID validation set (≤100 samples) is required; OOD data is approximated via interpolation transforms and noise injection.
Theoretical Guarantee (Theorem 4.1)
- It is proved that when the temperature parameter satisfies \(\tau > 2(K-1)\), the false positive rate (FPR) of LogitGap is strictly no greater than that of MCM.
- Key insights:
  - MCM suffers severe information loss at high temperatures (probability mass is overly dispersed).
  - LogitGap operates on raw logit gaps and is insensitive to the temperature parameter.

Composability with Other Methods¶

LogitGap can serve as a plug-in replacement for the scoring function in existing methods, and can be combined with few-shot approaches such as CoOp and ID-Like for additional gains.

Key Experimental Results¶

Main Results (CLIP ViT-B/16, ImageNet as ID, Zero-shot)¶

Method	NINCO FPR95↓	ImageNet-O FPR95↓	ImageNetOOD FPR95↓	Avg. FPR95↓	Avg. AUROC↑
MCM	79.67	75.85	80.98	78.83	77.15
MaxLogit	79.41	77.15	75.85	77.47	76.96
GL-MCM	74.38	72.35	79.16	75.30	74.74
LogitGap	76.83	72.35	76.37	75.18	79.23
LogitGap*	77.42	71.95	75.40	74.92	79.41

Ablation Study (Few-shot Setting, Combined with Other Methods)¶

Method	Avg. FPR95↓	Avg. AUROC↑
CoOp (1-shot)	80.60	74.78
CoOp + LogitGap* (1-shot)	78.67	77.02
ID-Like (1-shot)	79.07	71.40
ID-Like + LogitGap* (1-shot)	71.68	78.48
CoOp (4-shot)	79.09	76.17
CoOp + LogitGap* (4-shot)	76.49	78.41

Key Findings¶

LogitGap achieves state-of-the-art performance in both zero-shot and few-shot settings: average FPR95 is reduced by 3.65% (ImageNet) and 5.78% (ImageNet-100) compared to MCM.
Strong complementarity with few-shot methods: ID-Like + LogitGap* reduces FPR95 from 79.07 to 71.68 and improves AUROC from 71.40 to 78.48.
Applicable to conventionally trained models: not limited to CLIP; effective on ResNet as well.
top-N selection consistently improves performance: LogitGap* generally outperforms LogitGap with a fixed \(N = 20\%K\).

Highlights & Insights¶

Extreme simplicity: the method requires no training, no additional data, and no model modification — the scoring function reduces to a single formula.
Solid theoretical foundation: Theorem 4.1 establishes a rigorous FPR upper bound relationship between LogitGap and MCM, going beyond purely empirical justification.
Strong composability: can serve as a drop-in replacement for any logit-based OOD scoring function.
Deep insight: the difference in logit distribution shape between ID and OOD samples (peaked vs. flat) is an underexploited discriminative cue that LogitGap makes explicit.
Principled hyperparameter selection: the choice of \(N\) is formulated as a mean-gap maximization problem, avoiding blind search.

Limitations & Future Work¶

Near-semantic OOD detection remains challenging: when OOD data is semantically close to ID data (e.g., ImageNet-10 vs. ImageNet-20), the improvement from LogitGap diminishes.
Adaptive \(N\) selection relies on OOD simulation: the assumption of approximating OOD data via interpolation and noise injection may not generalize to all scenarios.
Restricted to logit-based models: a complete logit vector output is required, rendering the method inapplicable to certain black-box APIs.
Combination with feature-based methods unexplored: approaches such as Mahalanobis distance and KNN may be complementary to LogitGap.
Potential for nonlinear logit transformations: the current approach relies on linear gaps; whether superior nonlinear exploitations exist remains an open question.

vs. MCM: MCM implicitly leverages logit information via softmax but loses absolute values; LogitGap explicitly utilizes the gap, preserving complete information.
vs. MaxLogit: MaxLogit considers only the maximum value; LogitGap exploits the shape of the entire logit distribution.
vs. GL-MCM: GL-MCM introduces local feature cues; LogitGap operates purely at the logit level and is more lightweight.
vs. Energy: the energy score is another global logit utilization strategy; LogitGap exploits logit gaps rather than log-sum-exp, yielding better performance.

Rating¶

Novelty: ⭐⭐⭐⭐ — Explicit exploitation of logit gaps offers a fresh perspective, though the core formula is remarkably simple (which is also a strength).
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers zero-shot, few-shot, and conventionally trained settings with diverse ID/OOD combinations and comprehensive ablation experiments.
Writing Quality: ⭐⭐⭐⭐⭐ — Theoretical analysis is clear and rigorous; figures are intuitive (the score distribution comparison in Fig. 1 is particularly convincing).
Value: ⭐⭐⭐⭐ — High practical value (zero-cost replacement), though the theoretical contribution is relatively incremental.