Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models¶
Conference: AAAI 2026 arXiv: 2601.08476 Code: Not released Area: Multimodal VLM / OOD Detection Keywords: OOD Detection, VLM, Cross-modal Proxy Co-evolution, Zero-shot, Test-time Adaptation, CLIP, Negative Labels
TL;DR¶
This paper proposes CoEvo, a training-free and annotation-free test-time framework that dynamically updates positive and negative proxy caches through a bidirectional sample-conditioned text/visual proxy co-evolution mechanism. On ImageNet-1K, CoEvo improves AUROC by 1.33% and reduces FPR95 by 45.98% (from 18.92% to 10.22%) over the strongest negative-label baseline, achieving state-of-the-art zero-shot OOD detection.
Background & Motivation¶
Vision-language models (e.g., CLIP) deployed in open-world settings require reliable OOD detection to reject unseen categories. Zero-shot OOD detection requires no annotated negative samples. The dominant "negative label" paradigm selects text labels semantically unrelated to in-distribution (ID) classes from lexicons such as WordNet as OOD proxies, determining ID/OOD membership by comparing the similarity of a test image to positive and negative labels.
However, static negative label design suffers from two fundamental limitations:
- Unmodeled Negative Space: A globally fixed negative label set can only sparsely sample the vast semantic space outside ID classes, leaving many negative semantics relevant to current test samples uncovered.
- Modality Misalignment: Visual features drift under test-time distribution shift while text negative labels remain unchanged, distorting the cross-modal similarity geometry and destabilizing decision thresholds.
AdaNeg partially addresses the first issue by adapting visual proxies, but text negative labels remain static — adaptation is unidirectional.
Core Problem¶
How can bidirectional, sample-conditioned dynamic adaptation of both text and visual modal proxies be achieved without training or annotation, to support zero-shot OOD detection under test-time distribution shift?
Method¶
Overall Architecture¶
CoEvo maintains two modality-specific proxy caches at test time (a text cache and a visual cache), each containing positive and negative queues. The core is a Proxy-Aligned Co-Evolution mechanism: visual cues guide dynamic mining of text negative labels, and the updated text proxies in turn refine the visual decision boundary, forming a closed loop.
Key Designs¶
-
Text Proxy Cache:
- Positive proxy queue \(\mathbf{T}_p\): CLIP text encodings of ID class names, kept fixed to preserve ID semantic anchors.
- Negative proxy queue \(\mathbf{T}_n\): initialized with \(M\) negative class encodings selected from large-scale lexicons (WordNet/CSP), dynamically evolved during inference.
-
Visual Proxy Cache:
- Positive proxy queue \(\mathbf{V}_p \in \mathbb{R}^{K \times L \times D}\): stores \(L\) visual instances per ID class, initialized from corresponding text encodings, with high-confidence ID samples incrementally enqueued.
- Negative proxy queue \(\mathbf{V}_n\): high-confidence OOD samples are enqueued; a priority queue strategy evicts low-confidence or stale samples.
-
Proxy-Aligned Co-Evolution Mechanism (core contribution):
- Text proxy evolution: gated by a confidence margin mechanism (based on adaptive threshold \(\delta\) and margin \(\gamma\)); for high-confidence samples:
- Predicted OOD → retrieve text negative labels semantically close to the test sample (Near Negatives), tightening the local open-set boundary.
- Predicted ID → retrieve text negative labels semantically far from the test sample (Far Negatives), expanding negative space coverage.
- A deduplication constraint prevents redundant enqueuing.
- Visual proxy evolution: after text proxy updates, the visual negative proxy queue expands to accommodate newly exposed OOD semantics; an entropy-based confidence measure determines whether to replace existing proxies, ensuring the queue always stores the highest-confidence samples.
- Text proxy evolution: gated by a confidence margin mechanism (based on adaptive threshold \(\delta\) and margin \(\gamma\)); for high-confidence samples:
-
OOD Score Evolution:
- Pre-evolution score (for proxy updating): \(\mathcal{S}^{\text{pre}} = \lambda \mathcal{S}_T^{\text{pre}} + (1-\lambda) \mathcal{S}_V^{\text{pre}}\), with higher text weight (\(\lambda=0.8\)), since visual proxies are sparse and unreliable at cold start.
- Post-evolution score (for final decision): \(\mathcal{S}^{\text{post}} = (1-\lambda) \mathcal{S}_T^{\text{post}} + \lambda \mathcal{S}_V^{\text{post}}\), with weights flipped — after evolution, visual proxies accumulate instance-level information and become more discriminative.
- This weight-flipping strategy is a key design: at cold start, semantic priors are trusted; after sufficient evolution, visual evidence is trusted.
-
Adaptive Threshold: A data-driven threshold estimation approach (minimizing intra-class variance of ID/OOD scores) is adopted, providing greater robustness than fixed thresholds.
Loss & Training¶
Completely training-free and annotation-free. The CLIP backbone parameters are not modified; only the proxy caches are evolved online at test time.
Key Experimental Results¶
Main Results on ImageNet-1K (CLIP ViT-B/16)¶
| Method | iNaturalist FPR95↓ | SUN FPR95↓ | Places FPR95↓ | Textures FPR95↓ | Avg. FPR95↓ | Avg. AUROC↑ |
|---|---|---|---|---|---|---|
| NegLabel | 1.91 | 20.53 | 35.59 | 43.56 | 25.40 | 94.21 |
| CSP | 1.54 | 13.66 | 29.32 | 25.52 | 17.51 | 95.76 |
| AdaNeg | 0.59 | 9.50 | 34.34 | 31.27 | 18.92 | 96.66 |
| CoEvo-NegLabel | 0.53 | 4.42 | 23.51 | 12.42 | 10.22 | 97.95 |
| CoEvo-CSP | 0.46 | 4.68 | 25.83 | 12.78 | 10.94 | 97.85 |
- FPR95 drops from 18.92% (AdaNeg) to 10.22%, a relative reduction of 45.98%.
- The largest improvement is on the Textures dataset (31.27% → 12.42%), which exhibits the most severe distribution shift.
OpenOOD Benchmark (Near-OOD + Far-OOD)¶
| Method | Near-OOD FPR95↓ | Far-OOD FPR95↓ | ID ACC↑ |
|---|---|---|---|
| AdaNeg | 67.51 | 17.31 | 67.13 |
| CoEvo-NegLabel | 64.64 | 15.24 | 66.83 |
| CoEvo-CSP | 66.88 | 14.47 | 67.36 |
- Substantial gains are observed in the Far-OOD setting; improvements are more modest for Near-OOD.
Ablation Study¶
| Proxy Evolution Configuration | Avg. FPR95↓ | Avg. AUROC↑ |
|---|---|---|
| No evolution (NegLabel) | 24.97 | 94.56 |
| Text evolution only | 21.77 | 95.38 |
| Visual evolution only | 17.41 | 96.99 |
| Bidirectional co-evolution | 10.22 | 97.95 |
- Both text and visual evolution contribute independently; bidirectional combination yields super-additive gains.
- The standalone contribution of visual evolution (−7.56 FPR95) exceeds that of text evolution (−3.20 FPR95).
Robustness to Data Imbalance (FPR95↓, ImageNet vs. SUN)¶
| ID:OOD Ratio | NegLabel | AdaNeg | CoEvo-NegLabel |
|---|---|---|---|
| 1:100 | 23.00 | 28.00 | 17.00 |
| 1:1 | 21.55 | 8.01 | 5.27 |
| 100:1 | 19.69 | 17.40 | 14.77 |
- CoEvo achieves the best performance across all ratios and remains effective under extreme imbalance (100:1, only 100 OOD samples).
Highlights & Insights¶
- Bidirectional co-evolution is the core innovation: The closed-loop design in which text guides visual adaptation and visual feedback refines text addresses the fundamental limitation of unidirectional adaptation in prior methods.
- The pre/post weight-flipping strategy is elegant: Trusting text priors at cold start and visual evidence after sufficient evolution allows adaptive modulation of cross-modal contributions.
- Fully training- and annotation-free: CoEvo is plug-and-play with any CLIP model and requires neither additional training nor OOD annotation.
- A 45.98% reduction in FPR95 is highly significant — directly impacting safety in real-world deployment.
Limitations & Future Work¶
- Limited Near-OOD improvement: Gains are less pronounced in fine-grained OOD settings such as SSB-hard compared to Far-OOD scenarios.
- Test-time computational overhead: Each sample requires Top-N text candidate retrieval and dual-modal cache updates, introducing higher inference latency than static methods.
- Dependence on initial negative label quality: If the WordNet/CSP initial negative labels offer insufficient coverage, the evolution starts from a poor initialization.
- Order sensitivity: Online evolution results may be affected by the order of test samples, a concern not thoroughly discussed in the paper.
- Validated only on CLIP ViT-B/16: Performance on larger-scale VLMs (e.g., ViT-L, SigLIP) remains unknown.
Related Work & Insights¶
- vs. NegLabel/CSP (static negative labels): CoEvo addresses the sparsity and misalignment issues of static methods through dynamic proxy evolution, reducing FPR95 by approximately 60%.
- vs. AdaNeg (unidirectional adaptation): AdaNeg adapts only visual proxies while text labels remain static; CoEvo achieves bidirectional adaptation with substantial additional gains (FPR95 from 18.92% to 10.22%).
- vs. MCM (maximum softmax): MCM does not use negative labels and relies solely on ID class similarity, yielding performance well below negative-label methods.
- vs. training-based methods (LoCoOp/LAPT): CoEvo as a training-free method surpasses training-based methods requiring prompt tuning.
The co-evolution paradigm generalizes naturally to other test-time adaptation scenarios for VLMs, including domain adaptation, continual learning, and open-set recognition. The weight-flipping strategy — trusting priors at cold start and data after evolution — represents a general-purpose design pattern for online learning. The proxy cache with entropy-based online updates can also be adapted for retrieval-augmented systems.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Bidirectional proxy co-evolution mechanism is novel; the weight-flipping design is elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — ImageNet-1K, OpenOOD, ablations, hyperparameter sensitivity, and imbalance analysis are all thoroughly covered.
- Writing Quality: ⭐⭐⭐⭐ — Problem analysis is thorough, mathematical derivations are clear, and algorithm pseudocode is complete.
- Value: ⭐⭐⭐⭐ — Direct value for safe VLM deployment; training-free nature facilitates practical adoption.