Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models¶

AAAI 2026 Multimodal VLM OOD Detection VLM Cross-modal Proxy Co-evolution Zero-shot Test-time Adaptation CLIP Negative Labels

Conference: AAAI 2026 arXiv: 2601.08476 Code: Not released Area: Multimodal VLM / OOD Detection Keywords: OOD Detection, VLM, Cross-modal Proxy Co-evolution, Zero-shot, Test-time Adaptation, CLIP, Negative Labels

TL;DR¶

This paper proposes CoEvo, a training-free and annotation-free test-time framework that dynamically updates positive and negative proxy caches through a bidirectional sample-conditioned text/visual proxy co-evolution mechanism. On ImageNet-1K, CoEvo improves AUROC by 1.33% and reduces FPR95 by 45.98% (from 18.92% to 10.22%) over the strongest negative-label baseline, achieving state-of-the-art zero-shot OOD detection.

Background & Motivation¶

Vision-language models (e.g., CLIP) deployed in open-world settings require reliable OOD detection to reject unseen categories. Zero-shot OOD detection requires no annotated negative samples. The dominant "negative label" paradigm selects text labels semantically unrelated to in-distribution (ID) classes from lexicons such as WordNet as OOD proxies, determining ID/OOD membership by comparing the similarity of a test image to positive and negative labels.

However, static negative label design suffers from two fundamental limitations:

Unmodeled Negative Space: A globally fixed negative label set can only sparsely sample the vast semantic space outside ID classes, leaving many negative semantics relevant to current test samples uncovered.
Modality Misalignment: Visual features drift under test-time distribution shift while text negative labels remain unchanged, distorting the cross-modal similarity geometry and destabilizing decision thresholds.

AdaNeg partially addresses the first issue by adapting visual proxies, but text negative labels remain static — adaptation is unidirectional.

Core Problem¶

How can bidirectional, sample-conditioned dynamic adaptation of both text and visual modal proxies be achieved without training or annotation, to support zero-shot OOD detection under test-time distribution shift?

Method¶

Overall Architecture¶

CoEvo maintains two modality-specific proxy caches at test time (a text cache and a visual cache), each containing positive and negative queues. The core is a Proxy-Aligned Co-Evolution mechanism: visual cues guide dynamic mining of text negative labels, and the updated text proxies in turn refine the visual decision boundary, forming a closed loop.

Key Designs¶

Text Proxy Cache:
- Positive proxy queue \(\mathbf{T}_p\): CLIP text encodings of ID class names, kept fixed to preserve ID semantic anchors.
- Negative proxy queue \(\mathbf{T}_n\): initialized with \(M\) negative class encodings selected from large-scale lexicons (WordNet/CSP), dynamically evolved during inference.
Visual Proxy Cache:
- Positive proxy queue \(\mathbf{V}_p \in \mathbb{R}^{K \times L \times D}\): stores \(L\) visual instances per ID class, initialized from corresponding text encodings, with high-confidence ID samples incrementally enqueued.
- Negative proxy queue \(\mathbf{V}_n\): high-confidence OOD samples are enqueued; a priority queue strategy evicts low-confidence or stale samples.
Proxy-Aligned Co-Evolution Mechanism (core contribution):
- Text proxy evolution: gated by a confidence margin mechanism (based on adaptive threshold \(\delta\) and margin \(\gamma\)); for high-confidence samples:
  - Predicted OOD → retrieve text negative labels semantically close to the test sample (Near Negatives), tightening the local open-set boundary.
  - Predicted ID → retrieve text negative labels semantically far from the test sample (Far Negatives), expanding negative space coverage.
  - A deduplication constraint prevents redundant enqueuing.
- Visual proxy evolution: after text proxy updates, the visual negative proxy queue expands to accommodate newly exposed OOD semantics; an entropy-based confidence measure determines whether to replace existing proxies, ensuring the queue always stores the highest-confidence samples.
OOD Score Evolution:
- Pre-evolution score (for proxy updating): \(\mathcal{S}^{\text{pre}} = \lambda \mathcal{S}_T^{\text{pre}} + (1-\lambda) \mathcal{S}_V^{\text{pre}}\), with higher text weight (\(\lambda=0.8\)), since visual proxies are sparse and unreliable at cold start.
- Post-evolution score (for final decision): \(\mathcal{S}^{\text{post}} = (1-\lambda) \mathcal{S}_T^{\text{post}} + \lambda \mathcal{S}_V^{\text{post}}\), with weights flipped — after evolution, visual proxies accumulate instance-level information and become more discriminative.
- This weight-flipping strategy is a key design: at cold start, semantic priors are trusted; after sufficient evolution, visual evidence is trusted.
Adaptive Threshold: A data-driven threshold estimation approach (minimizing intra-class variance of ID/OOD scores) is adopted, providing greater robustness than fixed thresholds.

Loss & Training¶

Completely training-free and annotation-free. The CLIP backbone parameters are not modified; only the proxy caches are evolved online at test time.

Key Experimental Results¶

Main Results on ImageNet-1K (CLIP ViT-B/16)¶

Method	iNaturalist FPR95↓	SUN FPR95↓	Places FPR95↓	Textures FPR95↓	Avg. FPR95↓	Avg. AUROC↑
NegLabel	1.91	20.53	35.59	43.56	25.40	94.21
CSP	1.54	13.66	29.32	25.52	17.51	95.76
AdaNeg	0.59	9.50	34.34	31.27	18.92	96.66
CoEvo-NegLabel	0.53	4.42	23.51	12.42	10.22	97.95
CoEvo-CSP	0.46	4.68	25.83	12.78	10.94	97.85

FPR95 drops from 18.92% (AdaNeg) to 10.22%, a relative reduction of 45.98%.
The largest improvement is on the Textures dataset (31.27% → 12.42%), which exhibits the most severe distribution shift.

OpenOOD Benchmark (Near-OOD + Far-OOD)¶

Method	Near-OOD FPR95↓	Far-OOD FPR95↓	ID ACC↑
AdaNeg	67.51	17.31	67.13
CoEvo-NegLabel	64.64	15.24	66.83
CoEvo-CSP	66.88	14.47	67.36

Substantial gains are observed in the Far-OOD setting; improvements are more modest for Near-OOD.

Ablation Study¶

Proxy Evolution Configuration	Avg. FPR95↓	Avg. AUROC↑
No evolution (NegLabel)	24.97	94.56
Text evolution only	21.77	95.38
Visual evolution only	17.41	96.99
Bidirectional co-evolution	10.22	97.95

Both text and visual evolution contribute independently; bidirectional combination yields super-additive gains.
The standalone contribution of visual evolution (−7.56 FPR95) exceeds that of text evolution (−3.20 FPR95).

Robustness to Data Imbalance (FPR95↓, ImageNet vs. SUN)¶

ID:OOD Ratio	NegLabel	AdaNeg	CoEvo-NegLabel
1:100	23.00	28.00	17.00
1:1	21.55	8.01	5.27
100:1	19.69	17.40	14.77

CoEvo achieves the best performance across all ratios and remains effective under extreme imbalance (100:1, only 100 OOD samples).

Highlights & Insights¶

Bidirectional co-evolution is the core innovation: The closed-loop design in which text guides visual adaptation and visual feedback refines text addresses the fundamental limitation of unidirectional adaptation in prior methods.
The pre/post weight-flipping strategy is elegant: Trusting text priors at cold start and visual evidence after sufficient evolution allows adaptive modulation of cross-modal contributions.
Fully training- and annotation-free: CoEvo is plug-and-play with any CLIP model and requires neither additional training nor OOD annotation.
A 45.98% reduction in FPR95 is highly significant — directly impacting safety in real-world deployment.

Limitations & Future Work¶

Limited Near-OOD improvement: Gains are less pronounced in fine-grained OOD settings such as SSB-hard compared to Far-OOD scenarios.
Test-time computational overhead: Each sample requires Top-N text candidate retrieval and dual-modal cache updates, introducing higher inference latency than static methods.
Dependence on initial negative label quality: If the WordNet/CSP initial negative labels offer insufficient coverage, the evolution starts from a poor initialization.
Order sensitivity: Online evolution results may be affected by the order of test samples, a concern not thoroughly discussed in the paper.
Validated only on CLIP ViT-B/16: Performance on larger-scale VLMs (e.g., ViT-L, SigLIP) remains unknown.

vs. NegLabel/CSP (static negative labels): CoEvo addresses the sparsity and misalignment issues of static methods through dynamic proxy evolution, reducing FPR95 by approximately 60%.
vs. AdaNeg (unidirectional adaptation): AdaNeg adapts only visual proxies while text labels remain static; CoEvo achieves bidirectional adaptation with substantial additional gains (FPR95 from 18.92% to 10.22%).
vs. MCM (maximum softmax): MCM does not use negative labels and relies solely on ID class similarity, yielding performance well below negative-label methods.
vs. training-based methods (LoCoOp/LAPT): CoEvo as a training-free method surpasses training-based methods requiring prompt tuning.

The co-evolution paradigm generalizes naturally to other test-time adaptation scenarios for VLMs, including domain adaptation, continual learning, and open-set recognition. The weight-flipping strategy — trusting priors at cold start and data after evolution — represents a general-purpose design pattern for online learning. The proxy cache with entropy-based online updates can also be adapted for retrieval-augmented systems.

Rating¶

Novelty: ⭐⭐⭐⭐ — Bidirectional proxy co-evolution mechanism is novel; the weight-flipping design is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — ImageNet-1K, OpenOOD, ablations, hyperparameter sensitivity, and imbalance analysis are all thoroughly covered.
Writing Quality: ⭐⭐⭐⭐ — Problem analysis is thorough, mathematical derivations are clear, and algorithm pseudocode is complete.
Value: ⭐⭐⭐⭐ — Direct value for safe VLM deployment; training-free nature facilitates practical adoption.