Harnessing Vision-Language Models for Time Series Anomaly Detection¶

Conference: AAAI2026 arXiv: 2506.06836 Code: ZLHe0/VLM4TS Area: Multimodal VLM Keywords: time series anomaly detection, VLM, vision transformer, zero-shot, ViT4TS

TL;DR¶

A two-stage zero-shot time series anomaly detection framework is proposed: ViT4TS employs a lightweight ViT to perform multi-scale cross-patch matching on line-chart renderings of time series for candidate anomaly interval localization, while VLM4TS leverages GPT-4o with global temporal context to validate and refine detection results. The framework achieves F1-max surpassing the best baseline by 24.6% across 11 benchmarks, with token consumption only 1/36 of existing LLM-based methods.

Background & Motivation¶

State of the Field¶

Background: Traditional time series anomaly detection (TSAD) methods train domain-specific models on numerical data, lacking the visual-temporal understanding that human experts possess for identifying contextual anomalies (e.g., gradual drift).

Directly applying VLMs to TSAD encounters a resolution-context dilemma:

Root Cause¶

Key Challenge: Short windows preserve resolution but provide limited context, with prohibitively high token costs (1000-step sequence → ~20 images → ~20,000 tokens).

Limitations of Prior Work¶

Limitations of Prior Work: Long windows retain global context but suffer from severe resolution degradation, making precise anomaly boundary localization infeasible.

Method¶

Stage 1: ViT4TS — Visual Screening¶

Time Series to Image: Renders 1-D time series as plain line charts (no ticks/legends), with window length \(L_w\) matched to image width and stride \(L_s = \lfloor L_w/4 \rfloor\).
Multi-scale Embedding Extraction: Extracts patch-level feature maps \(\mathbf{F} \in \mathbb{R}^{P \times P \times D}\) using CLIP ViT-B/16, then applies average pooling with kernels \(k \in \{2,3\}\) to obtain multi-scale features.
Cross-patch Matching: Exploiting anomaly sparsity, patch embeddings of each window are matched against those of other windows via cosine dissimilarity; the median is taken to produce an anomaly score map.
Multi-scale Fusion: Patch scores across scales are aggregated via harmonic averaging, mapped back to time steps, and the 0.25 quantile is used to generate a 1-D anomaly score \(s(t)\).
A Gaussian threshold \(\tau\) extracts candidate anomaly intervals \(\hat{\mathbf{A}}\).

Stage 2: VLM4TS — VLM Verification¶

Visual Input: Renders the full time series as a single line chart with coordinate axes.
Text Input: A prompt lists candidate intervals from ViT4TS, asking the VLM to confirm, reject, or add anomalies and assign confidence scores of 1–3.
Output: A refined anomaly set in JSON format with confidence scores and natural language explanations; intervals with confidence = 1 are discarded.

Key Design Considerations¶

Both ViT4TS and VLM4TS are zero-shot, requiring no in-domain fine-tuning.
The two stages are complementary: ViT4TS provides high-recall, precise local detection, while VLM4TS improves precision via global contextual understanding.

Key Experimental Results¶

Evaluated on 11 benchmarks (5 NAB subsets + 2 NASA subsets + 4 YAHOO subsets), comparing against from-scratch, time-series-pretrained, and LLM-based methods.

Table 1: F1-max Comparison¶

Method	Type	Avg. F1-max
LSTM-DT	from scratch	0.529
AER	from scratch	0.527
UniTS	TS pretrained	0.390
TimesFM	TS pretrained	0.388
ViT4TS	ours (stage 1)	0.612
VLM4TS	ours (full)	0.659

VLM4TS outperforms the best baseline LSTM-DT by 24.6%.

Table 2: vs. LLM/VLM Methods (Efficiency)¶

Method	Avg. F1-max	Avg. Tokens/Seq.	Avg. Time/Seq.
SigLLM-PG	0.128	62133	2575s
TAMA	0.587	32965	88s
VLM4TS	0.665	1212	15s

Token consumption is only 1/27 of TAMA and 1/51 of SigLLM-PG.

Ablation Study (Table 3)¶

Removing patch-level embeddings → F1 drops by 11.94%.
Removing cross-patch matching → F1 on the YAHOO group drops by 18.76%.
Removing the ViT4TS screening stage → F1 on the YAHOO group plummets from 0.651 to 0.292.

Highlights & Insights¶

Two-stage divide-and-conquer resolves the resolution-context dilemma: lightweight ViT screening combined with heavyweight VLM verification balances precision and efficiency.
Fully zero-shot: requires no training on time series data, relying purely on visual pretraining weights and VLM inference.
Strong cross-domain generalization: consistently outperforms domain-specific models across 11 datasets spanning aerospace telemetry, network traffic, and social media data.
Token efficiency: saves ~36× tokens compared to sliding-window VLM methods, suitable for large-scale deployment.

Limitations & Future Work¶

VLM4TS assumes anomaly sparsity and behaves conservatively on datasets with dense synthetic anomalies (YAHOO A3/A4).
Only univariate time series are validated; multivariate extension is discussed only in the appendix.
The VLM stage relies on the GPT-4o API, introducing cost and latency constraints.
The line-chart rendering is relatively simple; richer visual representations such as spectrograms and recurrence plots remain unexplored.
Cross-patch matching may incur high memory overhead on very long sequences (partially mitigated by a median-reference variant).

Rating¶

Novelty: ⭐⭐⭐⭐ — Reformulating 1-D anomaly detection as 2-D visual understanding; the two-stage design elegantly addresses the resolution-context dilemma.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 11 datasets × multiple baseline categories, with comprehensive ablation and efficiency analysis.
Writing Quality: ⭐⭐⭐⭐ — Motivation figures are intuitive and method descriptions are clear.
Value: ⭐⭐⭐⭐ — Provides a viable paradigm for applying VLMs to non-traditional vision tasks.