Skip to content

Harnessing Vision-Language Models for Time Series Anomaly Detection

Conference: AAAI2026 arXiv: 2506.06836 Code: ZLHe0/VLM4TS Area: Multimodal VLM Keywords: time series anomaly detection, VLM, vision transformer, zero-shot, ViT4TS

TL;DR

A two-stage zero-shot time series anomaly detection framework is proposed: ViT4TS employs a lightweight ViT to perform multi-scale cross-patch matching on line-chart renderings of time series for candidate anomaly interval localization, while VLM4TS leverages GPT-4o with global temporal context to validate and refine detection results. The framework achieves F1-max surpassing the best baseline by 24.6% across 11 benchmarks, with token consumption only 1/36 of existing LLM-based methods.

Background & Motivation

State of the Field

Background: Traditional time series anomaly detection (TSAD) methods train domain-specific models on numerical data, lacking the visual-temporal understanding that human experts possess for identifying contextual anomalies (e.g., gradual drift).

Directly applying VLMs to TSAD encounters a resolution-context dilemma:

Root Cause

Key Challenge: Short windows preserve resolution but provide limited context, with prohibitively high token costs (1000-step sequence → ~20 images → ~20,000 tokens).

Limitations of Prior Work

Limitations of Prior Work: Long windows retain global context but suffer from severe resolution degradation, making precise anomaly boundary localization infeasible.

Method

Stage 1: ViT4TS — Visual Screening

  1. Time Series to Image: Renders 1-D time series as plain line charts (no ticks/legends), with window length \(L_w\) matched to image width and stride \(L_s = \lfloor L_w/4 \rfloor\).
  2. Multi-scale Embedding Extraction: Extracts patch-level feature maps \(\mathbf{F} \in \mathbb{R}^{P \times P \times D}\) using CLIP ViT-B/16, then applies average pooling with kernels \(k \in \{2,3\}\) to obtain multi-scale features.
  3. Cross-patch Matching: Exploiting anomaly sparsity, patch embeddings of each window are matched against those of other windows via cosine dissimilarity; the median is taken to produce an anomaly score map.
  4. Multi-scale Fusion: Patch scores across scales are aggregated via harmonic averaging, mapped back to time steps, and the 0.25 quantile is used to generate a 1-D anomaly score \(s(t)\).
  5. A Gaussian threshold \(\tau\) extracts candidate anomaly intervals \(\hat{\mathbf{A}}\).

Stage 2: VLM4TS — VLM Verification

  1. Visual Input: Renders the full time series as a single line chart with coordinate axes.
  2. Text Input: A prompt lists candidate intervals from ViT4TS, asking the VLM to confirm, reject, or add anomalies and assign confidence scores of 1–3.
  3. Output: A refined anomaly set in JSON format with confidence scores and natural language explanations; intervals with confidence = 1 are discarded.

Key Design Considerations

  • Both ViT4TS and VLM4TS are zero-shot, requiring no in-domain fine-tuning.
  • The two stages are complementary: ViT4TS provides high-recall, precise local detection, while VLM4TS improves precision via global contextual understanding.

Key Experimental Results

Evaluated on 11 benchmarks (5 NAB subsets + 2 NASA subsets + 4 YAHOO subsets), comparing against from-scratch, time-series-pretrained, and LLM-based methods.

Table 1: F1-max Comparison

Method Type Avg. F1-max
LSTM-DT from scratch 0.529
AER from scratch 0.527
UniTS TS pretrained 0.390
TimesFM TS pretrained 0.388
ViT4TS ours (stage 1) 0.612
VLM4TS ours (full) 0.659

VLM4TS outperforms the best baseline LSTM-DT by 24.6%.

Table 2: vs. LLM/VLM Methods (Efficiency)

Method Avg. F1-max Avg. Tokens/Seq. Avg. Time/Seq.
SigLLM-PG 0.128 62133 2575s
TAMA 0.587 32965 88s
VLM4TS 0.665 1212 15s

Token consumption is only 1/27 of TAMA and 1/51 of SigLLM-PG.

Ablation Study (Table 3)

  • Removing patch-level embeddings → F1 drops by 11.94%.
  • Removing cross-patch matching → F1 on the YAHOO group drops by 18.76%.
  • Removing the ViT4TS screening stage → F1 on the YAHOO group plummets from 0.651 to 0.292.

Highlights & Insights

  • Two-stage divide-and-conquer resolves the resolution-context dilemma: lightweight ViT screening combined with heavyweight VLM verification balances precision and efficiency.
  • Fully zero-shot: requires no training on time series data, relying purely on visual pretraining weights and VLM inference.
  • Strong cross-domain generalization: consistently outperforms domain-specific models across 11 datasets spanning aerospace telemetry, network traffic, and social media data.
  • Token efficiency: saves ~36× tokens compared to sliding-window VLM methods, suitable for large-scale deployment.

Limitations & Future Work

  • VLM4TS assumes anomaly sparsity and behaves conservatively on datasets with dense synthetic anomalies (YAHOO A3/A4).
  • Only univariate time series are validated; multivariate extension is discussed only in the appendix.
  • The VLM stage relies on the GPT-4o API, introducing cost and latency constraints.
  • The line-chart rendering is relatively simple; richer visual representations such as spectrograms and recurrence plots remain unexplored.
  • Cross-patch matching may incur high memory overhead on very long sequences (partially mitigated by a median-reference variant).

Rating

  • Novelty: ⭐⭐⭐⭐ — Reformulating 1-D anomaly detection as 2-D visual understanding; the two-stage design elegantly addresses the resolution-context dilemma.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 11 datasets × multiple baseline categories, with comprehensive ablation and efficiency analysis.
  • Writing Quality: ⭐⭐⭐⭐ — Motivation figures are intuitive and method descriptions are clear.
  • Value: ⭐⭐⭐⭐ — Provides a viable paradigm for applying VLMs to non-traditional vision tasks.