Skip to content

A Theoretical Analysis of Detecting Large Model-Generated Time Series

Conference: AAAI 2026 arXiv: 2511.07104 Code: None Area: Time Series / AI-Generated Content Detection Keywords: time series large model, generation detection, uncertainty contraction, recursive forecasting, UCE

TL;DR

This work presents the first theoretical framework for detecting time series large model (TSLM)-generated content. By establishing the Contraction Hypothesis, it reveals that TSLM-generated sequences exhibit exponentially decaying uncertainty under recursive forecasting. Based on this insight, the proposed UCE detector achieves an in-distribution AUROC of 0.855 across 32 datasets, substantially outperforming 10 text-detection baselines.

Background & Motivation

Background: Time series large models (e.g., Chronos, Timer, TimeMoE) are now capable of zero-shot long-horizon forecasting on unseen domains. This capability may be maliciously exploited to fabricate financial transaction records, scientific experiment data, and environmental monitoring indicators, posing serious threats to data authenticity.

Limitations of Prior Work: LLM text detection methods (DetectGPT, Fast-DetectGPT, etc.) rely on token-level probability or rank differences to distinguish human-written from AI-generated text. However, time series exhibit fundamental modal differences: low information density (adjacent values such as 25.1°C and 25.2°C are nearly identical) and smooth probability distributions (high entropy), rendering token-level probability differences non-discriminative. Experiments confirm that 10 text-detection baselines achieve an average AUROC of only 0.670 on time series.

Key Challenge: Point-wise probabilities lack discriminative power in time series, yet the dynamic distributional evolution at the sequence level encodes essential differences between real and generated sequences—the challenge lies in characterizing and exploiting this distinction.

Goal: (1) Why do text detection methods fail on time series? (2) What unique properties of the time series modality can be leveraged for detection? (3) Design a theoretically grounded detection method for TSLM-generated time series.

Key Insight: Rather than examining point-wise probabilities, this work analyzes the dynamics of TSLM internal predictive distributions during recursive forecasting. The authors find that, due to sampling strategies, each step of a TSLM-generated sequence yields a more concentrated distribution than the true distribution, and this effect accumulates and amplifies through recursive prediction.

Core Idea: TSLM-generated time series exhibit exponentially decaying uncertainty (distributional contraction) under recursive forecasting, while real sequences maintain stable uncertainty. Detecting AI-generated sequences is achieved by quantifying this dynamic uncertainty discrepancy.

Method

Overall Architecture

Given a candidate time series \(\mathbf{X}_t = (X_1, \ldots, X_t)\), a TSLM is used to compute internal predictive distributions over prefixes of varying lengths. Uncertainty metrics are extracted from these distributions, and the uncertainty level is used to determine whether the sequence is model-generated. The core theoretical insight is that uncertainty in TSLM-generated sequences decays exponentially with recursive steps, whereas real sequences do not exhibit this behavior.

Key Designs

  1. Contraction Hypothesis:

    • Function: Provides the theoretical foundation for detection—TSLM-generated time series exhibit progressively concentrated distributions, while real sequences do not.
    • Mechanism: The time series is decomposed into a trend component \(T_t\) and Gaussian noise \(n_t \sim \mathcal{N}(0, \sigma_t^2)\), where \(\sigma_t^2 = \sum_{i=1}^l \alpha_i \sigma_{t-i}^2\). The theoretical analysis proceeds in three steps: (a) Distributional Consistency: the internal predictive distribution \(f_\theta\) of an ideal model coincides with the true distribution \(f_t\) (proved via Gibbs inequality and cross-entropy minimization); (b) Sampling-Induced Variance Scaling: sampling strategies (temperature sampling, top-k, etc.) modify the internal distribution as \(\hat{\sigma}_t^2 = \gamma_t \cdot \tilde{\sigma}_t^2\), where \(\gamma_t < 1\) reduces uncertainty, and smaller \(\gamma_t\) yields lower evaluation function values; (c) Recursive Variance Decay: when generated sequences serve as subsequent inputs, \(\tilde{\sigma}_t^2 = \sum_{i=1}^l \alpha_i \gamma_{t-i} \tilde{\sigma}_{t-i}^2\), causing uncertainty to decay exponentially to zero due to \(\gamma_t < 1\).
    • Design Motivation: To theoretically explain why TSLM-generated sequences exhibit statistically detectable differences from real sequences, rather than relying on heuristic observations.
  2. Uncertainty Contraction Estimator (UCE):

    • Function: Translates the Contraction Hypothesis into a practically computable detection score.
    • Mechanism: For a candidate sequence, \(N\) time points \(t_1, \ldots, t_N\) are sampled at fixed intervals \(\Delta t\). For each prefix \(\mathbf{X}_{t_i}\), the TSLM computes the internal distribution \(\hat{P}_{t_i} = p_\theta(\cdot | X_1, \ldots, X_{t_i})\). Three uncertainty metrics are computed within a neighborhood \(\mathcal{U}\) around the distributional mean: (a) entropy \(E = -\sum_{x \in \mathcal{U}} \hat{P}(x) \log \hat{P}(x)\), (b) maximum probability \(P_{\max} = \max_{x \in \mathcal{U}} \hat{P}(x)\), and (c) variance \(\text{Var} = \sum_{x \in \mathcal{U}} (x - \mu)^2 \hat{P}(x)\). The UCE score is the mean of the metric sequence \(\text{UCE} = \frac{1}{N} \sum_{i=1}^N s_{t_i}\), and sequences with lower uncertainty are classified as model-generated.
    • Design Motivation: By leveraging distributional signals rather than point-wise probabilities and covering multiple aspects of uncertainty (information-theoretic, concentration, dispersion), the method remains computationally simple.
  3. Modal Difference Analysis:

    • Function: Explains why text detection methods fail on time series.
    • Mechanism: Text tokens have large semantic distances and sharp probability distributions—a small number of tokens carry high probability (e.g., after "I eat an," "apple"/"orange" dominate over other words), making token probabilities and ranks highly discriminative. In contrast, adjacent time series values are extremely similar, probability distributions are smooth, and while mutual information between neighboring values is large, individual information content is small, resulting in negligible token-level probability differences.
    • Design Motivation: Provides modal-level theoretical justification for introducing distributional-level detection methods.

Loss & Training

UCE is a zero-shot detection method requiring no training. It only requires white-box access to the TSLM's internal predictive distributions (logits). In experiments, Chronos-T5 (large) is used as the primary TSLM, generating forecast sequences of horizon \(H=64\).

Key Experimental Results

Main Results

Evaluated on 32 datasets (12 in-distribution + 20 zero-shot) against 10 text-detection baselines:

Method In-Dist AUROC In-Dist TPR@1%FPR Zero-Shot AUROC Zero-Shot TPR@1%FPR
DetectLLM-LLR 0.815 0.324 0.705 0.233
Baseline Average 0.670 0.118 0.632 0.151
UCE-Entropy 0.855 0.447 0.731 0.286

Cross-Model Detection (Timer & Time-MoE)

Model / Horizon UCE-Entropy AUROC UCE-Entropy TPR
Timer H=96 0.833 0.301
Timer H=768 0.788 0.366
Time-MoE H=96 0.829 0.320
Time-MoE H=336 0.957 0.611
Time-MoE H=720 0.950 0.561

Key Findings

  • UCE-Entropy consistently achieves the best performance across all settings: In-Dist AUROC of 0.855 surpasses the strongest baseline DetectLLM-LLR (0.815) by 0.040, with a TPR gain of 0.123.
  • Strong cross-model generalization: AUROC reaches 0.957 on Time-MoE with long sequences (H=336), suggesting that long-range predictions from MoE architectures are more detectable.
  • Among the three metrics, Entropy is the most stable, followed by MaxProb, while Variance performs relatively poorly on non-probabilistic models.
  • Empirical validation of the Contraction Hypothesis: within 1024 tokens, entropy and variance of generated sequences continuously decay toward 0 while maximum probability approaches 1, whereas real sequences maintain stable fluctuations.

Highlights & Insights

  • First theoretical framework for TSLM-generated content detection: The logical chain from modal difference analysis → Contraction Hypothesis → theoretical proof → detector design is complete and rigorous, bridging the gap between text detection and time series detection.
  • Universality of the Contraction Hypothesis: Chronos uses top-k + median sampling (where \(\gamma_t < 1\) holds directly); Timer/Time-MoE use MSE loss (equivalent in effect to \(\gamma_t < 1\)), demonstrating that the contraction phenomenon is a general property of TSLMs rather than an artifact of specific architectures.
  • Zero-shot, training-free design: UCE requires neither labeled data nor dedicated training, and uses an existing TSLM as the detection tool, resulting in extremely low deployment cost.

Limitations & Future Work

  • White-box requirement: Access to TSLM internal distributions is necessary, making the method inapplicable in black-box settings. The authors briefly discuss approximating with a locally deployed probabilistic model, but this has not been thoroughly validated.
  • Recursive forecasting assumption: If a TSLM employs non-recursive generation strategies (e.g., parallel decoding), the Contraction Hypothesis may not hold.
  • Idealized assumptions: The theoretical proofs rely on Gaussian noise structure and infinite model capacity assumptions, from which practical TSLMs may deviate.
  • Adversarial robustness: Adversaries may attempt to disguise uncertainty levels by injecting noise through post-processing.
  • vs DetectGPT/Fast-DetectGPT: Perturbation-based text detection methods rely on local probability variations, which lose discriminative power on the smooth distributions of time series. UCE shifts toward distributional-level dynamic analysis, representing a paradigm better suited to the time series modality.
  • vs FourierGPT: Spectral analysis of token probability sequences is an inspiring approach, but it remains a token-level method. UCE directly analyzes distributional signals, avoiding the fundamental limitation of low information density.
  • vs Binocular: The idea of cross-perplexity between two models may be applicable to black-box detection of time series—using distributional differences between two distinct TSLMs as a detection signal is a promising direction.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First TSLM generation detection framework; the Contraction Hypothesis is novel and rigorously proved.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 32 datasets, 10 baselines, 3 TSLMs, cross-model generalization verified.
  • Writing Quality: ⭐⭐⭐⭐ Modal difference analysis is insightful; the three-part theoretical development is logically progressive.
  • Value: ⭐⭐⭐⭐ Opens a new direction in AI-generated content detection for the time series domain.