Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement¶

Conference: ACL 2025
arXiv: 2503.01875
Code: HuggingFace (Yes, includes datasets and models)
Area: Time Series / LLM
Keywords: time series QA, multi-task learning, context enhancement, LLM fine-tuning, TSQA dataset

TL;DR¶

Proposes the Time-MQA framework and the TSQA dataset (~200k QA pairs), unifying time series forecasting, imputation, anomaly detection, classification, and open-ended reasoning QA under a natural language question answering paradigm, and endowing LLMs with time series understanding and reasoning capabilities through continual pre-training.

Background & Motivation¶

Background: Time series analysis is crucial in fields such as finance, healthcare, and energy. However, most existing methods and datasets focus on single tasks (e.g., forecasting or anomaly detection) and lack unified multi-task reasoning capabilities.

Limitations of Prior Work: Although LLMs have been introduced to time series analysis, most works only focus on single tasks (e.g., forecasting) and lack cross-task reasoning and natural language interaction capabilities. More critically, there is a lack of large-scale, paired text-time series datasets.

Key Challenge: Users expect to query time series intuitively in natural language (e.g., "Why did the temperature drop sharply at the 10th hour?"), but existing methods can only output numerical results and fail to provide explanatory reasoning.

Goal: To build a unified multi-task time series QA framework that enables LLMs to answer various time series questions using natural language.

Key Insight: Construction of a large-scale TSQA dataset + LoRA-based continual LLM pre-training.

Core Idea: By constructing a multi-domain, multi-task time series QA dataset with a scale of ~200k, the LLM is continually pre-trained to acquire time series knowledge and reasoning capabilities.

Method¶

Overall Architecture¶

Time-MQA learns a function \(f: (X, C, Q) \rightarrow A\): - \(X\): Time series input - \(C\): Contextual information (background description, feature specification, domain knowledge) - \(Q\): Natural language question - \(A\): Answer (can be predicted values, classification labels, anomaly timestamps, or textual explanations)

The model undergoes continual pre-training based on Mistral 7B / Llama-3 8B / Qwen-2.5 7B using LoRA adapters.

Key Designs¶

TSQA Dataset Construction: Covers 12 domains (healthcare, finance, energy, transportation, environment, IoT, etc.) and 5 task types:
- Forecasting: 42,557 instances, input length 64-256, prediction length 8-32, sourced from public datasets like UTSD and financial report data.
- Imputation: 38,657 instances, with 4-12 values randomly removed and replaced with "X".
- Anomaly Detection: 37,000 instances, input length 8-256, sourced from standard datasets such as UCR, ECG, and KPI.
- Classification: 37,000 instances, mostly sourced from human activity recognition data.
- Open-ended Reasoning QA: 37,629 instances, with question-answer pairs generated by GPT-4o covering topics like trends, seasonality, and volatility.
- Context Enhancement: All data is augmented with background information, feature descriptions, and task descriptions as textual context, helping the model provide different reasoning based on different contexts for the same time sequence.
- Training Strategy: A mixture of TSQA and general QA corpora (OpenOrca) is used at a 7:3 ratio, resulting in 10k QA pairs for continual pre-training. Training takes about 1 day on a single A100-80GB GPU.

Loss & Training¶

Uses LoRA (\(r=16\), \(\alpha=16\)) + AdamW (8-bit)
Learning rate \(5\text{e-}5\), embedding learning rate \(1\text{e-}5\), cosine scheduler
4000 max steps, 1000 warm-up steps
Batch size 4, gradient accumulation 8
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Key Experimental Results¶

Main Results¶

Model	Forecasting↓	Imputation↓	Anomaly↑	Classification↑	Judgment↑	MCQ↑
Doubao	—*	0.018	0.52	0.44	0.78	0.56
GPT-4o	1.79	0.018	0.64	0.32	0.72	0.58
Llama-3 8B	2.01	0.020	0.54	0.24	0.74	0.48
Qwen-2.5 7B	1.82	0.016	0.68	0.52	0.82	0.54
Mistral 7B	1.35	0.014	0.58	0.44	0.80	0.64

Ablation Study¶

Model	Judgment↑	MCQ↑
Zero-shot Mistral 7B	0.78	0.60
TSQA-tuned Mistral 7B	0.80	0.64

Key Findings¶

Mistral 7B achieves an MSE of 1.35 on the forecasting task, outperforming GPT-4o's 1.79, indicating the effectiveness of continual pre-training.
Qwen-2.5 7B performs best in anomaly detection (0.68) and classification (0.52), and reaches 82% accuracy in Judgment.
The TSQA-tuned model shows improvement in open-ended reasoning compared to the zero-shot version (MCQ: 0.64 vs. 0.60).
User study (78 participants): Mistral is most favored in numerical precision (80.8% preference), whereas Qwen is stronger in trend analysis.
The fine-tuned model can provide the reasoning behind predictions, demonstrating capabilities beyond numerical output.

Highlights & Insights¶

Dataset Contribution is Prominent: TSQA is the first large-scale, multi-domain, multi-task time series QA dataset (~200k), with a scale that far exceeds existing datasets (10x+).
Unified Framework Idea: Unifies traditionally isolated time series tasks into a QA paradigm, aligning with the interaction paradigm of the LLM era.
Value of Context Enhancement: The same time series can yield different interpretations under different background information; this design fosters the model's contextual reasoning capabilities.
Open-source Ecosystem: Datasets, models, and user study questionnaires are all open-sourced.

Limitations & Future Work¶

The MSE for forecasting tasks is relatively high (1.35-2.01). Long time series remain a challenge for LLMs, and there is a risk of hallucination.
Limited experimental scale: Only 50 QA pairs are used to evaluate each task, raising doubts about statistical significance.
The answers for open-ended reasoning QA are generated by GPT-4o, meaning answer quality is bounded by GPT-4o's capabilities.
Training only consumed 10k QA pairs (~5% of the ~200k), failing to fully exploit the potential of the complete dataset.
Automatic evaluation metrics cannot fully measure the quality of open-ended answers, and the user study has limited coverage.
Financial data is limited to earnings calls and does not include richer financial time series.

Different positioning compared to works like Time-LLM and TimeMMD: the latter focus on single tasks, while Time-MQA emphasizes unified multi-task QA.
The TSQA dataset fills the gap for the lack of large-scale text-numerical paired data in the time series community.
Insights: Similar QA datasets could be constructed in other fields (e.g., spatial data, graph-structured data) to empower LLMs with domain reasoning capabilities.
The concept of context enhancement is generalizable: adding text descriptions to any numerical data could potentially improve LLM's understanding and reasoning performance.

Rating¶

⭐⭐⭐⭐ (3.5/5) - Novelty: ⭐⭐⭐⭐ — The unified multi-task QA framework and large-scale dataset are primary contributions, but the method itself (LoRA fine-tuning) is relatively standard. - Experimental Thoroughness: ⭐⭐⭐ — Multi-model comparisons plus user study, but having only 50 evaluation samples per task is somewhat insufficient. - Writing Quality: ⭐⭐⭐⭐ — The framework and dataset descriptions are clear, and the illustrations are abundant. - Value: ⭐⭐⭐⭐ — The dataset contribution is the most valuable, serving as an important resource for time series + LLM research.