Learning to Reason Over Time: Timeline Self-Reflection for Temporal Reasoning¶
Conference: ACL 2025
arXiv: 2504.05258
Code: https://github.com/amazon-science/TISER
Area: Others
Keywords: Temporal reasoning, self-reflection, timeline construction, test-time scaling, synthetic data
TL;DR¶
The TISER framework is proposed to achieve test-time scaling for LLM temporal reasoning through a four-stage pipeline of "reasoning → timeline construction → self-reflection → answer generation." When combined with fine-tuning on synthetic reasoning trajectory data, this framework enables 7B open-source models to outperform GPT-4 on multiple temporal reasoning benchmarks and achieve SOTA results on tasks such as TGQA.
Background & Motivation¶
Background: LLMs exhibit excellent performance across many tasks, but temporal reasoning (understanding event order, duration, and temporal interval relationships) remains a weak spot. Benchmarks like TRAM and TimeBench indicate that even state-of-the-art models frequently fail on complex temporal queries.
Limitations of Prior Work: Existing approaches rely on prompt engineering (CoT), specialized pre-training (Temp-T5), or mathematical reasoning modules; however, they all lack explicit temporal structure representations. Models do not explicitly organize and cross-reference temporal information during the reasoning process.
Key Challenge: Temporal reasoning requires models to concurrently excel at two tasks: (a) extracting and sequencing temporal events from text, and (b) making logical inferences based on temporal order. Pure CoT reasoning lacks structured temporal representations, making it error-prone under complex temporal dependencies.
Goal: How to enable LLMs to explicitly construct timelines during inference and leverage self-reflection to detect and correct inconsistencies in temporal reasoning?
Key Insight: Inspired by test-time scaling, the reasoning trajectory is extended to capture complex temporal dependencies. Instead of simply being longer, it is systematically structured into three phases: reasoning → timeline → reflection.
Core Idea: Allowing LLMs to explicitly construct an event timeline during reasoning to serve as a "scaffolding", and then self-reflect and correct reasoning outcomes against the timeline, thereby significantly improving temporal reasoning accuracy.
Method¶
Overall Architecture¶
Four-stage reasoning pipeline (iterative): 1. Stage I - Reasoning: Generate an initial CoT reasoning trajectory \(r\) based on the question and temporal context. 2. Stage II - Timeline Construction: Extract temporal events from the reasoning trajectory and context, organizing them into an ordered timeline \(t\). 3. Stage III - Reflection: Compare the reasoning trajectory \(r\) with the timeline \(t\), detect inconsistencies/omissions/errors, and generate an improved reasoning trajectory \(r'\). 4. Stage IV - Answer Generation: Generate the final answer based on the refined reasoning and the timeline.
Key Designs¶
-
Explicit Timeline Construction (Stage II):
- Function: Extract all relevant temporal events from the reasoning trajectory and original text, arranging them in chronological order.
- Mechanism: Aggregate scattered temporal information in the text into an ordered structure, mimicking how humans draw timelines to solve complex temporal problems.
- Design Motivation: The timeline serves as an "external memory", allowing the model to intuitively cross-reference the chronological order of events rather than relying on implicit parametric memory.
-
Iterative Self-Reflection (Stage III):
- Function: Compare initial reasoning against the timeline, detect inconsistencies (e.g., incorrect event sorting, missing key time points), and generate corrected reasoning.
- Mechanism: Form a feedback loop of reasoning → timeline → cross-referencing → correction, which can iterate repeatedly until consistency is reached.
- Design Motivation: The core of test-time scaling—improving accuracy by extending the reasoning process.
-
Synthetic Reasoning Trajectory Dataset:
- Function: Starting from existing temporal reasoning datasets, use GPT-4 or DeepSeek to generate intermediate reasoning trajectories in the TISER format.
- Mechanism: For each sample \((q, a, c)\), generate a complete trajectory containing reasoning \(r\), timeline \(t\), and reflection \(f\). Retain only samples where the final answer \(a'\) matches the gold answer \(a\).
- Design Motivation: Ensure the correctness of the synthetic reasoning process by retaining only trajectories leading to the correct answer.
-
Structured Output Templates:
- Use XML tags to separate outputs of different stages:
<reasoning>,<timeline>,<reflection>,<answer>. - LoRA fine-tuning enables the model to learn to output in this format.
- Use XML tags to separate outputs of different stages:
Loss & Training¶
- Base Models: Mistral-7B, Qwen2.5-7B
- Fine-tuning Method: LoRA SFT
- Training Data: Synthetic reasoning trajectory versions of TGQA + TempReason + TimeQA
- Data Generator: GPT-4 or DeepSeek V2.5
Key Experimental Results¶
Main Results (Exact Match / F1)¶
| Model | Inference Method | TGQA | TempReason L2 | TempReason L3 | TimeQA Easy | TimeQA Hard | Average |
|---|---|---|---|---|---|---|---|
| GPT-4 | Standard | 72.5/82.5 | 78.6/86.2 | 81.9/88.3 | 83.6/93.7 | 76.0/85.3 | 78.5/87.2 |
| GPT-4 | TISER | 82.8/93.4 | 79.8/87.2 | 84.7/91.3 | 84.4/90.5 | 77.2/86.4 | 81.8/89.8 |
| Qwen2.5-7B | Standard | 46.1/48.9 | 51.0/53.6 | 40.1/42.7 | 70.9/73.5 | 53.2/55.8 | 52.3/55.0 |
| Mistral-7B + TISER-FT (GPT-4) | TISER | 80.5/87.4 | 82.5/84.3 | 87.1/88.5 | 97.5/98.5 | 95.9/96.4 | 88.7/91.0 |
| Qwen2.5-7B + TISER-FT (GPT-4) | TISER | 84.5/94.2 | 85.5/87.5 | - | - | - | - |
Ablation Study¶
| Configuration | Average EM | Description |
|---|---|---|
| Full TISER | 85.6 | Reasoning + Timeline + Reflection |
| w/o Reflection | Decrease | Without iterative correction |
| w/o Timeline | Decrease | Without explicit temporal structure |
| Standard CoT | 55.7 | Baseline standard fine-tuning |
Key Findings¶
- 7B Model Outperforms GPT-4: TISER-fine-tuned Mistral-7B achieves 88.7 EM, significantly outperforming GPT-4's 78.5 EM (+10.2).
- Timeline is Core: Explicitly constructing timelines provides a huge boost compared to pure CoT because it externalizes implicit temporal information into verifiable structures.
- Self-Reflection is Effective but Relies on Timeline Support: Self-reflection without a timeline as an "anchor" shows limited efficacy; reflection requires a structured reference.
- Improvement Even in Standard Inference: TISER-fine-tuned models perform better than standard fine-tuned models even when not using the TISER inference pipeline (i.e., using standard inference).
- Robust OOD Generalization: Performance is maintained or even improved on unseen benchmarks such as MultiHopRAG and Test-of-Time.
Highlights & Insights¶
- The "Drawing a Timeline" Concept is highly natural and intuitive—exactly how humans resolve complex temporal problem-solving. Translating this cognitive strategy into an LLM reasoning pipeline is highly ingenious.
- Synthetic Data Quality Control is well-designed. Retaining only reasoning trajectories that lead to correct answers ensures the accuracy of the training signal.
- Small Models Greatly Outperforming Large Models carries substantial practical significance. A 7B model equipped with TISER beats GPT-4 by over 10 points, demonstrating that structured reasoning strategies matter more than model scale.
Limitations & Future Work¶
- The model is currently evaluated only on temporal reasoning tasks. The underlying concept of TISER (explicitly building domain structures \(\to\) self-reflection) could potentially be transferred to other structured reasoning tasks, such as spatial reasoning.
- Dependence on the quality of training data generated by GPT-4.
- The configuration of iterative reflection loops and stopping criteria requires further investigation.
Related Work & Insights¶
- vs TG-LLM (Xiong et al. 2024): TG-LLM also focuses on temporal reasoning but uses CoT. Ours introduces explicit timelines and self-reflection, resulting in a substantial performance gain.
- vs s1 (Muennighoff et al. 2025): s1 utilizes budget forcing for general test-time scaling, whereas TISER is a specialized test-time scaling method tailored for temporal reasoning.
- vs Self-Refine (Madaan et al. 2023): Self-Refine performs general self-reflection, whereas TISER introduces timelines as a structured reference for reflection.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combined design of timeline construction and self-reflection is both natural and effective.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Benchmarked on multiple datasets (5+), multiple models, OOD evaluations, complete ablation studies, and impressive results outperforming GPT-4 using only a 7B model.
- Writing Quality: ⭐⭐⭐⭐ Clear flow with standard algorithm pseudocode.
- Value: ⭐⭐⭐⭐⭐ Provides a transferable paradigm for enhanced structured reasoning, with open-sourced code and data.