Learning to Reason Over Time: Timeline Self-Reflection for Temporal Reasoning¶

Conference: ACL 2025
arXiv: 2504.05258
Code: https://github.com/amazon-science/TISER
Area: Others
Keywords: Temporal reasoning, self-reflection, timeline construction, test-time scaling, synthetic data

TL;DR¶

The TISER framework is proposed to achieve test-time scaling for LLM temporal reasoning through a four-stage pipeline of "reasoning → timeline construction → self-reflection → answer generation." When combined with fine-tuning on synthetic reasoning trajectory data, this framework enables 7B open-source models to outperform GPT-4 on multiple temporal reasoning benchmarks and achieve SOTA results on tasks such as TGQA.

Background & Motivation¶

Background: LLMs exhibit excellent performance across many tasks, but temporal reasoning (understanding event order, duration, and temporal interval relationships) remains a weak spot. Benchmarks like TRAM and TimeBench indicate that even state-of-the-art models frequently fail on complex temporal queries.

Limitations of Prior Work: Existing approaches rely on prompt engineering (CoT), specialized pre-training (Temp-T5), or mathematical reasoning modules; however, they all lack explicit temporal structure representations. Models do not explicitly organize and cross-reference temporal information during the reasoning process.

Key Challenge: Temporal reasoning requires models to concurrently excel at two tasks: (a) extracting and sequencing temporal events from text, and (b) making logical inferences based on temporal order. Pure CoT reasoning lacks structured temporal representations, making it error-prone under complex temporal dependencies.

Goal: How to enable LLMs to explicitly construct timelines during inference and leverage self-reflection to detect and correct inconsistencies in temporal reasoning?

Key Insight: Inspired by test-time scaling, the reasoning trajectory is extended to capture complex temporal dependencies. Instead of simply being longer, it is systematically structured into three phases: reasoning → timeline → reflection.

Core Idea: Allowing LLMs to explicitly construct an event timeline during reasoning to serve as a "scaffolding", and then self-reflect and correct reasoning outcomes against the timeline, thereby significantly improving temporal reasoning accuracy.

Method¶

Overall Architecture¶

Four-stage reasoning pipeline (iterative): 1. Stage I - Reasoning: Generate an initial CoT reasoning trajectory \(r\) based on the question and temporal context. 2. Stage II - Timeline Construction: Extract temporal events from the reasoning trajectory and context, organizing them into an ordered timeline \(t\). 3. Stage III - Reflection: Compare the reasoning trajectory \(r\) with the timeline \(t\), detect inconsistencies/omissions/errors, and generate an improved reasoning trajectory \(r'\). 4. Stage IV - Answer Generation: Generate the final answer based on the refined reasoning and the timeline.

Key Designs¶

Explicit Timeline Construction (Stage II):
- Function: Extract all relevant temporal events from the reasoning trajectory and original text, arranging them in chronological order.
- Mechanism: Aggregate scattered temporal information in the text into an ordered structure, mimicking how humans draw timelines to solve complex temporal problems.
- Design Motivation: The timeline serves as an "external memory", allowing the model to intuitively cross-reference the chronological order of events rather than relying on implicit parametric memory.
Iterative Self-Reflection (Stage III):
- Function: Compare initial reasoning against the timeline, detect inconsistencies (e.g., incorrect event sorting, missing key time points), and generate corrected reasoning.
- Mechanism: Form a feedback loop of reasoning → timeline → cross-referencing → correction, which can iterate repeatedly until consistency is reached.
- Design Motivation: The core of test-time scaling—improving accuracy by extending the reasoning process.
Synthetic Reasoning Trajectory Dataset:
- Function: Starting from existing temporal reasoning datasets, use GPT-4 or DeepSeek to generate intermediate reasoning trajectories in the TISER format.
- Mechanism: For each sample \((q, a, c)\), generate a complete trajectory containing reasoning \(r\), timeline \(t\), and reflection \(f\). Retain only samples where the final answer \(a'\) matches the gold answer \(a\).
- Design Motivation: Ensure the correctness of the synthetic reasoning process by retaining only trajectories leading to the correct answer.
Structured Output Templates:
- Use XML tags to separate outputs of different stages: <reasoning>, <timeline>, <reflection>, <answer>.
- LoRA fine-tuning enables the model to learn to output in this format.

Loss & Training¶

Base Models: Mistral-7B, Qwen2.5-7B
Fine-tuning Method: LoRA SFT
Training Data: Synthetic reasoning trajectory versions of TGQA + TempReason + TimeQA
Data Generator: GPT-4 or DeepSeek V2.5

Key Experimental Results¶

Main Results (Exact Match / F1)¶

Model	Inference Method	TGQA	TempReason L2	TempReason L3	TimeQA Easy	TimeQA Hard	Average
GPT-4	Standard	72.5/82.5	78.6/86.2	81.9/88.3	83.6/93.7	76.0/85.3	78.5/87.2
GPT-4	TISER	82.8/93.4	79.8/87.2	84.7/91.3	84.4/90.5	77.2/86.4	81.8/89.8
Qwen2.5-7B	Standard	46.1/48.9	51.0/53.6	40.1/42.7	70.9/73.5	53.2/55.8	52.3/55.0
Mistral-7B + TISER-FT (GPT-4)	TISER	80.5/87.4	82.5/84.3	87.1/88.5	97.5/98.5	95.9/96.4	88.7/91.0
Qwen2.5-7B + TISER-FT (GPT-4)	TISER	84.5/94.2	85.5/87.5	-	-	-	-

Ablation Study¶

Configuration	Average EM	Description
Full TISER	85.6	Reasoning + Timeline + Reflection
w/o Reflection	Decrease	Without iterative correction
w/o Timeline	Decrease	Without explicit temporal structure
Standard CoT	55.7	Baseline standard fine-tuning

Key Findings¶

7B Model Outperforms GPT-4: TISER-fine-tuned Mistral-7B achieves 88.7 EM, significantly outperforming GPT-4's 78.5 EM (+10.2).
Timeline is Core: Explicitly constructing timelines provides a huge boost compared to pure CoT because it externalizes implicit temporal information into verifiable structures.
Self-Reflection is Effective but Relies on Timeline Support: Self-reflection without a timeline as an "anchor" shows limited efficacy; reflection requires a structured reference.
Improvement Even in Standard Inference: TISER-fine-tuned models perform better than standard fine-tuned models even when not using the TISER inference pipeline (i.e., using standard inference).
Robust OOD Generalization: Performance is maintained or even improved on unseen benchmarks such as MultiHopRAG and Test-of-Time.

Highlights & Insights¶

The "Drawing a Timeline" Concept is highly natural and intuitive—exactly how humans resolve complex temporal problem-solving. Translating this cognitive strategy into an LLM reasoning pipeline is highly ingenious.
Synthetic Data Quality Control is well-designed. Retaining only reasoning trajectories that lead to correct answers ensures the accuracy of the training signal.
Small Models Greatly Outperforming Large Models carries substantial practical significance. A 7B model equipped with TISER beats GPT-4 by over 10 points, demonstrating that structured reasoning strategies matter more than model scale.

Limitations & Future Work¶

The model is currently evaluated only on temporal reasoning tasks. The underlying concept of TISER (explicitly building domain structures \(\to\) self-reflection) could potentially be transferred to other structured reasoning tasks, such as spatial reasoning.
Dependence on the quality of training data generated by GPT-4.
The configuration of iterative reflection loops and stopping criteria requires further investigation.

vs TG-LLM (Xiong et al. 2024): TG-LLM also focuses on temporal reasoning but uses CoT. Ours introduces explicit timelines and self-reflection, resulting in a substantial performance gain.
vs s1 (Muennighoff et al. 2025): s1 utilizes budget forcing for general test-time scaling, whereas TISER is a specialized test-time scaling method tailored for temporal reasoning.
vs Self-Refine (Madaan et al. 2023): Self-Refine performs general self-reflection, whereas TISER introduces timelines as a structured reference for reflection.

Rating¶

Novelty: ⭐⭐⭐⭐ The combined design of timeline construction and self-reflection is both natural and effective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Benchmarked on multiple datasets (5+), multiple models, OOD evaluations, complete ablation studies, and impressive results outperforming GPT-4 using only a 7B model.
Writing Quality: ⭐⭐⭐⭐ Clear flow with standard algorithm pseudocode.
Value: ⭐⭐⭐⭐⭐ Provides a transferable paradigm for enhanced structured reasoning, with open-sourced code and data.