Skip to content

SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models

Conference: AAAI 2026
arXiv: 2511.09993
Code: GitHub
Area: Code Intelligence
Keywords: cross-calendar reasoning, temporal reasoning, six calendar systems, tool-augmented agent, evaluation benchmark

TL;DR

This paper proposes SPAN, a cross-calendar temporal reasoning benchmark (6 calendars × 10 reasoning directions × 100-year range × 37,380 instances). Baseline LLMs achieve an average accuracy of only 34.5% (none exceeding 80%), revealing two systematic failure modes—Future-Date Degradation and Calendar Asymmetry Bias. A tool-augmented Time Agent achieves 95.31%, demonstrating that cross-calendar reasoning requires external tools rather than parametric knowledge.

Background & Motivation

Background: Temporal reasoning evaluation for LLMs is limited to the Gregorian calendar, overlooking the importance of 20+ global calendar systems for multicultural applications.

Limitations of Prior Work: (a) No cross-calendar reasoning benchmark exists; (b) conversion between calendar systems involves complex astronomical, religious, and cultural rules (e.g., the Islamic calendar is lunar-based and approximately 11 days shorter than the Gregorian calendar per year); (c) LLMs' temporal knowledge is predominantly derived from Gregorian-calendar corpora.

Key Challenge: Cross-calendar conversion requires precise mathematical computation and domain knowledge, yet parametric memory in LLMs cannot cover all calendar–date combinations, especially for future dates.

Goal: Establish a systematic benchmark for cross-calendar temporal reasoning and design a tool-augmented solution.

Key Insight: Template-driven dynamic instantiation to avoid data contamination; six calendars covering the world's major cultural spheres.

Core Idea: A systematic benchmark of 6 calendars × 10 reasoning directions × 100 years, combined with a tool-augmented Time Agent achieving 95.31% accuracy.

Method

Overall Architecture

Six calendars: Gregorian, Chinese Lunar, Saka, Hebrew, Islamic, and Persian. Ten reasoning directions (intra-calendar → cross-calendar × 2 directions × date/festival × polarity/content questions). One-hundred-year range: 1960–2060.

Key Designs

  1. Template-Driven Dynamic Generation:

    • Function: Runtime instantiation to prevent data contamination.
    • Four stages: calendar conversion → template matching → variable instantiation (\(n_d\in[1,10]\) days / \(n_w\in[1,10]\) weeks / \(n_y\in[1,5]\) years) → code-execution verification.
    • Design Motivation: Static datasets are susceptible to contamination from LLM pretraining data.
  2. Time Agent:

    • Function: LLM + search_calendar tool interface.
    • Three-step pipeline: few-shot prompting to generate executable code → code execution → GPT-4o generates the final answer based on execution results.
    • The search_calendar interface supports {calendar_name, year, month, day} and {calendar_name, year, festival_name}.
    • Design Motivation: Precise calendar conversion requires algorithmic computation rather than memorization.
  3. Two Reasoning Types:

    • Date reasoning: reasoning given a specific date.
    • Festival reasoning: computing a date given a festival name in a particular calendar.
    • Each type includes polarity questions (yes/no) and content questions (specific date).

Key Experimental Results

Main Results (37,380 instances)

Model Average Accuracy Notes
GPT-4o ~45% Strongest closed-source
Claude-3.7-Sonnet ~43% Competitive
DeepSeek-V3 ~45% Matches closed-source
Gemini-1.5-Pro <30% Worst
All LLMs (average) 34.5% None exceeds 80%
OpenAI-o1 (reasoning) 59.29% 2nd place
GPT-4o + RAG 43.69% Only +0.68%
Time Agent 95.31% Tool-augmented wins decisively

Systematic Failure Mode Analysis

Failure Mode Manifestation Magnitude
Future-Date Degradation Past ~40% → Future ~25% −15 pp
Calendar Asymmetry Bias Gregorian→others vs. reverse 3.97–17.49% gap
Polarity vs. Content Polarity > Content +18.86% on average
Date vs. Festival Festival > Date +2.87–12.60%

Key Findings

  • Cross-calendar reasoning is a systematic blind spot for LLMs—34.5% accuracy approaches random chance.
  • Future-Date Degradation: Accuracy on future dates is 10–15 pp lower than on past dates—future events are absent from training data.
  • Calendar Asymmetry Bias: Accuracy in the Gregorian→other direction is 15–17 pp higher—pretraining data is predominantly Gregorian-centric.
  • Tool augmentation is the only effective solution: Time Agent (95.31%) vs. o1 reasoning (59.29%) vs. RAG (43.69%)—neither reasoning nor retrieval alone suffices.
  • RAG provides negligible benefit (+0.68%) because retrieved content is also grounded in parametric knowledge sources.

Highlights & Insights

  • Two systematic biases in LLMs' temporal knowledge are revealed—Future-Date Degradation and Calendar Asymmetry Bias are generalizable findings. Other knowledge domains (e.g., non-Western legal systems, non-English literature) may exhibit similar center-periphery biases.
  • Clear stratification among tools, reasoning, and retrieval: Time Agent (95%) >> o1 (59%) >> RAG (44%) >> base LLMs (34%). Precise computation tasks require external tools rather than stronger reasoning or more retrieval.
  • Contamination-resistant dynamic instantiation is a simple yet critical design choice that ensures the long-term validity of the benchmark.

Limitations & Future Work

  • Only six calendars are covered; the benchmark could be extended to the Japanese, Thai Buddhist, Indian national, and other calendars.
  • The Time Agent depends on the coverage of the search_calendar API.
  • Only calendar conversion is tested; more complex temporal reasoning (e.g., "day of the week from Islamic to Lunar calendar") is not evaluated.
  • Internalizing temporal reasoning capabilities into model parameters is a direction worth exploring.
  • vs. TimeQA/TempReason and similar temporal benchmarks: These cover only the Gregorian calendar; SPAN is the first to encompass multiple calendar systems, filling the gap in multicultural temporal reasoning evaluation.
  • vs. tool-augmented LLMs: The Time Agent demonstrates the irreplaceability of external tools for precise computation tasks—the gap between 95.31% and reasoning model's 59.29% indicates that certain capabilities must be externalized.
  • vs. RAG approaches: GPT-4o + RAG improves accuracy by only 0.68% (43.01→43.69%), showing that retrieval augmentation is largely ineffective for computation-oriented tasks, since retrieved content is also derived from parametric knowledge sources.
  • Insight: Multicultural and multi-system evaluation is critical for globally deployed AI. LLMs' knowledge biases (Gregorian-first, past-first) reflect structural imbalances in training data.
  • Future directions: The SPAN evaluation paradigm can be extended to other culturally dependent reasoning tasks (e.g., non-Western legal systems, traditional medicine).
  • Practical application scenarios: International meeting scheduling, cross-national holiday computation, and public services for multicultural communities all rely on accurate cross-calendar conversion.
  • Connection to other tool-augmented work: SPAN's findings are consistent with those of RoutingGen in code generation—certain tasks inherently require external tools rather than pure model reasoning.
  • Generalizability of contamination prevention: The template-driven dynamic instantiation approach is transferable to other benchmark designs that need to prevent pretraining data leakage.
  • Cultural coverage of six calendars: Gregorian (Western), Chinese Lunar (East Asian), Saka (Indian), Hebrew (Jewish culture), Islamic (Muslim world), Persian (Iran)—covering the world's major cultural spheres.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First cross-calendar reasoning benchmark; discovery of two systematic failure modes.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 37,380 instances × 6 calendars × 10 directions × multiple models + tool comparisons.
  • Writing Quality: ⭐⭐⭐⭐ Systematic benchmark design.
  • Value: ⭐⭐⭐⭐ Significant contribution to multicultural LLM evaluation and tool-augmented reasoning.