SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models¶
Conference: AAAI 2026
arXiv: 2511.09993
Code: GitHub
Area: Code Intelligence
Keywords: cross-calendar reasoning, temporal reasoning, six calendar systems, tool-augmented agent, evaluation benchmark
TL;DR¶
This paper proposes SPAN, a cross-calendar temporal reasoning benchmark (6 calendars × 10 reasoning directions × 100-year range × 37,380 instances). Baseline LLMs achieve an average accuracy of only 34.5% (none exceeding 80%), revealing two systematic failure modes—Future-Date Degradation and Calendar Asymmetry Bias. A tool-augmented Time Agent achieves 95.31%, demonstrating that cross-calendar reasoning requires external tools rather than parametric knowledge.
Background & Motivation¶
Background: Temporal reasoning evaluation for LLMs is limited to the Gregorian calendar, overlooking the importance of 20+ global calendar systems for multicultural applications.
Limitations of Prior Work: (a) No cross-calendar reasoning benchmark exists; (b) conversion between calendar systems involves complex astronomical, religious, and cultural rules (e.g., the Islamic calendar is lunar-based and approximately 11 days shorter than the Gregorian calendar per year); (c) LLMs' temporal knowledge is predominantly derived from Gregorian-calendar corpora.
Key Challenge: Cross-calendar conversion requires precise mathematical computation and domain knowledge, yet parametric memory in LLMs cannot cover all calendar–date combinations, especially for future dates.
Goal: Establish a systematic benchmark for cross-calendar temporal reasoning and design a tool-augmented solution.
Key Insight: Template-driven dynamic instantiation to avoid data contamination; six calendars covering the world's major cultural spheres.
Core Idea: A systematic benchmark of 6 calendars × 10 reasoning directions × 100 years, combined with a tool-augmented Time Agent achieving 95.31% accuracy.
Method¶
Overall Architecture¶
Six calendars: Gregorian, Chinese Lunar, Saka, Hebrew, Islamic, and Persian. Ten reasoning directions (intra-calendar → cross-calendar × 2 directions × date/festival × polarity/content questions). One-hundred-year range: 1960–2060.
Key Designs¶
-
Template-Driven Dynamic Generation:
- Function: Runtime instantiation to prevent data contamination.
- Four stages: calendar conversion → template matching → variable instantiation (\(n_d\in[1,10]\) days / \(n_w\in[1,10]\) weeks / \(n_y\in[1,5]\) years) → code-execution verification.
- Design Motivation: Static datasets are susceptible to contamination from LLM pretraining data.
-
Time Agent:
- Function: LLM +
search_calendartool interface. - Three-step pipeline: few-shot prompting to generate executable code → code execution → GPT-4o generates the final answer based on execution results.
- The
search_calendarinterface supports{calendar_name, year, month, day}and{calendar_name, year, festival_name}. - Design Motivation: Precise calendar conversion requires algorithmic computation rather than memorization.
- Function: LLM +
-
Two Reasoning Types:
- Date reasoning: reasoning given a specific date.
- Festival reasoning: computing a date given a festival name in a particular calendar.
- Each type includes polarity questions (yes/no) and content questions (specific date).
Key Experimental Results¶
Main Results (37,380 instances)¶
| Model | Average Accuracy | Notes |
|---|---|---|
| GPT-4o | ~45% | Strongest closed-source |
| Claude-3.7-Sonnet | ~43% | Competitive |
| DeepSeek-V3 | ~45% | Matches closed-source |
| Gemini-1.5-Pro | <30% | Worst |
| All LLMs (average) | 34.5% | None exceeds 80% |
| OpenAI-o1 (reasoning) | 59.29% | 2nd place |
| GPT-4o + RAG | 43.69% | Only +0.68% |
| Time Agent | 95.31% | Tool-augmented wins decisively |
Systematic Failure Mode Analysis¶
| Failure Mode | Manifestation | Magnitude |
|---|---|---|
| Future-Date Degradation | Past ~40% → Future ~25% | −15 pp |
| Calendar Asymmetry Bias | Gregorian→others vs. reverse | 3.97–17.49% gap |
| Polarity vs. Content | Polarity > Content | +18.86% on average |
| Date vs. Festival | Festival > Date | +2.87–12.60% |
Key Findings¶
- Cross-calendar reasoning is a systematic blind spot for LLMs—34.5% accuracy approaches random chance.
- Future-Date Degradation: Accuracy on future dates is 10–15 pp lower than on past dates—future events are absent from training data.
- Calendar Asymmetry Bias: Accuracy in the Gregorian→other direction is 15–17 pp higher—pretraining data is predominantly Gregorian-centric.
- Tool augmentation is the only effective solution: Time Agent (95.31%) vs. o1 reasoning (59.29%) vs. RAG (43.69%)—neither reasoning nor retrieval alone suffices.
- RAG provides negligible benefit (+0.68%) because retrieved content is also grounded in parametric knowledge sources.
Highlights & Insights¶
- Two systematic biases in LLMs' temporal knowledge are revealed—Future-Date Degradation and Calendar Asymmetry Bias are generalizable findings. Other knowledge domains (e.g., non-Western legal systems, non-English literature) may exhibit similar center-periphery biases.
- Clear stratification among tools, reasoning, and retrieval: Time Agent (95%) >> o1 (59%) >> RAG (44%) >> base LLMs (34%). Precise computation tasks require external tools rather than stronger reasoning or more retrieval.
- Contamination-resistant dynamic instantiation is a simple yet critical design choice that ensures the long-term validity of the benchmark.
Limitations & Future Work¶
- Only six calendars are covered; the benchmark could be extended to the Japanese, Thai Buddhist, Indian national, and other calendars.
- The Time Agent depends on the coverage of the
search_calendarAPI. - Only calendar conversion is tested; more complex temporal reasoning (e.g., "day of the week from Islamic to Lunar calendar") is not evaluated.
- Internalizing temporal reasoning capabilities into model parameters is a direction worth exploring.
Related Work & Insights¶
- vs. TimeQA/TempReason and similar temporal benchmarks: These cover only the Gregorian calendar; SPAN is the first to encompass multiple calendar systems, filling the gap in multicultural temporal reasoning evaluation.
- vs. tool-augmented LLMs: The Time Agent demonstrates the irreplaceability of external tools for precise computation tasks—the gap between 95.31% and reasoning model's 59.29% indicates that certain capabilities must be externalized.
- vs. RAG approaches: GPT-4o + RAG improves accuracy by only 0.68% (43.01→43.69%), showing that retrieval augmentation is largely ineffective for computation-oriented tasks, since retrieved content is also derived from parametric knowledge sources.
- Insight: Multicultural and multi-system evaluation is critical for globally deployed AI. LLMs' knowledge biases (Gregorian-first, past-first) reflect structural imbalances in training data.
- Future directions: The SPAN evaluation paradigm can be extended to other culturally dependent reasoning tasks (e.g., non-Western legal systems, traditional medicine).
- Practical application scenarios: International meeting scheduling, cross-national holiday computation, and public services for multicultural communities all rely on accurate cross-calendar conversion.
- Connection to other tool-augmented work: SPAN's findings are consistent with those of RoutingGen in code generation—certain tasks inherently require external tools rather than pure model reasoning.
- Generalizability of contamination prevention: The template-driven dynamic instantiation approach is transferable to other benchmark designs that need to prevent pretraining data leakage.
- Cultural coverage of six calendars: Gregorian (Western), Chinese Lunar (East Asian), Saka (Indian), Hebrew (Jewish culture), Islamic (Muslim world), Persian (Iran)—covering the world's major cultural spheres.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First cross-calendar reasoning benchmark; discovery of two systematic failure modes.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 37,380 instances × 6 calendars × 10 directions × multiple models + tool comparisons.
- Writing Quality: ⭐⭐⭐⭐ Systematic benchmark design.
- Value: ⭐⭐⭐⭐ Significant contribution to multicultural LLM evaluation and tool-augmented reasoning.