Counterfactual-Consistency Prompting for Relative Temporal Understanding in Large Language Models¶

Conference: ACL 2025 (Short)
arXiv: 2502.11425
Code: None
Area: Causal Reasoning / Temporal Reasoning
Keywords: Counterfactual Prompting, Temporal Consistency, Event Ordering, Large Language Models, Temporal Reasoning

TL;DR¶

This paper proposes Counterfactual-Consistency Prompting, a method that addresses the inconsistency in temporal reasoning of large language models (LLMs) by generating counterfactual questions and imposing collective constraints, achieving significant improvements across multiple temporal understanding datasets.

Background & Motivation¶

Background: Large language models (LLMs) have demonstrated strong capabilities across various natural language processing tasks, but their performance in temporal reasoning remains immature. Temporal reasoning requires models to accurately understand and judge logical relationships along the temporal dimension, such as the chronological order or simultaneity of events.

Limitations of Prior Work: Existing LLMs suffer from severe inconsistency issues when processing temporal queries. Specifically, when asked whether event A occurred before event B, a model might answer "yes"; yet when asked in an alternative way whether event B occurred after event A, it may produce a contradictory answer. This confusion over mutually exclusive temporal relations (e.g., "before" and "after") makes the model's predictions unreliable.

Key Challenge: LLM temporal reasoning depends on surface language patterns rather than true temporal logical understanding, causing the model to generate inconsistent judgments when presented with different formulations of the same temporal relation. While prior work has pointed out this issue, an effective solution is still lacking.

Goal: To improve the consistency and accuracy of LLMs in temporal reasoning tasks through prompt engineering without modifying model parameters, particularly when dealing with explicit event ordering, implicit event ordering, and temporal common sense understanding.

Key Insight: Drawing from counterfactual thinking in causal reasoning, this study is inspired by the counterfactual logic of "what would happen if the conditions changed?" This logic is applied to temporal relation judgment by constructing counterfactual questions to verify and correct the model's temporal judgments.

Core Idea: By automatically generating counterfactual dual questions for each temporal query (e.g., swapping "before" with "after") and enforcing collective consistency constraints on all related answers, the model is compelled to produce logically self-consistent temporal judgments.

Method¶

Overall Architecture¶

The overall process of the proposed method consists of three phases: (1) Given a temporal reasoning question, its counterfactual variants are first constructed via a counterfactual generation module; (2) The LLM is prompted to answer the original question and all counterfactual questions separately; (3) A consistency constraint module performs collective verification on all answers, selecting the final answer that satisfies temporal logical consistency. The input is a natural language question about temporal relations of events, and the output is the consistency-rectified temporal relation judgment.

Key Designs¶

Counterfactual Question Generation:
- Function: Generates semantically complementary or opposing counterfactual versions for each original temporal question.
- Mechanism: For a question querying the temporal relationship between event A and event B, counterfactual questions are generated by swapping the sequence of events or substituting the temporal relationship words. For example, if the original question is "Did A happen before B?", the counterfactual question would be "Did B happen after A?". This generation is automatically completed based on the symmetry and mutual exclusivity of temporal relations without requiring additional model reasoning.
- Design Motivation: Answers to a single question may lean toward a bias due to surface language patterns. However, utilizing counterfactual duals allows verifying the same temporal relationship from multiple perspectives, exposing the model's inconsistencies.
Collective Consistency Constraints:
- Function: Ensures that answers to the original question and all its counterfactual variants are logically consistent with regard to temporal logic.
- Mechanism: A set of constraint rules based on temporal logic is defined. For instance, if "A before B" is true, then "B after A" must also be true, and "A after B" must be false. These constraints are applied to all question-answer pairs. If a violation is detected, a voting or optimization strategy is utilized to select the answer combination that satisfies the most constraints. Constraints can be implemented through simple logical rules with extremely low computational overhead.
- Design Motivation: LLMs are prone to producing locally optimal but globally inconsistent answers when answering each question independently. Collective constraints bind multiple related questions together, leveraging the transitivity and symmetry of temporal logic to rectify individual erroneous answers.
Adaptive Prompting Strategy:
- Function: Adjusts prompt templates according to different types of temporal tasks (explicit events, implicit events, temporal common sense).
- Mechanism: For event ordering with explicit temporal expressions, the prompts guide the model to focus on temporal markers in the text; for implicit event ordering, the prompts guide the model toward causal reasoning and world knowledge deduction; for temporal common sense tasks, the prompts contain relevant temporal common sense examples.
- Design Motivation: Different types of temporal reasoning tasks require the model to activate different reasoning capabilities; a unified prompt template struggles to cover all scenarios.

Loss & Training¶

This work is a purely inference-stage method and does not involve model training or fine-tuning; hence, there is no loss function. The core of the method lies in the prompt design during inference and the post-processing consistency verification.

Key Experimental Results¶

Main Results¶

The method was evaluated on three types of temporal reasoning tasks, including explicit event ordering (MATRES), implicit event ordering (TRACIE), and temporal common sense understanding (MC-TACO, etc.).

Dataset	Model	Baseline Accuracy	+CCP Accuracy	Gain
MATRES	GPT-4	72.3%	79.1%	+6.8%
MATRES	GPT-3.5	65.4%	73.2%	+7.8%
TRACIE	GPT-4	68.7%	76.5%	+7.8%
TRACIE	GPT-3.5	59.3%	68.1%	+8.8%
MC-TACO	GPT-4	74.5%	80.2%	+5.7%

Ablation Study¶

Configuration	MATRES	TRACIE	Description
Full CCP	79.1%	76.5%	Full method
w/o Counterfactual Gen	73.8%	70.2%	Uses original questions only, removing counterfactuals
w/o Consistency Constraints	75.4%	72.9%	Generates counterfactuals but performs no constraint verification
Simple Voting Only	76.2%	73.8%	Uses majority voting instead of logical constraints

Key Findings¶

Both modules—counterfactual question generation and consistency constraints—contribute significantly to the final performance, with consistency constraints providing a slightly larger contribution than counterfactual generation alone.
The performance gains of the method on the weaker model (GPT-3.5) are larger than on the stronger model (GPT-4), indicating a greater compensatory effect for models with weaker reasoning capabilities.
The effect is most pronounced on the implicit event ordering task, as temporal judgments for implicit events rely more on reasoning than surface cues, making the inconsistency problem more prominent.
While simple majority voting shows some effectiveness, the logic-constraint-based approach is superior, illustrating that explicit temporal logic constraints are more effective than statistical voting.

Highlights & Insights¶

Clever utilization of counterfactual duals: Borrowing counterfactual thinking from causal reasoning to resolve consistency issues without any additional training, achieving substantial gains purely through prompt engineering and post-processing at inference time. This approach is highly lightweight and generalizable.
Revealing systematic flaws in LLM temporal reasoning: The paper not only provides solutions but also conducts a molecular analysis of inconsistency patterns, finding that model sensitivity heavily fluctuates across different temporal relationship words (before/after/during).
Potential for zero-cost transfer: The concept of counterfactual consistency checking can be directly transferred to other reasoning tasks that require logical consistency, such as spatial reasoning and causal reasoning, simply by defining corresponding logical constraint rules.

Limitations & Future Work¶

As a short paper, the experimental scale is relatively limited, verifying the approach on only a few representative datasets without covering broader temporal reasoning scenarios.
Counterfactual generation relies on pre-defined temporal relationship symmetry rules, which may fail to handle more complex multi-event chained temporal relationships.
The method increases the number of API calls during inference (generating and answering multiple counterfactual variants for each query), which could pose cost issues in large-scale applications.
No comparison was made with specialized fine-tuning methods for temporal reasoning, such as models specifically fine-tuned on temporal data.

vs Chain-of-Thought Prompting: CoT improves accuracy through step-by-step reasoning but cannot guarantee consistency across different phrasing styles. The proposed method directly tackles consistency via explicit logical constraints, making them complementary.
vs Self-Consistency: Self-consistency methods boost accuracy by taking a majority vote over multiple sampled outputs, whereas the collective constraints in this work are based on domain logic rather than statistics. Experiments demonstrate that logical constraints outperform simple voting.
vs Temporal Fine-tuning: Fine-tuning approaches require labeled data and incur training costs, though they can learn deeper temporal patterns. The proposed method achieves zero-cost improvement but is bounded by prompt engineering limitations.

Rating¶

Novelty: ⭐⭐⭐⭐ The approach of counterfactual consistency checking is relatively novel in temporal reasoning, although the core idea (symmetry constraints) is not entirely new in NLP.
Experimental Thoroughness: ⭐⭐⭐ The experimental scale is limited due to the short paper page limit, but it covers three typical types of temporal tasks.
Writing Quality: ⭐⭐⭐⭐ Problem motivation is clear, and the method description is concise, which fits the constraints of a short paper.
Value: ⭐⭐⭐⭐ Provides a practical, zero-training-cost approach to improving temporal reasoning consistency, with generalizable concepts.