Causal Graph based Event Reasoning using Semantic Relation Experts¶

Conference: ACL 2025
arXiv: 2506.06910
Code: github
Area: Causal Reasoning
Keywords: Causal Graph Generation, Event Reasoning, Multi-agent Collaboration, Semantic Relation Experts, Explainable Prediction

TL;DR¶

A causal event graph generation framework involving multi-round collaborative discussion among four types of semantic relation experts (Temporal, Discourse, Conditional, Commonsense) is proposed. Under a zero-shot setting, it achieves competitive results compared to fine-tuned models on multiple downstream tasks, such as event prediction and event forecasting, while providing explainable causal event chains.

Background & Motivation¶

Task Definition: Given a set of events, construct a global causal event graph (nodes = events, directed edges = causal relationships) and utilize this graph to assist in reasoning tasks such as event prediction and forecasting.
Limitations of Prior Work: Existing event reasoning methods mainly rely on distributional co-occurrence patterns of events, lacking explicit modeling of deep causal logic. Even SOTA LLMs still exhibit low accuracy in identifying causal relationships under standard in-context learning (ICL) settings.
Key Challenge: Causal determination requires considering how events are embedded within the global context. For example, an earthquake itself is common, but "officials request aid" only occurs when both "city faces resource shortage" and "city suffers damage" hold simultaneously. Standard LLMs evaluating pairs in isolation easily overlook such subtle multi-event joint causalities.
Core Idea: Four types of semantic relation experts are designed to generate a global causal graph through multi-round debate-like collaboration. This causal graph then drives downstream reasoning to achieve explainable event prediction.

Method¶

Overall Architecture¶

An LLM is used to simulate four "experts" focusing on different semantic dimensions. Through three stages—independent analysis, multi-round discussion, and judge consolidation—a global causal event graph is produced. Subsequently, the causal graph is applied to downstream tasks: Explainable Event Likelihood (EEL) prediction, Event Forecasting (ForecastQA), and Next Event Prediction (Narrative Cloze).

Key Designs¶

1. Four Types of Semantic Relation Experts

Each expert is assigned a different perspective for causal determination:

Expert	Dimension of Focus	Core Idea
Temporal Expert (Temporal)	Event temporal relationships	Temporal order is a necessary condition for causality. It narrows the search space by filtering temporally plausible event pairs.
Discourse Expert (Discourse)	Shared entity relationships	Event pairs sharing entities are more likely to have causal chains—operations on entities can trigger subsequent events.
Conditional Expert (Conditional)	Counterfactual preconditions	Determines via counterfactual reasoning: would event B still occur if event A were removed, identifying necessary preconditions.
Commonsense Expert (Commonsense)	Implicit background knowledge	Captures intermediate knowledge not explicitly mentioned in the text, bridging seemingly unrelated event pairs.

2. Multi-Round Collaborative Discussion Mechanism

A "separation of concerns" strategy is adopted to avoid forcing the LLM to process all dimensions at once:

Initialization: The four experts independently generate initial causal judgments and their reasoning paths.
Multi-round Discussion (up to 3 rounds): In each round, each expert receives the responses of all other experts, analyzes them, revises their own causal links, and provides the rationale for changes. Experts can accept, refute, or supplement other experts' views.
Judge Consolidation: A Causality Judge LLM summarizes all discussion results, resolves remaining disagreements, and outputs the final causal graph.

3. Downstream Reasoning based on Causal Graph (CGEL)

The causal graph is utilized for explainable event likelihood prediction: given a set of observed events and a query event, the system determines whether the query event can be inserted into the causal graph. If it can, the prediction is considered "likely," and a causal event chain is output as an explanation. This method is zero-shot and does not require fine-tuning on downstream tasks.

Loss & Training¶

This method is a pure inference-time framework and requires no training or fine-tuning. GPT-4o and Llama-70B-instruct are used as the base LLMs, implementing the expert role assignment and discussion protocol through carefully designed prompts.

Key Experimental Results¶

Main Results: Causal Graph Generation Quality (CRAB Dataset, Graph-level Metrics)¶

Method	LLM	BAcc	F1:Causal	F1:Non-Causal	Macro F1
Direct (Zero-shot direct generation)	GPT-4o	70.86	66.17	76.80	71.48
Pairwise (Pairwise determination)	GPT-4o	73.93	62.99	82.37	72.68
Experts wo collab	GPT-4o	74.92	70.21	78.23	74.22
Collab with experts	GPT-4o	79.27	75.62	82.80	79.21
Direct	Llama-70B	63.08	53.42	69.35	61.39
Collab with experts	Llama-70B	73.69	73.31	71.67	72.49

Downstream Task Results¶

Task	System	Accuracy
Event Forecasting (ForecastQA)	GPT-4 baseline	51.3%
	One-shot baseline	50.0%
	CGEL (Ours)	62.7%
	BERT-large + MDS (Fine-tuned)	67.4%
Next Event Prediction (NC)	ELM	46.0%
	EGELM	50.0%
	CGEL with context	61.0%

In the EEL task, CGEL vs. One-shot baseline: wins by 41.6% in causality, 48.4% in informativeness, and 37.0% in coherence.

Ablation Study¶

Setting	BAcc	Macro F1	Drop Relative to Full Method
Collab with experts (Full)	79.27	79.21	—
Collab wo experts (No expert roles)	75.39	75.51	-3.70
w/o Temporal Expert	77.51	77.72	-1.49
w/o Precondition Expert	77.48	77.26	-1.95
w/o Discourse Expert	78.32	78.29	-0.92
w/o Commonsense Expert	78.88	78.85	-0.36

Removing any expert leads to a performance drop, with the precondition expert and temporal expert having the greatest impact.

Debate Trajectory Analysis¶

Expert	Initial Overlap with Gold	Post-Discussion Overlap with Gold	Contribution	Error Flip Rate
Temporal Expert	13%	33%	64%	0%
Discourse Expert	17%	24%	64%	0%
Precondition Expert	17%	22%	46%	67%
Commonsense Expert	22%	26%	57%	0%

The temporal expert is initially the weakest but shows the greatest improvement after discussion; the precondition expert has the highest error flip rate.

Highlights & Insights¶

Heterogeneous Expert Collaboration Outperforms Homogeneous Debate: Unlike approaches like ChatEval that let multiple identical-role LLMs debate, this work endows each agent with a different semantic relation expertise, achieving a true "separation of concerns." Experiments demonstrate that removing expert roles (Collab wo experts) causes a nearly 4-point drop in BAcc.
Competitive Performance with Zero-Shot against Fine-tuned Models: CGEL achieves 62.7% on ForecastQA, close to 67.4% from a fine-tuned BERT-large, without requiring any task-specific training data. It also outputs causal event chains as explanations, which is typically unachievable with fine-tuned models.
Analyzable and Debuggable Debate Process: The flipping, adding, and conflict patterns of each expert in the discussion are tracked in detail, forming a transparent decision path that facilitates future improvements.

Limitations & Future Work¶

Reliance on the base LLM's causal understanding capability, which may deviate from human causal perception.
High computational cost of multi-round multi-expert discussion (requiring multiple LLM calls per scenario).
Lack of graded determination for causal strength, executing only binary causal/non-causal classification.
Using GPT-4 as an evaluator may introduce bias toward content generated by itself.
Potential to extend to more types of semantic relation experts and validation across more domains/languages.

Rating¶

Novelty: ★★★★☆ — The framework of heterogeneous semantic experts collaborating to generate a global causal graph is novel, differing fundamentally from existing multi-agent debate methods.
Value: ★★★★☆ — Zero-shot and explainable, applicable to various scenarios like event prediction, forecasting, and explanation.
Experimental Thoroughness: ★★★★★ — Comprehensive evaluation dimensions including intrinsic evaluation, three extrinsic tasks, ablation, and debate trajectory analysis.
Writing Quality: ★★★★☆ — Clear motivation and systematic framework description, though some symbol explanations are slightly redundant.