WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models¶

Conference: ACL 2025
arXiv: 2505.20249
Code: GitHub
Area: LLM Evaluation
Keywords: benchmark, weather impact, LLM evaluation, multi-label classification, question answering

TL;DR¶

Proposes WXImpactBench, the first LLM evaluation benchmark for extreme weather impact understanding. It features a four-stage data construction pipeline and two evaluation tasks (multi-label classification and ranking-based question answering) to systematically evaluate the capabilities of multiple LLMs in the domain of climate adaptation.

Background & Motivation¶

Problem Definition: Climate change adaptation requires understanding the societal impacts of extreme weather, yet the effectiveness of LLMs in this domain remains unexplored.
Limitations of Prior Work: Existing climate-related data mostly originates from structured meteorological records, which suffer from a lack of daily impact narratives. Furthermore, these data may have already been included in LLMs' pre-training corpora, leading to evaluation bias.
Key Challenge: Climate terminology in historical newspapers is often polysemous (e.g., "blizzard" can refer to both a snowstorm and a sports team), and text noise from OCR digitization severely degrades downstream tasks.
Ours: Builds a high-quality extreme weather impact dataset from historical newspapers and designs the WXImpactBench benchmark to evaluate LLMs using multi-label classification and ranking-based QA tasks.

Method¶

Overall Architecture¶

Four-stage data construction pipeline + two-task evaluation framework: 1. Corpus Collection: Digitized newspaper texts from two historical periods were acquired from proprietary archives. 2. Post-OCR Error Correction: GPT-4o was utilized for OCR text error correction, achieving BLEU/ROUGE scores highly consistent with human annotations. 3. Topic-Aware Article Selection: Filtered from 53,521 articles through LDA topic modeling, yielding 350 high-quality samples after manual review by three domain experts. 4. Human Label Annotation: Six categories of vulnerability-related impacts were defined (infrastructure, political, financial, ecological, agricultural, and human health), with multi-label binary annotations conducted by three annotators.

Key Designs¶

Multi-Label Classification Task: Evaluates LLMs' ability to distinguish among six categories of weather impacts, utilizing row-wise accuracy as a strict metric (requiring correct classification across all six labels simultaneously).
Ranking-Based QA Task: Generates pseudo-questions for each article and constructs a candidate article pool of 100 articles (1 positive + 99 negatives) to evaluate the retrieval and ranking capabilities of LLMs, laying the foundation for climate RAG system development.
Hybrid Context Version: Segmented long texts into passages of approximately 250 tokens and annotated them independently, creating 1,386 samples to evaluate the impact of long-context understanding.

Loss & Evaluation Metrics¶

Classification Task: \(\mathcal{L}(\hat{\mathcal{Y}}_t, \mathcal{Y}_t) = -\sum_{i=1}^{6} y_i \log \hat{y}_i\)
Classification Metrics: F1-score, Accuracy, Row-wise Accuracy
Ranking Task Metrics: Hit@1, nDCG@5, Recall@5, MRR

Experiments¶

Main Results¶

Model	Infrastructure	Political	Financial	Ecological	Agricultural	Human Health	Average
GPT-4o	80.94	58.46	65.82	46.81	70.33	73.23	65.93
DeepSeek-V3-671B	81.87	44.44	60.91	36.00	61.74	65.20	58.03
Mistral-24B-IT	79.12	47.18	59.64	44.90	67.74	66.88	60.91
Gemma-2-9b-IT	77.42	43.33	54.60	42.16	55.60	61.82	55.82

Zero-shot F1-score (hybrid context version), with ↑ indicating the improvement over the long context version.

Ablation Study¶

Setup	Impact
Long Context vs Hybrid Context	The hybrid context version achieves an average improvement of 2.38 F1, indicating that LLMs perform better on shorter texts.
Zero-shot vs One-shot	One-shot generally improves performance, though some models (e.g., Mixtral) exhibit instability.
Historical vs Modern Text	Performance on modern text is generally better, as historical narrative styles increase the difficulty of understanding.

Key Findings¶

GPT-4o performs the best in most categories, yet all models show weak performance in identifying ecological and political impacts.
Model size is not the decisive factor: DeepSeek-V3 (671B) underperforms Mistral-24B in certain categories.
The hybrid context version generally outperforms the long context version, indicating that current LLMs still have room for improvement in long text understanding.
Row-wise accuracy is extremely low (the highest is only ~30%), demonstrating that accurately classifying all six categories of impact simultaneously is highly challenging.

Highlights & Insights¶

The first LLM evaluation benchmark for extreme weather impact understanding, filling a critical gap in climate NLP.
The four-stage data construction pipeline is elegantly designed, combining OCR error correction, LDA topic modeling, and domain-expert annotation.
The evaluation tasks are designed to encompass both classification and retrieval application scenarios, establishing a foundation for the development of climate RAG systems.

Limitations & Future Work¶

The relatively small dataset size (350 articles) may limit the statistical significance of the evaluation.
Only English newspapers are covered, lacking cross-lingual evaluation.
The pseudo-questions for the ranking-based QA task are generated by an LLM, which might introduce bias.
The taxonomy of six impact categories may not cover all types of weather impacts.

Climate Text Processing: Mallick et al. (2024), Xie et al. (2024) focus on extreme weather event extraction.
Climate Benchmarks: CLLMate (Li et al., 2024) focuses on weather forecasting rather than impact understanding.
OCR Correction: Neural OCR correction models by Drobac & Lindén (2020).
Disaster NLP: Disaster text classification by Purohit et al. (2013), Imran et al. (2016).

Rating¶

Dimension	Score
Novelty	⭐⭐⭐⭐
Technical Depth	⭐⭐⭐
Experimental Thoroughness	⭐⭐⭐⭐
Writing Quality	⭐⭐⭐⭐
Value	⭐⭐⭐⭐