Power Ensemble Aggregation for Improved Extreme Event AI Prediction¶
Conference: NeurIPS 2025 arXiv: 2511.11170 Code: Unavailable Area: Time Series Keywords: extreme event prediction, ensemble aggregation, power mean, heat wave classification, climate prediction
TL;DR¶
This paper proposes an adaptive ensemble aggregation method based on the power mean. By applying nonlinear aggregation (power exponent \(p>1\)) to the score of ensemble members from generative weather prediction models, the method significantly improves classification performance for extreme high-temperature events, with greater gains at higher quantile thresholds.
Background & Motivation¶
Deep learning weather forecasting models (e.g., GraphCast, PanguWeather) have surpassed physics-based models on routine weather prediction, yet extreme event prediction remains a critical weakness. The root causes are as follows:
Rarity: Extreme events are, by definition, low-probability occurrences, occupying a negligible fraction of training data; models consequently tend to predict outcomes near the mean.
Inherent bias of mean aggregation: When ensemble models are used, the standard practice of averaging member predictions naturally suppresses extreme values—even if a minority of members correctly predict an extreme event, their signals are diluted by the conservative majority.
Lack of adaptive mechanisms: Extreme events of varying intensity require different degrees of "bias toward extremes," yet traditional aggregation methods (mean or maximum) lack this flexibility.
Core motivation: Between the mean (overly conservative) and the maximum (overly aggressive), does an optimal intermediate strategy exist? The power mean provides a continuously tunable compromise.
Method¶
Overall Architecture¶
The system consists of three components: (1) a U-Net-based deterministic weather prediction model employing a cubed-sphere grid to avoid polar singularities; (2) Perlin noise injection that converts the model into a generative one, producing \(n=50\) ensemble members; and (3) power mean aggregation that transforms ensemble member scores into extreme event classification probabilities.
Key Designs¶
- Extreme Event Definition and Classification Framework
Extreme events are defined using local climatology: for each location and season, the local climatological mean and standard deviation of surface air temperature are computed, and temperature is converted into a local anomaly \(x\). A threshold \(q\) is defined such that a temperature anomaly \(x\) satisfying \(\Phi(x) \geq q\) is classified as an extreme event, where \(\Phi\) is the standard normal CDF.
This is a critical design choice—using a local rather than a global threshold avoids the limitation of detecting only perennially hot regions such as the Sahara Desert.
- Power Mean Aggregation
Given local anomaly predictions \(\{\hat{x}_i\}_{i=1}^n\) from \(n\) ensemble members, the score for each member is first computed as \(\hat{s}_i = \Phi(\hat{x}_i)\), and power mean aggregation is then applied:
$\(\hat{s} = \left(\frac{1}{n}\sum_{i=1}^n \hat{s}_i^p\right)^{1/p}\)$
When \(p=1\), this reduces to the arithmetic mean; as \(p \to \infty\), it approaches the maximum. The parameter \(p \geq 1\) controls the degree to which the aggregation is skewed toward extreme-predicting members.
Note that the power mean is applied to the scores \(\hat{s}_i\) (which are positive) rather than to the anomalies \(\hat{x}_i\) (which may be negative), as the power operation requires positive inputs.
- Generative Model Construction
Perlin noise is injected into the inputs of the deterministic U-Net baseline to introduce ensemble diversity. Perlin noise offers a spatial coherence advantage over white noise—perturbations to weather fields should be spatially smooth rather than pixel-wise independent.
Specific improvements include: generating Perlin noise on a 3D cube \([0,1]^3\) and taking the 2D slice corresponding to the Earth's surface to ensure global continuity; randomizing gradient magnitudes via a log-normal distribution to better capture extreme values (standard Perlin noise is constrained to \([-1,1]\)); and combining noise at different frequencies into fractal noise via amplitude modulators learned through convolutional layers.
Loss & Training¶
The Continuous Ranked Probability Score (CRPS) is used as the training loss. Training data consists of ERA5 reanalysis from 1990–2010, resampled to 1.5° spatial resolution and daily temporal resolution. A cubed-sphere grid (\(6\times48\times48\)) is used to avoid polar singularities. The model is trained on a single 16 GB GPU within a few hours.
Key Experimental Results¶
Main Results — AUC Comparison Across Quantiles¶
| Quantile \(q\) | Lead Time | Mean Aggregation AUC | Power Mean AUC | \(p_{opt}\) | Relative Improvement RI |
|---|---|---|---|---|---|
| 0.80 | 7 days | Baseline | Marginally better | ~2 | ~0.5% |
| 0.90 | 7 days | Baseline | Better | ~5 | ~1.0% |
| 0.98 | 7 days | Baseline | Significantly better | ~20 | ~2.5% |
| 0.80 | 12 days | Baseline | Better | ~2 | ~1% |
| 0.98 | 12 days | Baseline | Substantially better | ~20 | ~4% |
Comparison with GraphCast (Test Set: 2018)¶
| Method | \(q=0.80\) | \(q=0.90\) | \(q=0.98\) |
|---|---|---|---|
| Persistence model | Low | Low | Low |
| GraphCast (deterministic) | Good | Good | Limited |
| Ensemble mean | Good | Good | Good |
| Ensemble power mean | Best | Best | Best |
At high quantiles (\(q=0.98\)) and long lead times, power mean aggregation applied to a simple generative model even surpasses deterministic GraphCast.
Key Findings¶
- The optimal power exponent \(p_{opt}\) scales exponentially with quantile \(q\): \(\log(p_{opt}) = f(q)\) is nearly perfectly linear, providing a simple extrapolation rule across quantiles.
- Improvement increases with event extremity: The relative improvement RI grows as the quantile threshold increases, demonstrating that the method is most effective for the most extreme events.
- Improvement increases with lead time: Greater uncertainty at longer forecast horizons amplifies the benefit of biasing toward extreme ensemble members via the power mean.
- The \(p_{opt}\) optimized on the validation set transfers directly to different forecast lead times, indicating good robustness.
Highlights & Insights¶
- Elegant simplicity: Introducing a single hyperparameter \(p\) achieves a continuously tunable aggregation strategy spanning from the mean to the maximum.
- Exponential scaling law of \(p_{opt}\): Provides an off-the-shelf guide for selecting \(p\) at different extremity thresholds.
- Model-agnostic applicability: Power mean aggregation can be applied to any generative forecasting model without modifying the model architecture.
- Demonstrating that simplicity can outperform complexity: A simple generative model combined with power mean aggregation surpasses the sophisticated GraphCast on extreme events.
Limitations & Future Work¶
- Simplified extreme event definition: Only a single variable (surface air temperature) and a static climatology are used.
- Climate change not considered: Anomalies defined relative to a fixed climatology may be unreliable under nonstationary climate conditions.
- Limitations of the AUC metric: Application-oriented evaluation incorporating socioeconomic loss is absent.
- Only tested on a simple self-built model: Whether the approach is effective for stronger baselines (e.g., GenCast) remains unverified.
- Multivariate extreme events (e.g., compound extreme weather) are not addressed.
Related Work & Insights¶
- GraphCast: DeepMind's deterministic weather forecasting model, used as the comparison baseline in this paper.
- WeatherBench2: A general-purpose weather forecasting benchmark and data source.
- Perlin noise: A coherent noise method from computer graphics, creatively repurposed here for meteorological ensemble perturbation.
Rating¶
- Novelty: ⭐⭐⭐☆☆ — The power mean has precedents; the contribution lies in its systematic validation in the climate domain.
- Experimental Thoroughness: ⭐⭐⭐⭐☆ — Multi-quantile and multi-lead-time comparisons are thorough, though stronger baseline models are not evaluated.
- Writing Quality: ⭐⭐⭐⭐☆ — Concise and focused.
- Value: ⭐⭐⭐⭐☆ — The method is simple and general, with practical relevance to extreme event prediction.