Beta Distribution Learning for Reliable Roadway Crash Risk Assessment¶
Conference: AAAI 2026 arXiv: 2511.04886 Code: https://www.gb-liang.com/projects/betarisk Area: Traffic Safety / Autonomous Driving / Uncertainty Estimation Keywords: Beta Distribution, Crash Risk Assessment, Satellite Imagery, Uncertainty Quantification, Calibration
TL;DR¶
A geospatial deep learning framework based on Beta distribution learning is proposed, which leverages multi-scale satellite imagery to predict the full probability distribution of fatal crash risk (rather than point estimates), achieving 17–23% improvement in Recall while naturally expressing uncertainty through distribution shape.
Background & Motivation¶
Background: Road traffic crashes cause over 1.3 million deaths annually worldwide, with economic losses reaching up to 3% of GDP. Traditional traffic safety research typically analyzes driving behavior, road infrastructure, traffic patterns, and weather in isolation, overlooking the complex spatial interactions among multiple factors.
Limitations of Prior Work: Existing DNN-based risk estimators produce point estimates without conveying model uncertainty; modern DNNs are generally miscalibrated, with predicted confidence mismatched to actual accuracy. Crash data is extremely sparse (the annual crash rate for 25m² road segments in the U.S. is approximately 0.1%), making traditional estimation methods highly unreliable.
Key Challenge: Safety-critical applications require models that are both (a) high-recall—dangerous areas must not be missed—and (b) well-calibrated—predicted confidence must faithfully reflect correctness probability. Point estimates cannot distinguish between "certain low risk" and "uncertain moderate risk."
Goal: Starting from satellite imagery, learn a crash fatality risk assessment model that is accurate, well-calibrated, and capable of outputting a complete probability distribution.
Key Insight: Risk estimation is framed as a Beta probability distribution learning problem, exploiting the natural \([0,1]\) support and flexible shape parameters of the Beta distribution to represent both risk and uncertainty.
Core Idea: By predicting the parameters \((\alpha, \beta)\) of a Beta distribution rather than a single risk value, geometric information from data augmentation is converted into structured probabilistic supervision signals, enabling uncertainty-aware assessment of crash risk.
Method¶
Overall Architecture¶
Three satellite image crops at different resolutions (1.19, 0.60, 0.30 m/pixel) → shared ResNet-50 backbone for feature extraction → channel-wise concatenation → two parallel prediction heads:
| Component | Output | Function | Used at Inference |
|---|---|---|---|
| Beta distribution head | \((\alpha, \beta)\), two positive scalars | Defines \(\text{Beta}(\alpha,\beta)\); mean \(R=\alpha/(\alpha+\beta)\) is the risk score | ✓ |
| Auxiliary classification head | Single logit | Binary classification (crash / no crash); assists backbone in learning discriminative features | ✗ |
Key Designs¶
Beta Probabilistic Modeling
- Function: The model outputs the two shape parameters \((\alpha, \beta)\) of a Beta distribution rather than a single risk scalar.
- Mechanism: A sharp Beta distribution (large \(\alpha+\beta\)) indicates high confidence; a broad distribution (small \(\alpha+\beta\)) indicates high uncertainty. The same mean of 0.5 can correspond to two semantically distinct cases: "confidently moderate risk" (\(\alpha=10, \beta=10\)) vs. "highly uncertain" (\(\alpha=2, \beta=2\)).
- Design Motivation: Safety-critical applications require the model to express not only its predictions but also its degree of confidence. The Beta distribution is naturally defined on \([0,1]\), perfectly matching the value range of risk probabilities.
Programmatic Target Distribution Generation
- Function: Dynamically generates target Beta distributions as supervision signals based on the geometric properties of random crop augmentation.
- Mechanism: For positive samples, an influence score is computed as \(0.7 \times (1 - \text{normalized distance}) + 0.3 \times \text{relative crop size}\), which modulates the target distribution's mean and concentration. Crops closer to the crash center and of larger size yield higher-mean, more concentrated target distributions.
- Design Motivation: Crash risk decays continuously in space—when a crop deviates from the crash center, visual evidence weakens and the target distribution should be flatter with a lower mean. This approach elevates data augmentation from a simple regularization technique to a rich source of structured supervision signals.
Multi-Scale Input Design
- Function: Three satellite image crops at different resolutions are fed into the same backbone.
- Mechanism: High-resolution inputs capture local road details (lane markings, intersection geometry); low-resolution inputs capture macro-scale environmental context (urban density, surrounding infrastructure).
- Design Motivation: Crash risk is jointly determined by local road characteristics and broader environmental factors.
Loss & Training¶
A composite loss function \(\mathcal{L} = \lambda_1 \cdot \mathcal{L}_{BCE} + \lambda_2 \cdot \mathcal{L}_{W_2^2}\), where \(\lambda_1=5, \lambda_2=1\).
- \(\mathcal{L}_{W_2^2}\) is a mean–variance surrogate for the Wasserstein-2 distance: \((\mu_p - \mu_t)^2 + (\sigma_p - \sigma_t)^2\), jointly optimizing the risk score (mean) and confidence (standard deviation).
- Compared to KL divergence, the \(W_2\) surrogate provides more stable gradients when the predicted and target distributions have limited overlap.
- The larger weight \(\lambda_1=5\) prioritizes classification capability and recall.
Training details: 75 epochs, AdamW + CosineAnnealingWarmRestarts, distribution head lr = 0.02, backbone lr = 1e-4, batch size 48 (multi-scale), NVIDIA A100.
Key Experimental Results¶
Main Results¶
Evaluated on the MSCM dataset (four major cities in Texas; 80,276 geographic locations; 240,828 multi-scale satellite images):
| Method | F1 | Precision | Recall | AUC | ECE↓ | Brier↓ |
|---|---|---|---|---|---|---|
| ImageNet | 0.4753 | 0.4968 | 0.4555 | 0.7980 | 0.1281 | 0.1600 |
| MSCM-SS | 0.4966 | 0.4981 | 0.4950 | 0.8165 | 0.1006 | 0.1458 |
| MSCM-MS | 0.5409 | 0.6731 | 0.4521 | 0.8572 | 0.1067 | 0.1296 |
| Prob-MS (Ours) | 0.5762 | 0.6296 | 0.5311 | 0.8663 | 0.0881 | 0.1211 |
Prob-MS improves Recall by 17.5% over MSCM-MS on the most critical metric, while achieving the lowest ECE (calibration error).
Ablation Study¶
Deep ensemble comparison—single model vs. three-model ensemble:
| Method | F1 | Recall | ECE↓ | Brier↓ | Variance↓ | Disagr. Rate↓ |
|---|---|---|---|---|---|---|
| Ensemble MSCM-MS (3 models) | 0.5966 | 0.5165 | 0.0787 | 0.1112 | 0.0925 | 16.93% |
| Ensemble Prob-MS (3 models) | 0.5976 | 0.5361 | 0.0605 | 0.1075 | 0.0822 | 15.14% |
| Single Prob-MS | 0.5762 | 0.5311 | 0.0881 | 0.1211 | — | — |
The single-model Prob-MS already approaches the performance of ensemble MSCM-MS at three times the computational cost, and surpasses it by over 3% in Recall.
Key Findings¶
- Baseline models produce severely polarized predictions (concentrated near 0 and 1), whereas Prob-MS utilizes the full probability spectrum to express varying degrees of confidence.
- Erroneous predictions (FP/FN) are consistently associated with higher uncertainty—indicating that the model can "know when it is uncertain."
- San Antonio River Walk case study: Prob-MS correctly identifies multiple fatal crash locations missed by MSCM-MS and produces spatially more coherent risk maps.
Highlights & Insights¶
- Data Augmentation → Probabilistic Supervision: Converting geometric properties of random crops into structured Beta distribution targets is a generalizable idea transferable to other tasks requiring spatially decaying supervision.
- Trustworthy Failure Modes: Even when predictions are incorrect, high uncertainty provides valuable safety signals for downstream decision-making.
- Relies Solely on Public Satellite Imagery: No traffic sensors, road-side cameras, or other infrastructure required, enabling global scalability.
- \(W_2\) Surrogate Loss: More stable than KL divergence and directly optimizes mean and standard deviation jointly, with approximation errors on the order of \(10^{-3}\) to \(10^{-2}\).
Limitations & Future Work¶
- Only static geographic risk is estimated; real-time traffic flow, weather, time-of-day, and other dynamic factors are not considered.
- Geographic coverage is limited to Texas; differences in climate, road design, and driving culture may affect generalization.
- The centrality weight 0.7 and size weight 0.3 are manually set and could be replaced by a learnable adaptive mechanism.
- The approach is fundamentally correlational rather than causal—the model learns associations between visual features and crashes, which do not imply causality.
Related Work & Insights¶
- vs. MSCM-MS: Transitioning from deterministic classification to probabilistic distribution learning yields +17% Recall and substantially improved calibration.
- vs. Deep Ensemble: A single model achieves ensemble-level performance at one-third the computational cost.
- vs. Monte Carlo Simulation Methods: No complex parameter tuning or high computational overhead required; enables near-real-time inference.
- Insights: Beta distribution learning can be extended to other safety-critical uncertainty estimation tasks such as medical imaging and disaster risk assessment; programmatic label generation is a generalizable training strategy.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The combination of Beta distribution learning and programmatic label generation is concise and effective.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Covers quantitative evaluation, qualitative analysis, case studies, and ensemble comparisons, though limited to a single region.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure and excellent visualizations.
- Value: ⭐⭐⭐⭐ — The uncertainty-aware prediction paradigm offers strong transferable insights.