GeoChemAD: Benchmarking Unsupervised Geochemical Anomaly Detection for Mineral Exploration¶
Conference: CVPR 2026 arXiv: 2603.13068 Code: https://github.com/yihaoding/geochemad Area: Self-Supervised Learning Keywords: Geochemical Anomaly Detection, Unsupervised Learning, Transformer, Benchmark Dataset, Mineral Exploration
TL;DR¶
This paper introduces GeoChemAD, an open-source benchmark dataset, and GeoChemFormer, a two-stage framework that performs unsupervised geochemical anomaly detection via spatial context learning and elemental dependency modeling, achieving an average AUC of 0.7712 across 8 subsets.
Background & Motivation¶
Geochemical anomaly detection (GAD) is critical for mineral exploration, as it identifies locations where elemental concentrations deviate from regional baselines to indicate mineralization zones. Surface geochemical distributions result from primary in-situ and secondary dispersion processes (weathering, erosion), and collected data may reflect multi-stage, multi-source mineralization processes, leading to high spatial discontinuity, uncertainty, and stochasticity. Three key issues exist in prior work:
Non-reproducible data: Most studies rely on proprietary datasets (primarily from the China Geological Survey), precluding fair comparison and result reproduction. Some papers even omit critical metadata.
Narrow scenario coverage: Studies typically focus on a single region, single sampling medium (sediment), and single target element (gold), leaving model generalizability across different spatial scales, sampling densities, and element types unknown.
Disconnect between anomalies and targets: Anomalies detected by unsupervised methods may be unrelated to actual mineralization or the target element — a core pain point in practical exploration.
Traditional statistical methods (PCA, factor analysis) struggle to capture complex nonlinear patterns. Deep learning approaches such as AE/VAE can model compositional relationships but neglect spatial dependencies. CNNs are constrained by fixed receptive fields, and graph models are limited in depth and representational capacity. The application of Transformers to GAD remains in its early stages, with systematic investigation of self-supervised pretraining strategies largely absent.
Method¶
Overall Architecture¶
The paper contributes two components: (1) the GeoChemAD benchmark dataset; and (2) the GeoChemFormer two-stage framework: - Stage 1: Spatial Context Learning (SCL), which learns spatial geochemical representations from neighborhood samples. - Stage 2: Elemental dependency modeling for anomaly detection, computing anomaly scores via reconstruction error.
Key Designs¶
-
GeoChemAD Dataset: Sourced from publicly available data of the Geological Survey of Western Australia (GSWA) Accelerated Geoscience Program, the dataset comprises 8 subsets covering 3 sampling media (2 sediment, 3 rock chips, 3 soil), 4 target elements (Au, Cu, W, Ni), and spatial scales ranging from 6 km² to 8,500 km². Each subset provides a geochemical sample CSV (containing metadata, spatial coordinates, and 124–126 elemental concentrations) and a known mineralization site CSV. Raw anomalous values (e.g., −9999, −0.5) are retained to preserve data integrity and require appropriate preprocessing. All data use the GDA2020 coordinate reference system to ensure spatial consistency. Compared to existing studies — mostly single-region, single-element, and non-public — this is the first standardized, multi-scenario open-source GAD benchmark.
-
Spatial Context Learning (SCL): For a query location \(p_i\), a KD-tree retrieves \(K\) nearest neighbors, constructing a neighborhood token sequence \(\mathcal{S} = [\mathbf{e}, \mathbf{q}_i, \mathbf{t}_1, \ldots, \mathbf{t}_K]\), where \(\mathbf{e}\) is the target element token, \(\mathbf{q}_i\) is the query location encoding, and \(\mathbf{t}_j = [\Delta x_j, \Delta y_j, \mathbf{f}_j]\) contains the relative spatial offset and concentration vector. After processing by a Transformer encoder, a spatial context representation \(\mathbf{q}_i'\) is obtained. The training objective is to predict the target elemental concentration at the query location:
\(\mathcal{L}_{\text{sc}} = \frac{1}{N}\sum_{i=1}^{N}(\hat{y}_i - y_i)^2\)
Core Idea: The model predicts the center-point concentration solely from neighborhood information, compelling it to learn geological spatial context rather than simple memorization.
-
Elemental Dependency Modeling: In Stage 2, the spatial representation learned by SCL serves as a geological context token, which is concatenated with individual element tokens and fed into a Transformer encoder to learn inter-elemental dependencies. The anomaly score is defined as the mean reconstruction error across all elements:
\(s_i = \frac{1}{C}\sum_{c=1}^{C}(x_{i,c} - \hat{x}_{i,c})^2\)
Samples that deviate from the learned elemental dependency patterns receive higher anomaly scores.
Loss & Training¶
Training proceeds in two stages: Stage 1 pretrains SCL with MSE loss (20–60 epochs); Stage 2 performs anomaly detection via reconstruction error. The evaluation metric is AUC, averaged over 20 repeated random samplings of background samples. Data preprocessing includes CLR/ILR transformations to address compositional closure, PCA/causal discovery/LLM-assisted feature selection, and IDW/Kriging spatial interpolation.
Key Experimental Results¶
Main Results¶
| Dataset | GeoChemFormer (T2) | Vanilla Transformer (T1) | AE | VAE-GAN | Best Baseline |
|---|---|---|---|---|---|
| sed1 | 0.7228 | 0.7111 | 0.5851 | 0.6843 | T1: 0.7111 |
| rock1 | 0.7844 | 0.7031 | 0.5516 | 0.6953 | T1: 0.7031 |
| soil1 | 0.8704 | 0.7242 | 0.5934 | 0.7124 | T1: 0.7242 |
| soil3 | 0.8334 | 0.6101 | 0.5544 | 0.6160 | VAE-CG: 0.6509 |
| Average | 0.7712 | 0.7147 | 0.7046 | 0.7279 | VAE-G: 0.7279 |
Ablation Study¶
| Configuration | Key Metric | Notes |
|---|---|---|
| SCL pretraining 20 epochs | rock2 AUC = 0.919 | Fast convergence on small datasets |
| SCL pretraining 40 epochs | sed1 AUC = 0.743 | Sediment data requires more training |
| K = 16 (neighborhood size) | Best on soil2 | Compact neighborhood suits soil samples |
| K = 256 (neighborhood size) | Best on sed1 = 0.720 | Sediment requires larger spatial context |
| ILR transformation | Average 0.6788 | Best preprocessing for Transformer-based models |
| LLM feature selection | Average 0.7412 | Automated selection outperforms manual |
Key Findings¶
- GeoChemFormer achieves the best performance on 5 of 8 subsets and exhibits the lowest variance (0.0039), demonstrating strong stability.
- Spatial context learning is critical to performance gains, especially on sediment and soil datasets.
- Preprocessing strategies (feature selection, transformation type) have significantly different impacts across model types.
Highlights & Insights¶
- Filling a field gap: Provides the first open, multi-region, multi-element, multi-sampling-medium GAD benchmark dataset.
- Target-element awareness: The target-element token design links anomaly detection to the specific mineralization element of interest.
- The two-stage design decouples spatial context and elemental dependency, yielding a natural and effective pretraining strategy.
Limitations & Future Work¶
- Data originate from a single geographic region in Western Australia; generalizability to other continents or geological settings (e.g., tropical weathering environments, glacial geomorphology) remains unverified.
- The number of positive samples (mineralization sites) is limited (7–32 per subset), constraining statistical robustness of evaluation and potentially causing large AUC variance.
- The temporal dimension is not considered (sampling variation across time periods and the dynamic effects of weathering/erosion).
- On some subsets, deep generative models (AE) still outperform GeoChemFormer (e.g., rock2 AUC 0.9185 vs. T2 0.8050; rock3 AUC 0.8446 vs. T2 0.7302), suggesting Transformers are not always optimal in small-sample or high-contrast scenarios.
- The KD-tree-based K-nearest-neighbor retrieval in SCL may face scalability challenges on large-scale datasets (>100,000 samples), which the paper does not discuss.
- The choice of feature selection strategy (PCA/CD/LLM) substantially affects results, yet the paper provides no guidance for automatically selecting the optimal strategy.
Related Work & Insights¶
- vs. Traditional statistical methods (Z-score, Mahalanobis): Average AUC of only 0.50–0.53; unable to capture complex nonlinear patterns in geochemical data.
- vs. AE/VAE family: AE performs excellently on certain subsets (rock2: 0.9185) but exhibits high cross-dataset variance (0.0220) and poor stability. GeoChemFormer achieves more stable cross-scenario performance through spatial context learning.
- vs. VAE-GAN: VAE-GAN achieves an average AUC of 0.7279 with low variance (0.0041), making it the most stable non-Transformer method, yet GeoChemFormer still surpasses it by 0.0433.
- vs. existing GAD deep learning studies (Yang2023, Yu2024, etc.): These works employ proprietary data and single-region evaluation, precluding fair comparison. The standardized GeoChemAD dataset enables rigorous future benchmarking.
- Insight: The SCL "predict center from neighborhood" strategy resembles masked prediction paradigms and is transferable to other geospatial anomaly detection tasks (environmental monitoring, urban heat island effects). The target-element-aware design philosophy — directing the model's attention toward "anomalies related to what" rather than simply "whether an anomaly exists" — offers valuable reference for anomaly detection in any domain.
Rating¶
- Novelty: ⭐⭐⭐ — The methodological design is sound but not groundbreaking; the primary contribution lies in the dataset.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation with 12 baselines, multi-dimensional preprocessing analysis, ablation studies, and case studies.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure with thorough dataset description.
- Value: ⭐⭐⭐⭐ — The open-source dataset makes an important contribution to the intersection of geoscience and AI.