CuMoLoS-MAE: A Masked Autoencoder for Remote Sensing Data Reconstruction¶
Conference: NEURIPS2025
arXiv: 2508.14957
Code: To be confirmed
Area: Autonomous Driving
Keywords: Masked Autoencoder, Remote Sensing Data Reconstruction, Uncertainty Quantification, Curriculum Learning, Monte Carlo Ensemble
TL;DR¶
This paper proposes CuMoLoS-MAE, a Masked Autoencoder combining a curriculum masking strategy with Monte Carlo stochastic ensemble inference for high-fidelity reconstruction and pixel-wise uncertainty quantification of remote sensing atmospheric profile data.
Background & Motivation¶
- Atmospheric profile data acquired by remote sensing instruments (Doppler lidars, radars, radiometers, etc.) are frequently corrupted by low-SNR conditions, range folding, and spurious discontinuities, resulting in large numbers of missing or distorted return values.
- Traditional gap-filling methods (e.g., sliding-window mean filtering) blur critical fine-scale structures such as wind shear lines, updraft/downdraft cores, and small eddies.
- Existing deep learning methods (e.g., VAEs) can recover sharper structures but provide no reconstruction uncertainty estimates, limiting their reliable use in data assimilation and early-warning systems.
- Consequently, a reconstruction approach is needed that simultaneously recovers fine atmospheric structures and provides pixel-level confidence information.
Core Problem¶
- Fine Structure Recovery: How to preserve key atmospheric features such as updraft cores and wind shear lines during denoising/inpainting?
- Uncertainty Quantification: How to provide reliable per-pixel confidence estimates to support downstream data assimilation and early-warning decisions?
- Training Efficiency: How to enable stable learning of reconstruction capability from sparse contexts?
Method¶
Overall Architecture¶
CuMoLoS-MAE (Curriculum-Guided Monte Carlo Stochastic Ensemble Masked Autoencoder) consists of two core stages: a curriculum-masked MAE during training and a Monte Carlo ensemble during inference.
Micro-Patch MAE with Curriculum Masking¶
- Input Patching: The Doppler lidar time–height array is divided into 64×64 image patches, and each patch is further tokenized into 2×2 micro-patches to capture fine-scale structures and mesoscale dynamics.
- Encoder–Decoder Architecture: The encoder is a 12-layer ViT that processes only the unmasked visible tokens; the decoder is a lightweight 4-layer structure that reconstructs the full field from the visible tokens.
- Curriculum Masking Strategy:
- The masking ratio is fixed at 50% for the first 5 epochs.
- It is then progressively increased to 70% via cosine annealing, reaching 70% at epoch 30.
- The ratio remains at 70% thereafter.
- This progressive strategy forces the model to incrementally learn reconstruction from increasingly sparse contexts.
- Loss Function: MSE loss is computed only on masked pixels.
Monte Carlo Ensemble Inference¶
- At inference time, \(N=50\) independent random masks are sampled for each input image.
- The full mask → encode → decode pipeline is executed separately for each mask.
- The 50 reconstruction results are aggregated as follows:
- Mean \(\bar{X} = \frac{1}{N}\sum_{i=1}^{N}\hat{X}^{(i)}\) serves as the high-fidelity denoised reconstruction.
- Standard deviation \(\sigma_X = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(\hat{X}^{(i)} - \bar{X})^2}\) serves as the pixel-wise uncertainty map.
Training Details¶
- Data Preprocessing: SNR filtering is applied (intensity ≥ 0.005), and valid velocities are clipped to \([-5, 5]\) m/s.
- Training Data: ARM SGP site data from June 1–9, 2011; the test set is June 15, 2011 (unseen data).
- Optimizer: AdamW (learning rate \(1.5 \times 10^{-4} \cdot \frac{32}{256}\), weight decay 0.05).
- Training Configuration: 500 epochs, batch size 32, single NVIDIA A100 GPU.
- Learning Rate Schedule: Cosine schedule with warmup aligned to the masking curriculum.
Key Experimental Results¶
Main Results (1,028 Test Images)¶
| Method | PSNR (dB) ↑ | SSIM ↑ | MSE ↓ | FID ↓ | Spectral Fidelity ↑ |
|---|---|---|---|---|---|
| 8×8 Mean Filter | 23.41 | 0.4950 | 0.5186 | 5.13 | 91.67% |
| CVAE | 26.70 | 0.4190 | 0.4036 | 3.28 | 80.21% |
| DnCNN (Noise2Void) | 23.09 | 0.6466 | 0.6232 | 0.12 | 36.46% |
| U-Net (Noise2Void) | 27.70 | 0.7016 | 0.2581 | 0.44 | 49.48% |
| CuMoLoS-MAE | 29.45 | 0.7857 | 0.1854 | 1.87 | 93.75% |
- CuMoLoS-MAE achieves the best performance on PSNR, SSIM, MSE, and spectral fidelity; its PSNR is 1.75 dB higher than the second-best U-Net.
- Low-frequency spectral fidelity reaches 93.75%, far surpassing the Noise2Void variants (36%–49%), preserving storm-scale energy.
Uncertainty Quantification Quality¶
- Pearson correlation between the Monte Carlo standard deviation \(\sigma_X\) and the absolute reconstruction error: r = 0.961 ± 0.037
- Global Spearman rank correlation: ρ = 0.926
- Per-pixel MAE sorted by \(\sigma\) quantile increases monotonically from 0.028 to 0.999 (a 35.1× range).
- The top 1%/5%/10%/20% of pixels by \(\sigma\) capture 10.1%/30.6%/43.4%/59.4% of the total error, respectively.
- These results demonstrate that the uncertainty estimates are highly reliable and can be effectively used for error triage.
Ablation Study: Temporal Window Size¶
| Window Size | PSNR ↑ | SSIM ↑ | MSE ↓ | Spectral Fidelity ↑ |
|---|---|---|---|---|
| 64×64 | 29.45 | 0.7857 | 0.1854 | 93.75% |
| 128×64 | 30.11 | 0.7697 | 0.2253 | 87.50% |
| 256×64 | 28.55 | 0.6103 | 0.3205 | 38.02% |
- The 64×64 window already provides sufficient context, suggesting that denoising is primarily a local process.
- Larger windows introduce more tokens and masked regions without increasing model capacity, leading to performance degradation.
Ablation Study: Curriculum Masking¶
- Curriculum masking allows the reconstruction loss to fall below 0.20 approximately 26 epochs earlier (epoch 198 vs. 224), improving training efficiency by roughly 10%.
- Final metrics are approximately equivalent; the primary benefit of curriculum masking lies in accelerating convergence.
Highlights & Insights¶
- Elegant Monte Carlo Ensemble Design: By sampling multiple random masks independently at inference time, the method approximates the posterior predictive distribution without any modification to model architecture or training procedure, yielding high-quality uncertainty estimates.
- Curriculum Masking Strategy: Progressively increasing the masking ratio enables the model to smoothly learn reconstruction from extremely sparse contexts, accelerating convergence.
- Highly Reliable Uncertainty Estimates: The correlation between \(\sigma\) and true error reaches 0.961, which is of high practical value for remote sensing data assimilation and extreme weather early warning.
- Micro-Patch Design: The 2×2 micro-patch tokenization captures fine atmospheric structures more effectively than standard 16×16 patching, making it more suitable for physical field data of this type.
- Spectral Fidelity Metric: The paper introduces a PSD-based low-frequency fidelity evaluation metric that better reflects physical field reconstruction quality than conventional pixel-level metrics.
Limitations & Future Work¶
- Very Small Data Scale: Training uses only nine days of data from a single site (ARM SGP) with one-day testing, raising concerns about generalizability.
- High Inference Cost: Each image requires 50 forward passes, imposing substantial computational overhead for real-time deployment.
- Single Variable Validation: Reconstruction is evaluated only on vertical velocity fields; other atmospheric variables such as temperature and humidity are not tested.
- Questionable Area Classification: This paper belongs to remote sensing/meteorological data reconstruction rather than autonomous driving.
- Missing Stronger Baselines: No comparison is made with recent diffusion model or Flow Matching denoising approaches.
- Lack of Cross-Sensor Validation: Noise characteristics vary significantly across lidar models, necessitating additional cross-sensor experiments.
Related Work & Insights¶
| Compared Method | Advantages | Disadvantages |
|---|---|---|
| Sliding-Window Mean Filter | Simple and fast | Blurs fine structures; PSNR only 23.41 |
| CVAE | Capable of generating sharp structures | No uncertainty estimation; lowest SSIM (0.419) |
| Noise2Void (DnCNN/U-Net) | No paired data required; best FID | Very poor spectral fidelity (36%–49%); severely loses low-frequency information |
| CuMoLoS-MAE | Best reconstruction quality + reliable uncertainty + high spectral fidelity | High inference cost (50 samples) |
The Monte Carlo ensemble concept is transferable to other MAE applications (e.g., medical imaging, remote sensing inpainting) for obtaining uncertainty estimates without modifying the model. The curriculum masking strategy offers a useful reference for other MAE variants requiring high masking ratios during training. The PSD-based spectral fidelity metric is worth adopting more broadly in physical field reconstruction tasks. The uncertainty-weighted strategy in data assimilation can be applied to point cloud completion and sensor fusion in autonomous driving perception.
Rating¶
- Novelty: 3.5/5 — The combination of Monte Carlo ensemble and curriculum masking is moderately novel, though neither component is entirely new.
- Experimental Thoroughness: 2.5/5 — Data scale and baseline coverage are insufficient; ablation experiments are relatively simple.
- Writing Quality: 4/5 — Structure is clear, with good coordination between formulas and visualizations.
- Value: 3/5 — Practically valuable for the meteorological remote sensing domain, but scale and generalizability remain to be verified.