PlantTraitNet: An Uncertainty-Aware Multimodal Framework for Global-Scale Plant Trait Inference from Citizen Science Data¶

Conference: AAAI 2026 arXiv: 2511.06943 Code: https://github.com/GeoSense-Freiburg/PlantTraitNet Area: Multimodal VLM Keywords: Plant trait prediction, multimodal fusion, uncertainty estimation, citizen science, global-scale mapping

TL;DR¶

This paper proposes PlantTraitNet, a multimodal, multi-task, uncertainty-aware deep learning framework that leverages weakly supervised plant photographs from citizen science platforms (iNaturalist, Pl@ntNet) in combination with image features (DINOv2), depth priors (Depth-Anything-V2), and geospatial priors (Climplicit) to simultaneously predict four key plant functional traits (plant height, leaf area, specific leaf area, and leaf nitrogen content). The resulting global trait maps consistently outperform existing global trait products on benchmarks against sPlotOpen vegetation survey data.

Background & Motivation¶

Global plant trait maps (e.g., leaf nitrogen content, plant height) are fundamental to understanding ecosystem processes such as carbon and energy cycling. However, existing trait maps are constrained by the high cost of field measurements and sparse geographic coverage. Citizen science platforms such as iNaturalist host over 50 million georeferenced plant photographs that contain rich visual information on plant morphology and physiology, representing an underutilized data resource.

Limitations of Prior Work: (1) Citizen science data lack direct trait annotations and provide only species labels; (2) weak labels obtained by matching species names to the TRY database are highly noisy; (3) image quality is highly variable (feature noise); (4) prior work has focused primarily on single-task models that do not exploit inter-trait correlations; (5) existing global trait maps have limited accuracy.

Key Insight: The paper integrates citizen science imagery with computer vision and geospatial AI, extracting features via visual foundation models, encoding 3D structural information via depth foundation models, and encoding geospatial context via climate foundation models. Uncertainty-guided data cleaning is employed to handle noise, enabling scalable and more accurate global trait mapping.

Method¶

Overall Architecture¶

Input: Citizen science plant photographs + geographic coordinates + acquisition timestamp Output: Simultaneous prediction of four plant traits (plant height H, leaf area LA, specific leaf area SLA, leaf nitrogen content LN)

Pipeline: Image encoder (DINOv2 ViT-B/14) → image embedding (768-dim); depth encoder (Depth-Anything-V2 ViT-B) → depth embedding (768-dim); geospatial encoder (Climplicit) → climate embedding (projected to 256-dim); three-way concatenation → projection to 1024-dim → 8-layer residual backbone → four independent trait prediction heads (with uncertainty estimation).

Key Designs¶

Multimodal Feature Fusion:
- Function: Fuses embeddings from three modalities — image, depth, and geospatial context.
- Mechanism: DINOv2 provides general-purpose visual features; Depth-Anything-V2 supplies monocular depth priors encoding sensor-to-plant-surface distance to assist morphological reconstruction; Climplicit encodes latitude/longitude/month into continuous climate embeddings capturing global climate factors such as temperature and precipitation. The three feature streams are fused via simple concatenation followed by linear projection.
- Design Motivation: Plant traits are strongly influenced by climate conditions; depth information facilitates inference of 3D structural traits such as plant height, which standard 2D images lack explicit spatial cues to recover.
Uncertainty-Aware Training:
- Function: Each trait prediction head outputs both a predicted value and an associated uncertainty (predicted variance/scale).
- Mechanism: A Laplace distribution is used for leaf area (long-tailed distribution); Gaussian distributions are used for the remaining traits. The training objective is the negative log-likelihood (NLL), enabling the model to jointly learn predictions and uncertainties.
- Design Motivation: Citizen science data are inherently noisy (variable image quality, weak label noise); uncertainty estimation enables dynamic down-weighting of noisy samples and filtering of unreliable data points.
Uncertainty-Guided Data Cleaning Loop:
- Function: Two-stage iterative cleaning of the training data.
- Mechanism:
  - Stage 1 (Uncertainty Filtering): After training for one epoch, inference is run on all training samples and the top 5% of samples with the highest joint uncertainty (e.g., winter scenes, leafless branches, blurry images) are removed; this is iterated until the number of high-uncertainty samples falls below a threshold.
  - Stage 2 (Residual-Aware Filtering): Reference dataset performance is tracked to identify the "inflection point" epoch (where the model begins to overfit noisy labels); for high-uncertainty samples, residuals between predictions and species-level medians are computed, and samples exhibiting both high uncertainty and high residuals are removed (e.g., juvenile plants annotated with adult height values).
- Design Motivation: Pure uncertainty filtering may erroneously discard legitimate high-variance samples (e.g., tall plants) due to heteroscedasticity; residual-aware filtering compensates for this bias.
Multi-Task Learning:
- Function: Simultaneously predicts four traits using a shared backbone network.
- Mechanism: Shared multimodal representation with independent trait prediction heads, exploiting ecological correlations among traits.
- Design Motivation: Compared to single-task models, the multi-task model achieves substantially better performance on plant height (R² from 0.12 to 0.19) while reducing computational cost by approximately 75%.

Loss & Training¶

Leaf area: Laplace negative log-likelihood
Remaining three traits: Gaussian negative log-likelihood
Stratified sampling by plant functional type (herb/shrub/tree) to ensure balanced batches
AdamW optimizer with cosine annealing learning rate schedule, gradient clipping max_norm=1.0
Approximately 90M total parameters; trained for up to 30 epochs on a single NVIDIA RTX A6000
Model selection via Pareto front + hypervolume maximization

Key Experimental Results¶

Main Results (Global Trait Maps vs. sPlotOpen Benchmark, 1° Resolution)¶

Method	H (R²/nMAE/r)	LA (R²/nMAE/r)	SLA (R²/nMAE/r)	LN (R²/nMAE/r)
Ours (Refined)	0.18/0.22/0.45	0.34/0.14/0.57	0.27/0.13/0.59	-0.12/0.17/0.50
Schiller	-0.32/0.28/0.42	0.11/0.17/0.52	0.16/0.14/0.53	0.06/0.14/0.40
Wolf	-0.61/0.31/0.43	-0.02/0.18/0.53	0.02/0.16/0.50	-0.20/0.18/0.41
Moreno	–	–	-0.72/0.23/0.23	-0.85/0.22/0.17

Ablation Study (Multimodal Configurations)¶

Configuration	H (R²)	LA (R²)	SLA (R²)	LN (R²)	# Top Rankings
DINOv2 only	0.15	0.31	0.32	0.14	1
DINOv2 + Climplicit	0.19	0.32	0.31	0.16	3
DINOv2 + Climplicit + DA-V2	0.19	0.32	0.31	0.18	4
Single-task (same config)	0.12	0.34	0.33	0.21	–

Key Findings¶

PlantTraitNet consistently outperforms existing global trait products on H, LA, and SLA.
The climate prior (Climplicit) contributes the largest performance gain, consistent with the ecological principle that plant traits are climate-driven.
The depth prior provides selective gains, particularly improving leaf nitrogen content prediction.
Multi-task learning yields substantial improvement on plant height (R² 0.12→0.19) while saving ~75% of computation.
Uncertainty-guided cleaning improves R² by 4% for SLA and 13% for LA.
The model captures intraspecific variation, particularly in plant height prediction, rather than simply regressing to species means.
Phylogenetic analysis shows that prediction errors are largely independent of species relatedness, demonstrating strong generalization.

Highlights & Insights¶

Innovative Use of Citizen Science Data: The first systematic conversion of 50M+ georeferenced plant photographs into global trait maps via a highly scalable methodology.
Two-Stage Uncertainty-Guided Cleaning: An elegant solution to feature noise and label noise in weakly supervised data; residual-aware filtering specifically addresses heteroscedastic traits.
Integration of Geospatial Foundation Models: The Climplicit encoding strategy (concatenating embeddings for months 3/6/9/12 to capture seasonal variation) is simple yet effective.
Capture of Intraspecific Variation: Despite training with species-level weak labels, the model retains the ability to distinguish individual differences across developmental stages within the same species.
Pareto Front Checkpoint Selection: Multi-objective optimization selects the checkpoint with the best simultaneous performance across all four traits.

Limitations & Future Work¶

Leaf nitrogen content (LN) prediction remains weak (negative R²), as it is a biochemical trait that is inherently difficult to infer from imagery.
The weakly supervised paradigm is a fundamental limitation — species-level annotations disregard inter-individual variation (intraspecific variability).
Citizen science data exhibit spatial and taxonomic biases (skewed toward Europe and North America, and toward herbaceous species).
Global trait maps exhibit systematic bias (R² substantially lower than r), requiring better calibration methods.
Only 2D images are utilized; video or time-series information is not exploited.

vs. Schiller et al. (2021): A pioneering single-task model that does not evaluate intraspecific variation; PlantTraitNet comprehensively outperforms it.
vs. Wolf et al. (2022): Introduced the sPlotOpen validation protocol and global feature maps; PlantTraitNet achieves better performance on the same benchmark.
vs. Remote Sensing Methods (Moreno, Butler): Traditional remote sensing extrapolation methods perform far worse than computer vision-based approaches.

Rating¶

Novelty: ⭐⭐⭐⭐ — First uncertainty-aware multimodal multi-task framework for global plant trait inference, with multiple novel components.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Multimodal ablations, loss function ablations, global benchmark comparisons, intraspecific variation analysis, and phylogenetic analysis.
Writing Quality: ⭐⭐⭐⭐ — Clear structure, solid ecological background, and outstanding cross-disciplinary integration.
Value: ⭐⭐⭐⭐⭐ — Establishes a new paradigm for leveraging citizen science imagery in global ecological mapping; datasets and code are open-sourced.