MicroEvoEval: A Systematic Evaluation Framework for Image-Based Microstructure Evolution Prediction¶

Conference: AAAI 2026 arXiv: 2511.08955 Code: GitHub Area: LLM Evaluation Keywords: Microstructure evolution, spatiotemporal prediction, benchmark evaluation, deep learning surrogate models, materials science

TL;DR¶

This paper introduces MicroEvoEval, the first standardized benchmark for image-level microstructure evolution prediction, encompassing 4 representative physical tasks (planar wave propagation, grain growth, spinodal decomposition, dendritic solidification), 14 models (5 domain-specific + 9 general spatiotemporal architectures), and a multi-dimensional evaluation framework (numerical accuracy + physical fidelity + computational efficiency). The study finds that modern general-purpose architectures (e.g., VMamba) outperform domain-specific models in long-term stability and physical fidelity while achieving an order-of-magnitude improvement in computational efficiency.

Background & Motivation¶

Background: Microstructure evolution (MicroEvo) simulation is critical for materials design. Traditional methods rely on PDE numerical solvers such as phase-field simulations, which are accurate but computationally prohibitive. Deep learning surrogate models have advanced rapidly in recent years, with methods such as E3D-LSTM, ConvGRU, and VMamba being applied to this task.

Limitations of Prior Work: (a) Absence of standardized benchmarks — existing methods are developed and evaluated independently on heterogeneous tasks and datasets, precluding systematic comparison; (b) Overemphasis on numerical accuracy at the expense of physical fidelity — standard image metrics (MSE, SSIM) fail to assess whether predicted microstructures conform to physical laws (e.g., phase area fractions, mean region areas); (c) Lack of long-term stability analysis — existing evaluations typically focus on short-term prediction, neglecting error accumulation in autoregressive long-term forecasting.

Critical Gap: Powerful general-purpose spatiotemporal prediction models (e.g., SimVP.v2, PredFormer) have never been systematically evaluated on physically constrained MicroEvo tasks.

Core Idea: Construct a standardized benchmark organized around a taxonomy of physical mechanisms (periodic, non-conserved, conserved, coupled multi-physics), evaluate models along three dimensions — numerical accuracy, physical fidelity, and computational efficiency — and design both short-term and long-term prediction settings to analyze error accumulation.

Method¶

Overall Architecture¶

The benchmark comprises four components: (1) 4 physical tasks with corresponding datasets; (2) 14 evaluated models (5 MicroEvo-specific + 9 general spatiotemporal); (3) dual short-term/long-term evaluation settings; (4) a multi-dimensional evaluation metric system.

Four Physical Tasks¶

Planar wave propagation (periodic structure): Describes scalar field wave propagation with an analytical solution; the simplest task.
Grain growth (Allen-Cahn equation, non-conserved order parameter): Multi-grain coarsening process in which grains progressively merge and grow.
Spinodal decomposition (Cahn-Hilliard equation, conserved order parameter): Spontaneous phase separation in binary alloys governed by a fourth-order PDE; more challenging.
Dendritic solidification (coupled multi-physics): Solidification involving coupled temperature field and order parameter; the most complex task.

All data are derived from 256×256 high-fidelity numerical simulations and downsampled to 64×64. Two evaluation settings are designed: short-term (10 frames input → 10 frames prediction) and long-term (autoregressive prediction of 50/90 frames).

Key Evaluation Metrics¶

Numerical accuracy: RMSE, SSIM — measure pixel-level prediction accuracy.
Physical fidelity (core innovation):
L-ETAP (logarithmic error in total area proportion): Assesses whether the predicted total area fraction of each phase is correct — reflects conservation laws and phase equilibrium.
L-EAPSR (logarithmic error in average per-segment area ratio): Assesses whether the average size of individual regions is correct — reflects coarsening dynamics.
Computational efficiency: Inference time.

14 Evaluated Models¶

MicroEvo-specific (5): E3D-LSTM, ConvGRU, PredRNN, VMamba, SpatioTemporalFormer
General spatiotemporal (9): SimVP, SimVP.v2, TAU, MAU, PredRNN++, SwinLSTM, VMRNN, PredFormer, etc.

Key Experimental Results¶

Short-Term Prediction (10→10) Key Results¶

Model	Type	RMSE (Grain Growth)	SSIM (Grain Growth)	L-EAPSR
E3D-LSTM	MicroEvo	0.034	0.986	-2.488
ConvGRU	MicroEvo	0.021	0.992	-2.586
PredRNN	MicroEvo	0.025	0.992	-2.541
SimVP.v2	General	~low	~high	~good
VMamba	MicroEvo	~best	~best	~best
PredFormer	General	~competitive	~competitive	~competitive

Planar wave propagation is the simplest task, with all models achieving high SSIM (>0.99). Spinodal decomposition and dendritic solidification are more challenging, with more pronounced inter-model differences.

Long-Term Prediction (10→90) Key Findings¶

Finding	Description
Short-term accuracy ≠ long-term stability	Some models with low short-term RMSE experience rapid error accumulation leading to complete collapse in long-term autoregressive prediction.
General architectures exhibit greater long-term stability	SimVP.v2 and PredFormer degrade more slowly in long-term prediction.
VMamba achieves the best overall performance	Superior accuracy and physical fidelity, with an order-of-magnitude advantage in computational efficiency.
Physical fidelity decouples from pixel-level accuracy	Cases exist where SSIM is high but L-ETAP/L-EAPSR is poor, indicating that structural errors are masked by pixel-level metrics.
Error accumulation patterns are task-dependent	Long-term collapse patterns for conserved PDEs (spinodal decomposition) differ from those for non-conserved PDEs (grain growth).

Computational Efficiency Comparison¶

Efficiency Tier	Representative Models	Relative Inference Time
Fastest	SimVP.v2, VMamba	1×
Moderate	PredFormer, TAU	3–5×
Slowest	E3D-LSTM, PredRNN++	10×

VMamba achieves the highest accuracy while maintaining extremely high inference efficiency, making it an ideal candidate for practical surrogate modeling.

Key Findings¶

Modern general-purpose architectures unexpectedly outperform domain-specific models: General architectures such as VMamba and SimVP.v2 lead not only in numerical accuracy but also in physical fidelity and long-term stability — challenging the common assumption that physical tasks require embedded physical priors.
Short-term accuracy is a poor predictor of long-term stability: Models that perform well under short-term evaluation may completely collapse in long-term autoregressive settings, revealing a critical methodological flaw in evaluations that consider only short-term prediction.
Physical fidelity metrics expose the blind spots of pixel-level metrics: A model may achieve high SSIM while being entirely incorrect in its physical structure (e.g., large deviations in phase area fractions), demonstrating the necessity of structure-preserving metrics.
VMamba's efficiency advantage is remarkable: It achieves superior accuracy and fidelity while being an order of magnitude faster than many competing models — a crucial advantage for scenarios requiring rapid exploration of materials design spaces.

Highlights & Insights¶

The introduction of physical fidelity metrics (L-ETAP, L-EAPSR) represents a key contribution. In materials science, whether predicted microstructures preserve correct phase area fractions and region-level statistical properties is more important than pixel-level accuracy. This insight generalizes to other prediction tasks with physical constraints.
The finding that general-purpose architectures outperform domain-specific models carries important implications: inductive biases inherent in architectures (e.g., translational invariance in CNNs, long-range dependency modeling in Mamba) may be more effective than manually embedded physical priors. This suggests that the materials science AI community should more actively adopt recent architectural advances from computer vision.

Limitations & Future Work¶

All data are derived from PDE numerical simulations; validation on real experimental microstructure data is absent — the structural and noise characteristics of simulation data may differ from those of physical experiments.
The 64×64 resolution is relatively low; practical microstructure analysis typically requires higher resolution.
The effect of directly embedding physical constraints (e.g., PDE residual losses) into general-purpose models has not been evaluated — this is an important direction for future work.
While L-ETAP and L-EAPSR are valuable, they do not cover all physically important quantities (e.g., interfacial morphology, topological structure).

vs. E3D-LSTM (Yang et al., 2021): Among the earliest deep learning methods applied to MicroEvo, but lacking physical fidelity evaluation and comparison with general-purpose architectures. MicroEvoEval addresses both gaps.
vs. general spatiotemporal benchmarks (e.g., OpenSTL): These focus on natural video and weather prediction tasks. MicroEvoEval is the first benchmark targeting physically constrained microstructure evolution.
vs. physics-guided methods (e.g., PhyDNet): Not included in the evaluation scope, but the strong performance of general-purpose architectures suggests that purely data-driven approaches may already be sufficiently powerful.

Rating¶

Novelty: ⭐⭐⭐⭐ First standardized MicroEvo benchmark; physical fidelity metric design is valuable.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 14 models × 4 tasks × short-term + long-term × three-dimensional metrics; outstanding in scale and systematicity.
Writing Quality: ⭐⭐⭐⭐ Clear taxonomy of physical tasks; well-structured evaluation design.
Value: ⭐⭐⭐⭐ Significant reference value for the materials science AI community; reveals the potential of general-purpose architectures.