Representation Learning for Spatiotemporal Physical Systems¶

Conference: CVPR 2026 arXiv: 2603.13227 Code: https://github.com/helenqu/physical-representation-learning Area: Self-Supervised Learning Keywords: JEPA, Physical Systems, Representation Learning, Parameter Estimation, Spatiotemporal PDE

TL;DR¶

This paper systematically benchmarks four learning paradigms — JEPA, VideoMAE, an autoregressive foundation model (MPP), and an operator learning method (DISCO) — across three PDE-based physical systems. It finds that latent-space predictive objectives (JEPA) consistently outperform pixel-level prediction methods on the downstream task of physical parameter estimation, achieving 28–51% relative MSE reduction with greater data efficiency.

Background & Motivation¶

Machine learning applied to spatiotemporal physical systems has largely focused on autoregressive surrogate modeling for next-frame prediction, aiming to learn efficient substitutes for numerical simulations. Such approaches are costly to train, suffer from compounding errors, and — more fundamentally — do not align with the core needs of scientific inquiry, which often center on higher-level downstream tasks such as estimating control parameters (Reynolds number, Prandtl number, etc.) or making qualitative predictions (e.g., laminar vs. turbulent flow).

The key question is: which learning paradigm best preserves physically meaningful information? Intuitively, methods specifically designed for physical modeling (e.g., autoregressive foundation models, neural operators) should outperform general-purpose self-supervised approaches. But is this actually the case? This question has previously lacked systematic investigation.

The paper addresses this by using physical parameter estimation accuracy as a quantifiable proxy for representation quality, systematically evaluating different learning paradigms — latent-space prediction (JEPA), pixel reconstruction (MAE), autoregressive foundation models (MPP), and operator learning (DISCO) — on physical systems. The core idea is: predicting representations in latent space (rather than pixel values) may better capture the high-level dynamical information of physical systems.

Method¶

Overall Architecture¶

Rather than proposing a new model architecture, this paper designs a systematic evaluation framework: 1. Four models are pretrained separately on three PDE-based physical systems. 2. Encoders are frozen, and attentive probes are trained for physical parameter estimation. 3. Representation quality is assessed via parameter estimation MSE.

Key Designs¶

JEPA (Joint Embedding Predictive Architecture) for Physical Dynamics:
- Function: Learns an encoder \(f: \mathcal{X} \to \mathcal{Z}\) and predictor \(g: \mathcal{Z} \to \mathcal{Z}\) to predict representations of future time segments in latent space.
- Mechanism: Given \(T\) timesteps \(x_{0:T}\), each segment \(x_{t:t+k}\) is encoded as \(z_i = f(x_i)\); the latent prediction loss is minimized: \(\mathcal{L}(f,g) = \mathbb{E}_{x_i, x_{i+1} \sim \mathcal{X}}[\ell_{\text{VICReg}}(g(f(x_i)), f(x_{i+1}))]\)
- VICReg loss is applied to prevent representational collapse: \(\ell_{\text{VICReg}}(z_i, z_{i+1}) = \lambda s(z_i, z_{i+1}) + \mu[v(z_i) + v(z_{i+1})] + \nu[c(z_i) + c(z_{i+1})]\) where \(s\) is the invariance term (L2 distance), \(v\) is variance regularization, and \(c\) is covariance regularization.
- Encoder: 3D ConvNeXt downsampling CNN; Predictor: inverted bottleneck CNN over the channel dimension.
- Design Motivation: By minimizing error in representation space rather than pixel space, JEPA avoids learning low-level visual details (e.g., texture) and focuses on high-level dynamical features.
VideoMAE Baseline (Pixel-Level Reconstruction):
- Function: Trains an encoder–decoder pair to minimize pixel reconstruction error over masked regions.
- Mechanism: Spatiotemporal tube masking with pixel-level MSE reconstruction.
- Architecture: ViT-tiny/16, output shape \(l/16 \times w/16 \times t/2 \times 384\).
- Design Motivation: Serves as a representative of pixel-level predictive paradigms, providing a direct contrast to JEPA.
Physics-Specific Baselines:
- MPP (Multiple Physics Pretraining): An autoregressive foundation model that predicts pixel values frame by frame; uses publicly released pretrained weights (AViT-tiny).
- DISCO: An operator meta-learning framework that infers trajectory-specific operator networks from short context windows; pretrained on The Well dataset.
- Design Motivation: Tests whether methods specifically designed for physical modeling genuinely outperform general self-supervised methods on scientific downstream tasks.

Loss & Training¶

JEPA and VideoMAE are pretrained separately on each physical system for 6 epochs to learn system-specific dynamics.
During fine-tuning, encoders are frozen and attentive probes are trained for 100 epochs (following the V-JEPA fine-tuning protocol).
MPP, whose pretraining did not include the target datasets, undergoes end-to-end fine-tuning.
AdamW optimizer with cosine learning rate scheduling.
VICReg hyperparameters: \(\lambda=2, \mu=40, \nu=2\).
Input: \(l \times w \times d \times 16\) (16-frame context).

Key Experimental Results¶

Main Results¶

Method	Type	Active Matter MSE↓	Shear Flow MSE↓	Rayleigh-Bénard MSE↓
JEPA	Latent-space prediction	0.079	0.38	0.13
VideoMAE	Pixel reconstruction	0.160	0.67	0.18
DISCO	Operator learning	0.057	0.13	0.01
MPP (full fine-tune)	Autoregressive foundation model	0.230	0.59	0.08

JEPA vs. VideoMAE improvement: Active Matter 51%, Shear Flow 43%, Rayleigh-Bénard 28%.

Ablation Study¶

Fine-tuning Data Fraction	JEPA MSE↓	VideoMAE MSE↓	Notes (Shear Flow)
10%	0.57	0.98	JEPA at 10% already surpasses VideoMAE at 100%
50%	0.40	0.75	JEPA reaches 95% of its best performance
100%	0.38	0.67	Baseline comparison

Key Findings¶

JEPA consistently outperforms VideoMAE across all three physical systems, with a uniform margin of 28–51%.
Not all physics-specific methods outperform general self-supervised approaches: MPP, despite end-to-end fine-tuning, underperforms frozen-encoder JEPA on Active Matter and Shear Flow. This mirrors findings in NLP where autoregressive models underperform encoder models on non-generative tasks (BERT vs. GPT).
DISCO and JEPA are top performers in their respective categories: both operate via latent-space prediction mechanisms (DISCO through hypernetworks producing latent embeddings; JEPA through encoder-based latent prediction), whereas MPP and VideoMAE both perform pixel-level prediction — strongly suggesting that the latent-space mechanism is the decisive factor.
JEPA exhibits higher data efficiency: it surpasses VideoMAE trained on 100% of the fine-tuning data using only 10%.
The relative ranking of methods varies across systems: DISCO achieves MSE = 0.01 on Rayleigh-Bénard, far outpacing all other methods, possibly because the physical structure of that system aligns particularly well with operator learning.

Highlights & Insights¶

Novel evaluation perspective: The paper shifts self-supervised representation learning evaluation from ImageNet image classification to physical parameter estimation, providing a distinctive scientific lens.
Significant core finding: Latent-space prediction outperforms pixel-level prediction — a conclusion that holds consistently across three distinct physical systems, suggesting broad generality.
Analogy to NLP: The observation that autoregressive models underperform encoder models on non-generative downstream tasks — classically noted in the BERT vs. GPT debate — is validated here in the context of physical modeling (MPP vs. JEPA).
Elegant experimental design: Using quantifiable physical parameters as a proxy for representation quality avoids the subjectivity inherent in conventional evaluation metric selection.
Concise and impactful: The paper makes its contribution through experimental findings and insights rather than architectural novelty, communicating its message efficiently.

Limitations & Future Work¶

Evaluation is limited to three 2D PDE systems; generalizability to 3D systems, particle-based systems, or non-PDE systems remains unknown.
Downstream tasks are restricted to parameter estimation (regression); classification tasks (e.g., laminar vs. turbulent flow) and other scientific tasks are not explored.
The JEPA encoder is a simple 3D CNN; the effect of larger-scale models or more complex architectures is not investigated.
The paper does not analyze what the learned representations physically encode — visualization or interpretability analysis is absent.
DISCO substantially outperforms JEPA on certain systems (Rayleigh-Bénard MSE: 0.01 vs. 0.13), indicating that physics-informed inductive biases retain irreplaceable advantages in specific settings.

V-JEPA (Assran et al., 2025) and VICReg (Bardes et al., 2021): Provide the theoretical and practical foundation for applying JEPA to spatiotemporal data.
MPP (McCabe et al., 2024): A representative physics foundation model; its underperformance on non-generative tasks warrants attention from the community.
DISCO (Morel et al., 2025): A representative operator meta-learning method; its strong performance validates the value of physics-informed inductive biases.
The Well (Ohana et al., 2025): Provides standardized datasets for physical systems.
Broader implication: Scientific machine learning may benefit from distinguishing between tasks requiring precise simulation and those requiring system understanding, favoring representation learning over autoregressive modeling for the latter.

Rating¶

Novelty: ⭐⭐⭐⭐ — The perspective is novel, though JEPA itself is not a new contribution; the value lies in the systematic experimental findings.
Experimental Thoroughness: ⭐⭐⭐ — Three systems and data efficiency analysis are convincing, but diversity of systems and tasks could be further extended.
Writing Quality: ⭐⭐⭐⭐ — Concise and clear, with key messages well-highlighted and accessible for quick reading.
Value: ⭐⭐⭐⭐ — The finding that latent-space prediction outperforms pixel-level prediction provides actionable guidance for the scientific ML community.