CVPR2025 Self-Supervised Learning Spatiotemporal Physical Systems JEPA Representation Learning Parameter Estimation Masked Autoencoder Physical Modeling

Representation Learning for Spatiotemporal Physical Systems¶

Conference: CVPR2025
arXiv: 2603.13227
Code: GitHub
Area: Self-Supervised Learning
Keywords: Self-Supervised Learning, Spatiotemporal Physical Systems, JEPA, Representation Learning, Parameter Estimation, Masked Autoencoder, Physical Modeling

TL;DR¶

This work systematically evaluates the capability of general self-supervised learning methods to learn physically meaningful representations in spatiotemporal physical systems. The evaluation reveals that JEPA, which performs predictions in the latent space, significantly outperforms pixel-level reconstruction methods (MAE) and autoregressive models, closely approaching the performance of the domain-specific physical modeling method DISCO.

Background & Motivation¶

The application of machine learning to spatiotemporal physical systems primarily focuses on frame-by-frame prediction (surrogate modeling), aiming to learn accurate simulators of system evolution.
However, training these simulators is expensive, and they suffer from error accumulation during autoregressive rollout.
A more critical question: For downstream scientific tasks (such as estimating physical parameters of a system), which learning paradigm is most effective at extracting physically relevant information?
The capability of parameter estimation serves as a quantifiable metric for a model's "understanding" of the underlying physics, as these parameters govern the temporal evolution of the system.
Prior work has rarely focused on the differences in physical representation quality across different learning paradigms.

Core Problem¶

Can general self-supervised learning methods effectively learn physically meaningful spatiotemporal representations? Latent-space prediction vs. pixel-level reconstruction: which paradigm is more suitable for extracting physical information?

Method¶

1. JEPA (Joint Embedding Predictive Architecture)¶

Temporal JEPA based on VICReg loss: Given a sample \(x_{0:T}\) of \(T\) steps, adjacent \(k\) frames are grouped together, encoded using an encoder \(f\), and a predictor \(g\) is used to predict the representation of the next group of \(k\) frames in the latent space.
Loss function = invariance term (MSE) + variance regularization + covariance regularization, designed to prevent mode collapse.
Encoder: 3D CNN (ConvNeXt-style), outputting \(l/16 \times w/16 \times 128\).
Predictor: CNN with inverted bottlenecks.
Key Design: Making predictions in the latent space rather than the pixel space, which avoids learning low-level details.

2. Masked Autoencoder (VideoMAE)¶

Standard VideoMAE ViT-tiny/16 architecture.
Temporal tube masking: all frames use the same spatial mask.
Minimizing the pixel-level reconstruction error in the masked regions.
Encoder outputs \(l/16 \times w/16 \times t/2 \times 384\).

3. Physical Modeling Baselines¶

DISCO: In-context operator learning, which infers trajectory-specific evolution rules from a short context window, with an embedding dimension of \(1 \times 384\).
MPP: An autoregressive foundation model that performs pixel-level frame-by-frame prediction, utilizing the released pretrained AViT-tiny weights.

4. Fine-Tuning and Evaluation¶

The encoder is frozen, and an attentive probe is trained on top of it (100 epochs).
Evaluation task: Physical parameter regression (MSE loss).
As MPP was not pretrained on the target datasets, end-to-end fine-tuning is used instead.

Evaluated Physical Systems (from The Well dataset)¶

Active Matter: Collective dynamics of active rod-like particles in a Stokes fluid. Parameters: \(\alpha\) (active dipole strength), \(\zeta\) (particle alignment strength).
Rayleigh-Bénard Convection: Horizontal fluid layers heated from below and cooled from above, forming convection cells. Parameters: Rayleigh number \(\nu\), Prandtl number \(\kappa\).
Shear Flow: The boundary between fluid layers flowing parallelly at different velocities. Parameters: Reynolds number, Schmidt number.

Key Experimental Results¶

Physical Parameter Estimation MSE (\(\downarrow\))¶

Method	Active Matter	Shear Flow	Rayleigh-Bénard
JEPA	0.079	0.38	0.13
VideoMAE	0.160	0.67	0.18
DISCO	0.057	0.13	0.01
MPP (End-to-End Fine-Tuning)	0.230	0.59	0.08

JEPA vs. VideoMAE: Gains of 51% in Active Matter, 43% in Shear Flow, and 28% in Rayleigh-Bénard.
JEPA closely approaches DISCO (a domain-specific physical modeling method), with a gap of only 0.022 on Active Matter.

Data Efficiency (Shear Flow)¶

Fine-tuning Data Ratio	JEPA	VideoMAE
10%	0.57	0.98
50%	0.40	0.75
100%	0.38	0.67

JEPA using only 10% of the data (0.57) already outperforms VideoMAE using 100% of the data (0.67).

Highlights & Insights¶

Latent-Space Prediction \(\gg\) Pixel-Level Reconstruction: The core finding is simple yet powerful; JEPA significantly outperforms VideoMAE across all systems.
Not All Physical Modeling Methods are Superior: MPP (an autoregressive model) is even weaker than general JEPA, which aligns with findings in NLP where autoregressive models underperform encoder-based models on non-generative tasks.
Superb Data Efficiency: The fine-tuning data scaling behavior of JEPA is superior to that of VideoMAE.
Both DISCO and JEPA are Latent-Space Prediction Models: A shared characteristic between the two top-performing methods is operating in the latent space, revealing a core design principle.
Code Available: Complete experimental code is provided.

Limitations & Future Work¶

Evaluation is limited to only three physical systems (all of which are 2D PDE systems).
The JEPA encoder utilizes a 3D CNN; a ViT-based JEPA architecture was not explored.
Downstream tasks are confined to parameter estimation, with other scientific tasks like classification (e.g., laminar vs. turbulent flow) remaining unexplored.
DISCO significantly outperforms JEPA on Rayleigh-Bénard (\(MSE = 0.01\) vs. \(0.13\)), indicating that physical inductive biases remain critical in certain systems.
Analysis of what physical information is actually captured by the JEPA-learned representations is not provided.

vs. VideoMAE: JEPA outperforms it across the board, proving that latent-space prediction is superior to pixel-level reconstruction.
vs. MPP (Autoregressive Foundation Model): The general JEPA even outperforms end-to-end fine-tuned MPP, suggesting that the next-frame prediction paradigm is not necessarily optimal for representation learning.
vs. DISCO: Domain-specific physical modeling methods still hold an advantage (particularly in Rayleigh-Bénard), but the gap for general JEPA is relatively small.
vs. V-JEPA/I-JEPA: The JEPA variant in this work is specifically designed for physical systems, adopting a 3D CNN rather than a ViT.

Insights & Connections¶

Crucial takeaway for the Scientific ML community: Pursuing the most accurate surrogate model may not be necessary; the latent-space prediction paradigm might be superior for downstream tasks.
Consistent with LeCun’s JEPA philosophy: Abstract representations are more effective than pixel-level reconstruction.
The trade-off between autoregressive and encoder-based models shows similar trends in both AI for Physics and NLP.
Offers a new direction for the design of foundation models in physical systems: potentially utilizing JEPA rather than an autoregressive paradigm.

Rating¶

Novelty: ⭐⭐⭐⭐ (First systematic comparison of self-supervised paradigms for parameter estimation in physical systems)
Experimental Thoroughness: ⭐⭐⭐ (3 systems × 4 methods, though downstream tasks are relatively limited)
Writing Quality: ⭐⭐⭐⭐⭐ (Concise and clear, with key findings presented straightforwardly)
Value: ⭐⭐⭐⭐ (Strong guiding significance for the Scientific ML community)