PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement¶
Conference: CVPR 2026 arXiv: 2509.24850 Code: GitHub Area: Human Understanding Keywords: rPPG, physics-informed network, temporal convolutional network, hemodynamics, Navier-Stokes, lightweight model
TL;DR¶
Starting from the Navier-Stokes equations, this work derives through rigorous mathematical analysis that rPPG pulse signals obey a second-order damped harmonic oscillator model whose discrete solution is equivalent to a causal convolution operator, thereby providing a first-principles justification for the TCN architecture. The resulting PHASE-Net, with only 0.29M parameters, achieves state-of-the-art performance across multiple datasets.
Background & Motivation¶
Background: Remote photoplethysmography (rPPG) extracts physiological signals such as heart rate by capturing subtle changes in subcutaneous blood volume via ordinary cameras, and is a key technology for contactless physiological monitoring. Deep learning methods (PhysNet, PhysFormer, RhythmMamba, etc.) have become the dominant paradigm.
Limitations of Prior Work:
- Most existing deep learning models are designed heuristically—treating rPPG as a generic spatiotemporal signal processing task, with architecture choices relying on empirical trial and error.
- The lack of physical theoretical grounding means models may overfit dataset-specific noise patterns, resulting in poor cross-domain generalization.
- Artifacts from head motion and illumination changes are far stronger than genuine pulse signals, and "black-box" models cannot provide reliability guarantees.
Key Challenge: High-performance deep learning models vs. lack of physical interpretability and theoretical guarantees.
Goal: Can one design an rPPG model whose architecture directly embodies the physical laws governing the signal, starting from first principles?
Key Insight: Derive blood flow pulse dynamics from the Navier-Stokes equations and rigorously prove that TCN is the physically correct architectural choice.
Core Idea: The physical dynamics of rPPG signals are equivalent to causal convolution; therefore, TCN is not a heuristic choice but a physical inevitability.
Method¶
Overall Architecture¶
Visual encoder (3 EST Blocks, each containing a ZAS module) extracts spatiotemporal features → Adaptive Spatial Filter (ASF) generates spatial attention masks, aggregates features, and computes temporal differences → Gated Temporal Convolutional Network (GTCN) models long-range temporal dynamics → rPPG waveform output.
Key Designs¶
-
Physics Derivation Chain: From Navier-Stokes to TCN
- Starting point: The Beer-Lambert law establishes a linear relationship between pixel variation \(\Delta I(t)\) and subcutaneous blood volume \(\Delta V(t)\); vascular compliance further links \(\Delta V(t)\) to local blood pressure pulsation \(z(t)\).
- Linearizing the Navier-Stokes equations → 1D momentum + continuity equations → eliminating velocity variables yields a damped wave equation: \(\frac{\partial^2 p'}{\partial t^2} + \alpha \frac{\partial p'}{\partial t} = c^2 \frac{\partial^2 p'}{\partial x^2}\)
- At a fixed observation point \(x_0\), this degenerates to a second-order ODE (damped harmonic oscillator): \(\ddot{z} + \alpha \dot{z} + \omega^2 z = u(t)\)
- Semi-implicit Euler discretization → LTI state-space model → Proposition 1 proves its solution is a causal convolution \(z_t = \sum_{m=0}^{\infty} g[m] \cdot a_{t-m}\) → Proposition 2 proves that an FIR filter can approximate the IIR to arbitrary precision \(\varepsilon\) → TCN is the exact computational realization of this physical process.
- Significance: This establishes, for the first time, a complete logical chain from hemodynamic first principles to a specific network architecture.
-
Zero-FLOPs Axial Swapper (ZAS)
- Applies intra-block spatial transposition to the last \(k = \lfloor pC \rfloor\) channels of the feature map (dividing \(H \times W\) into \(b \times b\) blocks and performing matrix transposition), leaving the remaining channels unchanged.
- Key properties: self-inverse (ZAS(ZAS(\(X\))) = \(X\), guaranteeing invertibility and gradient stability); energy-preserving (\(\|\)ZAS(\(X\))\(\|_2 = \|X\|_2\), 1-Lipschitz to prevent signal amplification).
- Design Motivation: Injects cross-region spatial interaction with zero FLOPs and zero parameters, enhancing feature mixing across distant facial regions.
-
Adaptive Spatial Filter (ASF)
- For each frame, a lightweight convolution generates a spatial logit map → spatial softmax normalization produces an attention mask \(M_t\) → weighted aggregation over the spatial dimensions yields a 1D feature vector \(\mathbf{z}_t\).
- Simultaneously computes a first-order temporal difference \(\mathbf{v}_t = \mathbf{z}_t - \mathbf{z}_{t-1}\) to encode pulse "velocity."
- Output = \([\mathbf{z}_t, \mathbf{v}_t]\) channel concatenation, retaining both spatially purified intensity information and short-term temporal variation.
- Design Motivation: The forehead and cheeks have high SNR, while other regions are dominated by noise—making global average pooling (GAP) suboptimal.
-
Gated Temporal Convolutional Network (GTCN)
- Dual-path causal dilated TCN: one path with tanh activation, one path with sigmoid gating → element-wise multiplication for fusion.
- Physical significance: Implements the causal convolution operation derived in Propositions 1 & 2, modeling long-range temporal dynamics.
Loss & Training¶
Negative Pearson correlation loss: \(\mathcal{L}_{\text{pred}} = -\frac{\sum_t (\hat{y}_t - \bar{\hat{y}})(y_t - \bar{y})}{\sqrt{\sum_t (\hat{y}_t - \bar{\hat{y}})^2 \sum_t (y_t - \bar{y})^2}}\), directly optimizing morphological similarity between the predicted waveform and the ground truth.
Key Experimental Results¶
Main Results (In-Domain Evaluation)¶
| Method | UBFC MAE↓ | UBFC RMSE↓ | PURE MAE↓ | PURE RMSE↓ | BUAA MAE↓ | MMPD MAE↓ | Params |
|---|---|---|---|---|---|---|---|
| PhysNet | 2.95 | 3.67 | 2.10 | 2.60 | 10.89 | 4.80 | Large |
| PhysFormer | 0.92 | 2.46 | 1.10 | 1.75 | 8.45 | 11.99 | Large |
| RhythmFormer | 0.50 | 0.78 | 0.27 | 0.47 | 9.19 | 4.69 | Medium |
| Contrast-Phys+ | 0.21 | 0.80 | 0.48 | 0.98 | - | - | Medium |
| Style-rPPG | 0.17 | 0.41 | 0.39 | 0.62 | - | - | Medium |
| LST-rPPG | 0.16 | 0.57 | 0.32 | 0.62 | - | - | Medium |
| PHASE-Net | 0.15 | 0.53 | 0.14 | 0.35 | 5.89 | 4.78 | 0.29M |
Ablation Study (Cross-Domain Generalization, Leave-One-Out)¶
| Method | Others→U MAE↓ | Others→P MAE↓ | Others→B MAE↓ | Others→M MAE↓ |
|---|---|---|---|---|
| PhysFormer | 10.29 | 19.75 | 22.09 | 13.90 |
| RhythmFormer | 14.71 | 21.11 | 6.04 | 16.14 |
| EfficientPhys | 12.87 | 7.15 | 32.30 | 12.87 |
| PHASE-Net | 10.04 | 2.86 | - | - |
Key Findings¶
- MAE of 0.14 bpm on PURE, halving that of RhythmFormer (0.27)—physical priors as inductive bias significantly improve accuracy.
- SOTA performance achieved with only 0.29M parameters—theoretical rigor and extreme efficiency are unified.
- Cross-domain Others→PURE MAE of 2.86 bpm, substantially outperforming PhysFormer (19.75) and RhythmFormer (21.11)—physical priors enhance generalization.
- On challenging datasets such as BUAA and MMPD, methods like PhysFormer exhibit negative correlation (\(R < 0\)), while PHASE-Net maintains positive correlation.
Highlights & Insights¶
- First derivation of an rPPG network architecture from first principles: The complete mathematical proof chain from Navier-Stokes → ODE → SSM → causal convolution → TCN elevates architectural selection from empiricism to physical inevitability.
- ZAS zero-FLOPs module: Pure permutation operations enhance cross-region feature interaction; the mathematical proofs of self-inverse and energy-preserving properties elegantly guarantee training stability.
- Temporal difference design in ASF: Unifies spatial aggregation and temporal differentiation in a single module, providing the downstream physical model with complete state information in the form of "position + velocity."
- Paradigm of theoretical rigor combined with engineering minimalism: 0.29M parameters demonstrate that well-chosen inductive biases can dramatically reduce model complexity.
Limitations & Future Work¶
- The physical derivation relies on several simplifying assumptions (laminar flow, linearization, single-point observation, elastic restoring force approximation), which may not hold under extreme motion or atypical vascular conditions.
- The block size \(b\) and channel ratio \(p\) of ZAS require manual specification, lacking an adaptive mechanism.
- Validation on large-scale in-the-wild datasets such as VIPL-HR has not been performed.
- The cross-domain generalization table has missing entries (Others→B, Others→M), leaving the evaluation incomplete.
Related Work & Insights¶
- vs. PhysFormer/RhythmMamba: These methods model temporal sequences with Transformers/SSMs, representing general-purpose sequence models; PHASE-Net demonstrates from a physical perspective that causal convolution (TCN) is the correct computational primitive for rPPG.
- vs. conventional PINN paradigm: Classical PINNs embed physical equations into the loss function; PHASE-Net's innovation lies in using physical laws to constrain the network architecture itself—"physics determines structure" rather than "physics constrains training."
- Implication: This methodology of "deriving network structure from PDEs" is generalizable to other signal processing tasks with well-defined physical models (e.g., seismic waves, acoustic signals).
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First derivation of an rPPG network architecture from first principles; paradigm-level significance.
- Experimental Thoroughness: ⭐⭐⭐⭐ In-domain and cross-domain evaluation on 4 datasets with comprehensive ablations, though some cross-domain entries are missing.
- Writing Quality: ⭐⭐⭐⭐⭐ Rigorous derivations and a clear, coherent logical chain from physics to architecture.
- Value: ⭐⭐⭐⭐⭐ The physics-driven architecture design paradigm has broad applicability; the extreme efficiency of 0.29M parameters is well-suited for deployment.