rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training¶

Conference: CVPR 2026 arXiv: 2604.11156 Code: https://github.com/Tianyang-Dai/rPPG-VQA Area: Human Understanding Keywords: Remote Photoplethysmography, Video Quality Assessment, Unsupervised Learning, Multimodal Large Language Models, Data Curation

TL;DR¶

rPPG-VQA proposes the first video quality assessment framework tailored for remote heart rate detection (rPPG), combining signal-level multi-method consensus SNR with scene-level MLLM disturbance recognition, along with a two-stage adaptive sampling strategy to curate in-the-wild training data.

Background & Motivation¶

Background: Unsupervised rPPG aims to learn contactless heart rate detection from unannotated video data, but existing research focuses primarily on methodological innovation while neglecting data quality issues.

Limitations of Prior Work: (1) Motion, illumination, and other noise in in-the-wild videos may overwhelm weak physiological signals; (2) AI-generated videos lack any real physiological basis; (3) conventional VQA evaluates human perceptual quality, which is misaligned with rPPG requirements; (4) a single SNR metric is easily fooled by periodic non-physiological signals (e.g., strobe lights).

Key Challenge: Videos with high visual quality may contain no extractable physiological signal, while visually degraded videos may still carry valid signals — a distinction that conventional VQA cannot make.

Core Idea: A dual-branch assessment — the signal-level branch applies multi-method consensus SNR to eliminate algorithmic bias, while the scene-level branch employs an MLLM to identify disturbances such as motion and illumination variation.

Method¶

Overall Architecture¶

In-the-wild video input → Signal-level branch (multiple rPPG algorithms extract signals → consensus SNR score) + Scene-level branch (MLLM evaluates motion/illumination/compression disturbances) → Fusion into a unified quality score → Two-stage adaptive sampling → Target training set construction.

Key Designs¶

Signal-Level Multi-Method Consensus SNR:
- Function: Evaluates the integrity of physiological signals in a video while eliminating single-algorithm bias.
- Mechanism: Multiple traditional rPPG algorithms (GREEN, ICA, CHROM, POS, etc.) independently extract signals and estimate SNR. If a genuine physiological signal is present, all methods should yield consistently high SNR (method-agnostic property); inconsistency indicates unreliable signal.
- Design Motivation: A single SNR metric is susceptible to periodic noise (e.g., strobe lights producing heartbeat-like signals); multi-method consensus filters out such false positives.
Scene-Level MLLM Disturbance Recognition:
- Function: Identifies scene-level disturbances that signal-level metrics cannot capture.
- Mechanism: An MLLM performs human-like scene reasoning over video frames to detect complex disturbances such as unstable illumination, severe motion, and camera artifacts, producing a disturbance score.
- Design Motivation: Signal-level metrics cannot distinguish the physiological origin of a signal and lack scene context to differentiate genuine biosignals from confounding artifacts.
Two-Stage Adaptive Sampling (TAS):
- Function: Constructs an optimal training set from a large-scale uncurated video pool.
- Mechanism: Stage 1 applies quality thresholds to filter low-quality videos; Stage 2 employs duration-aware probabilistic sampling to balance quality, diversity, and efficiency.
- Design Motivation: Naive filtering may yield insufficiently diverse training sets; probabilistic sampling maintains data diversity while ensuring quality.

Loss & Training¶

The curated training set is used to train existing unsupervised rPPG methods (e.g., ContrastPhys, SiNC), validating the effectiveness of the VQA framework.

Key Experimental Results¶

Main Results¶

Sampling Strategy	PURE Test Set HR MAE	Note
All (full data)	High error	Low-quality data degrades training
Random	Moderate error	Better than using all data
rPPG-VQA	Lowest error	Quality curation is highly effective

Ablation Study¶

Configuration	HR MAE	Note
Signal-level + Scene-level	Best	Dual branches are complementary
Signal-level only	Second best	Scene disturbances overlooked
Scene-level only	Moderate	Signal quality assessment missing

Key Findings¶

Training on all in-the-wild videos performs worse than training on a quality-curated subset.
The dual-branch design shows clear complementary effects; each branch alone has blind spots.
The TAS strategy maintains training set diversity while ensuring data quality.

Highlights & Insights¶

First systematic study of data quality for rPPG: Addresses the data-side gap in unsupervised rPPG research.
Method-agnostic signal quality metric: Multi-algorithm consensus eliminates bias and the idea is transferable to other signal processing tasks.

Limitations & Future Work¶

MLLM inference incurs substantial computational cost.
Quality threshold selection requires manual tuning.
Future work may explore end-to-end quality-aware training frameworks.

vs. Conventional VQA (PSNR/SSIM): Designed for human perceptual quality, misaligned with rPPG requirements.
vs. Post-hoc signal evaluation: Requires prior signal extraction and cannot pre-filter raw videos.

Rating¶

Novelty: ⭐⭐⭐⭐ First systematic treatment of data quality for rPPG
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive comparison across multiple sampling strategies
Writing Quality: ⭐⭐⭐⭐ Problem formulation is precise and well-motivated
Value: ⭐⭐⭐⭐ Unlocks the potential of in-the-wild data for unsupervised rPPG