Skip to content

rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training

Conference: CVPR 2026 arXiv: 2604.11156 Code: https://github.com/Tianyang-Dai/rPPG-VQA Area: Human Understanding Keywords: Remote Photoplethysmography, Video Quality Assessment, Unsupervised Learning, Multimodal Large Language Models, Data Curation

TL;DR

rPPG-VQA proposes the first video quality assessment framework tailored for remote heart rate detection (rPPG), combining signal-level multi-method consensus SNR with scene-level MLLM disturbance recognition, along with a two-stage adaptive sampling strategy to curate in-the-wild training data.

Background & Motivation

Background: Unsupervised rPPG aims to learn contactless heart rate detection from unannotated video data, but existing research focuses primarily on methodological innovation while neglecting data quality issues.

Limitations of Prior Work: (1) Motion, illumination, and other noise in in-the-wild videos may overwhelm weak physiological signals; (2) AI-generated videos lack any real physiological basis; (3) conventional VQA evaluates human perceptual quality, which is misaligned with rPPG requirements; (4) a single SNR metric is easily fooled by periodic non-physiological signals (e.g., strobe lights).

Key Challenge: Videos with high visual quality may contain no extractable physiological signal, while visually degraded videos may still carry valid signals — a distinction that conventional VQA cannot make.

Core Idea: A dual-branch assessment — the signal-level branch applies multi-method consensus SNR to eliminate algorithmic bias, while the scene-level branch employs an MLLM to identify disturbances such as motion and illumination variation.

Method

Overall Architecture

In-the-wild video input → Signal-level branch (multiple rPPG algorithms extract signals → consensus SNR score) + Scene-level branch (MLLM evaluates motion/illumination/compression disturbances) → Fusion into a unified quality score → Two-stage adaptive sampling → Target training set construction.

Key Designs

  1. Signal-Level Multi-Method Consensus SNR:

    • Function: Evaluates the integrity of physiological signals in a video while eliminating single-algorithm bias.
    • Mechanism: Multiple traditional rPPG algorithms (GREEN, ICA, CHROM, POS, etc.) independently extract signals and estimate SNR. If a genuine physiological signal is present, all methods should yield consistently high SNR (method-agnostic property); inconsistency indicates unreliable signal.
    • Design Motivation: A single SNR metric is susceptible to periodic noise (e.g., strobe lights producing heartbeat-like signals); multi-method consensus filters out such false positives.
  2. Scene-Level MLLM Disturbance Recognition:

    • Function: Identifies scene-level disturbances that signal-level metrics cannot capture.
    • Mechanism: An MLLM performs human-like scene reasoning over video frames to detect complex disturbances such as unstable illumination, severe motion, and camera artifacts, producing a disturbance score.
    • Design Motivation: Signal-level metrics cannot distinguish the physiological origin of a signal and lack scene context to differentiate genuine biosignals from confounding artifacts.
  3. Two-Stage Adaptive Sampling (TAS):

    • Function: Constructs an optimal training set from a large-scale uncurated video pool.
    • Mechanism: Stage 1 applies quality thresholds to filter low-quality videos; Stage 2 employs duration-aware probabilistic sampling to balance quality, diversity, and efficiency.
    • Design Motivation: Naive filtering may yield insufficiently diverse training sets; probabilistic sampling maintains data diversity while ensuring quality.

Loss & Training

The curated training set is used to train existing unsupervised rPPG methods (e.g., ContrastPhys, SiNC), validating the effectiveness of the VQA framework.

Key Experimental Results

Main Results

Sampling Strategy PURE Test Set HR MAE Note
All (full data) High error Low-quality data degrades training
Random Moderate error Better than using all data
rPPG-VQA Lowest error Quality curation is highly effective

Ablation Study

Configuration HR MAE Note
Signal-level + Scene-level Best Dual branches are complementary
Signal-level only Second best Scene disturbances overlooked
Scene-level only Moderate Signal quality assessment missing

Key Findings

  • Training on all in-the-wild videos performs worse than training on a quality-curated subset.
  • The dual-branch design shows clear complementary effects; each branch alone has blind spots.
  • The TAS strategy maintains training set diversity while ensuring data quality.

Highlights & Insights

  • First systematic study of data quality for rPPG: Addresses the data-side gap in unsupervised rPPG research.
  • Method-agnostic signal quality metric: Multi-algorithm consensus eliminates bias and the idea is transferable to other signal processing tasks.

Limitations & Future Work

  • MLLM inference incurs substantial computational cost.
  • Quality threshold selection requires manual tuning.
  • Future work may explore end-to-end quality-aware training frameworks.
  • vs. Conventional VQA (PSNR/SSIM): Designed for human perceptual quality, misaligned with rPPG requirements.
  • vs. Post-hoc signal evaluation: Requires prior signal extraction and cannot pre-filter raw videos.

Rating

  • Novelty: ⭐⭐⭐⭐ First systematic treatment of data quality for rPPG
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive comparison across multiple sampling strategies
  • Writing Quality: ⭐⭐⭐⭐ Problem formulation is precise and well-motivated
  • Value: ⭐⭐⭐⭐ Unlocks the potential of in-the-wild data for unsupervised rPPG