Deep Learning-based Contactless PPG Methods: Comparison
Please note this is a comparison between Version 1 by Aoxin Ni and Version 2 by Conner Chen.

Physiological measurements are widely used to determine a person’s health condition. Photoplethysmography (PPG) is a physiological measurement method that is used to detect volumetric changes in blood in vessels beneath the skin. Medical devices based on PPG have been introduced to measure different physiological measurements including heart rate (HR), respiratory rate, heart rate variability (HRV), oxyhemoglobin saturation, and blood pressure. Due to its low cost and non-invasive nature, PPG is utilized in many devices such as finger pulse oximeters, sports bands, and wearable sensors. PPG-based physiological measurements can be categorized into two types: contact-based and contactless.

  • remote PPG
  • heart rate measurement methods
  • review of heart rate measurement methods
  • deep learning for contactless or remote heart rate measurement
  • comparison of deep learning methods for contactless or remote heart rate measurement

1. Background

A common practice in the medical field to measure the heart rate is ECG or electrocardiography [1][2], where voltage changes in the heart electrical activity are detected using electrodes placed on the skin. In general, ECG provides a more reliable heart rate measurement compared to PPG [3][4]. Hence, ECG is often used as the reference for evaluation of PPG methods [1][2][3][4]. Typically, 10 electrodes of the ECG machine are attached to different parts of the body including the wrist and ankle. Different from ECG, PPG-based medical devices possess differing sensor shapes placed on different parts of the body such as rings, earpieces, and bands [1][5][6][7][8][9][10], and they all use a light source and a photodetector to detect the PPG signal with signal processing, see 

A common practice in the medical field to measure the heart rate is ECG or electrocardiography [7,8], where voltage changes in the heart electrical activity are detected using electrodes placed on the skin. In general, ECG provides a more reliable heart rate measurement compared to PPG [9,10]. Hence, ECG is often used as the reference for evaluation of PPG methods [7,8,9,10]. Typically, 10 electrodes of the ECG machine are attached to different parts of the body including the wrist and ankle. Different from ECG, PPG-based medical devices possess differing sensor shapes placed on different parts of the body such as rings, earpieces, and bands [7,11,12,13,14,15,16], and they all use a light source and a photodetector to detect the PPG signal with signal processing, see 

Figure 1. The signal processing is for the purpose of processing the reflected optical signal from the skin [11].

. The signal processing is for the purpose of processing the reflected optical signal from the skin [1].

Figure 1.

 PPG signal processing framework.

Early research in this field concentrated on obtaining the PPG signal and ways to perform pulse wave analysis [12]. A comparison between ECG and PPG is discussed in [13][14]. There are survey papers covering different PPG applications that involve the use of wearable devices [15][16], atrial fibrillation detection [17], and blood pressure monitoring [18]. Papers have also been published which used deep learning for contact-based PPG, e.g., [19][20][21][22]. The previous survey papers on contact-based PPG methods are listed in 

Early research in this field concentrated on obtaining the PPG signal and ways to perform pulse wave analysis [17]. A comparison between ECG and PPG is discussed in [18,19]. There are survey papers covering different PPG applications that involve the use of wearable devices [20,21], atrial fibrillation detection [22], and blood pressure monitoring [23]. Papers have also been published which used deep learning for contact-based PPG, e.g., [24,25,26,27]. The previous survey papers on contact-based PPG methods are listed in 

Table 1

.

Table 1.

 Previous survey papers on contact-based PPG methods.
Emphasis Ref Year Task
Contact [12][17] 2007 Basic principle of PPG operation, pulse wave analysis, clinical applications
Contact

ECG and PPG
[13][18] 2018 Breathing rate (BR) estimation from ECG and PPG, BR algorithms and its assessment
Contact [17][22] 2020 Approaches for PPG-based atrial fibrillation detection
Contact

Wearable device
[15][20] 2019 PPG acquisition, HR estimation algorithms, developments on wrist PPG applications, biometric identification
Contact

ECG and PPG
[14][19] 2012 Accuracy of pulse rate variability (PRV) as an estimate of HRV
Contact

Wearable device
[16][21] 2018 Current developments and challenges of wearable PPG-based monitoring technologies
Contact

Blood pressure
[18][23] 2015 Approaches involving PPG for continuous and non-invasive monitoring of blood pressure

Although contact-based PPG methods are non-invasive, they can be restrictive due to the requirement of their contact with the skin. Contact-based methods can be irritating or distracting in some situations, for example, for newborn infants [23][24][25][26]. When a less restrictive approach is desired, contactless PPG methods are considered. The use of contactless PPG methods or remote PPG (rPPG) methods has been growing in recent years [27][28][29][30][31].

Although contact-based PPG methods are non-invasive, they can be restrictive due to the requirement of their contact with the skin. Contact-based methods can be irritating or distracting in some situations, for example, for newborn infants [28,29,30,31]. When a less restrictive approach is desired, contactless PPG methods are considered. The use of contactless PPG methods or remote PPG (rPPG) methods has been growing in recent years [32,33,34,35,36].

Contactless PPG methods usually utilize a video camera to capture images which are then processed by image processing algorithms [27][28][29][30][31]. The physics of rPPG is similar to contact-based PPG. In rPPG methods, the light-emitting diode in contact-based PPG methods is replaced with ambient illuminance, and the photodetector is replaced with a video camera, see 

Contactless PPG methods usually utilize a video camera to capture images which are then processed by image processing algorithms [32,33,34,35,36]. The physics of rPPG is similar to contact-based PPG. In rPPG methods, the light-emitting diode in contact-based PPG methods is replaced with ambient illuminance, and the photodetector is replaced with a video camera, see 

Figure 2

. The light reaching the camera sensor can be separated into static (DC) and dynamic (AC) components. The DC component corresponds to static elements including tissue, bone, and static blood, while the AC component corresponds to the variations in light absorption due to arterial blood volume changes. 

Figure 3

 provides an illustration of the image processing framework in rPPG methods. The common image processing steps involved in the framework are illustrated in this figure. In the signal extraction part of the framework, a region of interest (ROI), normally on the face, is extracted.

Figure 2.

 Illustration of rPPG generation: diffused and specular reflections of ambient illuminance are captured by a camera with the diffused reflection indicating volumetric changes in blood vessels.

Figure 3.

 rPPG or contactless PPG image processing framework: signal extraction step (ROI detection and tracking), signal estimation step (filtering and dimensionality reduction), and heart rate estimation step (frequency analysis and peak detection).

In earlier studies, video images from motionless faces were considered [32][33][34]. Several papers relate to exercising situations [35][36][37][38][39]. ROI detection and ROI tracking constitute two major image processing parts of the framework. The Viola and Jones (VJ) algorithm [40] is often used to detect face areas [41][42][43][44]. As an example of prior work on skin detection, a neural network classifier was used to detect skin-like pixels in [45]. In the signal estimation part, a bandpass filter is applied to eliminate undesired frequency components. A common choice for the frequency band is [0.7 Hz, 4 Hz], which corresponds to an HR between 42 and 240 beats per minute (bpm) [45][46][47][48]. To separate a signal into uncorrelated components and to reduce dimensionality, independent component analysis (ICA) was utilized in [49][50][51][52] and principal component analysis (PCA) was utilized in [33][34][35][53][54]. In the heart rate estimation module, the dimensionality-reduced data will be mapped to certain levels using frequency analysis or peak detection methods. The survey papers on rPPG methods that have already appeared in the literature are listed in 

In earlier studies, video images from motionless faces were considered [37,38,39]. Several papers relate to exercising situations [40,41,42,43,44]. ROI detection and ROI tracking constitute two major image processing parts of the framework. The Viola and Jones (VJ) algorithm [45] is often used to detect face areas [46,47,48,49]. As an example of prior work on skin detection, a neural network classifier was used to detect skin-like pixels in [50]. In the signal estimation part, a bandpass filter is applied to eliminate undesired frequency components. A common choice for the frequency band is [0.7 Hz, 4 Hz], which corresponds to an HR between 42 and 240 beats per minute (bpm) [50,51,52,53]. To separate a signal into uncorrelated components and to reduce dimensionality, independent component analysis (ICA) was utilized in [54,55,56,57] and principal component analysis (PCA) was utilized in [38,39,40,58,59]. In the heart rate estimation module, the dimensionality-reduced data will be mapped to certain levels using frequency analysis or peak detection methods. The survey papers on rPPG methods that have already appeared in the literature are listed in 

Table 2

. These survey papers provide comparisons with contact-based PPG methods.

There are challenges in rPPG which include subject motion and ambient lighting variations [55][56][57]. Due to the success of deep learning in many computer vision and speech processing applications [58][59][60], deep learning methods have been considered for rPPG to deal with its challenges, for example, [39][44]. In deep learning methods, feature extraction and classification are carried out together within one network structure. The required datasets for deep learning models are collected using RGB cameras. As noted earlier, the focus of this review is on deep learning-based contactless heart rate measurement methods.

There are challenges in rPPG which include subject motion and ambient lighting variations [60,61,62]. Due to the success of deep learning in many computer vision and speech processing applications [63,64,65], deep learning methods have been considered for rPPG to deal with its challenges, for example, [44,49]. In deep learning methods, feature extraction and classification are carried out together within one network structure. The required datasets for deep learning models are collected using RGB cameras. As noted earlier, the focus of this review is on deep learning-based contactless heart rate measurement methods.

Table 2.

 Survey papers on conventional contactless methods previously reported in the literature.
Emphasis Ref Year Task
Contactless [61][66] 2018 Provides typical components of rPPG and notes the main challenges; groups published studies by their choice of algorithm.
Contactless [62][67] 2012 Covers three main stages of monitoring physiological measurements based on photoplethysmographic imaging: image acquisition, data collection, and parameter extraction.
Contactless

and contact
[63][68] 2016 States review of contact-based PPG and its limitations; introduces research activities on wearable and non-contact PPG.
Contactless

and contact
[64][69] 2009 Reviews photoplethysmographic measurement techniques from contact sensing placement to non-contact sensing placement, and from point measurement to imaging measurement.
Contactless

newborn infants
[23][28] 2013 Investigates the feasibility of camera-based PPG for contactless HR monitoring in newborn infants with ambient light.
Contactless

newborn infants
[25][30] 2016 Comparative analysis to benchmark state-of-the-art video and image-guided noninvasive pulse rate (PR) detection.
Contactless

and contact
[65][70] 2017 Heart rate measurement using facial videos based on photoplethysmography and ballistocardiography.
Contactless

and contact
[66][71] 2014 Covers methods of non-contact HR measurement with capacitively coupled ECG, Doppler radar, optical vibrocardiography, thermal imaging, RGB camera, and HR from speech.
Contactless

RR

and contact
[67][72] 2011 Discusses respiration monitoring approaches (both contact and non-contact).
Contactless

newborn infants
[26][31] 2019 Addresses HR measurement in babies.
Contactless [68][73] 2019 Examines challenges associated with illumination variations and motion artifacts.
Contactless [69][74] 2017 Covers HR measurement techniques including camera-based photoplethysmography, reflectance pulse oximetry, laser Doppler technology, capacitive sensors, piezoelectric sensors, electromyography, and a digital stethoscope.
Contactless

Main challenges
[70][75] 2015 Covers issues in motion and ambient lighting tolerance, image optimization (including multi-spectral imaging), and region of interest optimization.

2. Contactless PPG Methods Based on Deep Learning

Previous works on deep learning-based contactless HR methods can be divided into two groups: combinations of conventional and deep learning methods, and end-to-end deep learning methods. In what follows, a review of these papers is provided.

2.1. Combination of Conventional and Deep Learning Methods

Li et al. 2021 [71] presented multi-modal machine learning techniques related to heart diseases. From 

Li et al. 2021 [76] presented multi-modal machine learning techniques related to heart diseases. From 

Figure 3

, it can be seen that one or more components of the contactless HR framework can be achieved by using deep learning. These components include ROI detection and tracking, signal estimation, and HR estimation.

2.1.1. Deep Learning Methods for Signal Estimation

Qiu et al. 2018 [72] developed a method called EVM-CNN. The pipeline of this method consists of three modules: face detection and tracking, feature extraction, and HR estimation. In the face detection and tracking module, 68 facial landmarks inside a bounding box are detected by using a regression local binary features-based approach [73]. Then, an ROI defined by eight points around the central part of a human face is automatically extracted and inputted into the next module. In the feature extraction module, spatial decomposition and temporal filtering are applied to obtain so-called feature images. The sequence of ROIs is down-sampled into several bands. The lowest bands are reshaped and concatenated into a new image. Three channels of this new image are transferred into the frequency domain; then, fast Fourier transform (FFT) is applied to remove the unwanted frequency bands. Finally, the bands are transferred back to the time domain by performing inverse FFT and merging into a feature image. In the HR estimation module, a convolutional neural network (CNN) is used to estimate HR from the feature image. The CNN used in this method has a simple structure with several convolution layers which uses depth-wise convolution and point-wise convolution to reduce the computational burden and model size.

Qiu et al. 2018 [77] developed a method called EVM-CNN. The pipeline of this method consists of three modules: face detection and tracking, feature extraction, and HR estimation. In the face detection and tracking module, 68 facial landmarks inside a bounding box are detected by using a regression local binary features-based approach [78]. Then, an ROI defined by eight points around the central part of a human face is automatically extracted and inputted into the next module. In the feature extraction module, spatial decomposition and temporal filtering are applied to obtain so-called feature images. The sequence of ROIs is down-sampled into several bands. The lowest bands are reshaped and concatenated into a new image. Three channels of this new image are transferred into the frequency domain; then, fast Fourier transform (FFT) is applied to remove the unwanted frequency bands. Finally, the bands are transferred back to the time domain by performing inverse FFT and merging into a feature image. In the HR estimation module, a convolutional neural network (CNN) is used to estimate HR from the feature image. The CNN used in this method has a simple structure with several convolution layers which uses depth-wise convolution and point-wise convolution to reduce the computational burden and model size.
As shown in 

Figure 4

, in this method, the first two modules which are face detection/tracking and feature extraction are conventional rPPG approaches, whereas the HR estimation module uses deep learning to improve performance for HR estimation.

Figure 4.

 EVM-CNN modules.

2.1.2. Deep Learning Methods for Signal Extraction

Luguev et al. 2020 [74] established a framework which uses deep spatial-temporal networks for contactless HRV measurements from raw facial videos. In this method, a 3D convolutional neural network is used for pulse signal extraction. As for the computation of HRV features, conventional signal processing methods including frequency domain analysis and peak detection are used. More specifically, raw video sequences are inputted into the 3D-CNN without any skin segmentation. Several convolution operations with rectified linear units (ReLU) are used as activation functions together with pooling operations to produce spatiotemporal features. In the end, a pulse signal is generated by a channel-wise convolution operation. The mean absolute error is used as the loss function of the model.

Luguev et al. 2020 [79] established a framework which uses deep spatial-temporal networks for contactless HRV measurements from raw facial videos. In this method, a 3D convolutional neural network is used for pulse signal extraction. As for the computation of HRV features, conventional signal processing methods including frequency domain analysis and peak detection are used. More specifically, raw video sequences are inputted into the 3D-CNN without any skin segmentation. Several convolution operations with rectified linear units (ReLU) are used as activation functions together with pooling operations to produce spatiotemporal features. In the end, a pulse signal is generated by a channel-wise convolution operation. The mean absolute error is used as the loss function of the model.

Paracchini et al. 2020 [75] implemented rPPG based on a single-photon avalanche diode (SPAD) camera. This method combines deep learning and conventional signal processing to extract and examine the pulse signal. The main advantage of using a SPAD camera is its superior performance in dark environments compared with CCD or CMOS cameras. Its framework is shown in 

Paracchini et al. 2020 [80] implemented rPPG based on a single-photon avalanche diode (SPAD) camera. This method combines deep learning and conventional signal processing to extract and examine the pulse signal. The main advantage of using a SPAD camera is its superior performance in dark environments compared with CCD or CMOS cameras. Its framework is shown in 

Figure 5

. The signal extraction part has two components which are facial skin detection and signal creation. A U-shape network is then used to perform skin detection including all visible facial skin surface areas rather than a specific skin area. The output of the network is a binary skin mask. Then, a raw pulse signal is obtained by averaging the intensity values of all the pixels inside the binary mask. As for signal estimation, this is achieved by filtering, FFT, and peak detection. The experimental results include HR, respiration rate, and tachogram measurements.

Figure 5.

 rPPG using SPAD camera.

In another work from Zhan et al. 2020 [76], the focus was placed on understanding the CNN-based PPG signal extraction. Four questions were addressed: (1) Does the CNN learn PPG, BCG, or a combination of both? (2) Can a finger oximeter be directly used as a reference for CNN training? (3) Does the CNN learn the spatial context information of the measured skin? (4) Is the CNN robust to motion, and how is this motion robustness achieved? To answer these four questions, a CNN-PPG framework and four experiments were designed. The results of these experiments indicate the availability of multiple convolutional kernels is necessary for a CNN to arrive at a flexible channel combination through the spatial operation but may not provide the same motion robustness as a multi-site measurement. Another conclusion reached is that the PPG-related prior knowledge may still be helpful for the CNN-based PPG extraction.

In another work from Zhan et al. 2020 [81], the focus was placed on understanding the CNN-based PPG signal extraction. Four questions were addressed: (1) Does the CNN learn PPG, BCG, or a combination of both? (2) Can a finger oximeter be directly used as a reference for CNN training? (3) Does the CNN learn the spatial context information of the measured skin? (4) Is the CNN robust to motion, and how is this motion robustness achieved? To answer these four questions, a CNN-PPG framework and four experiments were designed. The results of these experiments indicate the availability of multiple convolutional kernels is necessary for a CNN to arrive at a flexible channel combination through the spatial operation but may not provide the same motion robustness as a multi-site measurement. Another conclusion reached is that the PPG-related prior knowledge may still be helpful for the CNN-based PPG extraction.

2.2. End-to-End Deep Learning Methods

In this section, end-to-end deep learning systems are stated which take video as the input and use different network architectures to generate a physiological signal as the output.

2.2.1. VGG-Style CNN

Chen and Mcduff 2018 [77] developed an end-to-end method for video-based heart and breathing rates using a deep convolutional network named DeepPhys. To address the issue caused by subject motion, the proposed method uses a motion representation algorithm based on a skin reflection model. As a result, motions are captured more effectively. To guide the motion estimation, an attention mechanism using appearance information was designed. It was shown that the motion representation model and the attention mechanism used enable robust measurements under heterogeneous lighting and motions.

Chen and Mcduff 2018 [82] developed an end-to-end method for video-based heart and breathing rates using a deep convolutional network named DeepPhys. To address the issue caused by subject motion, the proposed method uses a motion representation algorithm based on a skin reflection model. As a result, motions are captured more effectively. To guide the motion estimation, an attention mechanism using appearance information was designed. It was shown that the motion representation model and the attention mechanism used enable robust measurements under heterogeneous lighting and motions.

The model is based on a VGG-style CNN for estimating the physiological signal derived under motion [78]. VGG is an object recognition model that supports up to 19 layers. Built as a deep CNN, VGG is shown to outperform baselines in many image processing tasks. 

The model is based on a VGG-style CNN for estimating the physiological signal derived under motion [83]. VGG is an object recognition model that supports up to 19 layers. Built as a deep CNN, VGG is shown to outperform baselines in many image processing tasks. 

Figure 6

 illustrates the architecture of this end-to-end convolutional attention network. A current video frame at time t and a normalized difference between frames at t and t + 1 constitute the inputs to the appearance and motion models, respectively. The network learns spatial masks, which are shared between the models, and extracts features for recovering the blood volume pulse (BVP) and respiration signals.

Figure 6.

 DeepPhys architecture.

Deep PPG proposed by Reiss et al. 2019 [79] addresses three shortcomings of the existing datasets. First is the dataset size. While the number of subjects can be considered as sufficient (8–24 participants in each dataset), the length of each session’s recording can be rather short. Second is the small numbers of activities. The publicly available datasets include data from only two–three different activities. Additionally, third is data recording in laboratory settings rather than in real-world environments.

Deep PPG proposed by Reiss et al. 2019 [84] addresses three shortcomings of the existing datasets. First is the dataset size. While the number of subjects can be considered as sufficient (8–24 participants in each dataset), the length of each session’s recording can be rather short. Second is the small numbers of activities. The publicly available datasets include data from only two–three different activities. Additionally, third is data recording in laboratory settings rather than in real-world environments.

A new dataset, called PPG-DaLiA [80], was thus introduced in this paper: a PPG dataset for motion compensation and heart rate estimation in daily living activities. 

A new dataset, called PPG-DaLiA [85], was thus introduced in this paper: a PPG dataset for motion compensation and heart rate estimation in daily living activities. 

Figure 7

 illustrates the architecture of the VGG-like CNN used, where the time–frequency spectra of PPG signals are used as the input to estimate the heart rate.

Figure 7.

 Deep PPG architecture.

2.2.2. CNN-LSTM Network

Long short-term memory (LSTM) is a recurrent neural network (RNN) architecture which allows only process handling a single data point (such as images), but also an entire sequence of data points (such as speech or video). It has been previously used for various tasks such as connected handwriting recognition, speech recognition, and anomaly detection in network traffic [81][82][83].

Long short-term memory (LSTM) is a recurrent neural network (RNN) architecture which allows only process handling a single data point (such as images), but also an entire sequence of data points (such as speech or video). It has been previously used for various tasks such as connected handwriting recognition, speech recognition, and anomaly detection in network traffic [86,87,88].

rPPG signals are usually collected using a video camera with a limitation of being sensitive to multiple contributing factors, which include variation in skin tone, lighting condition, and facial structure. Meta-rPPG [84] is an end-to-end supervised learning approach which performs well when training data are abundant with a distribution that does not deviate too much from the testing data distribution. To cope with the unforeseeable changes during testing, a transductive meta-learner that takes unlabeled samples during testing for a self-supervised weight adjustment is used to provide fast adaptation to the changes. The network proposed in this paper is split into two parts: a feature extractor and an rPPG estimator modeled by a CNN and an LSTM network, respectively.

rPPG signals are usually collected using a video camera with a limitation of being sensitive to multiple contributing factors, which include variation in skin tone, lighting condition, and facial structure. Meta-rPPG [89] is an end-to-end supervised learning approach which performs well when training data are abundant with a distribution that does not deviate too much from the testing data distribution. To cope with the unforeseeable changes during testing, a transductive meta-learner that takes unlabeled samples during testing for a self-supervised weight adjustment is used to provide fast adaptation to the changes. The network proposed in this paper is split into two parts: a feature extractor and an rPPG estimator modeled by a CNN and an LSTM network, respectively.

2.2.3. 3D-CNN Network

A 3D convolutional neural network is a type of network with kernel sliding in three dimensions. 3D-CNN is shown to have better performance in spatiotemporal information learning than 2DCNN [85].

A 3D convolutional neural network is a type of network with kernel sliding in three dimensions. 3D-CNN is shown to have better performance in spatiotemporal information learning than 2DCNN [90].

Špetlík et al. 2018 [41] proposed a two-step convolutional neural network to estimate the heart rate from a sequence of facial images, see 

Špetlík et al. 2018 [46] proposed a two-step convolutional neural network to estimate the heart rate from a sequence of facial images, see 

Figure 8

. The proposed architecture has two components: an extractor and an HR estimator. The extractor component is run over a temporal image sequence of faces. The signal is then fed to the HR estimator to predict the heart rate.

Figure 8.

 HR-CNN modules.

In the work from Yu et al. 2019 [86], a two-stage end-to-end method was proposed. This work deals with video compression loss and recovers the rPPG signal from highly compressed videos. It consists of two parts: (1) a spatiotemporal video enhancement network (STVEN) for video enhancement, and (2) an rPPG network (rPPGNet) for rPPG signal recovery. rPPGNet can work on its own for obtaining rPPG measurements. The STVEN network can be added and jointly trained to further boost the performance, particularly on highly compressed videos.

In the work from Yu et al. 2019 [91], a two-stage end-to-end method was proposed. This work deals with video compression loss and recovers the rPPG signal from highly compressed videos. It consists of two parts: (1) a spatiotemporal video enhancement network (STVEN) for video enhancement, and (2) an rPPG network (rPPGNet) for rPPG signal recovery. rPPGNet can work on its own for obtaining rPPG measurements. The STVEN network can be added and jointly trained to further boost the performance, particularly on highly compressed videos.

Another method from Yu et al. 2019 [87] provides the use of deep spatiotemporal networks for reconstructing precise rPPG signals from raw facial videos. With the constraint of trend consistency in ground truth pulse curves, this method is able to recover rPPG signals with accurate pulse peaks. The heartbeat peaks of the measured rPPG signal are located at the corresponding R peaks of the ground truth ECG signal.

Another method from Yu et al. 2019 [92] provides the use of deep spatiotemporal networks for reconstructing precise rPPG signals from raw facial videos. With the constraint of trend consistency in ground truth pulse curves, this method is able to recover rPPG signals with accurate pulse peaks. The heartbeat peaks of the measured rPPG signal are located at the corresponding R peaks of the ground truth ECG signal.

To address the issue of a lack of training data, a heart track convolutional neural network was developed by Rerepelkina et al. 2020 [88] for remote video-based heart rate tracking. This learning-based method is trained on synthetic data to accurately estimate the heart rate in different conditions. Synthetic data do not include video and include only PPG curves. To select the most suitable parts of the face for pulse tracking at each particular moment, an attention mechanism is used.

To address the issue of a lack of training data, a heart track convolutional neural network was developed by Rerepelkina et al. 2020 [93] for remote video-based heart rate tracking. This learning-based method is trained on synthetic data to accurately estimate the heart rate in different conditions. Synthetic data do not include video and include only PPG curves. To select the most suitable parts of the face for pulse tracking at each particular moment, an attention mechanism is used.

Similar to the previous methods, the method proposed by Bousefsaf et al. 2019 [89] also uses synthetic data. 

Similar to the previous methods, the method proposed by Bousefsaf et al. 2019 [94] also uses synthetic data. 

Figure 9

 illustrates the process of how synthetic data are generated. A 3D-CNN classifier structure was developed for both extraction and classification of unprocessed video streams. The CNN acts as a feature extractor. Its final activations are fed into two dense layers (multilayer perceptron) that are used to classify the pulse rate. The network ensures concurrent mapping by producing a prediction for each local group of pixels.

Figure 9.

 Process of generating synthetic data.

Liu et al. 2020 [90] developed a lightweight rPPG estimation network, named DeeprPPG, based on spatiotemporal convolutions for utilization involving different types of input skin. To further boost the robustness, a spatiotemporal rPPG aggregation strategy was designed to adaptively aggregate rPPG signals from multiple skin regions into a final one. Extensive experimental studies were conducted to show its robustness when facing unseen skin regions in unseen scenarios. 

Liu et al. 2020 [95] developed a lightweight rPPG estimation network, named DeeprPPG, based on spatiotemporal convolutions for utilization involving different types of input skin. To further boost the robustness, a spatiotemporal rPPG aggregation strategy was designed to adaptively aggregate rPPG signals from multiple skin regions into a final one. Extensive experimental studies were conducted to show its robustness when facing unseen skin regions in unseen scenarios. 

Table 3

 lists the contactless HR methods that use deep learning.

Table 3.

Deep learning-based contactless PPG methods.
Focus Ref Year Feature Dataset
End-to-end system

Robust to illumination changes and subject’s motion
[41][46] 2018 A two-step convolutional neural network composed of

an extractor and HR estimator
COHFACE

PURE

MAHNOB-HCI
Signal estimation enhancement [72][77] 2019 Eulerian video magnification (EVM) to extract face color changes and using CNN to estimate heart rate MMSE-HR
3D-CNN for signal

extraction
[74][79] 2020 Using deep spatiotemporal networks for contactless HRV measurements from raw facial videos;

employing data augmentation
MAHNOB-HCI
Single-photon camera [75][80] 2020 Neural network for skin detection N/A
Understanding of

CNN-based PPG methods
[76][81] 2020 Analysis of CNN-based remote PPG to understand

limitations and sensitivities
HNU

PURE
End-to-end system

Attention mechanism
[77][82] 2018 Robust measurement under heterogeneous lighting

and motions
MAHNOB-HCI
End-to-end system

Real-life conditions dataset
[79][84] 2019 Major shortcoming of existing datasets: dataset size,

small number of activities, data recording in

laboratory setting
PPG-DaLiA
Synthetic training data

Attention mechanism
[88][93] 2020 CNN training with synthetic data to accurately

estimate HR in different conditions
UBFC-RPPG

MoLi-ppg-1

MoLi-ppg-2
Synthetic training data [89][94] 2019 Automatic 3D-CNN training process with

synthetic data with no image processing
UBFC-RPPG
End-to-end supervised learning approach

Meta-learning
[84][89] 2017 Meta-rPPG for abundant training data with a distribution

not deviating too much from distribution of testing data
MAHNOB-HCI

UBFC-RPPG
Counter video

compression loss
[86][91] 2019 STEVEN for video quality enhancement

rPPGNet for signal recovery
MAHNOB-HCI
Spatiotemporal network [87][92] 2019 Measuring rPPG signal from raw facial video;

taking temporal context into account
MAHNOB-HCI
Spatiotemporal network [90][95] 2020 Spatiotemporal convolution network,

different types of input skin
MAHNOB-HCI

PURE