SEQ-RPPG: A FAST BVP SIGNAL EXTRACTION METHOD FROM FRAME SEQUENCES

Abstract

Remote photoplethysmography (rPPG) can be widely used in various kinds of physical, health, and emotional monitoring, such as monitoring the heart rate of drivers, consumers, the elderly, and infants. Several rPPG methods have been proposed in the past few years, but non-contact heart rate estimation in realistic situations is still challenging. It is observed that the existing deep learning-based rPPG methods can not achieve real-time performance on low-cost devices. To deal with this problem, a simple, fast, and pre-processing-free approach called sequence-based rPPG (SEQ-rPPG) is proposed for non-contact heart rate estimation. SEQ-rPPG first transforms the RGB frame sequence into the new signal sequence by learning-based linear mapping and then outputs the final BVP signal using 1DCNN-based spectral transform and time-domain filtering. It requires no complex pre-processing, has the fastest speed, can run in real-time on mobile ARM CPUs, and can achieve real-time beat-to-beat performance on desktop CPUs. Furthermore, We present a well-annotated dataset, focusing on constructing a large-size and highly synchronized PPG and video. The entire data set will be made available to the research community. Benefiting from this high-quality dataset, other deep learning-based models reduced errors. To prove the efficacy of the proposed method, the comparison is done with state-of-the-art methods. The experimental results on both self-build and publicly available datasets have demonstrated the effectiveness of the proposed method. We also verified that the processing in the frequency domain is effective.

1. INTRODUCTION

The Blood volume pulse (BVP) is a physiological measurement used to extract physiological signals from the heart. The most used method of BVP signal extraction is photoplethysmogram (PPG). However, PPG requires the subject to wear an optical contact sensor, which can cause discomfort and lead to difficulties, such as monitoring patients in pediatric intensive care units. Non-contact BVP extraction is possible via high-sensitivity cameras and webcams using ambient light as a source of illumination (Takano & Ohta, 2007; Verkruysse et al., 2008) . So remote PPG (rPPG) has attracted extensive attention in recent years. Early rPPG methods mainly focused on temporal modeling. For example, independent component analysis (ICA) Poh et al. (2010a) is used for extracting BVP signals from RGB signals. Furthermore, Poh et al. (2010b) added the detrending operation to improve the robustness of ICA. But the blind source separation techniques in RGB color space show limited success. de Haan & Jeanne (2013) propose a chrominance-based method that is robust to luminance. We can find that temporal modeling focused on linear optical models and filter-based post-processing, which empirically proved to be fast and effective. In addition, spatial modeling is also an important way to extract BVP Tulyakov et al. (2016) ; Bobbia et al. (2019) . They attempt to select the part of the face with the highest signal-to-noise ratio from different regions of the face. This way, it can accommodate more head movements, illumination, and shading. However, these handcrafted algorithms with poor accuracy compared to deep learning-based approaches. handcrafted algorithms, making it difficult to deploy on low-cost devices. The high computation cost comes from two sources. First is the pre-processing. In Niu et al. (2020a; b) ; Song et al. (2021) ; Lu et al. (2021) , algorithms require fine segmentation of the face, so these algorithms rely on handcrafted features, and the computation cost required for pre-processing is even greater than the model itself. Second, some end-to-end models (Liu et al., 2021; Yu et al., 2022) achieve high accuracy, but the models are too large to run on devices without GPUs. Although some models (Yu et al., 2019; Liu et al., 2020) can run in real-time on mobile CPUs with good accuracy, the computation cost is still too high. rPPG is commonly used in affective computing and telemedicine, where users must juggle other components while running rPPG applications. It is unacceptable to allocate most of the computation resources to real-time rPPG. Therefore, designing an rPPG algorithm that only takes up a few computation resources is necessary. Apart from the algorithm design, the dataset is also vital for obtaining better learning models. Different training sets have a huge impact on model performance, the main reason is that most of the datasets are not well synchronized with the video signal and the BVP signal. There are some datasets that are highly synchronized, such as PURE, SCAMPS (Stricker et al., 2014; McDuff et al., 2022) . However, PURE only has 59 minutes of video, and SCAMPS has not yet shown strong enough generalization as a synthetic dataset. In addition, some works (Yu et al., 2019; Botina-Monsalve et al., 2022; Comas et al., 2022) focused on making the model adaptable to non-synchronized datasets, a more common practice is to use unsupervised methods to synchronize signals (eg. POS), but it is still an open problem to design methods robust to non-synchronization. Although there are different ways to alleviate the problem of non-synchronized, the scarcity of high-quality datasets still limits the performance of some models. To deal with the problems mentioned above, an effective BVP extraction method is proposed, and a new dataset is constructed. The main contributions of this paper can be summarized as follows: • A simple, fast, and pre-processing-free method is proposed for non-contact BVP extraction. The proposed method, called sequence-based rPPG (SEQ-rPPG), uses linear mapping, spectral transform, and time-domain filtering to extract BVP signals. The proposed method requires no complex pre-processing and can run on mobile ARM CPUs in realtime. Its computation cost and memory usage are much smaller than existing algorithms. • A large, synchronized dataset is built for better model training. In this paper, we developed software that simultaneously obtains the ground truth signal from the blood pulse meter and the RGB signal from the webcam, and all UNIX timestamps are recorded. It is designed for large-scale training, more than 30 hours (3.24M frames) of video available, and larger in duration than any public dataset.

2.1. END-TO-END PHYSIOLOGICAL SIGNAL MEASUREMENT NETWORK

Many studies focus on simplifying the pre-processing process or even removing it and are vague in their end-to-end definitions, and they all claim to be end-to-end models. In DeepPhys (Chen & McDuff, 2018) and MTTS-CAN (Liu et al., 2020) , the pre-processing algorithms generate differential and average frames, respectively, then enter them into the motion branch and appearance branch of the network. In EfficientPhys (Liu et al., 2021) , they use a pre-processing approach similar to MTTS-CAN, yet within the model and determine the normalization parameters by learning. It is hard to say if this is a pre-processing or part of the model internals. In PhysNet (Yu et al., 2019) and PhysFormer (Yu et al., 2022) , the model accepts original image input directly without additional pre-processing. MTTS-CAN is an improved version of DeepPhys, which has higher accuracy, and although they both require pre-processing, the pre-processing is simple and requires little additional cost. PhysFormer and PhysNet are both end-to-end models. However, the number of parameters and computation of PhysFormer is much larger than the latter, which is not enough to reflect the advantages of lightweight and real-time. So EfficientPhys, MTTS-CAN, and PhysNet are representative end-to-end models, and we will compare the proposed method with them. 2.2 FAST FOURIER CONVOLUTION Chi et al. (2020) proposed Fast Fourier Convolution (FFC), which can obtain an extensive range of receptive fields at a shallow layer of the network. It improves accuracy on multiple tasks and datasets by replacing the convolutional layers in general convolutional networks with FFC layers. Suvorov et al. (2022) applies FFC to a wide range of image restoration, showing superior performance, especially for periodic images. Shchekotov et al. (2022) applied FFC to speech and audio signal denoising, significantly improving the baseline method and reducing the number of parameters. FFC is particularly sensitive to periodic signals and will significantly affect work that deals with periodic signals, such as audio noise reduction and rPPG. The difference between the above work and ours is that we do not model spatially and do not divide the time and frequency domain channels. We use simple time domain channels and skip them by residual connection. In this view, SEQ-rPPG does not model the video. We filter the noise from the 1D sequence and extract the BVP signal.

3. METHOD

Handcrafted algorithms based on separate reflection components (Shafer, 1985)  C aX+bY = aC X + bC Y , where X and Y are power spectral density (PSD), PSD is the most commonly used method for heart rate extraction, which means that the rPPG signal can always be obtained by linear mapping of color space. In the early rPPG algorithm (Poh et al., 2010a; b; de Haan & Jeanne, 2013; Wang et al., 2017; Tulyakov et al., 2016) , there are two steps from the RGB signal to the BVP signal. Firstly, based on a linear dimensionality reduction method to obtain a 1-dimensional signal, and secondly, apply filtering post-processing in the time domain or frequency domain to filter out noise. In this paper, SEQ-rPPG also performs two main steps (see Fig. 1 ): first, it linearly maps the raw 8x8 facial RGB frames to the original signal sequence. Then it uses 1D convolution to filter the original signal sequence and outputs the BVP waveform. The linear mapping is done inside the model, so this method does not require pre-processing. The parameters of the linear mapping module are obtained by learning, and after training, it will have the appropriate weights to combine the RGB signals.

3.1. LINEAR MAPPING MODULE

Given an RGB facial video input X ∈ R N ×W ×H×C , first transpose it to X ∈ R N ×C×W ×H , ensure the channel is in the second dimension, then reshape it to X ∈ R N C×W H . X consists of N sets of RGB signals, the original signal is converted from RGB as Y n = W H i=1 m i × (R i , G i , B i ) T + b i , ( ) where m is a row vector, which maps RGB colors to real numbers, b is the bias. The transformation of colors can be achieved by convolution without activation layers, it uses a linear convolution layer with kernel size = 3 and stride = 3, linear mapping outputs multi-channel sequences, the number of channels depends on the number of convolution kernels.

3.2. SIGNAL PROCESSING MODULE

For the given original signal, it is usually filtered using several convolutional layers. This filtering is performed in the time domain, which means that adjacent signal points will be strongly correlated while distant signal points will be weakly correlated. For periodic signals, considering the correlation between similar periods helps filter noise from a wider range. ×2C . Then a convolution layer is applied to it and re-decomposes the output into real and imaginary parts. It is converted to complex numbers and recovered to the time domain signal by Inverse Real Fast Fourier Transform (IRFFT). The output signal is mixed with the original signal through the residual connection, and the number of channels remains constant throughout the process. The final signal is obtained by alternating spectral transform and 1D convolution operation. 

C=64

Y RealFFT1D Conv1D-BN-ReLU C=128, K=8, S=1 InvRealFFT1D Conv1D-BN-ReLU C=64, K=15, S=1 RealFFT1D Conv1D-BN-ReLU C=128, K=3, S=1 InvRealFFT1D Conv1D-BN-ReLU C=32, K=5, S=1

4. DATASET

To collect the dataset, we developed data capture software to obtain raw data from Logitech C930c webcam and Contech CMS50E Oximeter. The webcam has two capture modes, 1920*1080 resolution, MJPG encoding, and 640*480 resolution, YUY2 encoding. Each mode has a frame rate of 30fps. MJPG is a common fast intra-frame compression format, and YUY2 is a format for raw YUV images that uses 4 bytes to store 2 Y-components, 1 U-component, and 1 V-component. There are three video encoding formats, RAW RGB, MJPG, and H264. RAW RGB is the original lossless compressed RGB format, and H264 is the common inter-frame compression format on the Internet. Multiple formats are recorded simultaneously. Ground truth BVP waveform data is read from an oximeter, the sampling rate is 20 per second. During data recording, subjects were required to complete a series of tasks or watch a video, and subjects were not required to keep their heads stable, so that they may have experienced larger head movements. After completing the assigned task, the subject will take a short break, and the experimenter will assign him/her the next task. All 58 subjects (16 male and 42 female) are Chinese students, mostly master's students, and some girls may wear makeup. Excluding unavailable data due to formatting errors, incomplete tasks, or equipment failure, the video was available for over 30 hours (3.24M frames) of the video. Fig. 3 shows that the proposed dataset is well synchronized, where GT is the ground truth signal, and Video represents the BVP signal extracted from the video. More details of the self-build dataset can be found in Table 1 and Fig. 4 . As shown in Fig. 4 , this dataset contains head movement, illumination variation, and expression change. We have three models: SEQ-FT with alternating convolution in the frequency and time domains, SEQ-T with convolution in the time domain only, and the streamlined SEQ-tiny (see Fig 5 ). All models are trained and tested in the same Tensorflow environment, the version is 2.6.0, compiled with cuda 11.6, cudnn 8.3, AVX2. Hardware platforms include Nvidia Geforce RTX 3080 10G (GPU), AMD Ryzen9 5950x 16-core (CPU), ARM Cortex A72 4-core (CPU), when tested on ARM platform, also used Tensorflow 2.6.0, but does not include any acceleration support for GPUs or x86 CPUs. To train efficiently, we cut the video using a moving window with a step size of 3 seconds and used mediapipe to perform face detection in the middle of the frame (sometimes other people appeared in the background but were already processed). Finally, only the facial images were fed to the model. Although Liu et al. (2020); Yu et al. (2019) ; Liu et al. (2021) claims to be a pre-processing-free model, face detection had to be performed because of the large movements in the collected dataset. All models used the same data during training and testing, differing only in resolution. In addition, due to the lack of enough high heart rate samples in the collected dataset, we randomly selected a part of the segments with interval skipping half of the frames and did the same for the BVP waveform, which finally made the heart rate twice as high. The additional data accounted for about 25% and were evenly distributed in the training set. TS-CAN uses the Nadam optimizer with default parameters, MAE loss, batch size=128. PhysNet uses the SGD optimizer with learning rate=0.005, NP loss, batch size=32. our model uses the adam optimizer with default parameters, MSE loss, batch size=32. Classical methods and TS-CAN used a 0.75 to 2.5 Hz bandpass filters in the test. Neither our models nor PhysNet applied filters. CHROM, POS and ICA(de Haan & Jeanne, 2013; Wang et al., 2017; Poh et al., 2010b) use the code provided in Boccignone et al. (2022) and Boccignone et al. (2020) . The final heart rate is extracted from the peak of the signal power spectrum, ranging from 0.5 to 4 Hz. In calculating the error and Pearson correlation coefficients, we used a 30-second moving average window with 1-second step. 

5.3. EXPERIMENTAL RESULTS

The extensive experimental results are presented in Tables 3, 4 and 5. The rPPG subset is more difficult because it has more light changes and facial activity. SEQ-FT achieved the best accuracy on the collected dataset, with second place going to EfficientPhys-C. The results show that SEQ-FT is more accurate than SEQ-T, and the spectral transformation is effective. We compared the results of different models tested on UBFC-rPPG, which were trained on different training sets. Our models perform similarly, and all have high accuracy. Although the difference in the number of SEQ-FT and SEQ-tiny parameters is very large, their accuracy is similar, which may be caused by the stable head and stable lighting environment of UBFC-rPPG. TS-CAN and PhysNet are trained better on the collected dataset than on other datasets. In particular, the results on the collected dataset outperform those trained on UBFC-rPPG, indicating that a high-quality training set helps improve the model accuracy. Among all the cross-dataset results on UBFC-rPPG, the best is TS-CAN trained on the collected dataset, and the second best is PhysNet trained on the collected dataset. On the PURE dataset(Table 5 ), EfficientPhys-C achieves the best, followed by TS-CAN. Notably, compared to training on UBFC-rPPG, the improvement of EfficientPhys-C by training on the collected dataset is significant, with MAE dropping from 5.90 to 1.82. As observed from the results, our model has very high accuracy on the collected dataset, so it is more suitable for educational scenarios. Our model is competitive with small models like TS-CAN on two public datasets, UBFC-rPPG and PURE, and our model parameters and FLOPs are much smaller. There is still a gap compared to larger models like EfficientPhys. The results of swapping the training set and the test set are interesting. Cross-dataset evaluations are usually trained on PURE and tested on UBFC-rPPG, but few papers have reversed them. The results suggest that the models trained on UBFC-rPPG may lack generalizability and perform poorly on PURE. The good news is that our collected dataset can solve this problem. The models trained on the collected dataset can be applied to different test sets. We also tested the computation cost of the model on different platforms (see Table 6 ). Unlike the time spent per frame in Liu et al. (2020) , we calculated the time spent to process a whole batch. In past work, little attention was paid to the input size. Usually, the input is resized, and a float32 numerical matrix is obtained. If the input size is large, then the memory usage of the input data may be higher than the model itself. The memory usage of the model is a theoretical minimum based on the number of parameters, which will be smaller than the actual value. We tested the average inference time of the model on large data as the time-use result, device load is also concerned. The slow inference speed of the model on some devices may be due to the inability to use computational resources efficiently, resulting in a low load state on the device. Although both low load and high computation cost result in slower speeds, the energy consumed and heat generated are not the same. Smaller computation cost means lower heat generation and more energy efficiency, which is espe-cially important on battery-powered devices. Our model has a significant advantage in speed, being much faster and consuming much less power than its predecessor. These models can be divided into three classes depending on ARM mobile CPUs' performance. 1 ⃝: Can run in real-time on desktop platforms but not on mobile devices, such as EfficientPhys. 2 ⃝: Can run in real-time on mobile devices, but the computation cost is significant and increases heat dissipation and battery loads, such as PhysNet and TS-CAN. 3 ⃝: Can run smoothly on mobile devices, and the computation cost is tiny, such as the proposed method.

5.5. VISUALIZATION & DISCUSSION

To visually verify the effectiveness of the proposed method, SEQ-FT was chosen for visualization, and we randomly selected a video to feed into the model, and the intermediate feature maps were captured and used for visualization whenever possible. Each feature map can be viewed as a multichannel signal sequence, and its Power Spectral Density (PSD) is calculated for each channel. The 0.5-2.5 Hz portion is intercepted for visualization. A significant heart rate peak is observed in the output from the linear mapping. However, a considerable amount of noise power is also present in the PSD. As the features go deeper, the noise is contained, and some details start to appear, such as subpeaks in the high-frequency region. As shown in 

6. CONCLUSIONS

In terms of accuracy, our model performs similarly to or better than existing models; in terms of speed, our model reaches several times more than current models and can run effectively on mobile CPUs. On Raspberry Pi 4B, our highest accuracy model inference takes 146ms, and the fastest model takes 6.3ms, compared with TS-CAN's 1260ms and EfficientPhys' 8530ms. However, SEQ-FT does not perform as expected on CPUs, and it is worth investigating how to make the spectral transform run more efficiently on CPUs. The quality of our dataset is better than UBFC-rPPG and PURE. The accuracy of existing models can be improved significantly by training on the collected dataset, e.g., the MAE of TS-CAN decreases from 1.47 (trained on PURE) to 0.69; the MAE of PhysNet decreases from 1.99 (trained on PURE) to 0.72, and we expect the collected dataset to be a baseline for training sets. However, the proposed method is sensitive to the random shift present in the dataset, and we will try to deal with this issue in the future.

7. REPRODUCIBILITY STATEMENT

Although the steps required to reproduce the results have been described in this paper, factors that significantly affect the results need to be highlighted in this section. This model has not been finely tuned to the parameters, and its accuracy is not very sensitive to the parameters but to the training set. When working with datasets, there were some key operations: First, we implemented face detection and tracking. However, face detection frames are usually difficult to keep stable. We added a displacement threshold of size 25 pixels when implementing face tracking. Any movement at a distance less than the threshold will be filtered, and greater than the threshold will move the detection frame suddenly, thus avoiding introducing periodic noises. Secondly, when making the training set, we made additional high heart rate samples with heart rates in the 120- Our approach is sensitive to the length of the input frame sequence due to its focus on modeling in the time dimension. We trained SEQ-FT on the collected dataset and tested it on the collected dataset and UBFC-rPPG(Table 7 ). From the results, a longer sequence can improve the accuracy. However, we do not want too long input sequences to cause too much delay, so 15 seconds(450 frames) is a suitable value. Our model ignores spatial information, so a higher resolution does not help to improve the accuracy, but greatly increases the parameters. We trained SEQ-FT on the collected dataset, and tested on the collected dataset and UBFC-rPPG(Table 8 ). The results show that using a resolution of 8×8 is sufficient. show that the combination of these two is optimal. All models were trained on the collected dataset, and tested on the collected dataset and UBFC-rPPG.(Table 9 ) A.2 PRIVACY If the rPPG method is applied to a commercial application, the server expects to capture the user's facial image to optimize the user experience and improve the algorithm. However, the user's facial 

A.5 MORE EFFICIENT PRE-PROCESSING

Simple pre-processing can be eliminated by reasonable engineering optimization. In the original article, the pre-processing of TS-CAN was done by NumPy, and we tried to do the same using Tensorflow and move this operation inside the model. This means that we only need to input the original video after deploying the model and do not need to pay attention to the specific implementation of pre-processing, which helps it to be implemented in non-python environments (e.g. mobile or web), this operation does not affect the accuracy and makes full use of Tensorflow's powerful matrix operations. # Approach of the original literature motion, appearance = pre_process(vid) # Pre-processing with numpy x_ = (x_ -tf.reduce_mean(x_, axis=(0, )))/tf.math.reduce_std(x_, axis=(0, )) motion, appearance = tf.concat ([x_, tf.zeros([1, * size, 3] )], axis=0), tf.expand_dims(tf.reduce_mean(x, axis=(0, )), axis=0) d1, r1 = self.diff_input(motion), self.rawf_input(appearance) ......... # A better approach, closer to end-to-end BVP = TS_CAN(vid) 



Figure 1: Overview of our framework.

Figure 2: Signal processing module.

Figure 3: BVP signal extracted from UBFC-rPPG video and our collected video, where the collected dataset is highly synchronized.

Figure 4: Overview of our dataset, subjects' faces can move unconstrained and emotions are induced during tasks.

Fig 6, the final output PSD is very close to the ground truth. Through feature visualization, our model is well interpreted, and our visualization of temporal modeling demonstrates the role of convolution networks in another dimension than Liu et al. (2020) and Niu et al. (2020a)'s focus on spatial modeling.

Figure 6: Power Spectral Density(PSD) of the feature maps.

Figure 7: Facial images in different resolutions.

Figure 8: Extract the BVP signal from 2 seconds or 15 seconds of video in the same duration

Data collection workflows

Theoretical parameters and FLOPs

Results on collected dataset

Cross-dataset evaluation on UBFC-rPPG

Computation cost on different platforms

180, representing about 25% of the total number of samples. The samples were obtained by skipping half of the frames at intervals, so the heart rate would be twice as high, which is exceptionally important for training the model to correctly identify the high heart rate samples; if enough high heart rate samples are not correctly generated in the training set, then great errors may occur by chance when estimating the high heart rate test set samples. Finally, we ensured that the BVP signal in the training set was strictly aligned with the frames by the recorded UNIX timestamps. Our model does not require facial segmentation, but tracking using facial detection is necessary due to the uncertainty of the head position in the collected dataset. The above steps were applied to all models we implemented. Wim Verkruysse, Lars O Svaasand, and J Stuart Nelson. Remote plethysmographic imaging using ambient light. Opt. Express, 16(26):21434-21445, Dec 2008. doi: 10.1364/OE.16.021434. URL http://opg.optica.org/oe/abstract.cfm?URI=oe-16-26-21434.

Effect of different input sequence lengths on accuracy

The effect of different resolutions on accuracy

Effect of different model structure and number of layers on accuracy Spectral transformation in the frequency domain 2 1D convolution in the time domain image involves privacy issues, and the higher resolution, the more private information it contains (see Fig7), so using low-resolution images helps to protect privacy. Besides, low-resolution images can reduce network usage during uploading and improve the user experience. Our model uses 8x8 inputs, which implies strong privacy.

