SEQ-RPPG: A FAST BVP SIGNAL EXTRACTION METHOD FROM FRAME SEQUENCES

Abstract

Remote photoplethysmography (rPPG) can be widely used in various kinds of physical, health, and emotional monitoring, such as monitoring the heart rate of drivers, consumers, the elderly, and infants. Several rPPG methods have been proposed in the past few years, but non-contact heart rate estimation in realistic situations is still challenging. It is observed that the existing deep learning-based rPPG methods can not achieve real-time performance on low-cost devices. To deal with this problem, a simple, fast, and pre-processing-free approach called sequence-based rPPG (SEQ-rPPG) is proposed for non-contact heart rate estimation. SEQ-rPPG first transforms the RGB frame sequence into the new signal sequence by learning-based linear mapping and then outputs the final BVP signal using 1DCNN-based spectral transform and time-domain filtering. It requires no complex pre-processing, has the fastest speed, can run in real-time on mobile ARM CPUs, and can achieve real-time beat-to-beat performance on desktop CPUs. Furthermore, We present a well-annotated dataset, focusing on constructing a large-size and highly synchronized PPG and video. The entire data set will be made available to the research community. Benefiting from this high-quality dataset, other deep learning-based models reduced errors. To prove the efficacy of the proposed method, the comparison is done with state-of-the-art methods. The experimental results on both self-build and publicly available datasets have demonstrated the effectiveness of the proposed method. We also verified that the processing in the frequency domain is effective.

1. INTRODUCTION

The Blood volume pulse (BVP) is a physiological measurement used to extract physiological signals from the heart. The most used method of BVP signal extraction is photoplethysmogram (PPG). However, PPG requires the subject to wear an optical contact sensor, which can cause discomfort and lead to difficulties, such as monitoring patients in pediatric intensive care units. Non-contact BVP extraction is possible via high-sensitivity cameras and webcams using ambient light as a source of illumination (Takano & Ohta, 2007; Verkruysse et al., 2008) . So remote PPG (rPPG) has attracted extensive attention in recent years. Early rPPG methods mainly focused on temporal modeling. For example, independent component analysis (ICA) Poh et al. ( 2010a) is used for extracting BVP signals from RGB signals. Furthermore, Poh et al. (2010b) added the detrending operation to improve the robustness of ICA. But the blind source separation techniques in RGB color space show limited success. de Haan & Jeanne (2013) propose a chrominance-based method that is robust to luminance. We can find that temporal modeling focused on linear optical models and filter-based post-processing, which empirically proved to be fast and effective. In addition, spatial modeling is also an important way to extract BVP Tulyakov et al. (2016); Bobbia et al. (2019) . They attempt to select the part of the face with the highest signal-to-noise ratio from different regions of the face. This way, it can accommodate more head movements, illumination, and shading. However, these handcrafted algorithms with poor accuracy compared to deep learning-based approaches. Recently, deep learning-based approaches (Chen & McDuff, 2018; Niu et al., 2020a; b; Liu et al., 2020; Song et al., 2021; Yu et al., 2019; Lu et al., 2021; Liu et al., 2021; Yu et al., 2022) have achieved better accuracy. However, the computation cost of these methods is much larger than handcrafted algorithms, making it difficult to deploy on low-cost devices. The high computation cost comes from two sources. First is the pre-processing. In Niu et al. (2020a; b); Song et al. (2021); Lu et al. (2021) , algorithms require fine segmentation of the face, so these algorithms rely on handcrafted features, and the computation cost required for pre-processing is even greater than the model itself. Second, some end-to-end models (Liu et al., 2021; Yu et al., 2022) achieve high accuracy, but the models are too large to run on devices without GPUs. Although some models (Yu et al., 2019; Liu et al., 2020) can run in real-time on mobile CPUs with good accuracy, the computation cost is still too high. rPPG is commonly used in affective computing and telemedicine, where users must juggle other components while running rPPG applications. It is unacceptable to allocate most of the computation resources to real-time rPPG. Therefore, designing an rPPG algorithm that only takes up a few computation resources is necessary. Apart from the algorithm design, the dataset is also vital for obtaining better learning models. Different training sets have a huge impact on model performance, the main reason is that most of the datasets are not well synchronized with the video signal and the BVP signal. There are some datasets that are highly synchronized, such as PURE, SCAMPS (Stricker et al., 2014; McDuff et al., 2022) . However, PURE only has 59 minutes of video, and SCAMPS has not yet shown strong enough generalization as a synthetic dataset. In addition, some works (Yu et al., 2019; Botina-Monsalve et al., 2022; Comas et al., 2022) focused on making the model adaptable to non-synchronized datasets, a more common practice is to use unsupervised methods to synchronize signals (eg. POS), but it is still an open problem to design methods robust to non-synchronization. Although there are different ways to alleviate the problem of non-synchronized, the scarcity of high-quality datasets still limits the performance of some models. To deal with the problems mentioned above, an effective BVP extraction method is proposed, and a new dataset is constructed. The main contributions of this paper can be summarized as follows: • A simple, fast, and pre-processing-free method is proposed for non-contact BVP extraction. The proposed method, called sequence-based rPPG (SEQ-rPPG), uses linear mapping, spectral transform, and time-domain filtering to extract BVP signals. The proposed method requires no complex pre-processing and can run on mobile ARM CPUs in realtime. Its computation cost and memory usage are much smaller than existing algorithms. • A large, synchronized dataset is built for better model training. In this paper, we developed software that simultaneously obtains the ground truth signal from the blood pulse meter and the RGB signal from the webcam, and all UNIX timestamps are recorded. It is designed for large-scale training, more than 30 hours (3.24M frames) of video available, and larger in duration than any public dataset.

2.1. END-TO-END PHYSIOLOGICAL SIGNAL MEASUREMENT NETWORK

Many studies focus on simplifying the pre-processing process or even removing it and are vague in their end-to-end definitions, and they all claim to be end-to-end models. In DeepPhys (Chen & McDuff, 2018) and MTTS-CAN (Liu et al., 2020) , the pre-processing algorithms generate differential and average frames, respectively, then enter them into the motion branch and appearance branch of the network. In EfficientPhys (Liu et al., 2021) , they use a pre-processing approach similar to MTTS-CAN, yet within the model and determine the normalization parameters by learning. It is hard to say if this is a pre-processing or part of the model internals. In PhysNet (Yu et al., 2019) and PhysFormer (Yu et al., 2022) , the model accepts original image input directly without additional pre-processing. MTTS-CAN is an improved version of DeepPhys, which has higher accuracy, and although they both require pre-processing, the pre-processing is simple and requires little additional cost. PhysFormer and PhysNet are both end-to-end models. However, the number of parameters and computation of PhysFormer is much larger than the latter, which is not enough to reflect the advantages of lightweight and real-time. So EfficientPhys, MTTS-CAN, and PhysNet are representative end-to-end models, and we will compare the proposed method with them.

