METAPHYS: FEW-SHOT ADAPTATION FOR NON-CONTACT PHYSIOLOGICAL MEASUREMENT

Abstract

There are large individual differences in physiological processes, making designing personalized health sensing algorithms challenging. Existing machine learning systems struggle to generalize well to unseen subjects or contexts, especially in video-based physiological measurement. Although fine-tuning for a user might address this issue, it is difficult to collect large sets of training data for specific individuals because supervised algorithms require medical-grade sensors for generating the training target. Therefore, learning personalized or customized models from a small number of unlabeled samples is very attractive as it would allow fast calibrations. In this paper, we present a novel meta-learning approach called MetaPhys for learning personalized cardiac signals from 18-seconds of video data. MetaPhys works in both supervised and unsupervised manners. We evaluate our proposed approach on two benchmark datasets and demonstrate superior performance in cross-dataset evaluation with substantial reductions (42% to 44%) in errors compared with state-of-the-art approaches. Visualization of attention maps and ablation experiments reveal how the model adapts to each subject and why our proposed approach leads to these improvements. We have also demonstrated our proposed method significantly helps reduce the bias in skin type.

1. INTRODUCTION

The importance of scalable health sensing has been acutely highlighted during the SARS-CoV-2 (COVID-19) pandemic. The virus has been linked to increased risk of myocarditis and other serious cardiac (heart) conditions (Puntmann et al., 2020) . Contact sensors (electrocardiograms, oximeters) are the current gold-standard for measurement of heart function. However, these devices are still not ubiquitously available, especially in low-resource settings. The development of video-based contactless sensing of vital signs presents an opportunity for highly scalable physiological monitoring. Furthermore, in clinical settings non-contact sensing could reduce the risk of infection for vulnerable patients (e.g., infants and elderly) and the discomfort caused to them (Villarroel et al., 2019) . While there are compelling advantages of camera-based sensing, the approach also presents unsolved challenges. The use of ambient illumination means camera-based measurement is sensitive to environmental differences in the intensity and composition of the incident light. Camera sensor differences mean that hardware can differ in sensitivity across the frequency spectrum. People (the subjects) exhibit large individual differences in appearance (e.g., skin type, facial hair) and physiology (e.g, pulse dynamics). Finally, contextual differences mean that motions in a video at test time might be different from those seen in the training data. One specific example is that there exists biases in performance across skin types Nowara et al. (2020) . This problem is not isolated to physiological measurement as studies have found systematic biases in facial gender classification, with error rates up to 7x higher on women than men and poorer performance on people with darker skin types (Buolamwini & Gebru, 2018) . Moreover, there are several challenges in collecting large corpora of high-quality physiological data: 1) recruiting and instrumenting participants is often expensive and requires advanced technical expertise, 2) the data can reveal the identity of the subjects and/or sensitive health information meaning it is difficult for researchers to share such datasets. Therefore, training supervised models that generalize well across environments and subjects is challenging. For these reasons we observe that performance on cross-dataset evaluation is significantly worse than within-dataset evaluation using current state-of-the-art methods (Chen & McDuff, 2018; Liu et al., 2020) . Calibration of consumer health sensors is often performed in a clinic, where a clinician will collect readings from a high-end sensor to calibrate a consumer-level device the patient owns. The reason for this is partly due to the variability within readings from consumer devices across different individuals. Ideally, we would be able to train a personalized model for each individual; however, standard supervised learning training schemes require large amounts of labeled data. Getting enough physiological training data of each individual is difficult because it requires using medical-grade devices to provide reliable labels. Being able to generate a personalized model from a small amount of training samples would enable customization based on a few seconds or minutes of video captured while visiting a clinic where people have access to a gold-standard device. Furthermore, if this process could be achieved without even the need for these devices (i.e., in an unsupervised manner), that would have even greater impact. Finally, combining remote physiological measurement with telehealth could provide patients' vital signs for clinicians during remote diagnosis. Given that requests for telehealth appointments have increased more than 10x during COVID-19, and that this is expected to continue into the future (Smith et al., 2020) , robust personalized models are of growing importance. Meta-learning, or learning to learn, has been extensively studied in the past few years (Hospedales et al., 2020) . Instead of learning a specific generalized mapping, the goal of meta-learning is to design a model that can adapt to a new task or context with a small amount of data. Due to the inherent ability for fast adaption, meta-learning is a good candidate strategy for building personalized models (e.g., personalization in dialogue and video retargeting (Madotto et al., 2019; Lee et al., 2019) .) However, we argue that meta learning is underused in healthcare where clinicians can quickly adapt their clinical knowledge to different patients. The goal of this work is to develop a meta-learning based personalization framework in remote physiological measurement with a limited amount of data from an unseen individual (task) to mimic how a clinician manually calibrates sensor readings for a specific patient. When meta-learning is applied to remote physiological measurement, there are two kinds of scenarios: 1) supervised adaptation with few samples of labeled data from a clinical grade sensor and 2) unsupervised adaptation with unlabeled data. We hypothesize that supervised adaptation is more likely to yield a robust personalized model with only a few labels, while unsupervised adaptation may personalize the model less effectively but with much lower effort and complexity. In this paper, we propose a novel meta-learning approach to address the aforementioned challenges called MetaPhys. Our contributions are: 1) A meta-learning based deep neural framework, supporting both supervised and unsupervised few-shot adaptation, for camera-based vital sign measurement; 2) A systematic cross-dataset evaluation showing that our system considerably outperforms the state-of-the-art (42% to 52% reduction in heart rate error); 3) To perform an ablation experiment, freezing weights in the temporal and appearance branches to test sensitivity during adaptation; 4) An analysis of performance for subjects with different skin types. Our code, example models, and video results can be found on our github page.foot_0 

2. BACKGROUND

Video-Based Physiological Measurement: Video-based physiological measurement is a growing interdisciplinary domain that leverages ubiquitous imaging devices (e.g., webcams, smartphones' cameras) to measure vital signs and other physiological processes. Early work established that changes in light reflected from the body could be used to capture subtle variations blood volume and motion related to the photoplethysmogram (PPG) (Takano & Ohta, 2007; Verkruysse et al., 2008) and ballistocardiogram (BCG) (Balakrishnan et al., 2013) , respectively. Video analysis enables non-contact, spatial and temporal measurement of arterial and peripheral pulsations and allows for magnification of theses signals (Wu et al., 2012) , which may help with examination (e.g., (Abnousi et al., 2019) ). Based on the PPG and BCG signal, heart rate can be extracted (Poh et al., 2010b; Balakrishnan et al., 2013) . However, the relationship between pixels and underlying physiological changes in a video is complex and neural models have shown strong performance compared to source separation techniques (Chen & McDuff, 2018; Yu et al., 2019; Zhan et al., 2020) . Conventional supervised learning requires a large amount of training data to produce a generalized model. However, obtaining a large body of physiological and facial data is complicated and expensive. Current public datasets have limited



https://github.com/anonymous0paper/MetaPhys

