METAPHYS: FEW-SHOT ADAPTATION FOR NON-CONTACT PHYSIOLOGICAL MEASUREMENT

Abstract

There are large individual differences in physiological processes, making designing personalized health sensing algorithms challenging. Existing machine learning systems struggle to generalize well to unseen subjects or contexts, especially in video-based physiological measurement. Although fine-tuning for a user might address this issue, it is difficult to collect large sets of training data for specific individuals because supervised algorithms require medical-grade sensors for generating the training target. Therefore, learning personalized or customized models from a small number of unlabeled samples is very attractive as it would allow fast calibrations. In this paper, we present a novel meta-learning approach called MetaPhys for learning personalized cardiac signals from 18-seconds of video data. MetaPhys works in both supervised and unsupervised manners. We evaluate our proposed approach on two benchmark datasets and demonstrate superior performance in cross-dataset evaluation with substantial reductions (42% to 44%) in errors compared with state-of-the-art approaches. Visualization of attention maps and ablation experiments reveal how the model adapts to each subject and why our proposed approach leads to these improvements. We have also demonstrated our proposed method significantly helps reduce the bias in skin type.

1. INTRODUCTION

The importance of scalable health sensing has been acutely highlighted during the SARS-CoV-2 (COVID-19) pandemic. The virus has been linked to increased risk of myocarditis and other serious cardiac (heart) conditions (Puntmann et al., 2020) . Contact sensors (electrocardiograms, oximeters) are the current gold-standard for measurement of heart function. However, these devices are still not ubiquitously available, especially in low-resource settings. The development of video-based contactless sensing of vital signs presents an opportunity for highly scalable physiological monitoring. Furthermore, in clinical settings non-contact sensing could reduce the risk of infection for vulnerable patients (e.g., infants and elderly) and the discomfort caused to them (Villarroel et al., 2019) . While there are compelling advantages of camera-based sensing, the approach also presents unsolved challenges. The use of ambient illumination means camera-based measurement is sensitive to environmental differences in the intensity and composition of the incident light. Camera sensor differences mean that hardware can differ in sensitivity across the frequency spectrum. People (the subjects) exhibit large individual differences in appearance (e.g., skin type, facial hair) and physiology (e.g, pulse dynamics). Finally, contextual differences mean that motions in a video at test time might be different from those seen in the training data. One specific example is that there exists biases in performance across skin types Nowara et al. (2020) . This problem is not isolated to physiological measurement as studies have found systematic biases in facial gender classification, with error rates up to 7x higher on women than men and poorer performance on people with darker skin types (Buolamwini & Gebru, 2018) . Moreover, there are several challenges in collecting large corpora of high-quality physiological data: 1) recruiting and instrumenting participants is often expensive and requires advanced technical expertise, 2) the data can reveal the identity of the subjects and/or sensitive health information meaning it is difficult for researchers to share such datasets. Therefore, training supervised models that generalize well across environments and subjects is challenging. For these reasons we observe that performance on cross-dataset evaluation is significantly worse than within-dataset evaluation using current state-of-the-art methods (Chen & McDuff, 2018; Liu et al., 2020) . 1

