DROPS: DEEP RETRIEVAL OF PHYSIOLOGICAL SIG-NALS VIA ATTRIBUTE-SPECIFIC CLINICAL PROTO-TYPES

Abstract

The ongoing digitization of health records within the healthcare industry results in large-scale datasets. Manually extracting clinically-useful insight from such datasets is non-trivial. However, doing so at scale while simultaneously leveraging patient-specific attributes such as sex and age can assist with clinical-trial enrollment, medical school educational endeavours, and the evaluation of the fairness of neural networks. To facilitate the reliable extraction of clinical information, we propose to learn embeddings, known as clinical prototypes (CPs), via supervised contrastive learning. We show that CPs can be efficiently used for large-scale retrieval and clustering of physiological signals based on multiple patient attributes. We also show that CPs capture attribute-specific semantic relationships.

1. INTRODUCTION

Physiological data are being collected at a burgeoning rate. Such growth is driven by the digitization of previous patient records, the presence of novel health monitoring and recording systems, and the recent recommendation to facilitate the exchange of health records (European Commission, 2019) . This engenders large-scale datasets from which the manual extraction of clinically-useful insight is non-trivial. Such insight can include, but is not limited to, medical diagnoses, prognoses, or treatment. In the presence of large-scale datasets, retrieving instances based on some user-defined criteria has been a longstanding goal within the machine learning community (Manning et al., 2008) . This information retrieval (IR) process typically consists of a query that is used to search through a large database and retrieve matched instances. Within healthcare, the importance of an IR system is threefold (Hersh & Hickam, 1998; Hersh, 2008) . First, it provides researchers with greater control over which patients to choose for clinical trial recruitment. Second, IR systems can serve as an educational and diagnostic tool, allowing physicians to identify seemingly similar patients who exhibit different clinical parameters and vice versa. Lastly, if the query were to consist of sensitive attributes such as sex, age, and race, then such a system would allow researchers to more reliably evaluate the individual and counterfactual fairness of a particular model (Verma & Rubin, 2018). To illustrate this point, let us assume the presence of a query instance that corresponds to a patient with an abnormality of the heart, atrial fibrillation, who is male and under the age of 25. To reliably determine the sensitivity of a model with respect to sex, one would observe its response when exposed to a counterfactual instance, namely the exact same instance but with a different sex label (Kusner et al., 2017) . At present, deep-learning based IR systems within the healthcare domain fail to incorporate such patient-specific attributes. Existing IR systems which retrieve instances from the electronic health records (Wang et al., 2019; Chamberlin et al., 2019) do not incorporate an attribute-specific search and do not trivially extend to physiological signals. In this paper, we propose to learn embeddings, referred to as clinical prototypes (CPs). CPs are efficient descriptors of a combination of patient-specific attributes, such as disease, sex, and age. We learn these embeddings via contrastive learning whereby representations of instances are encouraged to be similar to their corresponding clinical prototype and dissimilar to the others. To the best of our knowledge, we are the first to design a supervised contrastive learning based large-scale retrieval system for electrocardiogram (ECG) signals. Contributions. Our contributions are the following:

