EFFECTIVE SELF-SUPERVISED TRANSFORMERS FOR SPARSE TIME SERIES DATA

Abstract

Electronic health records (EHRs) recorded in hospital settings such as intensive care units (ICUs) typically contain a wide range of numeric time series data that is characterized by high sparsity and irregular observations. Self-supervised Transformer architectures have shown outstanding performance in a variety of structured tasks in natural language processing and computer vision. However, the sparse irregular time series nature of ICU EHR data poses challenges for the application of transformers that have not been widely explored. One of the major challenges is the quadratic scaling of self-attention layers that can significantly limit the input sequence length. In this work, we introduce TESS, Transformers for EHR data with Self Supervised learning, a self-supervised Transformerbased architecture designed to extract robust representations from EHR data. We propose the application of input binning to aggregate the time series inputs and sparsity information into a regular sequence with fixed length, enabling the training of larger and deeper Transformers. We demonstrate that significant compression of ICU EHR data is possible without sacrificing useful information, likely due to the highly correlated nature of observations in small time bins. We then introduce self-supervised prediction tasks that provide rich and informative signals for model pre-training. TESS outperforms state-of-the-art deep learning models on multiple downstream tasks from the MIMIC-IV and PhysioNet-2012 ICU EHR datasets.

1. INTRODUCTION

Electronic health record (EHR) data collected in the hospital contains an immense amount of information about patients. This data typically comes in the form of vital sign measurements, lab results, and diagnoses/treatments. Patients in an Intensive Care Unit (ICU) are particularly heavily monitored, with frequent vital sign observations and diagnostic tests. The resulting multivariate numeric time series is highdimensional, sparse, and irregularly distributed across time, making it challenging to apply standard time series analysis methods that are primarily designed for densely sampled data. These challenges are not unique to health care, and data with such characteristics commonly arises in fields such as finance, banking, and e-commerce (Cao et al., 2021; Gómez-Losada & Duch-Brown, 2019; Zhang et al., 2015) . Good models of clinical outcomes need to extract predictive signal from the values, frequencies and missingness patterns from such data. Hand-crafting such features is a non-trivial and time-consuming task, which has led to the exploration of deep learning for problems arising in healthcare. However, when the labels are noisy and scarce, such methods too are susceptible to overfitting. Self-supervised learning (SSL) (Chopra et al., 2005; Caron et al., 2021) , has risen in popularity as a tool to reduce the dependence of representation learning on large amounts of labelled data. SSL relies on the premise that domain experts have prior knowledge about the patterns in high-dimensional data; by translating this domain knowledge into pseudo-tasks, practitioners can ensure that this knowledge is transferred to representation learning models prior to fine-tuning them with labelled data. The premise of SSL is attractive for EHR data, where few positive samples might be observed for a desired outcome and privacy limitations can prevent the collection of larger labelled datasets (Krishnan et al., 2022; Bak et al., 2022) . Methods for SSL are often applied to Transformers (Vaswani et al., 2017) , which have proven to be an effective neural architecture for finding useful patterns across a variety of different domains. Self-supervised Transformer models currently produce state-of-the-art results in natural language processing (NLP) (Brown et al., 2020 ), computational histopathology (Chen & Krishnan, 2021; Chen et al., 2022 ), computer vision (He et al., 2022) , and cross-modal learning (Radford et al., 2021) . On numeric EHR data (Li et al., 2021; Ren et al., 2021; Tipirneni & Reddy, 2022) however, there remain many open questions regarding good selfsupervised tasks and whether SSL and Transformers can achieve the same level of success as in other fields. In this work, we present TESS, an approach for self-supervised training of Transformers on ICU EHR data that produces representations which generalise well to different downstream tasks of interest. Prior work along this vein has used embedded input sequences that have one sequence element for every patient event (Li et al., 2021; Tipirneni & Reddy, 2022) . This is limited in scalability as patients can have hundreds of events in a relatively short period of time, while memory and runtime complexity of self-attention layers scales quadratically with input length. More efficient attention layers have been proposed (Wang et al., 2020) , but state-of-the-art Transformer models still generally use quadratic attention. Training large models with this input representation consequently requires significant hardware resources or aggressive input truncation, which can negatively impact accuracy. Contributions: To address the limitations discussed above, we first propose an application of time binning to compress the input. We observe that increasing time resolution beyond a certain point does not improve model performance, likely due to consecutive measurements of the same event within a small time window being highly correlated. By aggregating events within time bins and including auxiliary data describing the input sparsity structure, we can significantly compress the input without substantial loss of useful information. Each bin is projected using a multi-layer perceptron (MLP) to a given embedding dimension, and combined with an embedded representation of the time period the bin represents. This shorter and denser input allows us to train larger and deeper Transformer models, which in turn leads to better representations. Next, we propose an SSL approach to train TESS. Measurements that clinicians choose to take reflect their understanding of a patient's condition and related treatment strategies. Consequently, the missingness patterns of different events contain predictive signals of interest for a variety of different tasks (Lipton et al., 2016) since they represent a clinician's unobserved lack of intent to treat or measure the value in question. A good representation of a patient's state should be aware of both the past state that the patient transitioned from, as well as the future states they might evolve into. We construct SSL tasks by predicting a combination of both missingness masks and event values. Since adjacent measurements of the same event are likely to be highly correlated, we introduce a masked event type dropout scheme that encourages the model to design representations that pull information from other time-bins and events rather than using simple interpolation. We evaluate TESS on the MIMIC-IV (Johnson et al., 2022) and PhysioNet-2012 (Silva et al., 2012) datasets, showing that it outperforms state-of-the-art baselines on multiple downstream tasks such as mortality prediction and phenotype classification. We also evaluate the efficacy of self-supervised learning with TESS, showing that it learns to produce a good representation of patients that can be fine-tuned effectively with only a small amount of labelled data.

2. RELATED WORK

Deep learning for sparse irregular time series. A variety of neural network models have been proposed for supervised learning on sparse irregular time series data. Most are based on recurrent neural networks (Hochreiter & Schmidhuber, 1997; Cho et al., 2014) that expect regularly sampled inputs with-

