A MULTI-MODAL AND MULTITASK BENCHMARK IN THE CLINICAL DOMAIN Anonymous

Abstract

Healthcare represents one of the most promising application areas for machine learning algorithms, including modern methods based on deep learning. Modern deep learning algorithms perform best on large datasets and on unstructured modalities such as text or image data; advances in deep learning have often been driven by the availability of such large datasets. Here, we introduce Multi-Modal Multi-Task MIMIC-III (M3) -a dataset and benchmark for evaluating machine learning algorithms in the healthcare domain. This dataset contains multi-modal patient data collected from intensive care units -including physiological time series, clinical notes, ECG waveforms, and tabular inputs -and defines six clinical tasks -including predicting mortality, decompensation, readmission, and other outcomes -which serve as benchmarks for comparing algorithms. We introduce new multi-modal and multi-task models for this dataset, and show that they outperform previous state-of-the-art results that only rely on a subset of all tasks and modalities. This highlights the potential of multi-task and multi-modal learning to improve the performance of algorithms in the healthcare domain. More generally, we envision M3 as a general resource that will help accelerate research in applying machine learning to healthcare.

1. INTRODUCTION

Healthcare and medicine are the some of the most promising areas in which machine learning algorithms can have an impact (Yu et al., 2018) . Techniques relying on machine learning have found successful applications in dermatology, ophthalmology, and many other fields of medicine (Esteva et al., 2017; Gulshan et al., 2016; Hannun et al., 2019) . Modern machine learning techniques -including algorithms based on deep learning -perform best on large datasets and on unstructured inputs, such as text, images, and other forms of raw signal data (You et al., 2016; Agrawal et al., 2016) . Progress in modern machine learning has in large part been driven by the availability of these types of large datasets as well as by competitive benchmarks on which algorithms are evaluated (Deng et al., 2009; Lin et al., 2014) . Recently, machine learning algorithms that combine data from multiple domains and that are trained to simultaneously solve a large number of tasks have achieved performance gains in domains such as machine translation and drug discovery (Johnson et al., 2017; Ramsundar et al., 2015) . Current research in this area is driven by widely adopted computational benchmarks, particularly in the field of natural language processing (Wang et al., 2018a; 2019) . In this paper, we argue that multi-modal and multitask benchmarks can similarly drive progress in applications of machine learning to healthcare. In many healthcare settings, we have access to data coming from diverse modalities -including radiology images, clinical notes, wearable sensor data, and others -and we are solving many tasks -for example, estimating disease risk, predicting readmission, and forecasting decompensation events. These kinds of settings are naturally suited to modern deep learning algorithms; developing models that effectively leverage diverse tasks and modalities has the potential to greatly improve the performance of machine learning algorithms in the clinical domain. As a first step in this research direction, we introduce in this paper Multi-Modal Multi-Task MIMIC-III (M3)foot_0 , a dataset and benchmark for evaluating machine learning algorithms in healthcare that is inspired by popular multitask benchmarks in other application domains, such as natural language processing (Wang et al., 2018b; McCann et al., 2018) . Previous clinical datasets and benchmarks have either focused on specific tasks in isolation as in Khadanga et al. (2020) or on multiple tasks over a single input modality (Harutyunyan et al., 2019) . Our work is the first to combine multiple tasks and modalities into one benchmark. More specifically, we propose a dataset that is derived from the MIMIC-III database and is comprised of data collected from over forty thousand patients who stayed in intensive care units (ICUs) of the Beth Israel Deaconess Medical Center between 2001 and 2012 (Johnson et al., 2016)) . As part of this dataset, we have collected data from four modalities -including physiological time series, clinical notes, ECG waveforms, and tabular data -and have defined six clinical tasks -mortality prediction, decompensation, readmission, and others. We also propose an evaluation framework to benchmark models on this dataset. As a demonstration of how the M3 benchmark can drive progress in clinical applications of machine learning, we propose a first set of multi-modal and multitask models and evaluate them on our new benchmark. We find that these models achieve high performance levels and may serve as strong baselines for future work. In particular, our models outperform previous state-of-the-art results that only rely on a subset of all tasks and modalities. These results highlight the potential of multitask and multi-modal learning to improve the performance of algorithms in the healthcare domain. We envision M3 as a general resource that will help accelerate research in applying machine learning to healthcare. To facilitate such uses, we release M3 and our models as an easy-to-use open-source package for the research community. Contributions. In summary, our paper makes the following contributions. • We define a new benchmark for machine learning algorithms in the clinical domain. It defines six clinical tasks, and is the first to collect data across multiple modalities. • We introduce new multi-modal and multitask machine learning models which outperform previous state-of-the-art methods that only rely on a subset of tasks or modalities. This highlights the importance of multi-modal and multitask learning in clinical settings. • We package our benchmark into an easy to use format such that the clinical machine learning community can further build upon our work.

2. BACKGROUND

Machine Learning in the Clinical Domain. Machine learning has been successfully applied throughout healthcare, including in areas such as medical imaging, drug discovery, and many others (Rajpurkar et al., 2017; Vamathevan et al., 2019) . In this paper, we restrict our attention to a specific healthcare setting -intensive care. The Medical Information Mart for Intensive Care (MIMIC-III) database is one of the most important resources for applying machine learning to intensive care (Johnson et al., 2016) . Data collected in the ICUs includes vital signs, lab events, medical interventions, and socio-demographic information. Multi-Modal and Multi-Task Learning. Multitask learning trains models to simultaneously solve multiple tasks (Ruder, 2017) . Successful applications of multitask learning include machine translation and drug discovery (Johnson et al., 2017; Ramsundar et al., 2015) . Current research in this area is driven by popular benchmarks, particularly in the field of natural language processing (Wang et al., 2018b; 2019; Rajpurkar et al., 2016) . Multi-modal machine learning combines and models data of different modalities such as vision, language, speech. A key challenge in multi-modal learning is to combine representations over diverse input types. Applications of multi-modal learning include image captioning and visual question answering (Anderson et al., 2018; Agrawal et al., 2016; Moradi et al., 2018; Nguyen et al., 2019) .



Our code is available here: https://github.com/DoubleBlindGithub/M3

