A MULTI-MODAL AND MULTITASK BENCHMARK IN THE CLINICAL DOMAIN Anonymous

Abstract

Healthcare represents one of the most promising application areas for machine learning algorithms, including modern methods based on deep learning. Modern deep learning algorithms perform best on large datasets and on unstructured modalities such as text or image data; advances in deep learning have often been driven by the availability of such large datasets. Here, we introduce Multi-Modal Multi-Task MIMIC-III (M3) -a dataset and benchmark for evaluating machine learning algorithms in the healthcare domain. This dataset contains multi-modal patient data collected from intensive care units -including physiological time series, clinical notes, ECG waveforms, and tabular inputs -and defines six clinical tasks -including predicting mortality, decompensation, readmission, and other outcomes -which serve as benchmarks for comparing algorithms. We introduce new multi-modal and multi-task models for this dataset, and show that they outperform previous state-of-the-art results that only rely on a subset of all tasks and modalities. This highlights the potential of multi-task and multi-modal learning to improve the performance of algorithms in the healthcare domain. More generally, we envision M3 as a general resource that will help accelerate research in applying machine learning to healthcare.

1. INTRODUCTION

Healthcare and medicine are the some of the most promising areas in which machine learning algorithms can have an impact (Yu et al., 2018) . Techniques relying on machine learning have found successful applications in dermatology, ophthalmology, and many other fields of medicine (Esteva et al., 2017; Gulshan et al., 2016; Hannun et al., 2019) . Modern machine learning techniques -including algorithms based on deep learning -perform best on large datasets and on unstructured inputs, such as text, images, and other forms of raw signal data (You et al., 2016; Agrawal et al., 2016) . Progress in modern machine learning has in large part been driven by the availability of these types of large datasets as well as by competitive benchmarks on which algorithms are evaluated (Deng et al., 2009; Lin et al., 2014) . Recently, machine learning algorithms that combine data from multiple domains and that are trained to simultaneously solve a large number of tasks have achieved performance gains in domains such as machine translation and drug discovery (Johnson et al., 2017; Ramsundar et al., 2015) . Current research in this area is driven by widely adopted computational benchmarks, particularly in the field of natural language processing (Wang et al., 2018a; 2019) . In this paper, we argue that multi-modal and multitask benchmarks can similarly drive progress in applications of machine learning to healthcare. In many healthcare settings, we have access to data coming from diverse modalities -including radiology images, clinical notes, wearable sensor data, and others -and we are solving many tasks -for example, estimating disease risk, predicting readmission, and forecasting decompensation events. These kinds of settings are naturally suited to modern deep learning algorithms; developing models that effectively leverage diverse tasks and modalities has the potential to greatly improve the performance of machine learning algorithms in the clinical domain.

