DISTANTLY SUPERVISED END-TO-END MEDICAL ENTITY EXTRACTION FROM ELECTRONIC HEALTH RECORDS WITH HUMAN-LEVEL QUALITY Anonymous

Abstract

Medical entity extraction (EE) is a standard procedure used as a first stage in medical texts processing. Usually Medical EE is a two-step process: named entity recognition (NER) and named entity normalization (NEN). We propose a novel method of doing medical EE from electronic health records (EHR) as a singlestep multi-label classification task by fine-tuning a transformer model pretrained on a large EHR dataset. Our model is trained end-to-end in an distantly supervised manner using targets automatically extracted from medical knowledge base. We show that our model learns to generalize for entities that are present frequently enough, achieving human-level classification quality for most frequent entities. Our work demonstrates that medical entity extraction can be done end-to-end without human supervision and with human quality given the availability of a large enough amount of unlabeled EHR and a medical knowledge base.

1. INTRODUCTION

Wide adoption of electronic health records (EHR) in the medical care industry has led to accumulation of large volumes of medical data (Pathak et al., 2013) . This data contains information about the symptoms, syndromes, diseases, lab results, patient treatments and presents an important source of data for building various medical systems (Birkhead et al., 2015) . Information extracted from medical records is used for clinical support systems (CSS) (Shao et al., 2016) (Topaz et al., 2016) (Zhang et al., 2014) , lethality estimation (Jo et al., 2015) (Luo & Rumshisky, 2016) , drug side-effects discovery (LePendu et al., 2012 ) (Li et al., 2014 ) (Wang et al., 2009) , selection of patients for clinical and epidemiological studies (Mathias et al., 2012 ) (Kaelber et al., 2012 ) (Manion et al., 2012) , medical knowledge discovery (Hanauer et al., 2014 ) (Jensen et al., 2012) and personalized medicine (Yu et al., 2019) . Large volumes of medical text data and multiple applicable tasks determine the importance of accurate and efficient information extraction from EHR. Information extraction from electronic health records is a difficult natural language processing task. EHR present a heterogeneous dynamic combination of structured, semi-structured and unstructured texts. Such records contain patients' complaints, anamneses, demographic data, lab results, instrumental results, diagnoses, drugs, dosages, medical procedures and other information contained in medical records (Wilcox, 2015) . Electronic health records are characterised by several linguistic phenomena making them harder to process. • Rich special terminology, complex and volatile sentence structure. • Often missing term parts and punctuation. • Many abbreviations, special symbols and punctuation marks. • Context-dependant terms and large number of synonims. • Multi-word terms, fragmented and non-contiguous terms. From practical point of view the task of medical information extraction splits into entity extraction and relation extraction. We focus on medical entity extraction in this work. In the case of medical texts such entities represent symptoms, diagnoses, drug names etc. Entity extraction, also referred as Concept Extraction is a task of extracting from free text a list of concepts or entities present. Often this task is combined with finding boundaries of extracted entities as an intermediate step. Medical entity extraction in practice divides into two sequential tasks: Named entity recognition (NER) and Named entity normalization (NEN). During NER sequences of tokens that contain entities are selected from original text. During NEN each sequence is linked with specific concepts from knowledge base (KB). We used Unified Medical Language System (UMLS) KB (Bodenreider, 2004) as the source of medical entities in this paper. In this paper we make the following contributions. First, we show that a single transformer model (Devlin et al., 2018) is able to perform NER and NEN for electronic health records simultaneously by using the representation of EHR for a single multi-label classification task. Second, we show that provided a large enough number of examples such model can be trained using only automatically assigned labels from KB to generalize to unseen and difficult cases. Finally, we empirically estimate the number of examples needed to achieve human-quality medical entity extraction using such distantly-supervised setup.

2. RELATED WORK

First systems for named entity extraction from medical texts combined NER and NEN using term vocabularies and heuristic rules. One of the first such systems was the Linguistic String Project -Medical Language Processor, described in Sager et al. (1986) . Columbia University developed Medical Language Extraction and Encoding System (MedLEE), using rule-based models at first and subsequently adding feature-based models (Friedman, 1997). Since 2000 the National Library of Medicine of USA develops the MetaMap system, based mainly on rule-based approaches (Aronson et al., 2000) . Rule-based approaches depend heavily on volume and fullness of dictionaries and number of applied rules. These systems are also very brittle in the sense that their quality drops sharply when applied to texts from new subdomains or new institutions. Entity extraction in general falls into three broad categories: rule-based, feature-based and deeplearning (DL) based. Deep learning models consist of context encoder and tag decoder. The context encoder applies a DL model to produce a sequence of contextualized token representation used as input for tag decoder which assign entity class for each token in sequence. For a comprehensive survey see (Li et al., 2020) . In most entity extraction systems the EE task is explicitly (or for some DL models implicitly) separated into NER an NEN tasks. Feature-based approaches solve the NER task as a sequence markup problem by applying such feature-based models as Hidden Markov Models (Okanohara et al., 2006) and Conditional Random Fields (Lu et al., 2015) . The downside of such models is the requirement of extensive feature engineering. Another method for NER is to use DL models (Ma & Hovy, 2016 ) (Lample et al., 2016) . This models not only select text spans containing named entities but also extract quality entity representations which can be used as input for NEN. For example in (Ma & Hovy, 2016) authors combine DL bidirectional long short-term memory network and conditional random fields. Main approaches for NEN task are: rule-based (D'Souza & Ng, 2015 ) (Kang et al., 2013 ), featurebased (Xu et al., 2017a ) (Leaman et al., 2013 ) and DL methods (Li et al., 2017a ) (Luo et al., 2018b) and their different combinations (Luo et al., 2018a) . Among DL approaches a popular way is to use distance metrics between entity representations (Ghiasvand & Kate, 2014) or ranking metrics (Xu et al., 2017a ) (Leaman et al., 2013) . In addition to ranking tasks DL models are used to create contextualized and more representative term embeddings. This is done with a wide range of models: Word2Vec (Mikolov et al., 2013 ), ELMo (Peters et al., 2018) , GPT (Radford et al., 2018) , BERT (Devlin et al., 2018) . The majority of approaches combine several DL models to extract contextaware representations which are used for ranking or classification using a dictionary of reference entity representations (Ji et al., 2020) . The majority of modern medical EE systems sequentially apply NER and NEN. Considering that NER and NEN models themselves are often multistage the full EE systems are often complex combinations of multiple ML and DL models. Such models are hard to train end-to-end and if the NER task fails the whole system fails. This can be partially mitigated by simultaneous training of NER and NEN components. In (Durrett & Klein, 2014) a CRF model is used to train NER and NEN simultaneously. In Le et al. (2015) proposed a model that merged NER and NEN at prediction 

