DISTANTLY SUPERVISED END-TO-END MEDICAL ENTITY EXTRACTION FROM ELECTRONIC HEALTH RECORDS WITH HUMAN-LEVEL QUALITY Anonymous

Abstract

Medical entity extraction (EE) is a standard procedure used as a first stage in medical texts processing. Usually Medical EE is a two-step process: named entity recognition (NER) and named entity normalization (NEN). We propose a novel method of doing medical EE from electronic health records (EHR) as a singlestep multi-label classification task by fine-tuning a transformer model pretrained on a large EHR dataset. Our model is trained end-to-end in an distantly supervised manner using targets automatically extracted from medical knowledge base. We show that our model learns to generalize for entities that are present frequently enough, achieving human-level classification quality for most frequent entities. Our work demonstrates that medical entity extraction can be done end-to-end without human supervision and with human quality given the availability of a large enough amount of unlabeled EHR and a medical knowledge base.

1. INTRODUCTION

Wide adoption of electronic health records (EHR) in the medical care industry has led to accumulation of large volumes of medical data (Pathak et al., 2013) . This data contains information about the symptoms, syndromes, diseases, lab results, patient treatments and presents an important source of data for building various medical systems (Birkhead et al., 2015) . Information extracted from medical records is used for clinical support systems (CSS) (Shao et al., 2016) Information extraction from electronic health records is a difficult natural language processing task. EHR present a heterogeneous dynamic combination of structured, semi-structured and unstructured texts. Such records contain patients' complaints, anamneses, demographic data, lab results, instrumental results, diagnoses, drugs, dosages, medical procedures and other information contained in medical records (Wilcox, 2015) . Electronic health records are characterised by several linguistic phenomena making them harder to process. • Rich special terminology, complex and volatile sentence structure. • Often missing term parts and punctuation. • Many abbreviations, special symbols and punctuation marks. • Context-dependant terms and large number of synonims. • Multi-word terms, fragmented and non-contiguous terms. From practical point of view the task of medical information extraction splits into entity extraction and relation extraction. We focus on medical entity extraction in this work. In the case of medical texts such entities represent symptoms, diagnoses, drug names etc.



(Topaz et al., 2016)  (Zhang  et al., 2014), lethality estimation (Jo et al., 2015) (Luo & Rumshisky, 2016), drug side-effects discovery(LePendu et al., 2012) (Li et al., 2014) (Wang et al., 2009), selection of patients for clinical and epidemiological studies(Mathias et al., 2012) (Kaelber et al., 2012) (Manion et al., 2012), medical knowledge discovery(Hanauer et al., 2014) (Jensen et al., 2012)  and personalized medicine(Yu et al., 2019). Large volumes of medical text data and multiple applicable tasks determine the importance of accurate and efficient information extraction from EHR.

