Exploring semantic information in disease: Simple Data Augmentation Techniques for Chinese Disease Normalization

Abstract

The disease is a core concept in the medical field, and the task of normalizing disease names is the basis of all disease-related tasks. However, due to the multi-axis and multi-grain nature of disease names, incorrect information is often injected and harms the performance when using general text data augmentation techniques. To address the above problem, we propose a set of data augmentation techniques that work together as an augmented training task for disease normalization. Our data augmentation methods are based on both the clinical disease corpus and standard disease corpus derived from ICD-10 coding. Extensive experiments are conducted to show the effectiveness of our proposed methods. The results demonstrate that our methods can have up to 3% performance gain compared to non-augmented counterparts, and they can work even better on smaller datasets.

1. Introduction

The disease is a central concept in medical text processing problems. One of the most important tasks, i.e. disease normalization, uses diseases as both input and output to match the diagnoses terms used in clinical documents to standard names in ICD coding. The disease normalization task mainly faces the following three challenges. First, different writing styles. The writing styles of the diseases can be diversified, where different doctors have different writing habits, so a single disease might result in thousands of versions of names. Second, data scarcity, where some diseases may not be covered in the training set, which often leads to few-shot or zero-shot scenarios. For example, in the Chinese disease normalization dataset CHIP-CDN, there are 40472 diseases to classify, but only data of 3505 diseases (i.e. less than 10% of all diseases) are provided in the training set. Figure 1 illustrates the data scarcity problem in CHIP-CDN dataset. Third, semantics density. The length of disease names is usually short, which makes every character carries huge semantic information. The meanings of the diseases are very different from each other even if they share a lot of common characters, and a single change in characters could result in dramatic change in semantic meaning. For instance, " 髂总动脉夹层 (Common iliac artery dissection)" and " 劲总动脉夹层 (Common carotid artery dissection)" are only different in one character, but the positions of those diseases are very distinct, from the upper half of the body part to the lower half. Among all the challenges we discussed, data scarcity is the biggest one, since other problems usually can be solved by providing larger datasets for models to learn. A common way to address the data scarcity problem is through data augmentation. There are numerous data augmentation methods for general corpora such as synonym replacement or back translation. Wei & Zou (2019) has shown that simple text data augmentation methods can be effective for text classification problems. However, because of the unique structure of disease names (i.e. semantics density), general text data augmentation methods do not work well on them, and sometimes even hurt the overall performance. For example, if random deletion Wei & Zou (2019) is performed on disease " 阻塞性睡眠呼吸暂停 (Obstructive Sleep Apnoea)" and results in " 阻塞性睡眠 (Obstructive Sleep)", that would dramatically change the meaning of that disease name and makes it become another disease. Admittedly, general data augmentation methods may be able to address the challenge of different writing styles, as performing random operations on texts can be seen as a way to emulate different writing behaviors. However, due to the above reasons, general data augmentation methods tend to hurt performance, which is demonstrated in our experiments. Therefore, designing data augmentation methods specific to disease corpus is necessary. To bridge this gap, we propose a set of disease-oriented data augmentation methods to address this problem. As with other disease-related tasks, disease normalization can be thought as a process of text matching, from clinical names to standard names in ICD coding. Therefore, the key to this task is for the model to learn great encoding that contains enough similar information for each disease. For instance, the model needs to tell that " 左肾发育不全 (Left renal agenesis)" and " 先天性肾发育不全 (Congenital renal agenesis)" are the same disease while " 髂总动脉夹层 (Common iliac artery dissection)" and " 颈总动脉夹层 (Common carotid artery dissection)" are not, despite that they both share a lot of common characters. Our methods are based on the following two assumptions. First, disease names have the property of structural invariance. A disease name consists of several different types of key elements, such as location, clinical manifestations, etiology, pathology, etc. In the pair of clinical disease and standard ICD disease, the specified elements can correspond in most cases. Therefore, we can replace a specific element between the pair of clinical disease and standard ICD disease at the same time to generate new pairs. The matching relationship of the newly generated clinical disease and the ICD standard disease pairs can still be maintained. We screened the generated standard ICD diseases to ensure that they belonged to the correct label and that the pairs are effective. It should be noticed that replacing components could derive a new clinical disease name that turns out to be fake (i.e. the disease actually does not exist), but the key point here is to make models learn the necessary semantic association within the diseases. Second, labels in the disease normalization task have transitivity properties. In specific, a more specified description of an object can be comprised into a larger group where the descriptions are more coarse, e.g. a yellow chair is also a chair. In the ICD coding system, there are also different and clear granularities of diseases. Therefore, we can treat the fine-grained disease as their coarse-grained upper disease by assigning them father labels. Normally, a data augmentation method generates new data and trains them along with the existing data, without altering the training paradigm. However, the disease normalization task assigns each disease a unique label, while our methods augment the labels. Therefore, if the traditional training paradigm is still applied to our augmentation methods, a same input disease in the dataset may get different labels, which will make the model difficult to train due to label confusion. To overcome this problem, we treat the data augmentation operation as a pre-training task (we call it augmented training) prior to the original task, so that the model can first learn the necessary semantic information within diseases and then leverage that information when fine-tuning on the actual normalization dataset.



Figure 1: Data scarcity problem in CHIP-CDN dataset. The blue line represents the overall amount of diseases in ICD coding classified by the first coding letter, and the red line represents the number of diseases provided by the CHIP-CDN training set.

