Exploring semantic information in disease: Simple Data Augmentation Techniques for Chinese Disease Normalization

Abstract

The disease is a core concept in the medical field, and the task of normalizing disease names is the basis of all disease-related tasks. However, due to the multi-axis and multi-grain nature of disease names, incorrect information is often injected and harms the performance when using general text data augmentation techniques. To address the above problem, we propose a set of data augmentation techniques that work together as an augmented training task for disease normalization. Our data augmentation methods are based on both the clinical disease corpus and standard disease corpus derived from ICD-10 coding. Extensive experiments are conducted to show the effectiveness of our proposed methods. The results demonstrate that our methods can have up to 3% performance gain compared to non-augmented counterparts, and they can work even better on smaller datasets.

1. Introduction

The disease is a central concept in medical text processing problems. One of the most important tasks, i.e. disease normalization, uses diseases as both input and output to match the diagnoses terms used in clinical documents to standard names in ICD coding. The disease normalization task mainly faces the following three challenges. First, different writing styles. The writing styles of the diseases can be diversified, where different doctors have different writing habits, so a single disease might result in thousands of versions of names. Second, data scarcity, where some diseases may not be covered in the training set, which often leads to few-shot or zero-shot scenarios. For example, in the Chinese disease normalization dataset CHIP-CDN, there are 40472 diseases to classify, but only data of 3505 diseases (i.e. less than 10% of all diseases) are provided in the training set. Figure 1 illustrates the data scarcity problem in CHIP-CDN dataset. Third, semantics density. The length of disease names is usually short, which makes every character carries huge semantic information. The meanings of the diseases are very different from each other even if they share a lot of common characters, and a single change in characters could result in dramatic change in semantic meaning. For instance, " 髂总动脉夹层 (Common iliac artery dissection)" and " 劲总动脉夹层 (Common carotid artery dissection)" are only different in one character, but the positions of those diseases are very distinct, from the upper half of the body part to the lower half. Among all the challenges we discussed, data scarcity is the biggest one, since other problems usually can be solved by providing larger datasets for models to learn. A common way to address the data scarcity problem is through data augmentation. There are numerous data augmentation methods for general corpora such as synonym replacement or back translation. Wei & Zou (2019) has shown that simple text data augmentation methods can be effective for text classification problems. However, because of the unique structure of disease names (i.e. semantics density), general text data augmentation methods do not work well on them, and sometimes even hurt the overall performance. For example, if random deletion Wei & Zou (2019) is performed on disease " 阻塞性睡眠呼吸暂停 (Obstructive Sleep Apnoea)" and results in " 阻塞性睡眠 (Obstructive Sleep)", that would dramatically change the meaning of that disease name and makes it become another disease. Admittedly, general

