DISCRIMINATIVE CROSS-MODAL DATA AUGMENTA-TION FOR MEDICAL IMAGING APPLICATIONS

Abstract

While deep learning methods have shown great success in medical image analysis, they require a number of medical images to train. Due to data privacy concerns and unavailability of medical annotators, it is oftentimes very difficult to obtain a lot of labeled medical images for model training. In this paper, we study crossmodality data augmentation to mitigate the data deficiency issue in the medical imaging domain. We propose a discriminative unpaired image-to-image translation model which translates images in source modality into images in target modality where the translation task is conducted jointly with the downstream prediction task and the translation is guided by the prediction. Experiments on two applications demonstrate the effectiveness of our method.

1. INTRODUCTION

Developing deep learning methods to analyze medical images for decision-making has aroused much research interest in the past few years. Promising results have been achieved in using medical images for skin cancer diagnosis (Esteva et al., 2017; Tschandl et al., 2019) , chest diseases identification (Jaiswal et al., 2019) , diabetic eye disease detection (Cheung et al., 2019) , to name a few. It is well-known that deep learning methods are data-hungry. Deep learning models typically contain tens of millions of weight parameters. To effectively train such large-sized models, a large number of labeled training images are needed. However, in the medical domain, it is very difficult to collect labeled training images due to many reasons including privacy barriers, unavailability of doctors for annotating disease labels, etc. To address the deficiency of medical images, many approaches (Krizhevsky et al., 2012; Cubuk et al., 2018; Takahashi et al., 2019; Zhong et al., 2017; Perez & Wang, 2017) have been proposed for data augmentation. These approaches create synthetic images based on the original images and use the synthetic images as additional training data. The most commonly used data augmentation approaches include crop, flip, rotation, translation, scaling, etc. Augmented images created by these methods are oftentimes very similar to the original images. For example, a cropped image is part of the original image. In clinical practice, due to the large disparity among patients, the medical image of a new patient (during test time) is oftentimes very different from the images of patients used for model training. If the augmented images are very close to the original images, they are not very useful in improving the ability of the model to generalize to unseen patients. It is important to create diverse augmented images that are non-redundant with the original images. To create non-redundant augmented images for one modality such as CT, one possible solution is to leverage images from other modalities such as X-ray, MRI, PET, etc. In clinical practice, for the same disease, many different types of imaging techniques are applied to diagnose and treat this disease. For example, to diagnose lung cancer, doctors can use chest X-rays, CT scans, MRI scans, to name a few. As a result, different modalities of medical images are accumulated for the same disease. When training a deep learning model on an interested modality (denoted by X) of images, if the number of original images in this modality is small, we may convert the images in other modalities into the target modality X and use these converted images as additional training data. For example, when a hospital would like to train a deep learning model for CT-based lung cancer diagnosis, the hospital can collect MRI, X-ray, PET images about lung cancer and use them to augment the CT training dataset. Images of different modalities are typically from different patients. Therefore, their clinical diversity is large. This motivates us to study cross-modality data augmentation to address the data deficiency issue in training deep learning models for medical image based clinical decision-making. The problem setup is as follows. The goal is to train a deep learning model to diagnose disease D based on one modality (denoted by X) of medical images. However, the number of images in this modality is limited, which are not sufficient to train effective models. Meanwhile, there is another modality (denoted by Y ) of medical images used for diagnosing disease D. Cross-modality data augmentation refers to leveraging the images in Y to augment the training set in X. Specifically, we translate images in Y into images that have a similar style as those in X and add the translated images together with their disease labels into the training dataset in X. Compared with simple augmentation such as cropping, scaling, rotation, cross-modality data augmentation can bring in more diversity since the images in different modalities are from different patients and hence are clinically more heterogeneous. More diverse augmented images are more valuable in improving the generalization ability of the model to unseen patients. To perform cross-modality data augmentation, we propose a discriminative unpaired image-toimage translation (DUIIT) method. Given images in a source modality, we translate them into images in a target modality. The translated images, together with their associated disease labels, are added to the training set in the target modality to train the predictive model. Different from unsupervised translation methods such as CycleGAN (Zhu et al., 2017a) which perform the translation between images without considering their disease labels, our method conducts the translation in a discriminative way, where the translation is guided by the predictive task. Our model performs cross-modality image translation and predictive modeling simultaneously, to enable these two tasks to mutually benefit each other. The translated images are not only aimed to have similar style as those in the target modality, but also are targeted to be useful in training the predictive model. Our model consists of two modules: a translator and a predictor. The translator transforms the images in the source modality into synthesized images in the target modality. Then the translated images (together with their class labels) are combined with the images in the target modality to train the predictor. The predictor takes an image as input and predicts the class label. The two modules are trained jointly so that the translator is learned to generate target images that are effective in training the predictor. We apply our method to two medical imaging applications. In the first application, the source modality is MRI and the target modality is CT. Our method translates MRI images into CTs in a discriminative way and uses the combination of original CTs and translated CTs to train the predictive model, which achieves substantially lower prediction error compared with using original CTs only. In the second application, the source modality is PET and the target modality is CT. By translating PET images to CTs, our method significantly improves prediction performance. The major contributions of this paper include: • We propose a discriminative unpaired image-to-image translation method to translate medical images from the source modality to the target modality to augment the training data in the target modality. The translation is guided by the predictive task. • On two applications, we demonstrate the effectiveness of our method. The rest of the paper is organized as follows. Section 2 reviews related works. Section 3 and 4 present the methods and experiments. Section 5 concludes the paper. 



Radford et al., 2015)  to generate synthetic medical images, which are used for data augmentation and lesion classification.Jin et al. (2018)  develop a 3D GAN to learn lung nodule properties in the 3D space. The 3D GAN is conditioned on a volume of interest whose central part containing nodules has been erased. GANs are also used for generating segmentation maps(Guibas et al., 2017)  where a two stage pipeline is applied. Mok & Chung (2018) apply conditional GANs(Mirza & Osindero, 2014; Odena et al., 2017)  to synthesize brain MRI images in a coarse-to-fine manner, for brain tumour segmentation. To reserve fine-grained details of the tumor core, they encourage the generator to delineate tumor boundaries. Cross-modality translation has

