NETWORK-AGNOSTIC KNOWLEDGE TRANSFER FOR MEDICAL IMAGE SEGMENTATION

Abstract

Conventional transfer learning leverages weights of pre-trained networks, but mandates the need for similar neural architectures. Alternatively, knowledge distillation can transfer knowledge between heterogeneous networks but often requires access to the original training data or additional generative networks. Knowledge transfer between networks can be improved by being agnostic to the choice of network architecture and reducing the dependence on original training data. We propose a knowledge transfer approach from a teacher to a student network wherein we train the student on an independent transferal dataset, whose annotations are generated by the teacher. Experiments were conducted on five state-of-the-art networks for semantic segmentation and seven datasets across three imaging modalities. We studied knowledge transfer from a single teacher, combination of knowledge transfer and fine-tuning, and knowledge transfer from multiple teachers. The student model with a single teacher achieved similar performance as the teacher; and the student model with multiple teachers achieved better performance than the teachers. The salient features of our algorithm include: 1) no need for original training data or generative networks, 2) knowledge transfer between different architectures, 3) ease of implementation for downstream tasks by using the downstream task dataset as the transferal dataset, 4) knowledge transfer of an ensemble of models, trained independently, into one student model. Extensive experiments demonstrate that the proposed algorithm is effective for knowledge transfer and easily tunable.

1. INTRODUCTION

Deep learning often requires a sufficiently large training dataset, which is expensive to build and not easy to share between users. For example, a big challenge with semantic segmentation of medical images is the limited availability of annotated data (Litjens et al., 2017) . Due to ethical concerns and confidentiality constraints, medical datasets are not often released with the trained networks. This highlights the need for knowledge transfer between neural networks, wherein the original training dataset does not need to be accessed. On the other hand, according to the black-box metaphor in deep learning based methods, transferring knowledge is difficult between heterogeneous neural networks. To address these limitations, algorithms were proposed to reuse or share the knowledge of neural networks, such as network weight transfer (Tan et al., 2018) , knowledge distillation (Hinton et al., 2015) , federated learning (Yang et al., 2019), and self-training (Xie et al., 2020b) . Some conventional algorithms directly transfer the weights of standard large models that were trained on natural image datasets for different tasks (Kang & Gwak, 2019; Motamed et al., 2019; Jodeiri et al., 2019; Raghu et al., 2019) . For example, Iglovikov & Shvets (2018) adopted VGG11 pre-trained on ImageNet as the encoder of U-Net for 2D image segmentation. Similarly, the convolutional 3D (Tran et al., 2015) , pre-trained on natural video datasets, was used as the encoder of 3D U-Net for the 3D MR (Magnetic Resonance) medical image segmentation (Zeng et al., 2017) . Transferring the network weights generally requires adjustments to be made to the architecture of the receiver model, this in turn, limits the flexibility of the receiver network. Another technique that involves knowledge transfer is federated learning (Yang et al., 2019) ; it has received attention for its capability to train a large-scale model in a decentralized manner without requiring users' data. In general, federated learning approaches adopt the central model to capture the shared knowledge of all users by aggregating their gradients. Due to the difficulties in transferring knowledge between heterogeneous networks, federated learning often requires all devices, including both the central servers and local users, to use the same neural architecture (Xie et al., 2020a) . To our best knowledge, there has been no federated learning system that uses heterogeneous networks. Knowledge distillation is the process of transferring the knowledge of a large neural network or an ensemble of neural networks (teacher) to a smaller network (student) (Hinton et al., 2015) . Given a set of trained teacher models, one feeds training data to them and uses their predictions instead of the true labels to train the student model. For effective transfer of knowledge, however, it is essential that a reasonable fraction of the training examples are observable by the student (Li et al., 2018) or the metadata at each layer is provided (Lopes et al., 2017) . Yoo et al. ( 2019) used a generative network to extract the knowledge of a teacher network, which generated labeled artificial images to train another network. As can be seen, Yoo et al.'s method had to train an additional generative network for each teacher network. Different from knowledge distillation, self-training aims to transfer knowledge to a more capable model. The self-training framework (Scudder, 1965) has three main steps: train a teacher model on labeled images; use the teacher to generate pseudo labels on unlabeled images; and train a student model on the combination of labeled images and pseudo labeled images. Xie et al. (2020b) proposed self-training with a noisy student for classification, which iterates this process a few times by treating the student as a teacher to relabel the unlabeled data and training a new student. These studies required labeled images for training the student; moreover, they implicitly required that the pseudolabeled images should be similar in content to the original labeled images. Inspired by self-training, we propose a network-agnostic knowledge transfer algorithm for medical image segmentation. This algorithm transfers the knowledge of a teacher model to a student model by training the student on a transferal dataset whose annotations are generated by the teacher. The algorithm has the following characteristics: the transferal dataset requires no manual annotation and is independent of the teacher-training dataset; the student does not need to inherit the weights of the teacher, and such, the knowledge transfer can be conducted between heterogeneous neural architectures; it is straightforward to implement the algorithm with fine-tuning to solve downstream task, especially by using the downstream task dataset as the transferal dataset; the algorithm is able to transfer the knowledge of an ensemble of models that are trained independently into one model. We conducted extensive experiments on semantic segmentation using five state-of-the-art neural networks and seven datasets. The neural networks include DeepLabv3+ (Chen et al., 2018), U-Net (Ronneberger et al., 2015) , AttU-Net (Oktay et al., 2018) , SDU-Net (Wang et al., 2020), and Panoptic-FPN (Kirillov et al., 2019) . Out of the seven datasets, four public datasets involve breast lesion, nerve structure, skin lesion, and natural image objects, and three internal/in-house datasets involve breast lesion (a single dataset with two splits) and thyroid nodule. Experiments showed that the proposed algorithm performed well for knowledge transfer on semantic image segmentation.

2. ALGORITHM

The main goal of the proposed algorithm is to transfer the knowledge of one or more neural networks to an independent one, without access to the original training datasets. This section presents the proposed knowledge transfer algorithm for semantic segmentation in Algorithm 1 and its application with fine-tuning for a downstream task in Algorithm 2 (A.4). The knowledge transfer algorithm first employs one or more teacher models to generate pseudo masks for the transferal dataset and then trains the student model on the pseudo-annotated transferal dataset. For an image x in the transferal dataset D, we get the pseudo mask by the weighted average of the teacher models' outputs: y = Ti∈T w i • T i (x), where w i is the weight for the model T i . The output of a model T i (x) is either a soft mask (pixel value varies from 0 to 1) or a binary mask (pixel value is 0 or 1). Since the ensemble of teacher models is not our primary focus, we simply set the weights equally for all teacher models. We adopt two constraints to exclude images without salient targets, as shown in Figure 1 . The first constraint is on the target pixel number in the pseudo mask, where target pixel indicates the pixel with a value above 0.5. We exclude the image (of size 384 ×384 pixels) if the target pixel number is less than a threshold of 256. The second constraint is on the gray-level entropy of the pseudo

