XMIXUP: EFFICIENT TRANSFER LEARNING WITH AUXILIARY SAMPLES BY CROSS-DOMAIN MIXUP

Abstract

Transferring knowledge from large source datasets is an effective way to fine-tune the deep neural networks of the target task with a small sample size. A great number of algorithms have been proposed to facilitate deep transfer learning, and these techniques could be generally categorized into two groups -Regularized Learning of the target task using models that have been pre-trained from source datasets, and Multitask Learning with both source and target datasets to train a shared backbone neural network. In this work, we aim to improve the multitask paradigm for deep transfer learning via Cross-domain Mixup (XMixup). While the existing multitask learning algorithms need to run backpropagation over both the source and target datasets and usually consume a higher gradient complexity, XMixup transfers the knowledge from source to target tasks more efficiently: for every class of the target task, XMixup selects the auxiliary samples from the source dataset and augments training samples via the simple mixup strategy. We evaluate XMixup over six real world transfer learning datasets. Experiment results show that XMixup improves the accuracy by 1.9% on average. Compared with other state-of-the-art transfer learning approaches, XMixup costs much less training time while still obtains higher accuracy.

1. INTRODUCTION

Performance of deep learning algorithms in real-world applications is often limited by the size of training datasets. Training a deep neural network (DNN) model with a small number of training samples usually leads to the over-fitting issue with poor generalization performance. A common yet effective solution is to train DNN models under transfer learning Pan et al. (2010) settings using large source datasets. The knowledge transfer from the source domain helps DNNs learn better features and acquire higher generalization performance for the pattern recognition in the target domain Donahue et al. (2014); Yim et al. (2017) . Backgrounds. For example, the paradigm Donahue et al. (2014) proposes to first train a DNN model using the large (and possibly irrelevant) source dataset (e.g. ImageNet), then uses the weights of the pre-trained model as the starting point of optimization and fine-tunes the model using the target dataset. In this way, blessed by the power of large source datasets, the fine-tuned model is usually capable of handling the target task with better generalization performance. Furthermore, authors in Yim et al. (2017); Li et al. (2018; 2019) propose transfer learning algorithms that regularize the training procedure using the pre-trained models, so as to constrain the divergence of the weights and feature maps between the pre-trained and fine-tuned DNN models. Later, the work Chen et al. (2019); Wan et al. (2019) introduces new algorithms that prevent the regularization from the hurts to transfer learning, where Chen et al. (2019) proposes to truncate the tail spectrum of the batch of gradients while Wan et al. (2019) proposes to truncate the ill-posed direction of the aggregated gradients. In addition to the aforementioned strategies, a great number of methods have been proposed to transfer knowledge from the multi-task learning perspectives, such as Ge & Yu (2017b); Cui et al. (2018) . More specifically, Seq-Train Cui et al. (2018) proposes a two phase approach, where the algorithm first picks up auxiliary samples from the source datasets with respect to the target task, then pre-train a model with the auxiliary samples and fine-tune the model using the target dataset. Moreover, Co-Train Ge & Yu (2017b) adopts a multi-task co-training approach to simultaneously train a shared backbone network using both source and target datasets and their corresponding separate Fully-Connected (FC) layers. While all above algorithms enable knowledge transfer from source datasets to target tasks, they unfortunately perform poorly, sometimes, due to the critical technical issues as follows. • 



Catastrophic Forgetting and Negative Transfer. Most transfer learning algorithmsDonahue et al. (2014);Yim et al. (2017);Li et al. (2018; 2019)  consist of two steps -pre-training and fine-tuning. Given the features that have been learned in the pre-trained models, either forgetting some good features during the fine-tuning process (catastrophic forgetting) Chen et al.(2019)  or preserving the inappropriate features/filters to reject the knowledge from the target domain (negative transfer)Li et al. (2019); Wan et al. (2019)  would hurt the performance of transfer learning. In this way, there might need a way to make compromises between the features learned from both source/target domains during the fine-tuning process, where multi-task learning withSeq-Train Cui et al. (2018)  and Co-Train Ge & Yu (2017b) might suggest feasible solutions to well-balance the knowledge learned from the source/target domains, through fine-tuning the model with a selected set of auxiliary samples (rather than the whole source dataset)Cui et al. (2018)  or alternatively learning the features from both domains during fine-tuning Ge & Yu (2017b).• Gradient Complexity for Seq-Train and Co-Train. The deep transfer learning algorithms based on multi-task learning are ineffective. Though the pre-trained models based on some key datasets, such as ImageNet, are ubiquitously available for free, multi-tasking algorithms usually need additional steps for knowledge transfer. Prior to the fine-tuning procedure based on the target dataset, Seq-Train requires an additional step to select auxiliary samples and "mid-tunes" the pre-trained model using the selected auxiliary samples Cui et al. (2018). Furthermore, Co-Train Ge & Yu (2017b) requests additional cost for backpropagation in-situ as the two dataset combined. In this way, there might need a deep transfer learning algorithm that does not require explicit "mid-tuning" procedure or additional backpropagation to learn from the source dataset. Our Work. With both technical issues in mind, we aim to study efficient and effective deep transfer learning algorithms with low computational complexity from the multi-task learning perspectives. We propose XMixup, namely Cross-domain Mixup, which is a novel deep transfer learning algorithm enabling knowledge transfer from source to target domains through the low-cost Mixup Zhang et al. (2018b). More specifically, given the source and target datasets for image classification tasks, XMixup runs deep transfer learning in two steps -(1) Auxiliary sample selection: XMixup pairs every class from the target dataset to a dedicated class in the source dataset, where the samples in the source class are considered as the auxiliary samples for the target class; then (2) Mixup with auxiliary samples and Fine-tuning: XMixup combines the samples from the paired classes of the two domains randomly using mixup strategy Zhang et al. (2018a), and performs fine-tuning process over the mixup data. To the best of our knowledge, this work has made three sets of contributions as follows. 1. We study the problems of cross-domain deep transfer learning for DNN classifiers from the multitask learning perspective, where the knowledge transfer from the source to the target tasks is considered as a co-training procedure of the shared DNN layers using the target dataset and auxiliary samples Ge & Yu (2017b); Cui et al. (2018). We review the existing solutions Donahue et al. (2014); Yim et al. (2017); Li et al. (2018; 2019), summarize the technical limitations of these algorithms, and particularly take care of the issues in catastrophic forgetting Chen et al. (2019), negative transfer Wan et al. (2019), and computational complexity. 2. In terms of methodologies, we extend the use of Mixup Zhang et al. (2018b) to the applications of cross-domain knowledge transfer, where both source and target datasets own different sets of classes and the aim of transfer learning is to adapt classes in the target domain. While vanilla mixup augments the training data with rich features and regularizes the stochastic training beyond empirical risk minimization (ERM), the proposed algorithm XMixup in this paper uses mixup to fuse the samples from source and target domains. In this way, the catastrophic forgetting issue could be solved in part, as the model keeps learning from both domains, but with lower cost compared to Chen et al. (2019). To control the effects of knowledge transfer, XMixup also offers a tuning parameter to make trade-off between the two domains in the mixup of samples Zhang et al. (2018b).

