XMIXUP: EFFICIENT TRANSFER LEARNING WITH AUXILIARY SAMPLES BY CROSS-DOMAIN MIXUP

Abstract

Transferring knowledge from large source datasets is an effective way to fine-tune the deep neural networks of the target task with a small sample size. A great number of algorithms have been proposed to facilitate deep transfer learning, and these techniques could be generally categorized into two groups -Regularized Learning of the target task using models that have been pre-trained from source datasets, and Multitask Learning with both source and target datasets to train a shared backbone neural network. In this work, we aim to improve the multitask paradigm for deep transfer learning via Cross-domain Mixup (XMixup). While the existing multitask learning algorithms need to run backpropagation over both the source and target datasets and usually consume a higher gradient complexity, XMixup transfers the knowledge from source to target tasks more efficiently: for every class of the target task, XMixup selects the auxiliary samples from the source dataset and augments training samples via the simple mixup strategy. We evaluate XMixup over six real world transfer learning datasets. Experiment results show that XMixup improves the accuracy by 1.9% on average. Compared with other state-of-the-art transfer learning approaches, XMixup costs much less training time while still obtains higher accuracy.



In addition to the aforementioned strategies, a great number of methods have been proposed to transfer knowledge from the multi-task learning perspectives, such as Ge & Yu (2017b); Cui et al. (2018) . More specifically, Seq-Train Cui et al. (2018) proposes a two phase approach, where the algorithm first picks up auxiliary samples from the source datasets with respect to the target task, then pre-train a model with the auxiliary samples and fine-tune the model using the target dataset. Moreover, Co-Train Ge & Yu (2017b) adopts a multi-task co-training approach to simultaneously train a shared backbone network using both source and target datasets and their corresponding separate



learning algorithms in real-world applications is often limited by the size of training datasets. Training a deep neural network (DNN) model with a small number of training samples usually leads to the over-fitting issue with poor generalization performance. A common yet effective solution is to train DNN models under transfer learning Pan et al. (2010) settings using large source datasets. The knowledge transfer from the source domain helps DNNs learn better features and acquire higher generalization performance for the pattern recognition in the target domain Donahue et al. (2014); Yim et al. (2017).Backgrounds. For example, the paradigm Donahue et al.(2014)  proposes to first train a DNN model using the large (and possibly irrelevant) source dataset (e.g. ImageNet), then uses the weights of the pre-trained model as the starting point of optimization and fine-tunes the model using the target dataset. In this way, blessed by the power of large source datasets, the fine-tuned model is usually capable of handling the target task with better generalization performance. Furthermore, authors inYim et al. (2017); Li et al. (2018; 2019)  propose transfer learning algorithms that regularize the training procedure using the pre-trained models, so as to constrain the divergence of the weights and feature maps between the pre-trained and fine-tuned DNN models. Later, the work Chen et al. (2019); Wan et al. (2019) introduces new algorithms that prevent the regularization from the hurts to transfer learning, where Chen et al. (2019) proposes to truncate the tail spectrum of the batch of gradients while Wan et al. (2019) proposes to truncate the ill-posed direction of the aggregated gradients.

