INTERPRETATIONS OF DOMAIN ADAPTATIONS VIA LAYER VARIATIONAL ANALYSIS

Abstract

Transfer learning is known to perform efficiently in many applications empirically, yet limited literature reports the mechanism behind the scene. This study establishes both formal derivations and heuristic analysis to formulate the theory of transfer learning in deep learning. Our framework utilizing layer variational analysis proves that the success of transfer learning can be guaranteed with corresponding data conditions. Moreover, our theoretical calculation yields intuitive interpretations towards the knowledge transfer process. Subsequently, an alternative method for network-based transfer learning is derived. The method shows an increase in efficiency and accuracy for domain adaptation. It is particularly advantageous when new domain data is sufficiently sparse during adaptation. Numerical experiments over diverse tasks validated our theory and verified that our analytic expression achieved better performance in domain adaptation than the gradient descent method.

1. INTRODUCTION

Transfer learning is a technique applied to neural networks admitting rapid learning from one (source) domain to another domain, and it mimics human brains in terms of cognitive understanding. The concept of transfer learning has been considerably advantageous, and different frameworks have been formulated for various applications in different fields. For instance, it has been widely applied in image classification (Quattoni et al., 2008; Zhu et al., 2011; Hussain et al., 2018) , object detection (Shin et al., 2016) , and natural language processing (NLP) (Houlsby et al., 2019; Raffel et al., 2019) . In addition to applications in computer vision and NLP, transferability is fundamentally and directly related to domain adaptation and adversarial learning (Luo et al., 2017; Cao et al., 2018; Ganin et al., 2016) . Another majority field adopting transfer learning is domain adaptation, which investigates transition problems between two close domains (Kouw & Loog, 2018) . A typical understanding is that transfer learning deals with a general problem where two domains can be rather distinct, allowing sample space and label space to differ. While domain adaptation is considered a subfield in transfer learning where the sample/label spaces are fixed with only the probability distributions allowed to be varied. Several studies have investigated the transferability of network features or representations through experimentation, and discussed their relation to network structures (Yosinski et al., 2014) , features, and parameter spaces (Neyshabur et al., 2020; Gonthier et al., 2020) . In general, all methods that improve the predictive performance of a target domain, using knowledge of a source domain, are considered under the transfer learning category (Weiss et al., 2016; Tan et al., 2018) . This work particularly focuses on network-based transfer-learning, which refers to a specific framework that reuses a pretrained network. This approach is often referred to as finetuning, which has been shown powerful and widely applied with deep-learning models (Ge & Yu, 2017; Guo et al., 2019) . Even with abundant successes in applications, the understanding of the network-based transfer learning mechanism from a theoretical framework remains limited. This paper presents a theoretical framework set out from aspects of functional variation analysis (Gelfand et al., 2000) to rigorously discuss the mechanism of transfer learning. Under the framework, error estimates can be computed to support the foundation of transfer-learning, and an interpretation is provided to connect the theoretical derivations with transfer learning mechanism. Our contributions can be summarized as follows: we formalize transfer learning in a rigorous setting and variational analysis to build up a theoretical foundation for the empirical technique. A theorem is proved through layer variational analysis that under certain data similarity conditions, a transferred net is guaranteed to transfer knowledge successfully. Moreover, a comprehensible interpretation is presented to understand the finetuning mechanism. Subsequently, the interpretation reveals that the reduction of non-trivial transfer learning loss can be represented by a linear regression form, which naturally leads to analytical (globally) optimal solutions. Experiments in domain adaptation were conducted and showed promising results, which validated our theoretical framework.

2. RELATED WORK

Transfer Learning Transfer learning based on deep learning has achieved great success, and finetuning a pretrained model has been considered an influential approach for knowledge transfer (Guo et al., 2019; Girshick et al., 2014; Long et al., 2015) . Due to its importance, many studies attempt to understand the transferability from a wide range of perspectives, such as its dependency on the base model (Kornblith et al., 2019) , and the relation with features and parameters (Pan & Yang, 2009) . Empirically, similarity could be a factor for successful knowledge transfer. The similarity between the pretrained and finetuned models was discussed in (Xuhong et al., 2018) . Our work begins by setting up a mathematical framework with data similarity defined, which leads to a new formulation with intuitive interpretations for transferred models.

Domain Adaptation

Typically domain adaptations consider data distributions and deviations to search for mappings aligning domains. Early literature suggested that assuming data drawn from certain probabilities (Blitzer et al., 2006) can be used to model and compensate for the domain mismatch. Some studies then looked for theoretical arguments when a successful adaptation can be yielded. Particularly, (Redko et al., 2020) estimated learning bounds on various statistical conditions and yielded theoretical guarantees under classical learning settings. Inspired by deep learning, feature extractions (Wang & Deng, 2018) and efficient finetuning of networks (Patricia & Caputo, 2014; Donahue et al., 2014; Li et al., 2019) become popular techniques for achieving domain adaptation tasks. The finetuning of networks was investigated further to see what was being transferred and learned in (Wei et al., 2018) . This will be close to our investigation on weights optimizations.

3.1. FRAMEWORK FOR NETWORK-BASED TRANSFER LEARNING

To formally address error estimates, we first formulate the framework and notations for consistency. Definition 3.1 (Neural networks). An n-layer neural network f : R 0 → R n is a function of the form: f = f n • f n-1 • • • • • f 1 (1) for each j = 1, . . . , n, f j = σ j • A j : R j-1 → R j is a layer composed by an affine function A j (z) := W j z +b j and an activation function σ j : R j → R j , with W j ∈ L R j-1 , R j , b j ∈ R j and L(K, V ) the collection of all linear maps between two linear spaces K → V . The concept of transfer learning is formulated as follows: Definition 3.2 (K-layer fixed transfer learning). Given one (large and diverse) dataset D = {(x i , y i ) ∈ X × Y} and a corresponding n-layer network f = f n • f n-1 • ... • f 1 : X → Y trained by D under loss L(f ), the k-layer fixed transfer learning finds a new network g : X → Y of the form, g = g n • g n-1 • • • • • g k+1 • f k • • • • • f 1 fixed , under loss L(g) when a new and similar dataset D = {( x i , y i ) ∈ X × Y} is given. The first k layers of f remain fixed and g j := σ j • A j are new layers with affine functions A j to be adjusted (k < j < n). The net f trained by the original data D is called the pretrained net and g trained by new data D is called the transferred net or the finetuned net. Transfer learning has the empirical implication of

