INTERPRETATIONS OF DOMAIN ADAPTATIONS VIA LAYER VARIATIONAL ANALYSIS

Abstract

Transfer learning is known to perform efficiently in many applications empirically, yet limited literature reports the mechanism behind the scene. This study establishes both formal derivations and heuristic analysis to formulate the theory of transfer learning in deep learning. Our framework utilizing layer variational analysis proves that the success of transfer learning can be guaranteed with corresponding data conditions. Moreover, our theoretical calculation yields intuitive interpretations towards the knowledge transfer process. Subsequently, an alternative method for network-based transfer learning is derived. The method shows an increase in efficiency and accuracy for domain adaptation. It is particularly advantageous when new domain data is sufficiently sparse during adaptation. Numerical experiments over diverse tasks validated our theory and verified that our analytic expression achieved better performance in domain adaptation than the gradient descent method.

1. INTRODUCTION

Transfer learning is a technique applied to neural networks admitting rapid learning from one (source) domain to another domain, and it mimics human brains in terms of cognitive understanding. The concept of transfer learning has been considerably advantageous, and different frameworks have been formulated for various applications in different fields. For instance, it has been widely applied in image classification (Quattoni et al., 2008; Zhu et al., 2011; Hussain et al., 2018) , object detection (Shin et al., 2016) , and natural language processing (NLP) (Houlsby et al., 2019; Raffel et al., 2019) . In addition to applications in computer vision and NLP, transferability is fundamentally and directly related to domain adaptation and adversarial learning (Luo et al., 2017; Cao et al., 2018; Ganin et al., 2016) . Another majority field adopting transfer learning is domain adaptation, which investigates transition problems between two close domains (Kouw & Loog, 2018) . A typical understanding is that transfer learning deals with a general problem where two domains can be rather distinct, allowing sample space and label space to differ. While domain adaptation is considered a subfield in transfer learning where the sample/label spaces are fixed with only the probability distributions allowed to be varied. Several studies have investigated the transferability of network features or representations through experimentation, and discussed their relation to network structures (Yosinski et al., 2014) , features, and parameter spaces (Neyshabur et al., 2020; Gonthier et al., 2020) . In general, all methods that improve the predictive performance of a target domain, using knowledge of a source domain, are considered under the transfer learning category (Weiss et al., 2016; Tan et al., 2018) . This work particularly focuses on network-based transfer-learning, which refers to a specific framework that reuses a pretrained network. This approach is often referred to as finetuning, which has been shown powerful and widely applied with deep-learning models (Ge & Yu, 2017; Guo et al., 2019) . Even with abundant successes in applications, the understanding of the network-based transfer learning mechanism from a theoretical framework remains limited. This paper presents a theoretical framework set out from aspects of functional variation analysis (Gelfand et al., 2000) to rigorously discuss the mechanism of transfer learning. Under the framework, error estimates can be computed to support the foundation of transfer-learning, and an interpretation is provided to connect the theoretical derivations with transfer learning mechanism. Our contributions can be summarized as follows: we formalize transfer learning in a rigorous setting and variational analysis to build up a theoretical foundation for the empirical technique. A theorem is proved through layer variational analysis that under certain data similarity conditions, a transferred net is guaranteed to transfer knowledge successfully. Moreover, a comprehensible interpretation is presented to understand the finetuning mechanism. Subsequently, the interpretation reveals that the reduction of non-trivial transfer learning loss can be represented by a linear regression form, which naturally leads to analytical (globally) optimal solutions. Experiments in domain adaptation were conducted and showed promising results, which validated our theoretical framework.

2. RELATED WORK

Transfer Learning Transfer learning based on deep learning has achieved great success, and finetuning a pretrained model has been considered an influential approach for knowledge transfer (Guo et al., 2019; Girshick et al., 2014; Long et al., 2015) . Due to its importance, many studies attempt to understand the transferability from a wide range of perspectives, such as its dependency on the base model (Kornblith et al., 2019) , and the relation with features and parameters (Pan & Yang, 2009) . Empirically, similarity could be a factor for successful knowledge transfer. The similarity between the pretrained and finetuned models was discussed in (Xuhong et al., 2018) . Our work begins by setting up a mathematical framework with data similarity defined, which leads to a new formulation with intuitive interpretations for transferred models.

Domain Adaptation

Typically domain adaptations consider data distributions and deviations to search for mappings aligning domains. Early literature suggested that assuming data drawn from certain probabilities (Blitzer et al., 2006) can be used to model and compensate for the domain mismatch. Some studies then looked for theoretical arguments when a successful adaptation can be yielded. Particularly, (Redko et al., 2020) estimated learning bounds on various statistical conditions and yielded theoretical guarantees under classical learning settings. Inspired by deep learning, feature extractions (Wang & Deng, 2018) and efficient finetuning of networks (Patricia & Caputo, 2014; Donahue et al., 2014; Li et al., 2019) become popular techniques for achieving domain adaptation tasks. The finetuning of networks was investigated further to see what was being transferred and learned in (Wei et al., 2018) . This will be close to our investigation on weights optimizations.

3.1. FRAMEWORK FOR NETWORK-BASED TRANSFER LEARNING

To formally address error estimates, we first formulate the framework and notations for consistency. Definition 3.1 (Neural networks). An n-layer neural network f : R 0 → R n is a function of the form: f = f n • f n-1 • • • • • f 1 (1) for each j = 1, . . . , n, f j = σ j • A j : R j-1 → R j is a layer composed by an affine function A j (z) := W j z +b j and an activation function σ j : R j → R j , with W j ∈ L R j-1 , R j , b j ∈ R j and L(K, V ) the collection of all linear maps between two linear spaces K → V . The concept of transfer learning is formulated as follows: Definition 3.2 (K-layer fixed transfer learning). Given one (large and diverse) dataset D = {(x i , y i ) ∈ X × Y} and a corresponding n-layer network f = f n • f n-1 • ... • f 1 : X → Y trained by D under loss L(f ), the k-layer fixed transfer learning finds a new network g : X → Y of the form, g = g n • g n-1 • • • • • g k+1 • f k • • • • • f 1 fixed , under loss L(g) when a new and similar dataset D = {( x i , y i ) ∈ X × Y} is given. The first k layers of f remain fixed and g j := σ j • A j are new layers with affine functions A j to be adjusted (k < j < n). The net f trained by the original data D is called the pretrained net and g trained by new data D is called the transferred net or the finetuned net. Transfer learning has the empirical implication of "transferring" knowledge from one domain to another by fixing the first few pretrained layers. One main goal of this study is to characterize the transferring process under layer finetuning. We utilize the loss function to define the concept of a "well-trained" net, in order to comprehend the transfer learning mechanism. The losses for the pretrained net f and the transferred net g are computed on a general label space Y with norm • Y : L(f ) = 1 N N i=1 f (x i ) -y i 2 Y , L(g) = 1 N N i=1 g( x i ) -y i 2 Y . (3) Definition 3.3 (Well-trained net). Given a dataset D = {(x i , y i ) ∈ X × Y}, a network h is called well-trained under data D within error ε trained > 0 if L(h) = N i=1 h(x i ) -y i 2 Y ≤ ε 2 trained . The definition then implies that a well-pretrained net f within error ε pretrained is to satisfy L(f ) ≤ ε 2 pretrained . In certain domain adaptation applications, discussions are in terms of probability distributions (Redko et al., 2020) . Our framework is compatible with the usual probabilistic setting under the independent and identically distributed (i.i.d.) consideration, frequently applied in practice. In this case, Eq. ( 3) is equivalent to L(f ) = E (x,y)∼D 2 (f (x), y), L(g) = E ( x,ỹ)∼ D 2 (g( x), y) where is a norm (error) function. One intuition for successful knowledge transfer is that the previous knowledge base and the new domain share some common (or similar) features. The resemblance of two datasets D, D under transfer learning in this work is defined by: Definition 3.4 (Data deviation). The data deviation ε data ≥ 0 of two datasets D and D is defined as, max i min j ( x i , y i ) -(x j , y j ) X ×Y ≤ ε data , (i ≤ | D|, j ≤ |D|) Note that the first minimization in Eq. ( 7) is recognized as performing sample alignment in the context of domain adaptation. In fact, for each i ≤ | D| there exists an index j i := arg min j ( x i , y i ) -(x j , y j ) X ×Y such j i then corresponds to the closet sample in D to i. This step is usually referred to as sample alignment. Domain alignment is then to perform sample alignment over all samples of a given domain. Consequently, this definition aligns two datasets first and then considers the largest distance among all aligned pairs. Without loss of generality, in the rest discussions the sample index is rearranged by Eq. ( 8) with j i renamed as i for convenience.

3.2. TRANSFER LOSS BOUND VIA LAYER VARIATIONS

Neural networks of the form Eq. (1) with Lipschitz activation functions are Lipschitz continuous. We utilize Lipschitz constants to control the network propagation and estimate the error of the last r-layer finetuned nets. The Lipschitz constant of a network h is denoted by C h in this paper. Let a pretrained net f be composed of two parts: f = (f n • • • • • f n-r+1 ) • (f n-r • • • • • f 1 ) := F • F n-r where F n-r = f n-r • • • • • f 1 are layers to be fixed and F = f n • ... • f n-r+1 are the last r layers to be finetuned (1 < r < n). With only two essential definitions above, our main theorem addressing the validity of the transfer learning technique can be derived: Theorem 3.5 (Finetuned loss bounded by pretrain loss and data). Let D, D be two datasets with deviation ε data and f = F • F n-r be a well pretrained network under D, then a net finetuning the last r layers of f with the form g := (F + δF ) • F n-r has bounded loss on D, L(g) = 1 N N i=1 g( x i ) -y i 2 Y ≤ C 1 ε 2 pretrained + C 2 ε 2 data , ( ) where δF is a Lipschitz function with C δF ≤ ε data . C 1 , C 2 are two constants depending on C Fn-r , C F and C x := max i x i X . Sketch Proof. By g = (F + δF ) • F n-r and notations in Eq. ( 8), we have g ( x i ) = (F + δF ) • F n-r ( x i ) = f (x ji ) + v 1 (11) with v 1 := δF • F n-r ( x i ) + f ( x i ) -f (x ji ). We compute, 1 N N i=1 g( x i ) -y i 2 Y = 1 N N i=1 f (x ji ) -y i + v 1 2 Y ≤ 3 ε 2 pretrained + ε 2 data + v 1 2 Y , with Eq. ( 5) and the triangle inequality applied in the last inequality. Further, we estimate, v 1 2 Y ≤ 2 δF •F n-r ( x i )) 2 Y + f ( x i )-f (x i ) 2 Y ≤ 2 C 2 δF C 2 Fn-r C 2 x +C 2 F C 2 Fn-r ε 2 data . ( ) With Eq. ( 12) and condition C δF ≤ ε data , the result Eq. ( 10) can be yielded. For the complete proof, see Appendix 7.1. Theorem 3.5 depicts that the finetuned loss is bounded by the pretrained loss and the data difference in Eq. ( 10), linking the relation of network functions and data into one. This then explains our intuition why a well-pretrained net beforehand is necessary, and transfer learning is anticipated to be viable on datasets with certain similarities. Thus, with minimal conditions and definitions, Theorem 3.5 succinctly addresses when transfer learning guarantees a well-finetuned model under a new dataset. In fact, the generalization error under finetuning is also bounded (Theorem 7.1 Appendix). Remark 3.6. For two very different datasets, one has ε data 0. Eq. ( 10) then indicates such transfer learning process is generally more difficult as the upper bound becomes large, even with a small pretraining error ε pretrained . Typically in domain adaptation tasks, ε data is not expected to be large. Remark 3.7. Eq. ( 10) is regarded as an extension of Eq. ( 5) whenever D = D. The two equations coincide if ε data = 0 where two domains D, D perfectly align.

4. INTERPRETATIONS & CHARACTERIZATIONS OF NETWORK-BASED TRANSFER LEARNING

Inspired by the mathematical framework of Theorem 3.5, an intuitive interpretation of network-based transfer learning can be derived, which in turn leads to a novel formulation for neural network finetuning. The derivation begins with the Layer Variational Analysis (LVA), where the effect of data variation and transmission can be addressed.

4.1. LAYER VARIATIONAL ANALYSIS (LVA)

To clarify the effect of one layer variation (r = 1 in Theorem 3.5), we denote the latent variables z i , z i and the last finetuned layer g n from Eq. ( 2) by, z i := F n-1 (x i ), z i := F n-1 ( x i ), g n := f n + δf n The target prediction error can be related to the pretrain error via the following expansion, g( x i ) -y i Y = g n ( z i ) -f n ( z i ) + f n ( z i ) -f n (z i ) + f (x i ) -y i + y i -y i Y (15) with six self-canceled-out terms added for auxiliary purposes. Note the identities g( x i ) = g n ( z i ), f n (z i ) = f (x i ) are used and the second grouping can be rewritten by, f n ( z i ) -f n (z i ) = J(f n )(z i ) • δz i + δz T i • H(f n )(z i ) • δz i + O(δz 3 i ) where δz i := ( z i -z i ) corresponds to the latent feature shift; J(f n )(z i ) and H(f n )(z i ) are the Jacobian and the Hessian matrix of f n at z i , respectively. In principle, non-linear activation functions introduce high order δz i terms to Eq. ( 16), where by considering regression type problems (assuming f n purely an affine function) the finetune loss Eq. ( 15) is allowed to be simplified as, L(g) = 1 N N i=1 δf n ( z i ) -q i 2 Y (17) with δy i := y i -y i and the vector q i ∈ Y defined as q i := δy i -J(f n )(z i ) • δz i + (y i -f (x i )) which is referred to as transferal residue. Via Transmission of Layer Variations (Appendix 7.3) one may also write, q i = δy i -J(f )(x i ) • δx i + (y i -f (x i )) (19) 4.2 LVA INTERPRETATIONS & INTUITIONS The equivalent form Eq. ( 17) of the transfer loss L(g) = 1 N N i=1 g( x i ) -y i 2 Y under LVA renders an intuitive interpretation for network-based transfer learning. The resulting picture is particularly helpful for domain adaptations. Transferal residue: We justify the name transferal residue of Eq. ( 19). There are two pairs: q i = δy i -J(f )(x i ) • δx i domain mismatch (∆) + (y i -f (x i )) pretrain error( ) where the first pair (∆) evaluates domain mismatch; the other measures the pretrain error. To see this, observe: (a) a perfect pretrain f gives f (x i ) = y i then ( ) = 0, and (b) if two domains match perfectly D = D, δx i = δy i = 0 then (∆) = 0. In such a perfect case, q i ≡ 0 and there is nothing new for f to learn/adapt in D. On the contrary, if f is ill-pretrained with δx i , δy i large, q i becomes large too, indicating there is much to adapt in D. Therefore, the transferal residue characterizes the amount of new knowledge needed to be learned in the new domain, and hence the name. In short, q i → 0 (nothing to adapt), q i → ∞ (much to adapt) Layer variations to cancel transferal residue: Whenever the transferal residue is nonzero, the finetune net g adjusts from f to react, as Eq. ( 17) assigns layer variation δf n to neutralize q i . Minimizing the adaptation loss L(g) with δf n , we see that q i ≈ 0 no --→ adapt δf n ≈ 0, q i = 0 need ---→ adapt δf n ⊗ ← -min Eq. ( ) In fact, the last adaptation step ⊗ is analytically solvable if δf n is a linear functional with • Y being L 2 -norm. By Moore-Penrose pseudo-inverse, the minimum is achieved at δf n = z T • z -1 z T • q (21) with z := ( z 1 , . . . , z N ) and q := (q 1 , . . . , q N ). In short, layer variations tend to digest the knowledge discrepancy due to domain differences.

Knowledge transfer:

As q i reflects the amount of new knowledge left to be digested in D, we see a term J(f )(x i )δx i automatically emerges trying to reduce the remainder in q i . Especially, the Jacobian term appears with a negative sign opposite to f (x i ); it is presented as a self-correction (self-adaptation) to the new domain. The purpose of J(f )(x i )δx i serves to annihilate the new label shift δy i Eq. ( 20) (e.g. Fig. 1 ) and notice that the self-correction not needed if D = D, δx i = 0. Since the pretrained f contains previous domain information, the term J(f )(x i )δx i indicates the old knowledge is carried over by f to dissolve the new unknown data. Such quantification then yields an interpretation and renders an intuitive view for network-based transfer learning. The LVA formulation naturally admits 4 types of domain transitions according to Eq. ( 20): (a) δx i = δy i = 0, (b) δx i = 0, δy i = 0, (c) δx i = 0, δy i = 0 and (d) δx i = 0, δy i = 0. The analysis can be extended to multi-layer cases as well as Convolutional Neural Networks (CNN); only that the analytic computation soon becomes cumbersome. Therefore, only one-layer LVA finetuning is demonstrated in this work. Further discussions and calculations for LVA of two layers and CNNs can be found in Appendix 7.2. Although there is no explicit formula for multi-layer cases to obtain the optimal solution, iterative regression can be utilized recursively to obtain a suboptimal result analytically without invoking gradient descent. The LVA formulation not only renders interesting perspectives in theory but also demonstrates successful knowledge adaptations in practice.

5. EXPERIMENTS

Three experiments in various fields were conducted to verify our theoretical formulation. The first task was a simulated 1D time series regression, the second was Speech Enhancement (SE) of a real-world voice dataset; the third was an image deblurring task by extending our formula onto Convolutional Neural Networks (CNNs). The three tasks demonstrated the LVA had prompt adaptation ability under new domains and confirmed the calculations Eq. ( 15)∼( 18). The code is available on Githubfoot_0 . For detailed implementations and additional results on adaptive image classifications, see Appendix 8.

5.1. TIME SERIES (1D SIGNAL) REGRESSION

Goal: Train a DNN predicting temporal signal D and adapt it to another signal with noise D. Dataset: Two signals are designed as follows, with D mimicking a noisy signal (a low-quality device) D = (t i , sin(5πt i )) i , D = (t i + 0.05 ξ i , γ(t i ) sin(5πt i )) + 0.03 η i ) i (22) with {t i | i = 1, . . . , 2000} ∈ [-1, 1] equally distributed, γ(t i ) = 0.4 t i + 1.3998, ξ i ∼ N (1.5, 0.8), η i ∼ U(-1, 1). N (µ, σ) is a normal distribution of mean µ and variance σ; U as a uniform distribution [-1, 1]. Implementation: The pretrained model f consists of 3 fully-connected layers of 64 nodes with ReLu as activation functions. We refer to Appendix for more implementation details.

Result analysis:

The adaptation results were compared between the conventional gradient descent (GD) method and LVA by Eq. ( 18), ( 21). The transferred net g from f was finetuned with GD for 12000 epochs, while LVA only requires one step to derive. After adaptation, LVA reached the lowest L 2 -loss compared to GD of same finetuned layer numbers, see Fig. 2(f ). This verified the stability and validity of the proposed framework. In addition to obtaining the lowest loss, the actual signal prediction performance can also be observed improved via visualization Fig. 2 (b)∼(e) in the case of finetuning the last one and two layers. Notably, the finetuning loss of two layers indeed further decreased than that of one layer, which met the expectation as the two-layer case had more free parameters to reduce loss. Comparing Fig. 2 (b)(c), it showed that the 1-layer case LVA obtained more accurate signal predictions compared to GD's predictions. 

5.2. SPEECH ENHANCEMENT

To prove that the proposed method is effective on real data, we conducted experiments on an SE task. Goal: Train a SE model to enhance (denoise) speech signals in a noisy environment D and adapt to another noisy environment D. Dataset: Speech data pairs were prepared for the source domain D (served as the training set) and target domain D (served as the adaptation set) as follows: D = {(x ijk , y i )} N1,N2 i,k adapt -→ D = {( x ijk , y i )} N1, N2 i,k (x ijk = y i + c j × n k ; x ijk = y i + c j × n k ) where y i denotes a patchfoot_1 of a clean speech (as a label), which is then corrupted with noise n k of type k to form a noisy speech (patch) x ijk over an amplification of c j ∝ exp (-SNR j ), determined by a signal-to-noise ratio (SNR j ) given. N 1 (resp. N 1 ) denotes the number of patches extracted from clean utterances in D (resp. D); N 2 (resp. N 2 ) is the number of noise types contained in D (resp. D). An SE model f on D performed denoising such that f (x ijk ) = y i . In this experiment, 8,000 utterances (corresponding to N 1 = 112,000 patches) were randomly excerpted from the Deep Noise Suppression Challenge (Reddy et al., 2020) dataset; these clean utterances were further contaminated with domain noise types: n k ∈ {White-noise, Train, Sea, Aircraft-cabin, Airplane-takeoff}, (thus N 2 = 5), at four SNR levels: {-5, 0, 5, 10} (dB) to form the training set. These 112,000 noisy-clean pairs were used to train the pretrained SE model. We tested performance using different numbers of adaptation data with N 1 varying from 20 to 400 patches (Fig. 3 ). These adaptation data were contaminated by three target noise types: n k ∈ {Baby-cry, Bell, Siren} (thus Implementation: A Bi-directional LSTM (BLSTM) (Chen et al., 2015) model was used to construct the SE model f under L 2 -loss, consisting of one-layer BLSTM of 300 nodes and one fully-connected layer for output. The transferred net g adapted from f by finetuning the last layer was performed by GD to compare with LVA in Eq. ( 21). For data processing and network, details see Appendix 8.2. N 2 = 3) at {-1, 1}

Domain alignment:

It was mentioned that the domain alignment ought to be applied prior to LVA. For real-world data, optimal transport (Villani, 2009) (OT) was selected for implementation to match domain distributions. To contrast, LVA with and without OT were conducted. Evaluation metric: SE performances are usually measured by 3 evaluation metrics, L 2 -error (MSE), the perceptual evaluation of speech quality (PESQ) (Rix et al., 2001 ) and short-time objective intelligibility (STOI) (Taal et al., 2011) . Each metric evaluates different aspects of speech: PESQ ∈ [-0.5, 4.5] evaluates speech quality; STOI ∈ [0, 1] estimates speech intelligibility. Implementation: This experiment demonstrated that image related tasks can also be implemented by extending our LVA, Eq. ( 17), ( 18), from fully-connected layers to CNNs. Although the analytic formula Eq. ( 21) is only valid on a fully-connected layer of regression type, a key observation admitting such extension is that CNN kernels can locally be regarded as a fully-connected layer, in that every receptive field is exactly covered by the kernels of a CNN. By proper arrangement, a CNN kernel C = (C ijγ ) can be folded as a fully-connected layer of 2D matrix weight W = (W m ). As such, the LVA can be extended onto CNNs as finetune layers; detailed implementations see Appendix. In this experiment, an SRCNN model f (Fig. 5 ) was trained on source domain D to perform deblur f (x) = y, and subsequently used GD and LVA to adapt f → g on D such that g( x) = y. Result analysis: Fig. 7 showed the results of adapted models by GD and LVA. In 

6. CONCLUSIONS

This study aims to provide a theoretical framework for interpreting the mechanism under domain adaptation. We derived a firm basis to ensure that a finetuned net can be adapted to a new dataset. The objective of this study was achieved to address "why transfer learning might work?" and unravel the underlying process. Based on the LVA, a novel formulation for finetuned nets was introduced and yielded meaningful interpretations of transfer learning. The formalism was further validated by domain adaptation applications in Speech Enhancement and Super Resolution. Various tasks were observed to reach the lowest L 2 -loss under LVA, where the LVA is the theoretical limit for the gradient descent to approach. The LVA interpretations and successful experiments give this study an inspiring perspective to view transfer learning.

7.1. PROOF OF THEOREM 3.6

Consider finetuning on the last r layers in Eq. ( 2 ) with k + r = n. Denote F n-r = f n-r • • • • • f 1 and the last r layers F = f n • ... • f n-r+1 of pretrained network and finetuned net G = g n • ... • g n-r+1 . We also write the difference G -F = δF . With the notations, we rewrite g ( x i ) = G • F n-r ( x i ) = (F + δF ) • F n-r ( x i ) = f (x ji ) + v 1 , (A.23) where v 1 = δF • F n-r ( x i ) + f ( x i ) -f (x ji ) and j i is the corresponding index such that Eq. ( 7) is satisfied. Therefore, one computes, 1 N N i=1 g( x i ) -y i 2 Y = 1 N N i=1 f (x ji ) -y i + v 1 2 Y ≤ 3 N N i=1 f (x ji ) -y ji 2 Y + y ji -y i 2 Y + v 1 2 Y ≤ 3 ε 2 pretrained + ε 2 data + v 1 2 Y , (A.24) where the last term is further estimated by the triangle inequality to yield .25) with C x := max i x i X . Note that without any additional assumption, the first term in Eq. (A.25) is of order 1. To obtain a small error of g, one sufficient condition is to have C δF ≈ ε data . It is interesting to point out that the analysis relates the Lipschitz constant C δF of the network perturbation with the data similarity ε data . v 1 2 Y ≤ 2 δF • F n-r ( x i )) 2 Y + f ( x i ) -f (x ji ) 2 Y ≤ 2 C 2 δF C 2 Fn-r C 2 x + C 2 F C 2 Fn-r ε 2 data . (A Theorem 7.1 (Generalization error bound). Given a test (held-out) set D test = {(x (test) i , y (test) i ) ∈ X × Y} Ntest i=1 with | D test | ≤ | D|, then a well-finetuned net g in Theorem 3.5 has a generalization error bound, L test (g) = 1 N test Ntest i=1 g( x test i ) -y test i 2 Y ≤ C 1 ε 2 data + C 2 L(g) (A.26) where ε data is the data deviation between D and D test computed by Definition 3.4. C 1 , C 2 are two constants with C 1 depending on the Lipschitz constant C g and L(g) the finetune loss in Theorem 3.5. Proof. L test (g) = 1 N test Ntest i=1 g( x test i ) -y test i 2 Y ≤ 3 N test Ntest i=1 g( x test i ) -g( x i ) 2 Y + y test i -y i 2 Y + g( x i ) -y i 2 Y ≤ 3 N test Ntest i=1 C 2 g ε 2 test + ε 2 test + 3 N test N i=1 g( x i ) -y i 2 Y (A.27) 7.2 PERTURBATION APPROXIMATION DERIVATION FOR 2-LAYER CASE Similarly, let δz i := z i -z i and denote the finetuned n th layer and (n -1) th layer by g n = f n + δf n , g n-1 = f n-1 + δf n-1 (A.28) The perturbation assumption allows the approximation of f n-1 ( z i ) ≈ f n-1 (z i ) + J(f n-1 )(z i ) • δz i , (A.29) where J(f n-1 )(z i ) is the Jacobian matrix of f n-1 at z i . The network assumed to deviate from the pretrained f n-1 by an affine function, δf n-1 (z i ) := W n-1 • δz i + b, (A.30) for some weight W n-1 and bias b. With Eq. (A.29) and Eq. (A.30), we obtain g n-1 ( z i ) ≈ f n-1 (z i ) + b + J(f n-1 )(z i ) + W n-1 • δz i . (A.31) With Eq. (A.31), g( x i ) = (f n + δf n ) • g n-1 ( z i ) ≈ f n • f n-1 (z i ) + f n • J(f n-1 )(z i ) + W n-1 • δz i + b + δf n • f n-1 (z i ). (A.32) Note that the terms involving multiplications δf n and δf n-1 are omitted as we only focus on the first order terms. Parallel to Eq. ( 15) for 1-layer case, we approximate g( x i ) -y i Y = (g( x i ) -f n • f n-1 (z i )) + (f n • f n-1 (z i ) -y i ) + (y i -y i ) Y (A.33) and therefore, g( x i ) -y i Y ≈ δf n • f n-1 (z i ) + f n • J(f n-1 )(z i ) + W n-1 + b ( z i -z i ) + f (x i ) -y i -δy i Y . (A.34) We see that the 2-layer fine-tuning case involves three unknown (parameters), including δf n , b, and W n-1 . Although there is no explicit formula to obtain the optimal solution, iterative regression can be utilized multiple times to obtain a sub-optimal result analytically without using gradient descent. In fact, one notices that if b is ignored, the optimization of δf n and W n-1 based on the L 2 -loss can be solved. Once δf n and W n-1 are solved and fixed, b can be introduced back into the formula and can be solved again by regression. Similar approximations for general multi-layers cases would still be reliable as long as the features in the latent space are close to each other. That is to say, the features of the source and target domain share similarities. These intuition-motivated formulations were shown to successfully adapt knowledge in experiments for both images and speech signals. The results can be found in Section 5.

7.3. TRANSMISSION OF LAYER VARIATIONS

We show heuristically the propagation of layer variations in neural nets can be traced via data deviations. Given two datasets D, D with the indexing after sample alignment, we define δx i := x i -x i , δy i := y i -y i (A.35) Let the data deviation be ε from Def. 3.4, and Eq. (A.35) can be decomposed by orders of ε, δx i = δx (1) i + δx (2) i + • • • , δ y i = δy (1) i + δy (2) i + • • • (A.36) with O δx (k) i X = O δy (k) i Y = ε k . To obtain a simple picture, first we neglect the high order terms δx (2) i , δy i , etc. In such case, δx (1) i and δy (1) i can be simply denoted as δx i and δy i , respectively, without confusion. By Eq. ( 14), finetuning the last layer in Theorem 3.5 reveals that L(g) = 1 N N i=1 (f n + δf n ) • f n-1 • • • • • f 1 ( x i ) -y i 2 Y = 1 N N i=1 δf n ( z i ) -y i + f (x i ) + J(f )(x i )δx i 2 Y + O(ε 2 ) (A.37) where J(f )(x i ) is the Jacobian of f at x i and O(ε 2 ) contains terms like H(f )(x i )δx i δx i , etc with H(f )(x i ) the Hessian tensor of f at x i . Some effort converts Eq. (A.37) into a familiar form, L(g) = 1 N N i=1 δf n ( z i ) -q i 2 Y (A.38) with q i = δy i -J(f )(x i ) • δx i + (y i -f (x i )) (A.39) Regard that Eq. (A.38) recovers Eq. ( 16) via different routes and starting points, although the two q i 's in Eq. ( 18) and (A.39) slightly differ due to distinct variational conditions. This heuristic derivation then confirms that data deviations arouse network variations to induce the alternative finetune form. Consequently, both derivations yield same results.

8. EXPERIMENTAL DETAILS

All code and data can be found at https://github.com/HHTseng/ Layer-Variational-Analysis.git. The norms • X and • Y in all experiments are set as Euclidean L 2 -norms, unless otherwise specified. Some notable details are remarked here.

8.1. 1D SERIES REGRESSION

Dataset A 1D series D for the pretrained f and another series D for transfer learning are given by (Fig. 8 ) Network Architecture The pretrained model f using D was a 3 fully-connected-layer network with 64 nodes at each layer and ReLu used as activation functions except the output layer. A finetuned model g GD using Gradient Descent (GD) retrained the last layer of f by D. f and g GD were trained under L 2 -loss with ADAM optimizer at learning rate 10 -3 by 8,000 and 12,000 epochs, respectively. The finetuned model by the LVA method g pesudo using D required no training process but directly replaced the last layer of f by f n → (f n + δf n ) with δf n given in Eq. ( 21). D = (x i , sin(5πx i )) | x i ∈ [-1, 1] N =2000 i=1 D =      (x i + 0.05 ξ i , γ(x i ) y i + 0.03 η i ) ( xi, yi) | x i ∈ [-1, 1]      N =2000 i=1 (B.40) with γ(x i ) = 0.4 x i + 1.3998, ξ i ∼ N (1.5, 0.8) (normal distribution mean=1.5, variance=0.8), η i ∼ U(-1, 1) (uniform distribution in [-1, 1]). Hardware One NVIDIA V100 GPU (32 GB GPU memory) with 4 CPUs (128 GB CPU memory). Runtime Approximately 3 sec in the LVA method, 4 minutes in GD method.

8.2. SPEECH ENHANCEMENT

Dataset The utterances used in the experiment were excerpted from the Deep Noise Suppression Challenge (Reddy et al., 2020) . We randomly selected 8000 clean utterances for training and 100 utterances for testing. Five noise types {White-noise, Train, Sea, Aircraft-cabin, Airplane-takeoff} were involved to form the training set (to estimate the pretrained SE model) and three noise types, {Baby-cry, Bell, Siren} was used to form the transfer learning and testing sets. For the training set, the 8000 clean utterances were equally divided and contaminated by the five noise types (thus, each noise type had 1600 noisy utterances) at four signal-to-noise ratio (SNR) levels, {-5, 0, 5, 10} dB. For the testing set, 100 clean utterances were contaminated by the {Baby-cry, Bell, Siren} noises at {-1, 1} (dB) SNRs. For the adaptation set, we prepared 20-400 patches contaminated by three noise types, {Baby-cry, Bell, Siren}, at {-1, 1} dB SNRs. Network Architecture The overall flowchart of the SE task is shown in Fig. 9 . In this study, we implemented SE systems using two neural network models to evaluate the proposed LVA. The first one is the deep denoising autoencoder (DDAE) model (Lu et al., 2013) , which serves as a simpler neural-network architecture. The other is the Bi-directional LSTM (BLSTM) (Chen et al., 2015) , which is relatively complicated as compared to DDAE. For DDAE, the model is composed of 5 fully-connected layers with 512 units. Activation functions were added to each layer except the last layer. The LeakyReLU with a negative slope of 0.01 was used as the activation function. For BLSTM, the model consists of a one-layer BLSTM of 300 nodes and one fully-connected layer. The training setting is the same for the two SE models. The learning rate of Adam Optimizer was set at 10 -3 . After pretraining, only the last layer parameters were updated in the finetuning stage. According to Eq. ( 21), the pseudo inverse of latent features was used to calculate δf . Hardware One NVIDIA V100 GPU (32 GB GPU memory) with 4 CPUs (128 GB CPU memory). Runtime Approximately 7 minutes in the LVA method and 25 minutes in GD method with 40 utterances.

8.2.1. ADAPTATION ON MISMATCHED SPEAKERS

In Sec. 5.2, we have confirmed the effectiveness of the proposed LVA on the noise type and SNR adaptation of the SE task, where the main speech distortion is from the background noise. In this section, we further investigate the achievable performance of LVA on SE where speakers are mismatched in the source and target domains. First, we excluded the utterances of the speakers used in the training set and regenerated the adaptation and the test set by randomly sampling 200 utterances (100 for the adaptation set and 100 for the test set) from the remaining Deep Noise Suppression Challenge dataset. The experimental setup was same as Sec. 5.2 with testing speakers not given in the training set. We note that since the speakers did not match, an OT process was performed to align speakers. In In this extended experiment, we utilize a FullSubNet (Hao et al., 2021) pretrained network to perform SE domain adaptation tasks. The FullSubNet is a state-of-the-art SE model reaching the performance of PESQ: 2.89 and STOI: 0.96 on the DNS-challenge. This implementation serves to examine the effect of a pretrained network. To compare with previous results of the BLSTM model in Sec. 8.2, same source domain data and the target domain set are used for adaptation. The FullSubNet receives an input of spectrum amplitude and outputs a complex Ideal Ratio Mask (cIRM) to compare with cIRM labels under L 2 -loss. The architecture of the FullSubNet is composed of a full-band and a sub-band model. Within each model, two stacked LSTM layers and one fully connected layer are included. The LSTMs contain 512 hidden units and 384 hidden units in the full-band model and the sub-band model, respectively. The detailed structures can be found in (Hao et al., 2021) . A pretrained FullSubNet was obtained from the official Githubfoot_3 , where the last linear layer was finetuned to adapt to the target set. The finetuned networks were subsequently evaluated on an adaptation test set to measure PESQ and STOI, where the results are shown in Table 2 . The comparison shows that finetuning from a pretrained FullSubNet achieves enhanced baseline scores on the target domain (see Table 2 [no finetune]). While adaptation under the gradient descent significantly improves the pretrained network, the proposed LVA method outperforms the traditional gradient descent method in terms of PESQ and STOI. The results are consistent with the previous SE experiments to confirm that the proposed LVA promptly reaches better adaptation.

CNN extension

The LVA formulation Eq. ( 14) ∼ (19) can be extended to CNN layers with the following observation shown in Fig. 11 , where the left-hand side depicts the usual convolution operation in CNNs. Given a CNN, denote the kernels by {C ijαβ } with (i, j, α, β) ∈ L 1 × L 2 × C in × C out and L 1 , L 2 the kernel size (e.g., 3 × 3), C in input channels and C out output channels. It is noticed Thus, we completed the conversion of a CNN to a fully-connected-layer and naturally Eq. ( 17) ∼ (18) follow. There can also be possible extensions to other locally-connected networks in the future scope. Datasets We used the CUration of Flickr Events Dataset (CUFED) (Wang et al., 2016) to train SRCNN (Scaman & Virmaux, 2018) for image deblurring, available at https://acsweb. ucsd.edu/~yuw176/event-curation.html. We randomly selected 2000 images from the CUFED and randomly crop 20 patches of size 33 × 33 from each image. Three pairs of training & Multiple pretrained architectures f were implemented, of which were f = f 1 • f 2 • f 3 , (3-layer CNNs) f = f 1 • f 2 • f 3 • f 4 • f 5 , (5-layer CNNs) f = f 1 • f 2 • f 3 • f 4 • f 5 • f 6 • f 7 (7-layer CNNs) (B.46) Network Architecture We adopted the fast End-to-End Super Resolution model similar to the original architecture of SRCNN (Dong et al., 2016) with the insertion of a batch-normalization layer right after each ReLu activation function. The training batch size for f was set at 128, and the testing batch size for gradient descent finetuning as 1. The network architecture of pretrained models corresponding to Eq. (B.46): 1. 3-layer-CNN of kernel size: (9, 5, 5) with channel size: (1, 32, 32, 1). 2. 5-layer-CNN of kernel size: (9, 5, 5, 5, 5) channel size: (1, 32, 32, 32, 32, 1). 3. 7-layer-CNN of kernel size: (9, 5, 5, 5, 5, 5, 5), channel size: (1, 32, 32, 32, 32, 32, 32, 1) . where the first and the last number in the channel size is always 1 to indicate the input and output luminance channel. MNIST → USPS: In this experiment, the pretrained model was trained on MNIST (source domain) with 98.94% accuracy by 20 epochs. While adapting to the target domain USPS, the accuracy of the pretrained model dropped rapidly to 65.32%. Subsequently, gradient descent (GD) and the proposed LVA were deployed to enhance the target domain results. Table 4 shows the adaptation accuracy attained at different finetune epochs. It was observed that LVA consistently outperformed GD and reached high accuracy in just a few epochs. These two additional experiments on image classifications again verify that the proposed LVA is capable of real-world DA tasks; the adaptability is general. 



https://github.com/HHTseng/Layer-Variational-Analysis.git A patch is defined as a temporal segment extracted from an utterance. We used 128 magnitude spectral vectors to form a patch in practice. Published as a conference paper at ICLR 2023 (e) STOI score of BLSTM. (f) Loss of BLSTM. https://github.com/haoxiangsnr/FullSubNet



Figure 1: An illustration of transferal residue Eq. (20). In this example, a pretrained (animal classifier) f with 1-hot label is adapted from D → D, where D is of dog images and D is of cat images. The self-correction term transferring the old knowledge attempts to correct predictions, Eq. (20).

Fig. 2(d)(e) showed the results of 2-layer finetune by GD and LVA. The 2-layer LVA formulation is derived in Appendix 7.2. Observing that the LVA formulation equips prompt adaptation ability, we further conduct two real-world applications.

Figure 2: (a) Datasets D and D given by Eq. (22), (b) 1-layer finetuning by GD, (c) 1-layer LVA finetuning, (d) 2-layer finetuning by GD, (e) 2-layer LVA finetuning, (f) the training loss on the finetuned net over epochs.

dB) SNR. Note the speech contents, noise types, and SNRs were all mismatched in the training and adaptation sets.

Figure 3: Performance of GD vs. LVA by different patch amount N 1 of D for adaptation.Result analysis: We excerpted another 100 utterances, contaminated with the target noise types: n k ∈ {Baby-cry, Bell, Siren} at {-1, 1} (dB) SNR levels, to form the test set. Note that the speech contents were mismatched in the adaptation and test sets. Fig.3showed the domain adaptation results from environment D → D to compare GD with LVA. On this test set, the PESQ and STOI of the original noisy speech (without SE) were 1.704 and 0.808, respectively. The results in Fig.3showed consistent out-performance of LVA over GD under a different number of adaptation data in D ( N 1 as the horizontal axis). Especially, the L 2 -loss of LVA was notably less than that of GD, confirming that LVA indeed derived the globally optimal weights of a transferred net. Meanwhile, it was observed that LVA significantly outperformed GD when the amount of target samples in D was considerably less. It was also noticed that the LVA equipped OT alignment (LVA-OT) achieves similar performance to LVA. An example of the enhanced spectrograms by different SE models: the Pretrained, GD, LVA, and LVA-OT are shown in Fig.4with clean and unprocessed noisy spectrograms for comparison. Fig.4shows LVA recovers the speech components with clear structures than Pretrained and GD (see green rectangles). More extensive result analyses are provided in Appendix 8.2.

Figure 4: Spectrograms of an utterance (Baby-cry noise at SNR -1 dB; N 1 = 400 for adaptation).5.3 SUPER RESOLUTION FOR IMAGE DEBLURRINGGoal: Train a Super-Resolution Convolutional Neural Network (SRCNN, Fig.5)(Dong et al., 2016) to deblur images of domain D and adapt to another more blurred domain D.

Figure 5: [SRCNN for image deblurring] Pretrain model: a sequence of CNN layers f = f n • • • • f 2 • f 1 . Then last CNN layer f n is finetuned to adapt to more blurred images in D.

Figure 6: Sample images (CUFED) of D and D.

column (b) were more blurred inputs x of D; column (c)(d)(e) were deblurred images to be compared with the ground truths in column (a). Column (c)(d) compared the two methods adapted with the same number of target domain samples (N = 256 patches) and LVA reached the highest PSNR scores. While more target domain samples were exclusively included for GD to N = 16384 in column (e), the corresponding PSNRs were still not compatible to the LVA adaptation. There were more extensive experiments conducted to contrast the adaptation outcome of GD and LVA, see Appendix 8.3.

Figure 7: Image deblurring results of adapted SRCNNs on testing data SET14.

Figure 8: The original data D and new data D for transfer learning.

Figure 9: The flowchart of a Deep Denoising Autoencoder (DDAE) for SE. Implementation All speech utterances were recorded in a 16kHz sampling rate. The speech signals were first transformed to spectral features (magnitude spectrograms) by a short-time Fourier transform (STFT). The Hamming window with 512 points and a hop length of 256 points were applied to transform the speech waveforms into spectral features. To adequately shrink the magnitude range, a log1p function (log1p(x) := log(1 + x)) was applied to every element in the spectral features. During training, noisy spectral vectors were treated as input to the SE model to generate enhanced spectral vectors. The clean spectral vectors were treated as the reference, and the loss was computed based on the difference of enhanced and reference spectral vectors. Subsequently, the SE model was trained to minimize the difference. During inference, noisy spectral vectors were inputted to the trained SE model to generate enhanced ones, which were then converted into time-domain waveform with the preserved original noisy phase via inverse STFT.

Figure 10: SRCNN for image deblurring. A pretrained model was composed by a sequence of CNN layers f = f n • • • • f 2 • f 1 , where the last CNN layer f n was finetuned to fit more blurred images D.

Figure 11: (LHS) The masked patch x ijα is regarded as fully-connected under CNN kernels, Eq. (B.41). (RHS) By flattening the input and the convolved output into 1D vectors, a CNN is equivalent to a fully-connected layer Eq. (B.42).

Figure 12: Samples of CUFED for training and finetuning (bicubic{×3, ×4, ×6}) for Eq. (B.43)∼(B.43).

Figure 13: Results of image deblurring by SRCNN. Column (b): testing input images from SET14; column (c): results of LVA model finetuned with 256 patches; column (d): results of GD model finetuned with 256 patches; column (e): results of GD model finetuned with 16,384 patches. The corresponding PSNR shown in the figures indicated our LVA method reached the best PSNR with the least finetuning samples.

Loss of SRCNN.

Table 1, we denote LVA-OT as LVA for shorthand. The comparison of BLSTM model adaptation using GD and LVA is shown in Table 1. The table demonstrates that both GD and LVA improve the PESQ and STOI score after adaptation, while LVA consistently yields better performance over different noise types and SNR levels. Performance of finetuned models on mismatched speakers.

Performance of finetuned models from a pretrained FullSubNet.

MNIST → USPS (classification accuracy)Finetune method 10 epochs 50 epochs 90 epochs 120 epochs

annex

More extensive results Fig. 13 shows the deblurring results by finetuned models of GD and the LVA, where our method reached the highest PSNR with the least finetuning samples N = 256. This outcome indicated the LVA method is beneficial for few-shot learning, especially when adaptation data is few. 1. The loss is roughly inverse-proportional to PSNR score. Increasing CNN layers can handle more difficult tasks, such as {×3} → {×4, ×6} compared to {×3} → {×6} and {×4, ×6} → {×3} since {×3} → {×4, ×6} was asked to adapt to more new data than it was originally trained.2. In all figures, the LVA method had stable performance over GD in both loss and PSNR even when sample sizes (x-axis) largely decreased. We can see ups and downs in orange bars, but not blue bars.3. Obviously in Fig. (a) , GD showed difficulty in learning {×3} → {×4, ×6} and {×3} → {×6} than {×3, ×4} → {×6}, but the LVA method remained roughly the same (stable) as well as retaining its best performance over GD.4. Overall, GD method improvement was not consistent with the data number, likely due to the local minimum occurring in neural networks. Conversely in the LVA method, the performance is (stably) proportional to data samples. Therefore, one could easily derive acceptable results with few finetune data. Hardware One NVIDIA V100 GPU (32 GB GPU memory) with 4 CPUs (128 GB CPU memory).Runtime Approximately 4 minutes in the LVA method and 15 minutes in the GD method with 256 patches (1,000 epochs).

8.4. ADDITIONAL EXPERIMENTS ON DOMAIN ADAPTIVE IMAGE CLASSIFICATIONS

We conduct two additional experiments on images to investigate domain adaptation ability using (1) Office-31 (Saenko et al., 2010) and ( 2) MNIST → USPS.Office-31: Table 3 shows the results of domain transitions within the 3 domains of Office-31: Amazon (A), DSLR (D) and Webcam (W), resulting in 6 possible transitions to classify 31 items.On each individual source domain (A, D, W ), the pretrained model was trained for 30 epochs. Subsequently, it was finetuned onto target domains for another 30 epochs. The proposed LVA obtained higher adaptation accuracy in most domain transitions, confirming that LVA can promptly adapt to new tasks. 

