FACTOR NORMALIZATION FOR DEEP NEURAL NETWORK MODELS

Abstract

Deep neural network (DNN) models often involve features of high dimensions. In most cases, the high-dimensional features can be decomposed into two parts. The first part is a low-dimensional factor. The second part is the residual feature, with much-reduced variability and inter-feature correlation. This leads to a number of interesting theoretical findings for deep neural network training. Accordingly, we are inspired to develop a new factor normalization method for better performance. The proposed method leads to a new deep learning model with two important features. First, it allows factor related feature extraction. Second, it allows adaptive learning rates for factors and residuals, respectively. This leads to fast convergence speed on both training and validation datsets. A number of empirical experiments are presented to demonstrate its superior performance. The code is available at

1. INTRODUCTION

In recent decades, the progress of deep learning, together with advances in GPU devices, has led to a growing popularity of deep neural network (DNN) models in both academia and industry. DNN models have been widely used in various fields, such as image classification (Simonyan & Zisserman, 2014; He et al., 2016a) , speech recognition (Hinton et al., 2012; Maas et al., 2017) , and machine translation (Wu et al., 2016; Vaswani et al., 2017) . However, due to their deep structure, most DNN models are extremely difficult to train. The practical training of a DNN model often highly depends on empirical experience and is extremely time consuming. Therefore, a series of effective optimization methods have been developed for fast DNN training. According to a recent survey paper by Sun et al. (2019) , most of the optimization methods with explicit derivatives can be roughly categorized into two groups: the first-order optimization methods and the high-order optimization methods. The widely used stochastic gradient descent (SGD) algorithm and its variants (Robbins & Monro, 1951; Jain et al., 2018) are typical examples of the first-order optimization methods. The SGD algorithm only computes the first-order derivatives (i.e., the gradient) using a randomly sampled batch. By doing so, the SGD algorithm can handle large-sized datasets with limited computational resources. Unfortunately, the practical feasibility of SGD comes at the cost of sublinear convergence speed (Johnson & Zhang, 2013a) . For better convergence speed, various accelerated SGD algorithms have been developed. For instance, the popularly used momentum method (Polyak, 1964; Qian, 1999) and the Nesterov Accelerated Gradient Descent (NAG) (Nesterov, 1983; Sutskever et al., 2013) method. Both of them took the information from the previous update gradient direction into consideration. Further improvements include AdaGrad (Duchi et al., 2011 ), AdaDelta (Zeiler, 2012 ), RMSprop (Tieleman & Hinton, 2012) , Adam (Kingma & Ba, 2014) and others. For a more stable gradient estimation, the stochastic average gradient (SAG) (Roux et al., 2012) and stochastic variance reduction gradient (SVRG) (Johnson & Zhang, 2013b) methods are also developed. Except for the first-order optimization methods, high-order optimization methods also exist. Popular representatives are the Newton's method and its variants (Shanno, 1970; Hu et al., 2019; Pajarinen et al., 2019) . Compared to the first-order methods, high-order methods might lead to faster convergence speed since they take the information of Hessian matrix into consideration. For example, the Newton's method can have a quadratic convergence speed under appropriate conditions (Avriel, 2003) . However, calculating and storing the Hessian matrix and its inverse is extremely expensive in terms of both time and storage. This leads to the development of some approximation methods, such as Quasi-Newton method (Avriel, 2003) and stochastic Quasi-Newton method (Luo et al., 2014) . The idea of Quasi or stochastic Quasi Newton method is to approximate the inverse Hessian matrix by a positive definite matrix. For example, DFP (Fletcher & Powell, 1963; Davidon, 1991) , BFGS (Broyden, 1970; Fletcher & R, 1970; Donald & Goldfarb, 1970) and L-BFGS (Nocedal & Jorge, 1980; Liu & Nocedal, 1989) methods are popular representatives. Moreover, as a useful technique for fast convergence, various pre-conditioning techniques are also popularly used. (Huckle, 1999; Benzi, 2002; Tang et al., 2009) . The basis idea of pre-conditioning is to transform a difficult or ill-conditioned linear system (e.g., Aθ = b) into an easier system with better condition (Higham & Mary, 2019). As a consequence, the information contained in the feature covariance can be effectively used (Wang et al., 2019) . Other interesting methods trying to extract useful information from the feature covariance also exist; see for example Denton et al. (2015) , Ghiasi & Fowlkes (2016), and Lai et al. (2017) . However, to our best knowledge, there seems no existing models or methods that are particularly designed for high-dimensional features with a factor structure. In the meanwhile, ample amounts of empirical experience suggest that most high-dimensional features demonstrate a strong factor type of covariance structure. In other words, a significant amount of the feature variability can be explained by a latent factor with very low dimensions. As a consequence, we can decompose the original features into two parts. The first is a low dimensional factor part, which accounts for a significant portion of the total volatility. The second is the residual part with factor effects removed. This residual part has the same dimension as the original feature. Consequently, it has a much reduced variability. Moreover, the inter-feature correlation is also reduced substantially. To this end, the original learning problem concerning for the high dimensional features can be decomposed into two sub learning problems. The first one is a learning problem concerning for the latent factor. This is relatively simple since the dimension of the factor is very low. The second problem is related to the residual feature. Unfortunately, this is still a challenging problem due to the high dimensions. However, compared with the original one, it is much easier because the inter feature dependence has been substantially reduced. For a practical implementation, we propose here a novel method called factor normalization. It starts with a benchmark model (e.g., VGG or ResNet) and then slightly modifies the benchmark model into a new model structure. Compared with the benchmark model, the new model takes the latent factor and residuals as different inputs. The benchmark model is remained to process the residuals. The latent factor is then put back to the model in the last layer. This is to compensate for the information loss due to factor extraction. By doing so, the new model allows the factor-related features and residual-related features to be processed separately. Furthermore, different (i.e., adaptive) learning rates can be allowed for factor and residuals, respectively. This leads to adaptive learning and thus fast convergence speed. The rest of this article is organized as follows. Section 2 develops our theoretical motivation with statistical insights. Section 3 provides the details of the proposed new model. Section 4 demonstrates the outstanding performance of the propose model via extensive empirical experiments. Section 5 concludes the article with a brief discussion for future research.

2. THEORETICAL MOTIVATION

To motivate our new model, we provide here a number of interesting theoretical motivations from different perspectives. Since the SGD algorithm is a stochastic version of the GD algorithm, we thus focus on a standard GD algorithm in this section for discussion simplicity.

2.1. THE GD ALGORITHM

Let (X i , Y i ) be the observation collected from the i-th instance with 1 ≤ i ≤ N , where Y i is often the class label and X i = (X i1 , ..., X ip ) ⊤ ∈ R p is the associated p-dimensional feature. The loss function evaluated at i is defined as ℓ(Y i , X ⊤ i θ), where θ ∈ R p is the unknown parameter. Then, the global loss is given by L N (θ) = N -1 ∑ N i=1 ℓ(Y i , X ⊤ i θ). The global gradient is given by

availability

https://github.com/HazardNeo4869/FactorNormalization 

