FACTOR NORMALIZATION FOR DEEP NEURAL NETWORK MODELS

Abstract

Deep neural network (DNN) models often involve features of high dimensions. In most cases, the high-dimensional features can be decomposed into two parts. The first part is a low-dimensional factor. The second part is the residual feature, with much-reduced variability and inter-feature correlation. This leads to a number of interesting theoretical findings for deep neural network training. Accordingly, we are inspired to develop a new factor normalization method for better performance. The proposed method leads to a new deep learning model with two important features. First, it allows factor related feature extraction. Second, it allows adaptive learning rates for factors and residuals, respectively. This leads to fast convergence speed on both training and validation datsets. A number of empirical experiments are presented to demonstrate its superior performance. The code is available at

1. INTRODUCTION

In recent decades, the progress of deep learning, together with advances in GPU devices, has led to a growing popularity of deep neural network (DNN) models in both academia and industry. DNN models have been widely used in various fields, such as image classification (Simonyan & Zisserman, 2014; He et al., 2016a ), speech recognition (Hinton et al., 2012; Maas et al., 2017) , and machine translation (Wu et al., 2016; Vaswani et al., 2017) . However, due to their deep structure, most DNN models are extremely difficult to train. The practical training of a DNN model often highly depends on empirical experience and is extremely time consuming. Therefore, a series of effective optimization methods have been developed for fast DNN training. According to a recent survey paper by Sun et al. (2019) , most of the optimization methods with explicit derivatives can be roughly categorized into two groups: the first-order optimization methods and the high-order optimization methods. The widely used stochastic gradient descent (SGD) algorithm and its variants (Robbins & Monro, 1951; Jain et al., 2018) are typical examples of the first-order optimization methods. The SGD algorithm only computes the first-order derivatives (i.e., the gradient) using a randomly sampled batch. By doing so, the SGD algorithm can handle large-sized datasets with limited computational resources. Unfortunately, the practical feasibility of SGD comes at the cost of sublinear convergence speed (Johnson & Zhang, 2013a). For better convergence speed, various accelerated SGD algorithms have been developed. For instance, the popularly used momentum method (Polyak, 1964; Qian, 1999) and the Nesterov Accelerated Gradient Descent (NAG) (Nesterov, 1983; Sutskever et al., 2013) method. Both of them took the information from the previous update gradient direction into consideration. Further improvements include AdaGrad (Duchi et al., 2011 ), AdaDelta (Zeiler, 2012 ), RMSprop (Tieleman & Hinton, 2012 ), Adam (Kingma & Ba, 2014) and others. For a more stable gradient estimation, the stochastic average gradient (SAG) (Roux et al., 2012) and stochastic variance reduction gradient (SVRG) (Johnson & Zhang, 2013b) methods are also developed. Except for the first-order optimization methods, high-order optimization methods also exist. Popular representatives are the Newton's method and its variants (Shanno, 1970; Hu et al., 2019; Pajarinen et al., 2019) . Compared to the first-order methods, high-order methods might lead to faster convergence speed since they take the information of Hessian matrix into consideration.

availability

https://github.com/HazardNeo4869/FactorNormalization 

