A GRADIENT-BASED KERNEL APPROACH FOR EFFI-CIENT NETWORK ARCHITECTURE SEARCH

Abstract

It is widely accepted that vanishing and exploding gradient values are the main reason behind the difficulty of deep network training. In this work, we take a further step to understand the optimization of deep networks and find that both gradient correlations and gradient values have strong impacts on model training. Inspired by our new finding, we explore a simple yet effective network architecture search (NAS) approach that leverages gradient correlation and gradient values to find wellperforming architectures. To be specific, we first formulate these two terms into a unified gradient-based kernel and then select architectures with the largest kernels at initialization as the final networks. The new approach replaces the expensive "train-then-test" evaluation paradigm with a new lightweight function according to the gradient-based kernel at initialization. Experiments show that our approach achieves competitive results with orders of magnitude faster than "train-then-test" paradigms on image classification tasks. Furthermore, the extremely low search cost enables its wide applications. It also obtains performance improvements on two text classification tasks. 1

1. INTRODUCTION

Understanding and improving the optimization of deep networks has been an active field of artificial intelligence. One of the mysteries in deep learning is why extremely deep neural networks are hard to train. Currently, the vanishing and exploding gradient problem is widely believed to be the main answer (Bengio et al., 1994; Hochreiter & Schmidhuber, 1997; Pascanu et al., 2013) . Gradients exponentially decrease (or increase) from the top layer to the bottom layer in a multilayer network. By extension, the current vanishing and exploding gradient problem also refers to extremely small (or large) gradient values. Following this explanation, several widely-used training techniques have been proposed to assist optimization. These studies can be roughly classified into four categories: initialization-based approaches (Glorot & Bengio, 2010; He et al., 2015; Zhang et al., 2019 ), activation-based approaches (Hendrycks & Gimpel, 2016; Klambauer et al., 2017) , normalization based approaches (Ioffe & Szegedy, 2015; Salimans & Kingma, 2016; Ulyanov et al., 2016; Lei Ba et al., 2016; Nguyen & Salazar, 2019) , and skip-connection-based approaches (He et al., 2016; Huang et al., 2017a) . These techniques provide good alternatives to stabilize gradients and bring promising performance improvements. Despite much evidence showing the importance of steady gradient values, we are still curious about whether there are unexplored but important factors. To answer this question, we conduct extensive experiments and find some surprising phenomena. First, the accuracy curves of some models converge well with the remaining gradient vanishing and exploding problem. Second, we find several cases where models with similar gradient values show different convergence performance. These results indicate that gradient values may not be vital as much as we expect, and some hidden factors play a significant role in optimization. To answer this question, we explore other gradient features, such as covariance matrix, correlations, variance, and so on. These features are used to evaluate gradients from different perspectives. The covariance matrix evaluates the similarity between any two parameters, which is widely used in optimization studies (Zhao & Zhang, 2015; Faghri et al., 2020) . Gradient correlation is defined to evaluate the similarity between any two examples. Among them, experiments show that gradient correlation is an important factor. Compared to widely-explored gradient values, gradient correlations are rarely studied. As gradient values affect the step size in optimization, vanishing gradients prevent models from changing weight values. In the worst case, this could completely stop training. Gradient correlations evaluate the randomness of gradient directions in Euclidean space. Lower gradient correlations indicate more "random" gradient directions and more "conflicting" parameter updates. For a better understanding, we visualize absolute gradient values and gradient correlations (See Section 4 for more details). It illustrates that models with either small values or small correlations show worse convergence and generalization performance, indicating the importance of these two factors. Following the new finding, we take a further step to explore gradient correlations and gradient values in network architecture search. We first formulate two factors into a unified gradient-based kernel, which is defined as the average of a gradient dot-product matrix. Motivated by the observation that architectures with larger kernels at initialization tend to have better average results, we develop a lightweight network architecture search approach, called GT-NAS, which evaluates architectures according to the gradient-based kernel at initialization. Unlike other NAS approaches that evaluate architectures via full training, which takes massive computation resources, GT-NAS only relies on a few examples for kernel calculation. We first select top-k architectures with largest kernels as candidates, which are then trained to select the best one based on validation performance. In practice, k is usually set to be a very small value. Experiments show that GT-NAS achieves competitive results with extremely fast search on NAS-Bench-201 (Ying et al., 2019) , a NAS benchmark dataset. The low search cost allows us to apply NAS on diverse tasks. Specifically, the structure searched by the proposed policy outperforms the naive baseline without NAS on 4 datasets covering image classification and text classification. The main contributions are summarized as follows: • We propose that both gradient correlations and gradient values matter in optimization, which gives a new insight to understand and develop optimization techniques. • Following the new finding, we propose a gradient-based kernel approach for NAS, which is able to search for well-performing architectures efficiently. • Experiments show that GT-NAS achieves competitive results with orders of magnitude faster than the naive baseline.

2. RELATED WORK

Understanding and improving optimization Understanding and improving the optimization of deep networks has long been a hot research topic. Bengio et al. (1994) et al., 2017) , and so on. The motivation behind these approaches is to avoid unsteady derivatives of the activation with respect to the inputs. The fourth focuses on gradient clipping (Pascanu et al., 2013) . Gradient-based Kernel Our work is also related to gradient-based kernels (Advani & Saxe, 2017; Jacot et al., 2018) . NTK (Jacot et al., 2018 ) is a popular gradient-based kernel, defined as the Gram matrix of gradients. It is proposed to analyze model's convergence and generalization. Following these studies, many researchers are devoted to understand current networks from the perspective of NTK (Lee et al., 2019; Hastie et al., 2019; Allen-Zhu et al., 2019; Arora et al., 2019) . Du et al. (2019b) use the Gram matrix of gradients to prove that for an m hidden node shallow neural network



We will release the code on publication.



find the vanishing and exploding gradient problem in neural network training. To address this problem, a lot of promising approaches have been proposed in recent years, which can be classified into four research lines.

