A GRADIENT-BASED KERNEL APPROACH FOR EFFI-CIENT NETWORK ARCHITECTURE SEARCH

Abstract

It is widely accepted that vanishing and exploding gradient values are the main reason behind the difficulty of deep network training. In this work, we take a further step to understand the optimization of deep networks and find that both gradient correlations and gradient values have strong impacts on model training. Inspired by our new finding, we explore a simple yet effective network architecture search (NAS) approach that leverages gradient correlation and gradient values to find wellperforming architectures. To be specific, we first formulate these two terms into a unified gradient-based kernel and then select architectures with the largest kernels at initialization as the final networks. The new approach replaces the expensive "train-then-test" evaluation paradigm with a new lightweight function according to the gradient-based kernel at initialization. Experiments show that our approach achieves competitive results with orders of magnitude faster than "train-then-test" paradigms on image classification tasks. Furthermore, the extremely low search cost enables its wide applications. It also obtains performance improvements on two text classification tasks. 1

1. INTRODUCTION

Understanding and improving the optimization of deep networks has been an active field of artificial intelligence. One of the mysteries in deep learning is why extremely deep neural networks are hard to train. Currently, the vanishing and exploding gradient problem is widely believed to be the main answer (Bengio et al., 1994; Hochreiter & Schmidhuber, 1997; Pascanu et al., 2013) . Gradients exponentially decrease (or increase) from the top layer to the bottom layer in a multilayer network. By extension, the current vanishing and exploding gradient problem also refers to extremely small (or large) gradient values. Following this explanation, several widely-used training techniques have been proposed to assist optimization. These studies can be roughly classified into four categories: initialization-based approaches (Glorot & Bengio, 2010; He et al., 2015; Zhang et al., 2019) , activation-based approaches (Hendrycks & Gimpel, 2016; Klambauer et al., 2017) , normalization based approaches (Ioffe & Szegedy, 2015; Salimans & Kingma, 2016; Ulyanov et al., 2016; Lei Ba et al., 2016; Nguyen & Salazar, 2019) , and skip-connection-based approaches (He et al., 2016; Huang et al., 2017a) . These techniques provide good alternatives to stabilize gradients and bring promising performance improvements. Despite much evidence showing the importance of steady gradient values, we are still curious about whether there are unexplored but important factors. To answer this question, we conduct extensive experiments and find some surprising phenomena. First, the accuracy curves of some models converge well with the remaining gradient vanishing and exploding problem. Second, we find several cases where models with similar gradient values show different convergence performance. These results indicate that gradient values may not be vital as much as we expect, and some hidden factors play a significant role in optimization. To answer this question, we explore other gradient features, such as covariance matrix, correlations, variance, and so on. These features are used to evaluate gradients from different perspectives. The covariance matrix evaluates the similarity between any two parameters, which is widely used in optimization studies (Zhao & Zhang, 2015; Faghri et al., 2020) . Gradient correlation is defined to evaluate the similarity between any two examples. Among them, experiments show that gradient correlation is an important factor. Compared to widely-explored 1 We will release the code on publication. 1

