RSO: A GRADIENT FREE SAMPLING BASED AP-PROACH FOR TRAINING DEEP NEURAL NETWORKS

Abstract

We propose RSO (random search optimization), a gradient free, sampling based approach for training deep neural networks. To this end, RSO adds a perturbation to a weight in a deep neural network and tests if it reduces the loss on a mini-batch. If this reduces the loss, the weight is updated, otherwise the existing weight is retained. Surprisingly, we find that repeating this process a few times for each weight is sufficient to train a deep neural network. The number of weight updates for RSO is an order of magnitude lesser when compared to backpropagation with SGD. RSO can make aggressive weight updates in each step as there is no concept of learning rate. The weight update step for individual layers is also not coupled with the magnitude of the loss. RSO is evaluated on classification tasks on MNIST and CIFAR-10 datasets with deep neural networks of 6 to 10 layers where it achieves an accuracy of 99.1% and 81.8% respectively. We also find that after updating the weights just 5 times, the algorithm obtains a classification accuracy of 98% on MNIST.

1. INTRODUCTION

Deep neural networks solve a variety of problems using multiple layers to progressively extract higher level features from the raw input. The commonly adopted method to train deep neural networks is backpropagation (Rumelhart et al. (1985) ) and it has been around for the past 35 years. Backpropagation assumes that the function is differentiable and leverages the partial derivative w.r.t the weight w i for minimizing the function f (x, w) as follows, w i+1 = w i -η∇f (x, w)∆w i , where η is the learning rate. Also, the method is efficient as it makes a single functional estimate to update all the weights of the network. As in, the partial derivative for some weight w j , where j = i would change once w i is updated, still this change is not factored into the weight update rule for w j . Moreover, it may not even be optimal for all weights to move in the same direction as obtained from the gradients in the previous layer. Although deep neural networks are non-convex (and the weight update rule measures approximate gradients), this update rule works surprisingly well in practice. To explain the above observation, recent literature (Du et al. (2019) ; Li & Liang (2018)) argues that because the network is over-parametrized, the initial set of weights are very close to the final solution and even a little bit of nudging using gradient descent around the initialization point leads to a very good solution. We take this argument to another extreme -instead of using gradient based optimizers -which provide strong direction and magnitude signals for updating the weights; we explore the region around the initialization point by sampling weight changes to minimize the objective function. Formally, our weight update rule is w i+1 = w i , f (x, w i ) <= f (x, w i + ∆w i ) w i + ∆w i , f (x, w i ) > f (x, w i + ∆w i ) , where ∆w i is the weight change hypothesis. Here, we explicitly test the region around the initial set of weights by computing the function and update a weight if it minimizes the loss, see Fig. 1 . Figure 1 : Gradient descent vs sampling. In gradient descent we estimate the gradient at a given point and take a small step in the opposite direction direction of the gradient. In contrast, for sampling based methods we explicitly compute the function at different points and then chose the point where the function is minimum. Surprisingly, our experiments demonstrate that the above update rule requires fewer weight updates compared to backpropagation to find good minimizers for deep neural networks, strongly suggesting that just exploring regions around randomly initialized networks is sufficient, even without explicit gradient computation. We evaluate this weight update scheme (called RSO; random search optimization) on classification datasets like MNIST and CIFAR-10 with deep convolutional neural networks (6-10 layers) and obtain competitive accuracy numbers. For example, RSO obtains 99.1% accuracy on MNIST and 81.8% accuracy on CIFAR-10 using just the random search optimization algorithm. We do not use any other optimizers for optimizing the final classification layer. Although RSO is computationally expensive (because it requires updates which are linear in the number of network parameters), our hope is that as we develop better intuition about structural properties of deep neural networks, we will be able to accelerate RSO (using Hebbian principles, Gabor filters, depth-wise convolutions). If the number of trainable parameters are reduced drastically (Frankle et al. (2020) ), search based methods could be a viable alternative to back-propagation. Furthermore, since architectural innovations which have happened over the past decade use backpropagation by default, a different optimization algorithm could potentially lead to a different class of architectures, because minimizers of an objective function via different greedy optimizers could potentially be different. 



Multiple optimization techniques have been proposed for training deep neural networks. When gradient based methods were believed to get stuck in local minima with random initialization, layer wise training was popular for optimizing deep neural networks (Hinton et al. (2006); Bengio et al. (2007)) using contrastive methods (Hinton (2002)). In a similar spirit, recent work, Greedy InfoMax by Löwe et al. (2019) maximizes mutual information between adjacent layers instead of training a network end to end. Taylor et al. (2016) finds the weights of each layer independently by solving a sequence of optimization problems which can be solved globally in closed form. Weight perturbation (Werfel et al. (2004)) based methods have been used for approximate gradients estimation in situations where gradient estimation is expensive. However, these training methods do not generalize to deep neural networks which have more than 2-3 layers and its not shown that the performance increases as we make the network deeper. Hence, back-propagation with SGD or other gradient based optimizers (Duchi et al. (2011); Sutskever et al. (2013); Kingma & Ba (2014)) are commonly used for optimizing deep neural networks. Recently, multiple works have proposed that because these networks are heavily over-parametrized, the initial set of random filters is already close to the final solution and gradient based optimizers only nudge the parameters to obtain the final solution (Du et al. (2019); Li & Liang (2018)). For example, only training batch-norm parameters and keeping the random filters fixed can obtain very good results with heavily parametrized very deep neural networks (> 800 layers) as shown in Frankle et al. (2020). It was also shown that networks can be trained by just masking out some weights without modifying the original set of weights by Ramanujan et al. (2020) -although one can argue that

