RSO: A GRADIENT FREE SAMPLING BASED AP-PROACH FOR TRAINING DEEP NEURAL NETWORKS

Abstract

We propose RSO (random search optimization), a gradient free, sampling based approach for training deep neural networks. To this end, RSO adds a perturbation to a weight in a deep neural network and tests if it reduces the loss on a mini-batch. If this reduces the loss, the weight is updated, otherwise the existing weight is retained. Surprisingly, we find that repeating this process a few times for each weight is sufficient to train a deep neural network. The number of weight updates for RSO is an order of magnitude lesser when compared to backpropagation with SGD. RSO can make aggressive weight updates in each step as there is no concept of learning rate. The weight update step for individual layers is also not coupled with the magnitude of the loss. RSO is evaluated on classification tasks on MNIST and CIFAR-10 datasets with deep neural networks of 6 to 10 layers where it achieves an accuracy of 99.1% and 81.8% respectively. We also find that after updating the weights just 5 times, the algorithm obtains a classification accuracy of 98% on MNIST.

1. INTRODUCTION

Deep neural networks solve a variety of problems using multiple layers to progressively extract higher level features from the raw input. The commonly adopted method to train deep neural networks is backpropagation (Rumelhart et al. (1985) ) and it has been around for the past 35 years. Backpropagation assumes that the function is differentiable and leverages the partial derivative w.r.t the weight w i for minimizing the function f (x, w) as follows, w i+1 = w i -η∇f (x, w)∆w i , where η is the learning rate. Also, the method is efficient as it makes a single functional estimate to update all the weights of the network. As in, the partial derivative for some weight w j , where j = i would change once w i is updated, still this change is not factored into the weight update rule for w j . Moreover, it may not even be optimal for all weights to move in the same direction as obtained from the gradients in the previous layer. Although deep neural networks are non-convex (and the weight update rule measures approximate gradients), this update rule works surprisingly well in practice. 2018)) argues that because the network is over-parametrized, the initial set of weights are very close to the final solution and even a little bit of nudging using gradient descent around the initialization point leads to a very good solution. We take this argument to another extreme -instead of using gradient based optimizers -which provide strong direction and magnitude signals for updating the weights; we explore the region around the initialization point by sampling weight changes to minimize the objective function. Formally, our weight update rule is w i+1 = w i , f (x, w i ) <= f (x, w i + ∆w i ) w i + ∆w i , f (x, w i ) > f (x, w i + ∆w i ) , where ∆w i is the weight change hypothesis. Here, we explicitly test the region around the initial set of weights by computing the function and update a weight if it minimizes the loss, see Fig. 1 .



To explain the above observation, recent literature (Du et al. (2019); Li & Liang (

