CHANNEL-DIRECTED GRADIENTS FOR OPTIMIZATION OF CONVOLUTIONAL NEURAL NETWORKS

Abstract

We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error. The method requires only simple processing of existing stochastic gradients, can be used in conjunction with any optimizer, and has only a linear overhead (in the number of parameters) compared to computation of the stochastic gradient. The method works by computing the gradient of the loss function with respect to output-channel directed re-weighted H 0 or Sobolev metrics, which has the effect of smoothing components of the gradient across a certain direction of the parameter tensor. We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental. We present the continuum theory of such gradients, its discretization, and application to deep networks. Experiments on benchmark datasets, several networks, and baseline optimizers show that optimizers can be improved in generalization error by simply computing the stochastic gradient with respect to output-channel directed metrics.

1. INTRODUCTION

Stochastic gradient descent (SGD) is currently the dominant algorithm for optimizing large-scale convolutional neural networks (CNNs) (LeCun et al. (1998) ; Simonyan & Zisserman (2014) ; He et al. (2016b) ). Although there has been large activity in optimization methods seeking to improve performance, SGD still dominates in terms of its generalization ability. Despite SGD's dominance, there is still often a gap between training and real-world test accuracy performance, which motivates research in improved optimization methods. In this paper, we derive new optimization methods that are simple modifications of SGD. The methods implicitly induce correlation in the output direction of parameter tensors in CNNs. This is based on the empirical observation that parameter tensors in trained networks typically exhibit correlation over output channel dimension (see Figure 1 ). We thus explore encoding correlation by constructing smooth gradients in the output direction, which we show improves generalization accuracy. This is done by introducing new Riemmanian metrics on the parameter tensors, which changes the underlying geometry of the space of tensors, and reformulating the gradient with respect to those metrics. Our contributions are as follows. First, we formulate output channel-directed Riemannian metrics (a re-weighted version of the standard L 2 metric and another that is a Sobolev metric) over the space of parameter tensors. This encodes channel-directed correlation in the gradient optimization without changing the loss. Second, we compute Riemannian gradients with respect to the metrics showing linear complexity (in the number of parameters) over standard gradient computation, and thus derive new optimization methods for CNN training. Finally, we apply the methodology to training CNNs and show the empirical advantage in generalization accuracy, especially with small batch sizes, over standard optimizers (SGD, Adam) on numerous applications (image classification, semantic segmentation, generative adversarial networks) with simple modification of existing optimizers.

1.1. RELATED WORK

We discuss related work in deep network optimization; for a detailed survey, see Bottou et al. (2018) . SGD, e.g., Bottou (2012) , samples a batch of data to tractably estimate the gradient of the loss function. As the stochastic gradient is a noisy version of the gradient, learning rates must follow

