CHANNEL-DIRECTED GRADIENTS FOR OPTIMIZATION OF CONVOLUTIONAL NEURAL NETWORKS

Abstract

We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error. The method requires only simple processing of existing stochastic gradients, can be used in conjunction with any optimizer, and has only a linear overhead (in the number of parameters) compared to computation of the stochastic gradient. The method works by computing the gradient of the loss function with respect to output-channel directed re-weighted H 0 or Sobolev metrics, which has the effect of smoothing components of the gradient across a certain direction of the parameter tensor. We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental. We present the continuum theory of such gradients, its discretization, and application to deep networks. Experiments on benchmark datasets, several networks, and baseline optimizers show that optimizers can be improved in generalization error by simply computing the stochastic gradient with respect to output-channel directed metrics.

1. INTRODUCTION

Stochastic gradient descent (SGD) is currently the dominant algorithm for optimizing large-scale convolutional neural networks (CNNs) (LeCun et al. (1998) ; Simonyan & Zisserman (2014) ; He et al. (2016b) ). Although there has been large activity in optimization methods seeking to improve performance, SGD still dominates in terms of its generalization ability. Despite SGD's dominance, there is still often a gap between training and real-world test accuracy performance, which motivates research in improved optimization methods. In this paper, we derive new optimization methods that are simple modifications of SGD. The methods implicitly induce correlation in the output direction of parameter tensors in CNNs. This is based on the empirical observation that parameter tensors in trained networks typically exhibit correlation over output channel dimension (see Figure 1 ). We thus explore encoding correlation by constructing smooth gradients in the output direction, which we show improves generalization accuracy. This is done by introducing new Riemmanian metrics on the parameter tensors, which changes the underlying geometry of the space of tensors, and reformulating the gradient with respect to those metrics. Our contributions are as follows. First, we formulate output channel-directed Riemannian metrics (a re-weighted version of the standard L 2 metric and another that is a Sobolev metric) over the space of parameter tensors. This encodes channel-directed correlation in the gradient optimization without changing the loss. Second, we compute Riemannian gradients with respect to the metrics showing linear complexity (in the number of parameters) over standard gradient computation, and thus derive new optimization methods for CNN training. Finally, we apply the methodology to training CNNs and show the empirical advantage in generalization accuracy, especially with small batch sizes, over standard optimizers (SGD, Adam) on numerous applications (image classification, semantic segmentation, generative adversarial networks) with simple modification of existing optimizers.

1.1. RELATED WORK

We discuss related work in deep network optimization; for a detailed survey, see Bottou et al. (2018) . SGD, e.g., Bottou (2012) , samples a batch of data to tractably estimate the gradient of the loss function. As the stochastic gradient is a noisy version of the gradient, learning rates must follow 2014) adaptively adjusts the learning rate so that parameters that have changed infrequently based on historical gradients are updated more quickly than parameters that have changed frequently. Another way to interpret such methods is that they change the underlying metric on the space on which the loss function is defined to an iso-tropically scaled version of the L 2 metric given by a simple diagonal matrix; we change the metrics an-isotropically. We show that our method can be used in conjunction with such methods by simply using the stochastic gradient computed with our metrics to boost performance. As the stochastic gradient is computed based on sampling, different runs of the algorithm can result in different local optima. To reduce the variance, several methods have been been formulated, e.g., Defazio et al. (2014); Johnson & Zhang (2013) . We are not motivated by variance reduction, rather, inducing correlation in the parameter tensor to improve generalization. However, as our method smooths the gradient, our experiments show reduced variance with our metrics compared to SGD. 2020) and tangentially relate to our work. These works are based on Amari's Amari (1998) information geometry on probability measures, and the metric considered is the Fisher information metric. The motivation for these methods is re-parametrization invariance of optimization, whereas our motivation is imposing correlation in the



Visualization of parameter tensor of convolutional layers trained on ImageNet. Frequently within in layers (especially deeper layers), there is correlation of the weights along the output channel direction. The table shows output correlation (invariant to re-scaling) relative to input direction. Our method induces parameter correlations in the output direction.

Another method motivated by variance reduction is Osher et al.(2018)  (see applicationsWang et al.  (2019);Liang et al. (2020);Wang et al. (2020)), where the stochastic gradient is pre-multiplied with an inverse Laplacian smoothing matrix. For CNNs, the gradient with respect to parameters is rasterized in row or column order of network filters before smoothing. Our work is inspired byOsher  et al. (2018), though we are motivated by correlation in the parameter tensor. Osher et al. (2018) can be interpreted as using the gradient of the loss with respect to a Sobolev metric. One insight over Osher et al. (2018) is that keeping the structure of the parameter tensor and defining the Sobolev metric with respect to the output-channel direction boosts accuracy, while other directions do not. Secondly, we introduce a re-weighted H 0 metric that preferentially treats the output-channel direction, and can be implemented with a line of Pytorch code, has linear (in parameter size) complexity, and performs comparably (in many cases) to our channel-directed Sobolev metric, boosting accuracy of SGD. Third, our Sobolev gradient, a variant of the ordinary one, has linear complexity rather than quasi-linear (not requiring FFT as Osher et al. (2018)). Sobolev gradients have been used in computer visionSundaramoorthi et al. (2007); Charpiat et al. (2007)  for their coarse-to-fine evolutionSundaramoorthi et al. (2008); we adapt that formulation to CNNs.We formulate Sobolev gradients by considering the space of parameter tensors as a Riemannian manifold, and choosing the Sobolev metric on the tangent space. By choosing a metric, gradients intrinsic to the manifold can be computed and gradient flows are guaranteed to decrease loss. Other Riemannian metrics have been used for optimization in neural networks, e.g., Amari (1998);Marceau- Caron & Ollivier (2016); Hoffman et al. (2013); Gunasekar et al. (

