CHANNEL-DIRECTED GRADIENTS FOR OPTIMIZATION OF CONVOLUTIONAL NEURAL NETWORKS

Abstract

We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error. The method requires only simple processing of existing stochastic gradients, can be used in conjunction with any optimizer, and has only a linear overhead (in the number of parameters) compared to computation of the stochastic gradient. The method works by computing the gradient of the loss function with respect to output-channel directed re-weighted H 0 or Sobolev metrics, which has the effect of smoothing components of the gradient across a certain direction of the parameter tensor. We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental. We present the continuum theory of such gradients, its discretization, and application to deep networks. Experiments on benchmark datasets, several networks, and baseline optimizers show that optimizers can be improved in generalization error by simply computing the stochastic gradient with respect to output-channel directed metrics.

1. INTRODUCTION

Stochastic gradient descent (SGD) is currently the dominant algorithm for optimizing large-scale convolutional neural networks (CNNs) (LeCun et al. (1998) ; Simonyan & Zisserman (2014) ; He et al. (2016b) ). Although there has been large activity in optimization methods seeking to improve performance, SGD still dominates in terms of its generalization ability. Despite SGD's dominance, there is still often a gap between training and real-world test accuracy performance, which motivates research in improved optimization methods. In this paper, we derive new optimization methods that are simple modifications of SGD. The methods implicitly induce correlation in the output direction of parameter tensors in CNNs. This is based on the empirical observation that parameter tensors in trained networks typically exhibit correlation over output channel dimension (see Figure 1 ). We thus explore encoding correlation by constructing smooth gradients in the output direction, which we show improves generalization accuracy. This is done by introducing new Riemmanian metrics on the parameter tensors, which changes the underlying geometry of the space of tensors, and reformulating the gradient with respect to those metrics. Our contributions are as follows. First, we formulate output channel-directed Riemannian metrics (a re-weighted version of the standard L 2 metric and another that is a Sobolev metric) over the space of parameter tensors. This encodes channel-directed correlation in the gradient optimization without changing the loss. Second, we compute Riemannian gradients with respect to the metrics showing linear complexity (in the number of parameters) over standard gradient computation, and thus derive new optimization methods for CNN training. Finally, we apply the methodology to training CNNs and show the empirical advantage in generalization accuracy, especially with small batch sizes, over standard optimizers (SGD, Adam) on numerous applications (image classification, semantic segmentation, generative adversarial networks) with simple modification of existing optimizers.

1.1. RELATED WORK

We discuss related work in deep network optimization; for a detailed survey, see Bottou et al. (2018) . SGD, e.g., Bottou (2012) , samples a batch of data to tractably estimate the gradient of the loss function. As the stochastic gradient is a noisy version of the gradient, learning rates must follow

Architectures

AlexNet VGG-16 DenseNet ResNet-50 Input Channel Correlation 0.0057 0.0082 0.0029 0.0047 Output Channel Correlation 0.0267 0.0178 0.0116 0.0077 Figure 1 : Visualization of parameter tensor of convolutional layers trained on ImageNet. Frequently within in layers (especially deeper layers), there is correlation of the weights along the output channel direction. The table shows output correlation (invariant to re-scaling) relative to input direction. Our method induces parameter correlations in the output direction. a decay schedule in order to converge. Many methods have been formulated to choose learning rate over epochs and components of the gradient, including adaptive learning rates (e.g., Duchi et al. (2011) ; Zeiler (2012); Kingma & Ba (2014) ; Bengio (2015) ; Loshchilov & Hutter (2017) ; Luo et al. (2019) ). For instance, Adam Kingma & Ba (2014) adaptively adjusts the learning rate so that parameters that have changed infrequently based on historical gradients are updated more quickly than parameters that have changed frequently. Another way to interpret such methods is that they change the underlying metric on the space on which the loss function is defined to an iso-tropically scaled version of the L 2 metric given by a simple diagonal matrix; we change the metrics an-isotropically. We show that our method can be used in conjunction with such methods by simply using the stochastic gradient computed with our metrics to boost performance. As the stochastic gradient is computed based on sampling, different runs of the algorithm can result in different local optima. To reduce the variance, several methods have been been formulated, e.g., Defazio et al. (2014) ; Johnson & Zhang (2013) . We are not motivated by variance reduction, rather, inducing correlation in the parameter tensor to improve generalization. However, as our method smooths the gradient, our experiments show reduced variance with our metrics compared to SGD. Another method motivated by variance reduction is Osher et al. (2018) (see applications Wang et al. (2019) ; Liang et al. (2020) ; Wang et al. (2020) ), where the stochastic gradient is pre-multiplied with an inverse Laplacian smoothing matrix. For CNNs, the gradient with respect to parameters is rasterized in row or column order of network filters before smoothing. Our work is inspired by Osher et al. (2018) , though we are motivated by correlation in the parameter tensor. Osher et al. (2018) can be interpreted as using the gradient of the loss with respect to a Sobolev metric. One insight over Osher et al. (2018) is that keeping the structure of the parameter tensor and defining the Sobolev metric with respect to the output-channel direction boosts accuracy, while other directions do not. Secondly, we introduce a re-weighted H 0 metric that preferentially treats the output-channel direction, and can be implemented with a line of Pytorch code, has linear (in parameter size) complexity, and performs comparably (in many cases) to our channel-directed Sobolev metric, boosting accuracy of SGD. Third, our Sobolev gradient, a variant of the ordinary one, has linear complexity rather than quasi-linear (not requiring FFT as Osher et al. (2018) ). Sobolev gradients have been used in computer vision Sundaramoorthi et al. (2007) ; Charpiat et al. (2007) for their coarse-to-fine evolution Sundaramoorthi et al. (2008) ; we adapt that formulation to CNNs. We formulate Sobolev gradients by considering the space of parameter tensors as a Riemannian manifold, and choosing the Sobolev metric on the tangent space. By choosing a metric, gradients intrinsic to the manifold can be computed and gradient flows are guaranteed to decrease loss. Other Riemannian metrics have been used for optimization in neural networks, e.g., Amari (1998) ; Marceau-Caron & Ollivier (2016); Hoffman et al. (2013) ; Gunasekar et al. (2020) and tangentially relate to our work. These works are based on Amari's Amari (1998) information geometry on probability measures, and the metric considered is the Fisher information metric. The motivation for these methods is re-parametrization invariance of optimization, whereas our motivation is imposing correlation in the parameter space. Other works Gunasekar et al. (2020) use the Hessian metric (in the convex case), but these metrics are data-dependent and the gradient is challenging to compute, requiring (a large) inverse matrix computation.

2. CHANNEL-DIRECTED GRADIENTS

We now present the theory to define channel-directed gradients. To do this, we formulate new metrics on the space of tensors, and then derive analytic formulas for channel-directed gradients in terms of the standard L 2 gradient. As we show, our channel-directed gradients effectively smooth the components of the L 2 gradient across the output direction of the parameter tensors of the CNN, which induces correlation in that direction in the gradient and thus also the parameter tensor. Another interpretation is we are changing the geometry of the loss landscape (without changing the loss) to a more smooth one by changing the metric of the space on which the loss is defined.

2.1. BACKGROUND ON RIEMANNIAN GRADIENTS

We present the definition of gradient on a Riemannian manifold, and show the dependence of the gradient on the chosen metric on the manifold (see Carmo (1992) ; Abraham et al. (2012) for more details). A manifold X is a space that is locally linear around each point X ∈ X ; this linear space is the tangent space, denoted T X X . A Riemannian manifold has a smoothly varying positive definite bilinear form •, • (called the metric) on the tangent space. This metric allows one to define the notion of lengths of curves on the space, in addition to other operations, including gradients. Definition 1 (Gradient of a Function) Let X be a Riemannian manifold, and f : X → R be a function. The directional derivative of f at X ∈ X along a direction k ∈ T X X is defined as df (X) • k = d dε f (X + εk)| ε=0 . The gradient of f at X ∈ X is the vector, ∇f (X) ∈ T X X , that satisfies the relation df (X) • k = ∇f (X), k , for all k ∈ T X X . Note that "the" gradient will depend on the choice of the metric on the manifold. We note that any such gradient will decrease the the function f by moving infinitesimally in the tangent space in the direction of negative the gradient as df (X) • k = -∇f (X) 2 < 0 when k = -∇f (X), where • is the norm induced from the metric. The gradient flow, defined by the differential equation Ẋt = -∇f (X t ), will converge to a local minimum. In our application of this theory to CNN optimization, f will be the loss function, and X will be the space of parameter tensors. A consequence of this definition is that the gradient is the direction (up to a scale factor) in the tangent space that optimizes the following problem: arg max k∈T X X \{0} | df (X) • k| k . Thus, the gradient can be regarded as the most efficient direction as it maximizes the ratio of the change in energy by perturbing in a direction k over the cost (defined by the metric) of k. Thus, by constructing the metric to have small costs for perturbations (directions) that we prefer for gradients, the gradient flow will move in these preferential directions while minimizng the function, and thus land in favorable local minima.

2.2. CHANNEL-DIRECTED METRICS

In existing deep network gradient-based optimization schemes, the underlying metric on the loss function is assumed to be the standard Euclidean L 2 metric. We will consider a re-weighted version of the L 2 metric and a Sobolev metric that favor correlation in the output channel direction of the gradient and thus the parameter tensors. To formulate the methodology, we start from a continuum formulation, where we treat weight tensors in the continuum, formulate the metrics in the continuum and then in the next sub-section derive the gradients with respect to these metrics. Finally, we discretize gradient flows in the implementation to derive iterative schemes.  k 1 , k 2 H 0 = O,I,H,W k 1 (o, i, h, w) • k 2 (o, i, h, w) do di dh dw, where k 1 , k 2 are in the tangent space of tensors. We now define a re-weighted version of H 0 that favors tangent vectors that have global smoothness in the direction of the O dimension: k 1 , k 2 H 0 λ = I,H,W k1 (i, h, w) • k2 (i, h, w) di dh dw + λ O k 1 -k1 , k 2 -k2 H 0 , where λ > 0 is a hyper-parameter, and k is the average value in the output channel direction, i.e., k(i, h, w) = 1 O O k(o, i, h, w) do. The metric in (4) splits the tangent vector into global translations in the output channel direction and its orthogonal complement, i.e., the deformation. The weight λ is used to control the weighting between the translation and deformation components, i.e., larger values of λ means that deformations more heavily influence the norm of the perturbation. As shown in the next sub-section that means gradients with respect to this metric have higher weighted channel-directed translations than deformations. Next, we introduce a channel-directed version of a Sobolev metric, defined as follows: k 1 , k 2 H1 = I,H,W k1 (i, h, w) • k2 (i, h, w) di dh dw + λO ∂k 1 ∂o , ∂k 2 ∂o H 0 , where ∂ ∂o indicates the partial derivative with respect to the the output channel direction. The partial derivative in the o-direction implies that tensor perturbations that are smooth along the o-direction are close with respect to these metrics, which will imply that the corresponding gradients will exhibit smoothness in this direction, i.e., convolution filters that are nearby in the output direction will exhibit correlation. The metric is a weighted combination of the H 0 metric of the derivative in the output direction, and the H 0 metric of the output-directed translation. Note that the traditional Sobolev metric uses the H 0 metric of the perturbation rather than the translation. Our choice is motivated by computational efficiency of the corresponding gradient, to be discussed below. The scale factors of O in the expressions above are so that the metric is scale invariant with respect to different sizes of output channels. The part of the metric with the partial derivative component implies that tensors that differ in the output channel direction by a non-smooth perturbation are far away in distance. Tensors that differ by just a channel-directed translation are close.

2.3. COMPUTING CHANNEL-DIRECTED GRADIENTS

We now compute gradients with respect to the metrics defined in the previous sub-section in terms of the H 0 gradient so that existing SGD code can be re-use with minimal changes. To compute the relation between the channel-directed gradients and the usual H 0 gradient, we note (1) that the directional derivative can be written as an inner product with the gradient with respect to any metric: dL(X) • k = ∇ H 0 λ L(X), k H 0 λ = ∇ H 1 L(X), k H1 = ∇ H 0 L(X), k H 0 . With this relation, we may compute the channel-directed gradients in terms of the H 0 gradient (details are in Appendix D). Letting f = ∇ H 0 L(X), we have ∇ H 0 λ L(X) = f + 1 λ (f -f ) and f = ∇ H1 L(X) -λO 2 ∂ 2 ∂o 2 ∇ H1 L(X), where the last expression is a second order ordinary differential equations (ODE), whose solution we discuss next. Notice that the re-weighted H 0 gradient (8) re-weights the channel-directed translation component and the deformation component of the H 0 gradient differently, i.e., as λ gets larger, the channel-directed translation becomes more prominent. Our Sobolev gradient effectively computes local averages, as we show, in the output channel direction, and by doing so effectively imposes an ordering of the kernels in CNNs so that nearby kernels (according to the distance in the output direction) are similar. As ordering of kernels in CNNs is arbitrary, in the sense that permutations of kernels in the output direction along with the input channels result in the same output, we are free to impose one ordering, which Sobolev effectively does during the optimization so that filters that are close in the o-dimension are similar. In obtaining the expression for the Sobolev gradient below, we have assumed periodic boundary conditions in the O dimension, which further imposes the ordering of filters such that starting and ending filters in the o-dimension are similar. The periodic assumption gives that the Sobolev gradient can be computed with a circular convolution with the H 0 gradient, which is simpler to compute in practice. In fact, the Sobolev gradient is given as ∇ H1 L(X)(o, i, h, w) = 1 O O K((o -õ)/O)f (õ, i, h, w) dõ, where K(o) = 1 + o 2 -o + 1/6 2λ , for o ∈ [0, 1]. (9) Note that the re-weighted H 0 solution also has an interpretation of convolution with respect to a smoothing kernel. Figure 2 shows plots of the kernels for the parameter λ chosen in experiments. For each o, the Sobolev or re-weighted H 0 is a local average whose weights die far away from o. Thus, the effect of the metrics is to induce smoothness of the gradient along the output channel direction. The Sobolev gradient need not use the convolution formula, as one can just integrate the ODE twice, an advantage of our mean variant of the Sobolev metric. This saves one from having to compute the convolution directly, and hence a reduction in computational cost from quadratic (or quasi-linear with an FFT) to linear in O given the H 0 gradient. The Sobolev gradient can be computed as g(o, i, h, w) = g(0, i, h, w) + o ∂g ∂o (0, i, h, w) - 1 λ o 0 (o -õ)(f (oO, i, h, w) -f (i, h, w)) dõ (10) ∂g ∂o (0, i, h, w) = - 1 λ 1 0 o(f (oO, i, h, w) -f (i, h, w)) do (11) g(0, i, h, w) = 1 0 K(o)f (oO, i, h, w) do, o ∈ [0, 1] where g = ∇ H1 L(X) and f = ∇ H 0 L(X). These are just three integrals that can be computed in linear complexity with respect to O. The gradient flows under these metrics are given by Ẋt = -∇L(X t ), where t denotes the artificial time variable, Ẋ is the time derivative of the parameter tensor, and ∇ denotes the gradient with respect to the desired metric.

2.4. PROPERTIES OF CHANNEL DIRECTED GRADIENT FLOWS

Correlation in the Weight Tensor: By the convolution formula, the Sobolev gradient is a smoothing of the H 0 gradient. Noting that the gradient flow (13) integrates (smooth) gradients over time, the final tensor will exhibit correlation in the output direction as it sums smooth (correlated) gradients in the output direction and the initialization, which is typically chosen to be decorrelated noise. Coarse-to-Fine Evolution and Removal of Some Local Minima: Sobolev gradient flows evolve according to coarse-scale perturbations before moving to finer scale perturbations Sundaramoorthi et al. (2008) . This avoids being trapped in local minima due to fine-scale structures. Also, since Sobolev balls can fit in any L 2 ball but not vice-versa, the loss landscape changes (i.e., topologically in the continuum) and some local minima (in L 2 ) may cease to exist numerically. As wide local minima generalize well Chaudhari et al. (2019) , the numerical removal of local minima due to fine structures (e.g., sharp minima) may encourage convergence to wide minima and hence generalize better than SGD. The correlated nature of the Sobolev (and re-weighted H 0 ) gradient makes it difficult to lock into sharp local minima. d e f r e w e i g h t e d _ H 0 _ g r a d ( g r a d = param . g r a d . d a t a , lambda ) : # g r a d : L2 g r a d i e n t ; lambda >0 w e i g h t s t r a n s l a t i o n o f L2 g r a d g r a d += lambda * t o r c h . mean ( g r a d , 0 , T r u e ) . r e p e a t ( g r a d . s i z e ( 0 ) , 1 , 1 , 1 ) r e t u r n g r a d Figure 3 : Pytorch code to compute the re-weighted H 0 (H 0 λ ) gradient from the H 0 gradient.

3. APPLICATION TO SGD AND IMPLEMENTATION

To apply re-weighted H 0 and Sobolev channel-directed gradients to optimizing CNNs based on SGD or its variants, we discretize the gradient flow (13) according to forward Euler. We approximate the standard H 0 gradient of the loss, ∇ H 0 L(X), using a mini-batch, as is standard. We then use this approximation of the H 0 gradient to approximate the H1 gradient, ∇ H1 L(X), by discretizing ( 10)-( 12) using a standard Riemann sum. Note that (10) can be computed for each o, the output channel index of the tensor, with the cumulative sum (CUMSUM) operation, which is linear in cost, as are ( 11) and ( 12). We compute the Sobolev gradient for each convolutional layer parameter tensor independent of others. We use λ = 1 for H1 gradient and add it to a scaled version (by a hyper parameter) of the H 0 gradient (as in Figure 2 ) to avoid over-smoothing. The re-weighted H 0 gradient is computed by using ( 8) from the H 0 stochastic gradient. Both our gradients require few additional lines of code; the code for re-weighted H 0 is shown in Figure 3 (see Appendix Figure 12 for H1 code). Thus, our gradients replace the usual one, and other additions to SGD (e.g., momentum, Adam) can be used.

4. EXPERIMENTS

We test our methods on different baseline optimizers and tasks. Our intent is to show that any method can be improved just by switching to either of our gradients. We fix λ = 1 unless specified otherwise. Table 1 shows the settings for each experiment. Experiments are run on a single NVIDIA Titan Xp GPU except for GANs, which are run on a Tesla v100 GPU due to memory requirements. Osher et al. (2018) . For SGD, we set the initial learning rate to be 0.1 and 0.01 on ResNet-56 and VGG-16 respectively with momentum 0.9 and weight decay 5e-4. For ADAM, we set the initial learning rate to 0.01. We decrease the learning rate by a factor of 10 every 40 epochs as Osher et al. (2018) . We run 25 independent trials on SGD and 10 on ADAM (due to lower variance of ADAM), and report the average. In Figure 4 , we show an example of training and test accuracy curves (batch size of 8) for baselines as well as Laplacian Smoothing (LS) Osher et al. (2018) , which rasterizes before smoothing. We out-perform all methods. We also apply LS (without rasterization) to smooth the gradient in our output-channel directed fashion, which improves LS, but we still out-perform it. In Figure 5 (left), we compare the histograms of test accuracy over multiple runs of ours and SGD. Our method achieves higher average test accuracy with reduced variance. To investigate the effect of different channel directions of smoothing, we apply our method as well as LS along different channel-directions. We compare approaches under two settings, which are smoothing gradients in all layers and smoothing gradients in only convolutional layers. Figure 5 (right) shows that our output-channel direction is preferred regardless of smoothing method used. This shows that the output channel smoothing is essential. Smoothing only convolutional layers in a rasterized order (as Table 2 summarizes results (over 10-25 trials). Both our gradients improve over H 0 . A greater advantage is achieved with small batch sizes is small as the stochastic gradient is noisy, and our method imposes regularity. Both gradients perform similarly, but H1 performs better with ADAM. Effect of Smoothing Parameter: We examine the effect of the smoothing parameter on MNIST Le-Cun & Cortes (2010) and Fashion- MNIST Xiao et al. (2017) by varying it from 0 to 20. We conduct training on the test set (10000 samples) and test on the training set (60000 samples) to make generalization more challenging. We use a 2-layer CNN with 50 and 100 5 × 5 filters in each layer, respectively, and train with batch size 100. Figure 6 shows the accuracy at the 100th epoch (average over 5 trials). Note λ = 0 is SGD. Our methods are not sensitive to λ and improve SGD for any λ.

Semantic Segmentation:

The experiments are conducted on PascalVOC Everingham et al. (2015) using the popular UNet segmentation network Ronneberger et al. (2015) with ResNet-50 as the encoder (https://github.com/nyoki-mtl/pytorch-segmentation). We use initial learning rate 7e-3 and batch size 2 (to fit on Titan Xp memory), and average results over 3 trials. Figure 7 shows results. Both our gradients improve segmentation accuracy by ~8% over SGD on the test set. We reduced the generalization gap from 0.163 to 0.151 (by 7.4%) and 0.150 (by 8.0%) for H1 and H 0 λ , respectively. 2017) score (lower is better). Learning rates are 1e -4 and 4e -4 for the generator and discriminator, respectively. We compare to SGD with momentum 0.9 and weight decay 5e-4. All models are trained with batch size 2 (to fit on Tesla v100 memory). For each optimizer, we summarize the results of 24 different trained models. Table 3 provides results. Our methods achieve better average FID score with less variance. Method FID SGD 65.77 ± 11.94 + H1 60.17 ± 6.15 +H 0 λ 57.99 ± 5.01 Table 3 : Results on the image generation task. Our methods achieve better result with reduced variance due to regularity imposed during training. Speed: With PyTorch, re-weighted H 0 adds negligible overhead. Currently, H1 increases training time on CIFAR-10 by 50% with batch size 128. 70% of this overhead is due to using tensor transpose and saving/loading, which is required due to limitations of Pytorch library. This can be eliminated by implementing our own Pytorch function in C++; in this case, H1 would add a 15% overhead.

5. CONCLUSION

Using stochastic gradients that promote correlation (and smoothness) in the output-channel dimension of CNN network tensors is effective in improving accuracy of SGD and its variants. We reformulated the gradient (without changing the loss) by changing the underlying Riemannian geometry on the tensor space using two different metrics. In the continuum, Sobolev changes the topology of the loss landscape (possibly removing fine-scale local minima), and so has better theoretical properties. Both the channel-directed re-weighted H 0 and H1 gave accuracy boosts, with H1 performing better with ADAM. Regularity in other tensor dimensions is not effective in improving accuracy. Both channel-directed gradients have the same (linear) computational complexity and not much cost over SGD (re-weighted H 0 is faster), and the code is simple.

A ADDITIONAL ANALYSIS OF EVOLUTION OF CHANNEL-DIRECTED OPTIMIZATION

Figure 8 and Figure 9 present the evolution of training and test accuracy of ADAM and SGD with different batch sizes. Using channel-directed gradients ( H1 in this experiment) for SGD or ADAM improves test accuracy for any batch size. More prominent performance gains are seen for smaller batch sizes. This is due to that the stochastic gradient is typically more noisy when the batch size is small, and our proposed channel-directed metrics implicitly encode smoothness. 

B ADDITIONAL EXPERIMENTAL VERIFICATION OF OUTPUT-CHANNEL DIRECTION

To investigate the effect of different channel directions of smoothing, we apply our method as well as LS along different channel-directions. Figure 10 shows that our output-channel direction is preferred regardless of different smoothing approaches. 

C REGULARITY OF TRAINED CONVOLUTIONAL LAYERS

We show that the final weight tensor at convergence in our methods have correlation in the output channel dimension in Figure 11 , as should be the case as the tensor is composed of a component that is smooth. To show this, we plot the correlation between filters in the weight tensors as a function of the distance in the output channel dimension. This is done over multiple tensor layers in ResNet-56 and over multiple trials of optimization on CIFAR-10. We also show the correlation of filters in the input channel direction. As can be seen, all optimization methods produce tensors that exhibit correlation (in additional smoothness for Sobolev) in the output channel direction while no (or much less) correlation in the input direction. Notice that our methods increase the amount of regularity compared to SGD as it imposes this in optimization. We first derive the re-weighted L 2 gradient under H 0 λ metric following the same notations from the paper. Consider f ∇ H 0 L(X) the standard L 2 gradient, and we want to solve for g ∇ H 0 λ L(X). By ( 4) and ( 7) we have f, k H 0 = g, k H 0 λ (14) = ḡ, k H 0 + λ g -ḡ, k -k H 0 . Note the fact that ḡ, k -k H 0 = 0 holds for all k. This is because ḡ(k -k) do = ḡ (k -k) do and (k -k) do = 0 since k -k is zero-mean. In this way, k and k -k become a set of orthogonal basis.

After decomposing f and k into

f = f + (f -f ), k = k + (k -k), by simple algebra we have f = ḡ, f -f = λ(g -ḡ), which leads to the result of (8). We then derive the Sobolev gradient under H 1 metric, following similar computations in Sundaramoorthi et al. (2007) . Consider ∇ H 1 L(X) the Sobolev gradient under H 1 metric. By ( 6) and ( 7) we have ∇ H 0 L(X), k H 0 = ∇ H 1 L(X), k H 1 (18) = 1 O k, ∇ H 1 L(X) H 0 + λO ∂k ∂o , ∂∇ H 1 L(X) ∂o H 0 . ( ) Integrating by parts and considering the periodic boundary conditions, we have ∇ H 0 L(X), k H 0 = ∇ H 1 L(X) -λO 2 ∂ 2 ∂o 2 ∇ H 1 L(X), k H 0 . ( ) Since k can be any perturbation, by uniqueness, we have ∇ H 0 L(X) = ∇ H 1 L(X) -λO 2 ∂ 2 ∂o 2 ∇ H 1 L(X) which is (8). Similarly, for H1 metric, we have ∇ H 0 L(X) = ∇ H1 L(X) -λO 2 ∂ 2 ∂o 2 ∇ H1 L(X). First observe that by computed the output-channel directed average of the both sides of the above equation, we see that ∇ H1 L(X) = ∇ H 0 L(X), i.e., the average values are same. One may integrate (22) twice to solve for the H1 gradient. For simplicity, let f be the L 2 gradient and g be the H1 gradient. Integrating twice yields g(o, i, h, w) = g(0, i, h, w) + o 0 ∂g ∂o (0, i, h, w) dõ - 1 λ o 0 ô 0 (f (õO, i, h, w) -f (i, h, w)) dõ dô (23) = g(0, i, h, w) + o 0 ∂g ∂o (0, i, h, w) dõ - 1 λ o 0 o õ (f (õO, i, h, w) -f (i, h, w)) dô dõ (24) = g(0, i, h, w) + o ∂g ∂o (0, i, h, w) - 1 λ o 0 (o -õ)(f (õO, i, h, w) -f (i, h, w)) dõ. (25) Note that here we perform normalization by scaling to the channel direction by letting o ∈ [0, 1]. With boundary conditions g(0) = g(1), ∂g ∂o (0) = ∂g ∂o (1) and f = ḡ, we have ∂g ∂o (0, i, h, w) = - 1 λ 1 0 o(f (oO, i, h, w) -f (i, h, w)) do. For simplicity, we eliminate i, h, w and O in the following derivations. We have  g(0) = g(o) -o ∂g ∂o (0) + 1 λ o 0 (o -õ)(f (õ) -f ) dõ (27) = g(o) + o 1 λ 1 0 o(f (o) -f ) do + 1 λ o 0 (o -õ)(f (õ) -f ) dõ. ( g(0) = ḡ + 1 λ 1 0 o do • 1 0 o(f (o) -f ) do + 1 λ 1 0 o 0 (o -õ)(f (õ) -f ) dõ do (29) = f + 1 2λ 1 0 of (o) do - 1 4λ f + 1 λ ( 1 0 o 0 (o -õ)f (õ) dõ do + f 1 0 o 0 (o -õ) dõ do (30) = (1 - 1 4λ - 1 6λ ) f + 1 2λ 1 0 of (o) do + 1 λ 1 0 1 õ (o -õ)f (õ) do dõ (31) = (1 - 5 12λ ) 1 0 f (o) do + 1 2λ 1 0 of (o) do + 1 λ 1 0 ( 1 2 + õ2 2 -õ)f (õ) dõ (32) = 1 0 (1 + o 2 -o + 1/6 2λ )f (o) do. This gives (12) in the main paper.

E CODE FOR SOBOLEV GRADIENT

The Pytorch code to compute the Sobolev gradient is provided in Figure 12 . In theory, the 'cumsum' operation should be the main part of the code with largest computational cost. However, in order to match with standard Pytorch library, tensor operations including 'permute', 'repeat' and 'unsqueeze' are currently required. These operations contribute to over 70% of computational overhead, and can be avoided by if the computation were done using C++. . p e r m u t e ( 1 , 2 , 3 , 0 ) gp_0 = t o r c h . matmul ( t m p _ d i f f , s ) / ( -lambda * L * * 3 ) gp_0 = gp_0 . u n s q u e e z e _ ( 3 ) . r e p e a t ( 1 , 1 , 1 , L ) s = s . u n s q u e e z e _ ( 0 ) . u n s q u e e z e _ ( 0 ) . u n s q u e e z e _ ( 0 ) . r e p e a t ( t m p _ d i f f . s i z e ( 0 ) , t m p _ d i f f . s i z e ( 1 ) , t m p _ d i f f . s i z e ( 2 Figure 12 : Pytorch code to compute the Sobolev ( H1 ) gradient from the H 0 gradient. The 'permute', 'repeat' and 'unsqueeze' operations are due to standard library limitations, and can be avoided by further code optimization (e.g., writing the function in C++/Cuda that Pytorch calls).

F FURTHER ANALYSIS OF CORRELATION IN CONVOLUTIONAL LAYERS

Existing analysis on regularity in CNNs (Mao et al. (2017) ) focus on filter level and kernel level regularities in pruning. To the best of our knowledge, the channel-directed regularity proposed by our paper has not been investigated nor has the regularity been used in optimization. We show below that the output-direction correlation is not due to randomness in the network or due to particular weight transformations that leave the output behavior of the CNN fixed. The reason for the output direction correlation is unknown to us, but a direction for future investigation. Could it be due to random noise? No. Figure 13 shows the histogram of correlation of a representative convolutional layer from ImageNet pretrained DenseNet. Note that the input channel correlation is distributed as zero-mean Gaussian, which is likely to be due to random noise. In contrast, the output channel correlation (our proposed channel direction) shows positive correlation. There are also outliers (points near 0.3) corresponding to channels with mean value far away from zero (see vertical lines in Figure 1 ). This shows that the neural network prefers regularities in the output channel direction. Could it be due to scaling? No. In modern CNNs, scaling the affine factor in BatchNorm layer could create such structure in the following convolutional layer, without affecting the output of the neural network. Note that in Figure 1 , we use correlation that is invariant to re-scalings so this is not the case. We also investigate this further. Figure 14 presents the scatter plot of mean of tensors in the output direction (larger mean corresponds to stronger correlation) and standard deviation of output channels. If scalings are contributing to the regularity, there would be a positive correlation between the mean and standard deviation, as a scaling amplifies both. The plot does NOT show a positive correlation between mean and standard deviation in this channel direction, which means the structure is not due to simply scaling up all weights within particular channels (producing the same CNN). Figure 14 : Scatter plot of mean and standard deviation of output channels. There is no positive correlation between channel mean and standard deviation, showing that the structure in output channel direction is not due to scaling. Each color corresponds to a layer.



Let X : O × I × H × W → R denote a parameter tensor of a layer of a convolutional neural network. Here O = [0, O] denotes indices to the output channel dimension of the tensor, I = [0, I]

Figure2: Visualization of kernels applied to the H 0 gradient under different metrics for λ = 1. This illustrates the smoothing effect of the metrics. In computation, linear cost formulas are applied to compute the gradients not using the convolution interpretation.

Figure 4: Evolution of training and test accuracy on CIFAR-10: an example with batchsize = 8. Our metric improves both training and test accuracy.

Figure 5: Distribution of results on CIFAR-10. Left: Histogram of test accuracy. Ours achieves higher average with significantly reduced variance. Right: Results from different methods. Best accuracy obtained from our proposed direction. Ours: SGD+ H1 ; LS-ChanDir: LS applied in our proposed channel direction; Ours+O: output channel smoothing; Ours+R: parameters rasterized into a 1-D vector to perform smoothing; Ours+I: Input channel smoothing.

Figure 6: Results on MNIST and Fashion-MNIST with different choice of smoothness. Our methods improve classification accuracy over SGD (i.e., λ = 0) for a wide range of smoothness.

Figure 7: Semantic Segmentation Results on PascalVOC. Sobolev H1 and re-weighted H 0 (H 0 λ ) improve segmentation accuracy by 8.5% and 7.8% respectively relative to SGD.

Figure 8: Training and test accuracy on CIFAR-10 with ADAM.

Figure 9: Training and test accuracy on CIFAR-10 with SGD.

Figure 10: Channel-Directed Smoothing Leads to Better Performance. Best accuracy obtained from our proposed direction. A: Output-Channel Directed; B: Input-Channel Directed; All: parameters rasterized into a 1-D vector to perform smoothing; Ours: re-weighted L 2 .

Figure 11: Correlation of Final Tensor. Correlation between weights within different channel directions in CIFAR trained ResNet 56 conv layers (over 10 trials). |i -j| is distance between weight locations in tensor for correlation computation. Sobolev/re-weighted H 0 show strong correlation in output direction, but not input. SGD shows correlation in output direction.

o) do, we integrate both sides over the entire interval [0, 1].

d e f S o b o l e v _ g r a d ( g r a d = param . g r a d . d a t a , lambda ) : # g r a d : L2 g r a d i e n t ; lambda >0 L = g r a d . s i z e ( 0 ) s = t o r c h . a r a n g e ( L , d t y p e = t o r c h . f l o a t 3 2 ) . c u d a ( ) tmp_mean = t o r c h . mean ( g r a d , 0 , T r u e ) . r e p e a t ( L , 1 , 1 , 1 ) t m p _ d i f f = ( g r a d -tmp_mean )

) , 1 ) # S o b o l e v g r a d i e n t c o m p u t a t i o n tmp2 = s * gp_0 -( s * t o r c h . cumsum ( t m p _ d i f f , dim = 3 ) t o r c h . cumsum ( s * t m p _ d i f f , dim = 3 ) ) / ( lambda * L * * 2 ) g r a d = lambda * ( tmp2 . p e r m u t e ( 3 , 0 , 1 , 2 ) + tmp_mean ) r e t u r n g r a d

Figure 13: Histogram of correlation of a representative tensor. While the input channel correlation is distributed as zero-mean Gaussian, the output channel shows positive correlation and sparsity. The neural network prefers regularities in the output channel direction.

Experimental settings.

Test accuracy on CIFAR-10. Channel-directed gradients improve H 0 in all cases. Up to 11% of errors can be reduced. Results average 25 trials for SGD and 10 trials for ADAM.

