UNDERSTANDING THE COVARIANCE STRUCTURE OF CONVOLUTIONAL FILTERS

Abstract

Neural network weights are typically initialized at random from univariate distributions, controlling just the variance of individual weights even in highlystructured operations like convolutions. Recent ViT-inspired convolutional networks such as ConvMixer and ConvNeXt use large-kernel depthwise convolutions whose learned filters have notable structure; this presents an opportunity to study their empirical covariances. In this work, we first observe that such learned filters have highly-structured covariance matrices, and moreover, we find that covariances calculated from a small network may be used to effectively initialize a variety of larger networks of different depths, widths, patch sizes, and kernel sizes, indicating a degree of model-independence to the covariance structure. Motivated by this finding, we then propose a learning-free multivariate initialization scheme for convolutional filters using a simple, closed-form construction of their covariance. Models using our initialization outperform those using traditional univariate initializations, and typically meet or exceed the performance of those initialized from the covariances of learned filters; in some cases, this improvement can be achieved without training the depthwise convolutional filters at all. Our code is available at https://github.com/locuslab/convcov.

1. INTRODUCTION

Early work in deep learning for vision demonstrated that the convolutional filters in trained neural networks are often highly-structured, in some cases being qualitatively similar to filters known from classical computer vision (Krizhevsky et al., 2017) . However, for many years it became standard to replace large-filter convolutions with stacked small-filter convolutions, which have less room for any notable amount of structure. But in the past year, this trend has changed with inspiration from the long-range spatial mixing abilities of vision transformers. Some of the most prominent new convolutional neural networks, such as ConvNeXt and ConvMixer, once again use large-filter convolutions. These new models also completely separate the processing of the channel and spatial dimensions, meaning that the now-single-channel filters are, in some sense, more independent from each other than in previous models such as ResNets. This presents an opportunity to investigate the structure of convolutional filters. In particular, we seek to understand the statistical structure of convolutional filters, with the goal of more effectively initializing them. Most initialization strategies for neural networks focus simply on controlling the variance of weights, as in Kaiming (He et al., 2015) and Xavier (Glorot & Bengio, 2010) initialization, which neglect the fact that many layers in neural networks are highly-structured, with interdependencies between weights, particularly after training. Consequently, we study the covariance matrices of the parameters of convolutional filters, which we find to have a large degree of perhaps-interpretable structure. We observe that the covariance of filters calculated from pretrained models can be used to effectively initialize new convolutions by sampling filters from the corresponding multivariate Gaussian distribution. We then propose a closed-form and completely learning-free construction of covariance matrices for randomly initializing convolutional filters from Gaussian distributions. Our initialization is highly effective, especially for larger filters, deeper models, and shorter training times; it usually outperforms both standard uniform initialization techniques and our baseline technique of initializing by sampling from the distributions of pre-trained filters, both in terms of final accuracy and time-toconvergence. Models using our initialization often see gains of over 1% accuracy on CIFAR-10 and short-training ImageNet classification; it also leads to small but significant performance gains on full-scale, ≈ 80%-accuracy ImageNet training. Indeed, in some cases our initialization works so well that it outperforms uniform initialization even when the filters aren't trained at all. And our initialization is almost completely free to compute. Saxe et al. (2013) proposed to replace random i.i.d. Gaussian weights with random orthogonal matrices, a constraint in which weights depend on each other and are thus, in some sense, "multivariate"; Xiao et al. ( 2018) also proposed an orthogonal initialization for convolutions. Similarly to these works, our initialization greatly improves the trainability of deep (depthwise) convolutional networks, but is much simpler and constraint-free, being just a random sample from a multivariate Gaussian distribution. Zhang et al. (2022) suggests that the main purpose of pretraining may be to find a good initialization, and crafts a mimicking initialization based on observed, desirable information transfer patterns. We similarly initialize convolutional filters to be closer to those found in pre-trained models, but do so in a completely random and simpler manner. Romero et al. ( 2021) proposes an analytic parameterization of variable-size convolutions, based in part on Gaussian filters; while our covariance construction is also analytic and built upon Gaussian filters, we use them to specify the distribution of filters. (Wang et al., 2022; Chen et al., 2022; Han et al., 2021) .

Related work

Preliminaries This work is concerned with depthwise convolutional filters, each of which is parametrized by a k × k matrix, where k (generally odd) denotes the filter's size. Our aim is to study distributions that arise from convolutional filters in pretrained networks, and to explore properties of distributions whose samples produce strong initial parameters for convolutional layers. More specifically, we hope to understand the covariance among pairs of filter parameters for fixed filter size k. This is intuitively expressed as a covariance matrix Σ ∈ R k 2 ×k 2 with block structure: Σ has k × k blocks, where each block [Σ i,j ] ∈ R k×k corresponds to the covariance between filter pixel i, j and all other k 2 -1 filter pixels. That is, [Σ i,j ] ,m = [Σ ,m ] i,j gives the covariance of pixels i, j and , m. In practice, we restrict our study to multivariate Gaussian distributions, which by convention are considered as distributions over n-dimensional vectors rather than matrices, where the distribution N (µ, Σ ) has a covariance matrix Σ ∈ S n + where Σ i,j = Σ j,i represents the covariance between vector elements i and j. To align with this convention when sampling filters, we convert from our original block covariance matrix representation to the representation above by simple reassignment of matrix entries, given by Σ ki+j,k +m := [Σ i,j ] ,m for 1 ≤ i, j, , m ≤ k. (1) In this form, we may now easily generate a filter F ∈ R k×k by drawing a sample f ∈ R k 2 from N (µ, Σ ) and assigning F i,j := f ki+j . In this paper, we assume covariance matrices are in the block form unless we are sampling from a distribution, where the conversion between forms is assumed. Scope We restricted our study to the large-filter depthwise convolutions found in new ViT-style CNNs, namely the popular ConvMixer and ConvNeXt architectures. These networks consist of a patch embedding layer followed by alternating spatial-and channel-mixing steps. Both use depthwise convolution for spatial mixing, but ConvMixer uses pointwise convolution (equivalently, linear layers) for spatial mixing while ConvNeXt uses MLPs. ConvMixer uses no internal downsampling, while ConvNeXt includes several downsampling stages. Unlike normal convolutions, the filters in depthwise convolutions act on each input channel separately rather than summing features over input channels. The depth of networks throughout the paper is synonymous with the number of depthwise convolutional layers. All networks investigated use a fixed filter size throughout the network, though the methods we present could easily be extended to the non-uniform case. Further, all methods presented do not concern the biases of convolutional layers.



Our contribution is most advantageous for large-filter convolutions, which have become prevalent in recent work: ConvNeXt (Liu et al., 2022b) uses 7 × 7 convolutions, and ConvMixer (Trockman & Kolter, 2022) uses 9 × 9; taking the trend a step further, Ding et al. (2022) uses 31 × 31, and Liu et al. (2022a) uses 51 × 51 sparse convolutions. Many other works argue for large-filter convolutions

