UNDERSTANDING THE COVARIANCE STRUCTURE OF CONVOLUTIONAL FILTERS

Abstract

Neural network weights are typically initialized at random from univariate distributions, controlling just the variance of individual weights even in highlystructured operations like convolutions. Recent ViT-inspired convolutional networks such as ConvMixer and ConvNeXt use large-kernel depthwise convolutions whose learned filters have notable structure; this presents an opportunity to study their empirical covariances. In this work, we first observe that such learned filters have highly-structured covariance matrices, and moreover, we find that covariances calculated from a small network may be used to effectively initialize a variety of larger networks of different depths, widths, patch sizes, and kernel sizes, indicating a degree of model-independence to the covariance structure. Motivated by this finding, we then propose a learning-free multivariate initialization scheme for convolutional filters using a simple, closed-form construction of their covariance. Models using our initialization outperform those using traditional univariate initializations, and typically meet or exceed the performance of those initialized from the covariances of learned filters; in some cases, this improvement can be achieved without training the depthwise convolutional filters at all. Our code is available at https://github.com/locuslab/convcov. Published as a conference paper at ICLR 2023 sampling from the distributions of pre-trained filters, both in terms of final accuracy and time-toconvergence. Models using our initialization often see gains of over 1% accuracy on CIFAR-10 and short-training ImageNet classification; it also leads to small but significant performance gains on full-scale, ≈ 80%-accuracy ImageNet training. Indeed, in some cases our initialization works so well that it outperforms uniform initialization even when the filters aren't trained at all. And our initialization is almost completely free to compute. Saxe et al. (2013) proposed to replace random i.i.d. Gaussian weights with random orthogonal matrices, a constraint in which weights depend on each other and are thus, in some sense, "multivariate"; Xiao et al. ( 2018) also proposed an orthogonal initialization for convolutions. Similarly to these works, our initialization greatly improves the trainability of deep (depthwise) convolutional networks, but is much simpler and constraint-free, being just a random sample from a multivariate Gaussian distribution. Zhang et al. (2022) suggests that the main purpose of pretraining may be to find a good initialization, and crafts a mimicking initialization based on observed, desirable information transfer patterns. We similarly initialize convolutional filters to be closer to those found in pre-trained models, but do so in a completely random and simpler manner. Romero et al. (2021) proposes an analytic parameterization of variable-size convolutions, based in part on Gaussian filters; while our covariance construction is also analytic and built upon Gaussian filters, we use them to specify the distribution of filters. Our contribution is most advantageous for large-filter convolutions, which have become prevalent in recent work: ConvNeXt (Liu et al., 2022b) uses 7 × 7 convolutions, and ConvMixer (Trockman & Kolter, 2022) uses 9 × 9; taking the trend a step further, Ding et al. (2022) uses 31 × 31, and Liu et al. (2022a) uses 51 × 51 sparse convolutions. Many other works argue for large-filter convolutions ( Related work

1. INTRODUCTION

Early work in deep learning for vision demonstrated that the convolutional filters in trained neural networks are often highly-structured, in some cases being qualitatively similar to filters known from classical computer vision (Krizhevsky et al., 2017) . However, for many years it became standard to replace large-filter convolutions with stacked small-filter convolutions, which have less room for any notable amount of structure. But in the past year, this trend has changed with inspiration from the long-range spatial mixing abilities of vision transformers. Some of the most prominent new convolutional neural networks, such as ConvNeXt and ConvMixer, once again use large-filter convolutions. These new models also completely separate the processing of the channel and spatial dimensions, meaning that the now-single-channel filters are, in some sense, more independent from each other than in previous models such as ResNets. This presents an opportunity to investigate the structure of convolutional filters. In particular, we seek to understand the statistical structure of convolutional filters, with the goal of more effectively initializing them. Most initialization strategies for neural networks focus simply on controlling the variance of weights, as in Kaiming (He et al., 2015) and Xavier (Glorot & Bengio, 2010) initialization, which neglect the fact that many layers in neural networks are highly-structured, with interdependencies between weights, particularly after training. Consequently, we study the covariance matrices of the parameters of convolutional filters, which we find to have a large degree of perhaps-interpretable structure. We observe that the covariance of filters calculated from pretrained models can be used to effectively initialize new convolutions by sampling filters from the corresponding multivariate Gaussian distribution. We then propose a closed-form and completely learning-free construction of covariance matrices for randomly initializing convolutional filters from Gaussian distributions. Our initialization is highly effective, especially for larger filters, deeper models, and shorter training times; it usually outperforms both standard uniform initialization techniques and our baseline technique of initializing by Preliminaries This work is concerned with depthwise convolutional filters, each of which is parametrized by a k × k matrix, where k (generally odd) denotes the filter's size. Our aim is to study distributions that arise from convolutional filters in pretrained networks, and to explore properties of distributions whose samples produce strong initial parameters for convolutional layers. More specifically, we hope to understand the covariance among pairs of filter parameters for fixed filter size k. This is intuitively expressed as a covariance matrix Σ ∈ R k 2 ×k 2 with block structure: Σ has k × k blocks, where each block [Σ i,j ] ∈ R k×k corresponds to the covariance between filter pixel i, j and all other k 2 -1 filter pixels. That is, [Σ i,j ] ,m = [Σ ,m ] i,j gives the covariance of pixels i, j and , m. In practice, we restrict our study to multivariate Gaussian distributions, which by convention are considered as distributions over n-dimensional vectors rather than matrices, where the distribution N (µ, Σ ) has a covariance matrix Σ ∈ S n + where Σ i,j = Σ j,i represents the covariance between vector elements i and j. To align with this convention when sampling filters, we convert from our original block covariance matrix representation to the representation above by simple reassignment of matrix entries, given by Σ ki+j,k +m := [Σ i,j ] ,m for 1 ≤ i, j, , m ≤ k. (1) In this form, we may now easily generate a filter F ∈ R k×k by drawing a sample f ∈ R k 2 from N (µ, Σ ) and assigning F i,j := f ki+j . In this paper, we assume covariance matrices are in the block form unless we are sampling from a distribution, where the conversion between forms is assumed. Scope We restricted our study to the large-filter depthwise convolutions found in new ViT-style CNNs, namely the popular ConvMixer and ConvNeXt architectures. These networks consist of a patch embedding layer followed by alternating spatial-and channel-mixing steps. Both use depthwise convolution for spatial mixing, but ConvMixer uses pointwise convolution (equivalently, linear layers) for spatial mixing while ConvNeXt uses MLPs. ConvMixer uses no internal downsampling, while ConvNeXt includes several downsampling stages. Unlike normal convolutions, the filters in depthwise convolutions act on each input channel separately rather than summing features over input channels. The depth of networks throughout the paper is synonymous with the number of depthwise convolutional layers. All networks investigated use a fixed filter size throughout the network, though the methods we present could easily be extended to the non-uniform case. Further, all methods presented do not concern the biases of convolutional layers. Figure 1 : In pre-trained models, the covariance matrices of convolutional filters are highlystructured. Filters in earlier layers tend to be focused, becoming more diffuse as depth increases. Observing the structure of each block, we note that there is often a static, centered negative component and a dynamic positive component that moves according to the block's position. Often, covariances are higher towards the center of the filters. 

2. THE COVARIANCES OF TRAINED CONVOLUTIONAL FILTERS AND THEIR TRANSFERABILITY ACROSS ARCHITECTURES

In this section, we propose a simple starting point in our investigation of convolutional filter covariance structure: using the distribution of filters from pre-trained models to initialize filters in new models, a process we term covariance transfer. In the simplest case, we use a pre-trained model with exactly the same architecture as the model to be initialized; we then show that we can actually transfer filter covariances across very different models. Basic method. We use i ∈ 1, . . . , D to denote the i th depthwise convolutional layer of a model with D layers. We denote the j ∈ 1, . . . , H filters of the i th pre-trained layer of the model by F ij for a model with H convolutional filters in a particular layer (i.e., hidden dimension H) and F to denote the filters of a new, untrained model. Then the empirical covariance of the filters in layer i is Σ i = Cov[vec(F i1 ), . . . , vec(F iH )], with the mean µ i computed similarly. Then the new model can be initialized by drawing filters from the multivariate Gaussian distribution with parameters µ i , Σ i : F ij ∼ N (µ i , Σ i ) for j ∈ 1, . . . , H, i ∈ 1, . . . , D Note that in this section, we use the means of the filters in addition to the covariances to define the distributions from which to initialize. However, we found that the mean can be assumed to be zero with little change in performance, and we focus solely on the covariance in later sections. Experiment design. We test our initialization methods primarily on ConvMixer since it is simple and exceptionally easy to train on CIFAR-10. We use FFCV (Leclerc et al., 2022) for fast data loading using our own implementations of fast depthwise convolution and RandAugment (Cubuk et al., 2020) . To demonstrate the performance of our methods across a variety of training times, we train for 20, 50, or 200 epochs with a batch size of 512, and we repeat all experiments with three random seeds. For all experiments, we use a simple triangular learning rate schedule (see Appendix A.1) with the AdamW optimizer, a learning rate of .01, and weight decay of .01 as in Trockman & Kolter (2022) . Most of our CIFAR experiments use a ConvMixer-256/8 with either patch size 1 or 2; a ConvMixer-H/D has precisely D depthwise convolutional layers with H filters each, ideal for testing our initial covariance transfer techniques. We train ConvMixers using popular filter sizes 3, 7, and 9, as well as 15. We also test our methods on ConvNeXt (Liu et al., 2022b) , which includes downsampling unlike ConvMixer; we use a patch size of 1 or 2 with ConvNeXt rather than the default 4 to accomodate relatively small CIFAR-10 images, and the default 7 × 7 filters. For most experiments, we provide two baselines for comparison: standard uniform initialization, the standard in PyTorch (He et al., 2015) , as well as directly transferring the learned filters from a pre-trained model to the new model. In most cases, we expect new random initializations to fall between the performance of uniform and direct transfer initializations. For our covariance transfer experiments, we trained a variety of reference models from which to compute covariances; these are all trained for the full 200 epochs using the same settings as above. Frozen filters. even when the filters are frozen; that is, the filter weights remain unchanged over the course of training, receiving no gradient updates. As we are initializing filters from the distribution of trained filters, we suspect that additional training may not be completely necessary. Consequently, in all experiments we investigate both models with thawed filters as well as their frozen counterparts. Freezing filters removes one of the two gradient calculations from depthwise convolution, resulting in substantial training speedups as kernel size increases (see Figure 2 ). ConvMixer-512/12 with kernel size 9 × 9 is around 20% faster, while 15 × 15 is around 40% faster. Further, good performance in the frozen filter setting suggests that an initialization technique is highly effective.

2.1. RESULTS

The simplest case of covariance transfer (from exactly the same architecture) is a fairly effective initialization scheme for convolutional filters. In Fig. 3 , note that this case of covariance transfer (group B) results in somewhat higher accuracies than uniform initialization (group A), particularly for 20epoch training; it also substantially improves the case for frozen filters. Across all trials, the effect of using this initialization is higher for larger kernel sizes. In Fig. 8 , we show that covariance transfer (gold) initially increases convergence, but the advantange over uniform initialization quickly fades. As expected, covariance transfer tends to fall between the performance of direct transfer, where we directly initialize using the filters of the pre-trained model, and default uniform initialization (see group D in Fig. 3 and the green curves in Fig. 8 ). However, we acknowledge that it is not appealing to pre-train models just for an initialization technique with rather marginal gains, so we explore the feasibility of covariance transfer from smaller models, both in terms of width and depth. Narrower models. We first see if it's possible to train a narrower reference model to calculate filter covariances to initialize a wider model; for example, using a ConvMixer-32/8 to initialize a ConvMixer-256/8. In Figure 4 , we show that the optimal performance surprisingly comes from the covariances of a smaller model. For filter sizes sizes greater than 3, the covariance transfer performance increases with width until width 32, and then decreases for width 256 for both the thawed and frozen cases. We plot this method in Fig. 3 (group C), and note that it almost uniformly exceeds the performance of covariance transfer from the same-sized model. Note that the method does not change; the covariances are simply calculated from a smaller sample of filters. Shallower models. Covariance transfer from a shallow model to a deeper model is somewhat more complicated, as there is no longer a one-to-one mapping between layers. Instead, we linearly interpolate the covariance matrices to the desired depth (see Appendix A.1 for more details). Surprisingly, we find that this technique is also highly effective: for example, for a 32-layer-deep ConvMixer, the optimal covariance transfer result is from an 8-layer-deep ConvMixer, and 4-deep models are also quite effective (see Figure 4 ). Different patch sizes. Similarly, it is straightforward to transfer covariances between models with different patch sizes. We find that initializing ConvMixers with 1 × 1 patches from filter covariances of ConvMixers with 2 × 2 patches leads to a decrease in performance relative to using a reference model of the correct patch size; however, using the filters of a 1×1 patch-size ConvMixer to initialize a 2 × 2 patch size ConvMixer increases performance (see group b vs. group B in Fig. 9 ). Yet, in both cases, the performance is better than uniform initialization. Test Accuracy (\%) Different filter sizes. Covariance transfer between models with different filter sizes is more challenging, as the covariance matrices have different sizes. In the block form, we mean-pad or clip each block to the target filter size, and then bilinearly interpolate over the blocks to reach a correctly-sized covariance matrix. This technique is still better than uniform initialization for filter sizes larger than 3 (which naturally has very little structure to transfer), especially in the frozen case (see Fig. 9 ) A B C D E A B C D E A B C D E A B C D E A B C D E A B C D E A B C D E A B C A B C D E A B C D E A B C D E A B C D E A B C D E A B C D E A B C D E A B C D E ConvMixer- Discussion. We have demonstrated that it is possible to initialize filters from the covariances of pre-trained models of different widths, depths, patch sizes, and kernel sizes; while some of these techniques perform better than others, they are almost all better than uniform initialization. Our ob-servations indicate that the optimal choice of reference model is narrower or shallower, and perhaps with a smaller patch size or kernel size. We also found that covariance transfer from ConvMixers trained on ImageNet led to greater performance still (Appendix A). This suggests that the best covariances for filter initialization may be quite unrelated to the target model, i.e., model independent. 3 D.I.Y. FILTER COVARIANCES Ultimately, the above methods for initializing convolutional filters via transfer are limited by the necessity of a trained network from which to form a filter distribution, which must be accessible at initialization. We thus use observations on the structure of filter covariance matrices to construct our own covariance matrices from scratch. Using our construction, we propose a depth-dependent but simple initialization strategy for convolutional filters that greatly outperforms previous techniques. Visual observations. Filter covariance matrices in pre-trained ConvMixers and ConvNeXts have a great deal of structure, which we observe across models with different patch sizes, architectures, and data sets; see Fig. 1 and 32 for examples. In both the block and rearranged forms of the covariance matrices, we noticed clear repetitive structure, which led to an initial investigation on modeling covariances via Kronecker factorizations; see Appendix A for experimental results. Beyond this, we first note that the overall variance of filters tends to increase with depth, until breaking down towards the last layer. Second, we note that the blocks of the covariances often have a static negative component in the center, with a dynamic positive component whose position mirrors that of the block itself. Finally, the covariance of filter parameters is greater in their center, i.e., covariance matrices are at first centrally-focused and become more diffuse with depth. These observations agree with intuition about the structure of convolutional filters: most filters have the greatest weight towards their center, and their parameters are correlated with their neighbors. Constructing covariances. With these observations in mind, we propose a construction of covariance matrices. We fix the (odd) filter size k ∈ N + , let 1 ∈ R k×k be the all-ones matrix, and, as a building block for our initialization, use unnormalized Gaussian-like filters Z σ ∈ R k×k with a single variance parameter σ, defined elementwise by (Z σ ) i,j := exp - (i -k 2 ) 2 + (j -k 2 ) 2 2σ for 1 ≤ i, j, ≤ k. ( ) Such a construction produces filters similar to those observed in the blocks of the Layer #5 covariance matrix in Fig. 1 . To capture the dynamic component that moves according to the position of its block, we define the block matrix C ∈ R k 2 ×k 2 with k × k blocks by [C i,j ] = Shift(Z σ , i, j) where the Shift operation translates each element of the matrix i and j positions forward in their respective dimensions; see Appendix E for details. We then define two additional components, both constructed from Gaussian filters: a static component S = 1 ⊗ Z σ ∈ R k 2 ×k 2 and a blockwise mask component M = Z σ ⊗ 1 ∈ R k 2 ×k 2 , which encodes higher variance as pixels approach the center of the filter. Using these components and our intuition, we first consider Σ = M (C -1 2 S), where is an elementwise product. While this adequately represents what we view to be the important structural components of filter covariance matrices, it does not satisfy the property [Σ i,j ] ,m = [Σ ,m ] i,j (i.e., covariance matrices must be symmetric, accounting for our block representation). Consequently, we instead calculate its symmetric part, using the notation as follows to denote a "block-transpose": Σ B = Σ ⇐⇒ [Σ i,j ] ,m = Σ ,m i,j for 1 ≤ i, j, , m ≤ k. (6) Equivalently, this is the perfect shuffle permutation such that (X⊗Y ) B = Y ⊗X with X, Y ∈ R k×k . First, we note that C B = C due to the definition of the shift operation used in Eq. 5 (see Appendix E). Then, noting that S B = M and M B = S by the previous rule, we define our construction Here we use the parameters σ 0 = .5, v σ = .5, a σ = 3. of Σ to be the symmetric part of Σ (where C, S, M are implicitly parameterized by the σ of Z σ ): Σ = 1 2      M C - S + C S      Σ = 1 2 ( Σ + ΣT ) = 1 2 M (C -1 2 S) + (M (C -1 2 S)) B (7) = 1 2 M (C -1 2 S) + (M B (C B -1 2 S B )) = M (C -1 2 S) + S (C -1 2 M ) (8) = 1 2 [M (C -S) + S C] . ( ) While Σ is now symmetric (in the rearranged form of Eq. 1), it is not positive semi-definite, but can easily be projected to S k 2 + , as is often done automatically by multivariate Gaussian procedures. We illustrate our construction in Fig. 5 , and provide an implementation in Fig. 30 . Completing the initialization. As explained in Fig. 1 , we observed that in pre-trained models, the filters become more "diffuse" as depth increases; we capture this fact in our construction by increasing the parameter σ with depth according to a simple quadratic schedule; let d be the percentage depth, i.e., d = i-1 D-1 for the i th convolutional layer of a model with D total such layers. Then for layer i, we parameterize our covariance construction by a variance schedule: σ(d) = σ 0 + v σ d + 1 2 a σ d 2 where σ 0 , v σ , a σ jointly describe how the covariance evolves with depth. Then, for each layer i ∈ 1, . . . , D, we compute d = i-1 D-1 and initialize the filters as F i,j ∼ N (0, Σ σ(d) ) for j ∈ 1, . . . , H. We illustrate our complete initialization scheme in Figure 6 .

4. RESULTS

In this section, we present the performance of our initialization within ConvMixer and ConvNeXt on CIFAR-10 and ImageNet classification, finding it to be highly effective, particularly for deep models with large filters. Our new initialization overshadows our previous covariance transfer results. Settings of initialization hyperparameters σ 0 , v σ , and a σ were found and fixed for CIFAR-10 experiments, while two such settings were used for ImageNet experiments. Appendix C.1 contains full details on our (relatively small) hyperparameter searches and experimental setups, as well as empirical evidence that our method is robust to a large swath of hyperparameter settings. Additional experiments on more datasets and baseline initializations may be found in Appendix B.

4.1. CIFAR-10 RESULTS

Thawed filters. In Fig. 3 , we show that large-kernel models using our initialization (group E) outperform those using uniform initialization (group A), covariance transfer (groups B, C), and even those directly initializing via learned filters (group D). For 2 × 2-patch models (200 epochs), relative to uniform, our initialization causes up to a 1.1% increase in accuracy for ConvMixer-256/8, and up to 1.6% for ConvMixer-256/24. The effect size increases with the the filter size, and is often more prominent for shorter training times. Results are similar for 1 × 1-patch models, but with a smaller increase for 7 × 7 filters (0.15% vs. 0.5%). Our initialization has the same effects for ConvNeXt (Fig. 7 ). However, our method works poorly for 3 × 3 filters, which we believe have fundamentally different structure than larger filters; this setting is better-served by our original covariance transfer techniques. In addition to improving the final accuracy, our initialization also drastically speeds up convergence of models with thawed filters (see Fig. 8 ), particularly for deeper models. A ConvMixer-256/16 with 2×2 patches using our initialization reaches 90% accuracy in approximately 50% fewer epochs than uniform initialization, and around 25% fewer than direct learned filter transfer. The same occurs, albeit to a lesser extent, for 1 × 1 patches-but note that for this experiment we used the same initialization parameters for both patch sizes to demonstrate robustness to parameter choices. Frozen filters. Our initialization leads to even more surprising effects in models with frozen filters. In Fig. 3 , we see that frozenfilter 2×2-patch models using our initialization often exceed the performance of their uniform, thawed-filter counterparts by a significant margin of 0.4% -2.0% for 200 epochs, and an even larger margin of 0.6% -5.0% for 20 epochs (for large filters). That is, group E (frozen) consistently outperforms groups A-D (thawed), and in some cases even group E (thawed), especially for the deeper 24-layer ConvMixer. While this effect breaks down for 1 × 1 patch models, such frozen-filter models still see accuracy increases of 0.6%-3.5%. However, the effect can still be seen for 1×1-patch ConvNeXts (Fig. 7 ). Also note that frozenfilter models can be up to 40% faster to train (see Fig. 2 ), and may be more robust (Cazenavette et al.) .  0

4.2. IMAGENET EXPERIMENTS

Our initialization performs extremely well on CIFAR-10 for large-kernel models, almost always helping and rarely hurting. Here, we explore if the performance gains transfer to larger-scale Ima-geNet models. We observe in Fig. 32 , Appendix F that filter covariances for such models have finergrained structure than models trained on CIFAR-10, perhaps due to using larger patches. Nonetheless, our initialization leads to quite encouraging improvements in this setting. Experiment design. We used the "A1" training recipe from Wightman et al. (2021) , with crossentropy loss, fewer epochs, and a triangular LR schedule as in Trockman & Kolter (2022) . We primarily demonstrate our initialization for 50-epoch training, as the difference between initializations is most pronounced for lower training times. We also present two full, practical-scale 150-epoch experiments on large models. We also included covariance transfer experiments in Appendix F. Thawed filters. On models trained for 50 epochs with thawed filters, our initialization improves the final accuracy by 0.4% -3.8% (see Table 1 ). For the relatively-shallow ConvMixer-512/12 on which we tuned the initialization parameters, we see a gain of just 0.4%; however, when increasing the depth to 24 or 32, we see larger gains of 1.8% and 3.8%, respectively, and a similar trend among the wider ConvMixer-1024 models. Our initialization also boosts the accuracy of the 18layer ConvNeXt-Tiny from 76.0% to 77.1%; however, it decreased the accuracy of the smaller, 12-layer ConvNeXt-Atto. This is perhaps unsurprising, seeing as our initialization seems to be more helpful for deep models, and we used hyperparameters optimized for a model with a substantially different patch and filter size. Our initialization is also beneficial for more-practical 150-epoch training, boosting accuracy by around 0.1% on both ConvMixer-1536/24 and ConvNeXt-Tiny (see Table 1 , bottom rows). While the effect is small, this demonstrates that our initialization is still helpful even for longer training times and very wide models. We expect that within deeper models and with slightly more parameter tuning, our initialization could lead to still larger gains in full-scale ImageNet training. Frozen filters. Our initialization is extremely helpful for models with frozen filters. Using our initialization, the difference between thawed and frozen-filter models decreases with increasing depth, i.e., it leads to 2 -11% improvements over models with frozen, uniformly-initialized filters. For ConvMixer-1024/32, the accuracy improves from 64.9% to 73.1%, which is over 1% better than the corresponding thawed, uniformly-initialized model, and only 2% from the best result using our initialization. This mirrors the effects we saw for deeper models on our earlier CIFAR-10 experiments. We see a similar effect for ConvNeXt-Tiny, with the frozen version using our initialization achieving 75.2% accuracy vs. the thawed 76.0%. In other words, our initialization so effectively captures the structure of convolutional filters that it is hardly necessary to train them after initialization; one benefit of this is that it substantially speeds up training for large-filter convolutions.

5. CONCLUSION

In this paper, we proposed a simple, closed-form, and learning-free initialization scheme for large depthwise convolutional filters. Models using our initialization typically reach higher accuracies more quickly than uniformly-initialized models. We also demonstrated that our random initialization of convolutional filters is so effective, that in many cases, networks perform nearly as well (or even better) if the resulting filters do not receive gradient updates during training. Moreover, like the standard uniform initializations generally used in neural networks, our technique merely samples from a particular statistical distribution, and it is thus almost completely computationally free. In summary, our initialization technique for the increasingly-popular large-kernel depthwise convolution operation almost always helps, rarely hurts, and is also free.

9x9 15x15

Filter Size Covariance structure. As a first step towards modeling the structure of filter covariances, we replaced covariances with their Kroneckerfactorized counterparts using the rearranged form of the covariance matrix defined in Eq. (1), i.e., Σ = A ⊗ A where A ∈ S k + . Surprisingly, this slightly improved performance over unfactorized covariance transfer (see Fig. 12 ), suggesting that filter covariances are not only eminently transferrable for initialization, but that their core structure may be simpler than meets the eye. Kronecker factorizations were computed via gradient descent minimizing the mean squared error.

Empirical Covariance Our Covariance

Layer #1 of 8 Figure 22 : We investigated how our initialization affects data efficiency by training on random (smaller) subsets of CIFAR-10. Each trial is averaged over 3 such subsets. In the very low-data setting (a), we trained for 300 epochs to compensate for the smaller number of iterations overall. In (b), we trained for 100 epochs. In some cases, using our initialization is comparable to doubling the number of training points.

C HYPERPARAMETER GRID SEARCHES & EXPERIMENTAL SETUP

CIFAR-10 hyperparameter search. We chose an initial setting of our method's three hyperparameters via visual inspection, and then refined them via small-scale grid searches. For CIFAR-10 experiments, we searched over parameters for ConvMixer-256/8 with frozen 9 × 9 filters trained for 20 epochs, and chose σ 0 = .08, v σ = .37, a σ = 2.9 for 2 × 2-patch models, and found the optimal parameters for 1 × 1-patch models to be approximately doubled. However, note that our initialization is quite robust to different parameter settings, with the difference from our doubling choice being less than 0.1% (see Figure 23 ). We used the same parameters across all kernel sizes, as well as for ConvNeXt, a choice which is likely sub-optimal; our search only serves as a rough heuristic. ImageNet-1k hyperparameter search. We did a small grid search using a ConvMixer-512/12 with 14 × 14 patches and 9 × 9 filters trained for 10 epochs on ImageNet-1k (see Appendix F), from which we chose two candidate settings: σ 0 = .15, v σ = .5, a σ = .25 for frozen-filter models and σ 0 = .15, v σ = 0.25, a σ = 1.0 for thawed models. We use these parameters for all the ImageNet experiments, even for models with different patch and kernel sizes (e.g., ConvNeXt). This demonstrates that hyperparameter tuning is optional for our technique; its transferability is not surprising given our results in Sec. 

E SHIFT FUNCTION DEFINITION & PROOF

For a given matrix Z ∈ R k×k (say, a Gaussian kernel centered at 0, 0-the top left of the filter), we assume the shift operator is defined as follows: Shift(Z, δx, δy) i,j = Z (i+δx) mod k,(j+δy) mod k . Then, if [C i,j ] = Shift(Z σ , i, j) and the operation (.) B is defined by Σ B = Σ ⇐⇒ [Σ i,j ] ,m = Σ ,m i,j , and [C i,j ] ,m = Shift(Z, i, j) ,m = Z (i+ ) mod k,(j+m) mod k (14) [C ,m ] i,j = Shift(Z, , m) i,j = Z ( +i) mod k,(m+j) mod k , ( ) this shows that [C i,j ] ,m = [C ,m ] i,j for all 1 ≤ i, j, , m ≤ k (i.e., C is "block-symmetric"), which shows C = C B . F ADDITIONAL IMAGENET EXPERIMENTS 



Figure 2: The backward pass is faster with frozen filters.

Figure 4: CIFAR-10 experimental results from initializing via covariances from narrower (top) and shallower (bottom) models. The numeric annotations represent the width (top) and depth (bottom) of the pre-trained model we use to initialize. U represents uniform initialization.

Figure 5: Our convolutional covariance matrix construction with σ = π/2.

Figure 7: Our init also improves ConvNeXt's accuracy on CIFAR-10 (group E vs. A).

Figure 8: Convergence plots: each data point runs through a full cycle of the LR schedule, and all points are averaged over three trials with shaded standard deviation.

across patch sizes: ConvMixer-256/8 Patch Size: 2x2 a Cov. CM-256/8 1x1 b Cov. CM-32/8 1x1 c Tfr. CM-256/8 1x1

Figure 11: Convergence plots: each data point runs through a full cycle of the LR schedule, and all points are averaged over three trials with shaded standard deviation.

Figure 13: Filters learned or generated for ConvMixer-256/8 with 2 × 2 patches and 9 × 9 filters trained on CIFAR-10: learned filters (left), filters sampled from the Gaussian defined by the empirical covariance matrix of learned filters (center), and filters from our initialization technique (right).

Figure20: Our initialization is also helpful on Tiny ImageNet. Here we increased the patch size to 4x4 to accomodate the increased input size of 64x64 (for computational efficiency).

Figure 21: Compared to other initialization techniques, ours results in faster convergence when training a ConvMixer-256/16 on CIFAR-10 and CIFAR-100. That is, one can often train models for fewer epochs when using our initialization to achieve results comparable to those from using other initialization techniques.

Figure 23: Grid search over initialization parameters σ 0 , v σ , a σ for ConvMixer-258/8 with 9 × 9 frozen filters and 2 × 2 patches trained for 20 epochs on CIFAR-10. Note that the performance of uniform initialization is only ≈85%, i.e., almost all choices result in some improvement.

Figure24: Grid search over initialization parameters σ 0 , v σ , a σ for ConvMixer-258/8 with 9 × 9 frozen filters and 1 × 1 patches trained for 20 epochs on CIFAR-10. Note that the performance of uniform initialization is only ≈88%, i.e., almost all choices result in some improvement.

Figure 26: Grid search over initialiation parameters σ 0 , v σ , a σ for ConvNeXt-atto on CIFAR-10 with frozen filters and 1 × 1 patches trained for 20 epochs, using the "sawtooth" variance schedule (see Fig 27) to account for downsampling layers. While this perhaps shows better robustness to parameter changes than Fig. 26, the effect could also be due to effectively dividing the parameters by two.

Figure 31: Code to use our covariance construction and variance schedule to initalize depthwise convolutional layers in PyTorch. wconv is the weight of a depthwise convolutional layer (nn.Conv2d), and d ∈ [0, 1] is its depth as a fraction of the total depth.

Figure 32: Covariance matrices from a ConvMixer trained on Im-ageNet exhibit similar structure to those of ConvMixers trained on CIFAR-10; however, later layers tend to have more structure, including a "checkerboard" pattern in each block.

Cazenavette et al. noticed that ConvMixers with 3 × 3 filters perform well

ImageNet-1k accuracy from various architectures and initializations. "Ours" denotes our proposed initialization. Bold indicates best within architecture and category (frozen or thawed).

Initializing via covariances from models with different patch (left) and filter sizes (right). Left: Lowercase denotes initializing from patch size 1 × 1, and uppercase 2 × 2. Right: Annotations denote the reference filter size, U is uniform.

.897 0.902 0.903 0.904 0.905 0.907 0.905 0.906 0.906 0.893 0.899 0.902 0.905 0.903 0.907 0.909 0.908 0.908 0.909

ConvMixer performance on ImageNet-1k training with 10 epochs. Our initialization performs comparably to loading covariance matrices from previously-trained models (which were trained for 150 epochs).

ImageNet 50-epoch training

CIFAR-10 results for ConvMixer-256/8 with patch size 2. Bold denotes the highest per group, and blue bold denotes the second highest.

CIFAR-10 results for ConvMixer-256/24 with patch size 2. Bold denotes the highest per group, and blue bold denotes the second highest.

CIFAR-10 results for ConvMixer-256/8 with patch size 1. Bold denotes the highest per group, and blue bold denotes the second highest.

CIFAR-10 results for ConvNeXt-atto with patch size 1. Bold denotes the highest per group, and blue bold denotes the second highest.

CIFAR-10 results for ConvNeXt-atto with patch size 2. Bold denotes the highest per group, and blue bold denotes the second highest.

