REPAIR: RENORMALIZING PERMUTED ACTIVATIONS FOR INTERPOLATION REPAIR

Abstract

In this paper we look into the conjecture of Entezari et al. (2021) which states that if the permutation invariance of neural networks is taken into account, then there is likely no loss barrier to the linear interpolation between SGD solutions. First, we observe that neuron alignment methods alone are insufficient to establish lowbarrier linear connectivity between SGD solutions due to a phenomenon we call variance collapse: interpolated deep networks suffer a collapse in the variance of their activations, causing poor performance. Next, we propose REPAIR (REnormalizing Permuted Activations for Interpolation Repair) which mitigates variance collapse by rescaling the preactivations of such interpolated networks. We explore the interaction between our method and the choice of normalization layer, network width, and depth, and demonstrate that using REPAIR on top of neuron alignment methods leads to 60%-100% relative barrier reduction across a wide variety of architecture families and tasks. In particular, we report a 74% barrier reduction for ResNet50 on ImageNet and 90% barrier reduction for ResNet18 on CIFAR10. Our code is available at https://github.com/KellerJordan/REPAIR.

1. INTRODUCTION

Training a neural network corresponds to optimizing a highly non-linear function by navigating a complex loss landscape with numerous minima, symmetries and saddles (Zhang et al., 2017; Keskar et al., 2017; Draxler et al., 2018; S ¸ims ¸ek et al., 2021) . Overparameterization is one of the reasons behind the abundance of minima leading to different functions that behave similarly on the training data (Neyshabur et al., 2017; Nguyen et al., 2018; Li et al., 2018; Liu et al., 2020) . Another reason is the existence of permutation and scaling invariances which lead to functionally identical minima that differ in the weight space (Brea et al., 2019; Entezari et al., 2021) . Due to the relationship of the loss landscape with generalization and optimization, a large body of recent works (Li et al., 2017; Mei et al., 2018; Geiger et al., 2019; Nguyen et al., 2018; Fort et al., 2019; S ¸ims ¸ek et al., 2021; Juneja et al., 2022) study the loss landscape of deep neural networks with the goal of navigating the optimizer to a region with desired properties, e.g., with respect to flatness around the SGD solution (Baldassi et al., 2020; Pittorino et al., 2020) . Early work conjectured the existence of a non-linear path of non-increasing loss between solutions found by SGD (Freeman and Bruna, 2016; Draxler et al., 2018) and empirically showed how to find it (Garipov et al., 2018; Tatro et al., 2020; Pittorino et al., 2022) . Recently, Entezari et al. (2021) conjectured the existence of such a linear path between SGD solutions if the permutation invariance of neural networks' weight space is taken into account. That is, with high probability over SGD solutions, for each pair of trained networks A and B there exists a permutation of the hidden units in each layer of B such that the linear path between A and the permuted network B (B') is of non-increasing loss relative to the endpoints. This conjecture is important from both theoretical and empirical perspectives. Theoretically, it leads to a drastic simplification of the loss landscape, reducing the complexity obstacle for analyzing deep neural networks. Empirically, linear interpolation between neural network weights has become an important tool, having recently been used to set state of the art accuracy on ImageNet (Wortsman et al., 2022a) , improve robustness of finetuned models (Wortsman et al., 2022b; Ilharco et al., 2022) , build effective weight-space model ensembles (Izmailov et al., 2019; Frankle et al., 2020; Guo et al., 2022) , and constructively merge models trained on separate data splits (Wang et al., 2020; Ainsworth et al., 2022) . Therefore, any improvements toward reducing the obstacles to interpolation between trained models has the potential to lead to empirical progress in the above areas. In each experiment, we interpolate between the weights of two independently trained networks whose hidden units have been aligned using the method described in Section 2.3. We then compare the interpolated network before and after applying our correction method REPAIR. Left: The variance of activations in interpolated networks progressively collapses. We report the average variance across each layer, normalized by that of the corresponding layer in the original endpoint networks. REPAIR is designed to correct this phenomenon. Middle: REPAIR reduces the barrier to linear interpolation between aligned ResNet50s independently trained on ImageNet by 74% (from 76% to 20%). Right: REPAIR reduces the interpolation barrier across many choices of architecture, training dataset, and normalization layer. For each architecture/dataset pair we vary the network width; larger markers indicate wider networks. Prior and concurrent works on linear interpolation (Singh and Jaggi, 2020; Entezari et al., 2021; Ainsworth et al., 2022) have focused on improving the algorithms used to bring the hidden units of two networks into alignment, in order to reduce the barrier to interpolation between them. Singh and Jaggi (2020) develop a strong optimal transport-based method which allows linear interpolation between a pair of ResNet18 (He et al., 2016) networks such that the minimum accuracy attained along the path is 77%. This constitutes a "barrier" of 16% relative to the original endpoint networks which achieve over 93% accuracy on the CIFAR-10 test set. Entezari et al. ( 2021) use an approach based on simulated annealing (Zhan et al., 2016) in order to find permutations such that wide multi-layer perceptrons (MLPs) (Rosenblatt, 1958) trained on MNIST (LeCun, 1998) can be linearly interpolated with a barrier of nearly zero. Ainsworth et al. ( 2022) make the first demonstration of zero-barrier connectivity between wide ResNets trained on CIFAR-10 by replacingfoot_0 the standard Batch Normalization (Ioffe and Szegedy, 2015) layers with Layer Normalization (Ba et al., 2016) , and develop several novel alignment methods. Further discussion of related work can be found in Appendix A. In this paper we are interested in understanding why alignment of the endpoint networks alone has so far been insufficient to reach low-barrier linear connectivity between standard deep networks.

Contributions

In this work we focus on understanding the source of the poor performance of standard deep networks (ResNet18, VGG11) whose weights have been linearly interpolated from between pairs of networks with aligned neurons. Our contributions are as follows: • We find that such interpolated networks suffer from a phenomenon of variance collapse in which their hidden units have significantly smaller activation variance compared to the corresponding units of the original networks from which they were interpolated. We further identify and explain the reason behind this variance collapse. (Figure 1 (left) and Section 3). • We propose REPAIR (REnormalizing Permuted Activations for Interpolation Repair), a method that corrects variance collapse by rescaling hidden units in the interpolated network such that their statistics match those of the original networks. (Section 4). • We demonstrate that applying REPAIR to such interpolated networks leads to significant barrier reductions across a wide variety of architectures, datasets, normalization techniques, and network width/depth (Section 5 and Figure 1 (middle and right)).

2. PRELIMINARIES

In this section we give preliminary definitions and algorithms to be used throughout the paper.

2.1. LINEAR INTERPOLATION OF NEURAL NETWORKS

We consider the problem of interpolating between independently trained neural networks. That is, if we let θ 1 , θ 2 be the weight vectors of two such networks, then we are interested in networks whose



See the code release, https://github.com/samuela/git-re-basin/blob/main/src/resnet20.py#L18



Figure 1: REPAIR improves the performance of interpolated networks by mitigating variance collapse.

