REPAIR: RENORMALIZING PERMUTED ACTIVATIONS FOR INTERPOLATION REPAIR

Abstract

In this paper we look into the conjecture of Entezari et al. (2021) which states that if the permutation invariance of neural networks is taken into account, then there is likely no loss barrier to the linear interpolation between SGD solutions. First, we observe that neuron alignment methods alone are insufficient to establish lowbarrier linear connectivity between SGD solutions due to a phenomenon we call variance collapse: interpolated deep networks suffer a collapse in the variance of their activations, causing poor performance. Next, we propose REPAIR (REnormalizing Permuted Activations for Interpolation Repair) which mitigates variance collapse by rescaling the preactivations of such interpolated networks. We explore the interaction between our method and the choice of normalization layer, network width, and depth, and demonstrate that using REPAIR on top of neuron alignment methods leads to 60%-100% relative barrier reduction across a wide variety of architecture families and tasks. In particular, we report a 74% barrier reduction for ResNet50 on ImageNet and 90% barrier reduction for ResNet18 on CIFAR10. Our code is available at https://github.com/KellerJordan/REPAIR.

1. INTRODUCTION

Training a neural network corresponds to optimizing a highly non-linear function by navigating a complex loss landscape with numerous minima, symmetries and saddles (Zhang et al., 2017; Keskar et al., 2017; Draxler et al., 2018; S ¸ims ¸ek et al., 2021) . Overparameterization is one of the reasons behind the abundance of minima leading to different functions that behave similarly on the training data (Neyshabur et al., 2017; Nguyen et al., 2018; Li et al., 2018; Liu et al., 2020) . Another reason is the existence of permutation and scaling invariances which lead to functionally identical minima that differ in the weight space (Brea et al., 2019; Entezari et al., 2021) . Due to the relationship of the loss landscape with generalization and optimization, a large body of recent works (Li et al., 2017; Mei et al., 2018; Geiger et al., 2019; Nguyen et al., 2018; Fort et al., 2019; S ¸ims ¸ek et al., 2021; Juneja et al., 2022) study the loss landscape of deep neural networks with the goal of navigating the optimizer to a region with desired properties, e.g., with respect to flatness around the SGD solution (Baldassi et al., 2020; Pittorino et al., 2020) . Early work conjectured the existence of a non-linear path of non-increasing loss between solutions found by SGD (Freeman and Bruna, 2016; Draxler et al., 2018) and empirically showed how to find it (Garipov et al., 2018; Tatro et al., 2020; Pittorino et al., 2022) . Recently, Entezari et al. ( 2021) conjectured the existence of such a linear path between SGD solutions if the permutation invariance of neural networks' weight space is taken into account. That is, with high probability over SGD solutions, for each pair of trained networks A and B there exists a permutation of the hidden units in each layer of B such that the linear path between A and the permuted network B (B') is of non-increasing loss relative to the endpoints. This conjecture is important from both theoretical and empirical perspectives. Theoretically, it leads to a drastic simplification of the loss landscape, reducing the complexity obstacle for analyzing deep neural networks. Empirically, linear interpolation between neural network weights has become an important tool, having recently been used to set state of the art accuracy on ImageNet (Wortsman et al., 2022a) , improve robustness of finetuned models (Wortsman et al., 2022b; Ilharco et al., 2022) , build effective weight-space model ensembles (Izmailov et al., 2019; Frankle et al., 2020; Guo et al., 2022) , and constructively merge models trained on separate data splits (Wang et al., 2020; Ainsworth et al., 2022) . Therefore, any improvements toward reducing the obstacles to interpolation between trained models has the potential to lead to empirical progress in the above areas. 1

