DYNAMICAL ISOMETRY FOR RESIDUAL NETWORKS

Abstract

The training success, training speed and generalization ability of neural networks rely crucially on the choice of random parameter initialization. It has been shown for multiple architectures that initial dynamical isometry is particularly advantageous. Known initialization schemes for residual blocks, however, miss this property and suffer from degrading separability of different inputs for increasing depth and instability without Batch Normalization or lack feature diversity. We propose a random initialization scheme, RISOTTO, that achieves perfect dynamical isometry for residual networks with ReLU activation functions even for finite depth and width. It balances the contributions of the residual and skip branches unlike other schemes, which initially bias towards the skip connections. In experiments, we demonstrate that in most cases our approach outperforms initialization schemes proposed to make Batch Normalization obsolete, including Fixup and SkipInit, and facilitates stable training. Also in combination with Batch Normalization, we find that RISOTTO often achieves the overall best result.

1. INTRODUCTION

Random initialization of weights in a neural network play a crucial role in determining the final performance of the network. This effect becomes even more pronounced for very deep models that seem to be able to solve many complex tasks more effectively. An important building block of many models are residual blocks He et al. (2016) , in which skip connections between non-consecutive layers are added to ease signal propagation (Balduzzi et al., 2017) and allow for faster training. ResNets, which consist of multiple residual blocks, have since become a popular center piece of many deep learning applications (Bello et al., 2021) . Batch Normalization (BN) (Ioffe & Szegedy, 2015) is a key ingredient to train ResNets on large datasets. It allows training with larger learning rates, often improves generalization, and makes the training success robust to different choices of parameter initializations. It has furthermore been shown to smoothen the loss landscape (Santurkar et al., 2018) and to improve signal propagation (De & Smith, 2020) . However, BN has also several drawbacks: It breaks the independence of samples in a minibatch and adds considerable computational costs. Sufficiently large batch sizes to compute robust statistics can be infeasible if the input data requires a lot of memory. Moreover, BN also prevents adversarial training (Wang et al., 2022) . For that reason, it is still an active area of research to find alternatives to BN Zhang et al. (2018); Brock et al. (2021b) . A combinations of Scaled Weight Standardization and gradient clipping has recently outperformed BN (Brock et al., 2021b) . However, a random parameter initialization scheme that can achieve all the benefits of BN is still an open problem. An initialization scheme allows deep learning systems the flexibility to drop in to existing setups without modifying pipelines. For that reason, it is still necessary to develop initialization schemes that enable learning very deep neural network models without normalization or standardization methods. A direction of research pioneered by Saxe et al. (2013); Pennington et al. (2017) has analyzed the signal propagation through randomly parameterized neural networks in the infinite width limit using random matrix theory. They have argued that parameter initialization approaches that have the dynamical isometry (DI) property avoid exploding or vanishing gradients, as the singular values of the input-output Jacobian are close to unity. DI is key to stable and fast training (Du et al., 2019; Hu et al., 2020) . While Pennington et al. ( 2017) showed that it is not possible to achieve DI in networks with ReLU activations with independent weights or orthogonal weight matrices, Burkholz & Dubatovka (2019); Balduzzi et al. (2017) derived a way to attain perfect DI even in finite ReLU

