DYNAMICAL ISOMETRY FOR RESIDUAL NETWORKS

Abstract

The training success, training speed and generalization ability of neural networks rely crucially on the choice of random parameter initialization. It has been shown for multiple architectures that initial dynamical isometry is particularly advantageous. Known initialization schemes for residual blocks, however, miss this property and suffer from degrading separability of different inputs for increasing depth and instability without Batch Normalization or lack feature diversity. We propose a random initialization scheme, RISOTTO, that achieves perfect dynamical isometry for residual networks with ReLU activation functions even for finite depth and width. It balances the contributions of the residual and skip branches unlike other schemes, which initially bias towards the skip connections. In experiments, we demonstrate that in most cases our approach outperforms initialization schemes proposed to make Batch Normalization obsolete, including Fixup and SkipInit, and facilitates stable training. Also in combination with Batch Normalization, we find that RISOTTO often achieves the overall best result.

1. INTRODUCTION

Random initialization of weights in a neural network play a crucial role in determining the final performance of the network. This effect becomes even more pronounced for very deep models that seem to be able to solve many complex tasks more effectively. An important building block of many models are residual blocks He et al. (2016) , in which skip connections between non-consecutive layers are added to ease signal propagation (Balduzzi et al., 2017) and allow for faster training. ResNets, which consist of multiple residual blocks, have since become a popular center piece of many deep learning applications (Bello et al., 2021) . Batch Normalization (BN) (Ioffe & Szegedy, 2015) is a key ingredient to train ResNets on large datasets. It allows training with larger learning rates, often improves generalization, and makes the training success robust to different choices of parameter initializations. It has furthermore been shown to smoothen the loss landscape (Santurkar et al., 2018) and to improve signal propagation (De & Smith, 2020) . However, BN has also several drawbacks: It breaks the independence of samples in a minibatch and adds considerable computational costs. Sufficiently large batch sizes to compute robust statistics can be infeasible if the input data requires a lot of memory. Moreover, BN also prevents adversarial training (Wang et al., 2022) . For that reason, it is still an active area of research to find alternatives to BN Zhang et al. (2018); Brock et al. (2021b) . A combinations of Scaled Weight Standardization and gradient clipping has recently outperformed BN (Brock et al., 2021b) . However, a random parameter initialization scheme that can achieve all the benefits of BN is still an open problem. An initialization scheme allows deep learning systems the flexibility to drop in to existing setups without modifying pipelines. For that reason, it is still necessary to develop initialization schemes that enable learning very deep neural network models without normalization or standardization methods. A direction of research pioneered by Saxe et al. (2013); Pennington et al. (2017) has analyzed the signal propagation through randomly parameterized neural networks in the infinite width limit using random matrix theory. They have argued that parameter initialization approaches that have the dynamical isometry (DI) property avoid exploding or vanishing gradients, as the singular values of the input-output Jacobian are close to unity. DI is key to stable and fast training (Du et al., 2019; Hu et al., 2020) . While Pennington et al. ( 2017) showed that it is not possible to achieve DI in networks with ReLU activations with independent weights or orthogonal weight matrices, Burkholz & Dubatovka (2019); Balduzzi et al. ( 2017) derived a way to attain perfect DI even in finite ReLU networks by parameter sharing. This approach can also be combined (Blumenfeld et al., 2020; Balduzzi et al., 2017) with orthogonal initialization schemes for convolutional layers (Xiao et al., 2018) . The main idea is to design a random initial network that represents a linear isometric map. We transfer a similar idea to ResNets but have to overcome the additional challenge of integrating residual connections and, in particular, potentially non-trainable identity mappings. In contrast to other ResNet initialization schemes that achieve (approximate) dynamical isometry by initially scaling down the residual connections, we balance the weight of skip and residual connections and thus promote higher initial feature diversity, as highlighted by Fig. 1 . We thus propose RISOTTO (Residual dynamical isometry by initial orthogonality), an initialization scheme that induces exact dynamical isometry (DI) for ResNets (He et al., 2016) with convolutional or fully-connected layers and ReLU activation functions. RISOTTO achieves this for networks of finite width and finite depth and not only in expectation but exactly. We provide theoretical and empirical evidence that highlight the advantages of our approach. Remarkably and in contrast to other initialization schemes that aim to improve signal propagation in ResNets, RISOTTO can achieve performance gains even in combination with BN. We hypothesize that this improvement is due to the balanced skip and residual connections. 

1.1. CONTRIBUTIONS

• To explain the drawbacks of most initialization schemes for residual blocks, we derive signal propagation results for finite networks without requiring mean field approximations and highlight input separability issues for large depths. Accordingly, activations corresponding to different inputs become more similar for increasing depth which makes it harder for the neural network to distinguish different classes. • We propose a solution, RISOTTO, which is an initialization scheme for residual blocks that provably achieves dynamical isometry (exactly for finite networks and not only approximately). A residual block is initialized so that it acts as an orthogonal, norm and distance preserving transform. • In experiments on multiple standard benchmark datasets, we demonstrate that our approach achieves competitive results in comparison with alternatives: -We show that RISOTTO facilitates training ResNets without BN or any other normalization method and often outperforms existing BN free methods including Fixup, SkipInit and NF ResNets. -It outperforms standard initialization schemes for ResNets with BN on Tiny Imagenet, CIFAR100, and ImageNet.

1.2. RELATED WORK

Preserving Signal Propagation Random initialization schemes have been designed for a multitude of neural network architectures and activation functions. Early work has focused on the layerwise preservation of average squared signal norms (Glorot & Bengio, 2010; He et al., 2015; Hanin, 2018) and their variance (Hanin & Rolnick, 2018) . The mean field theory of infinitely wide networks has also integrated signal covariances into the analysis and further generated practical insights into good choices that avoid exploding or vanishing gradients and enable feature learning (Yang & Hu, 2021) if the parameters are drawn independently (Poole et al., 2016; Raghu et al., 2017; Schoenholz et al., 2017; Yang & Schoenholz, 2017; Xiao et al., 2018) . Indirectly, these works demand that the average eigenvalue of the signal input-output Jacobian is steered towards 1. Yet, in this set-up, ReLU activation functions fail to support parameter choices that lead to good trainability of very



Figure 1: Features of our initialization scheme Risotto. This figure has been designed using images from Flaticon.com.

Furthermore, we demonstrate that RISOTTO can successfully train ResNets without BN and achieve the same or better performance than Zhang et al. (2018); Brock et al. (2021b).

