DYNAMICAL ISOMETRY FOR RESIDUAL NETWORKS

Abstract

The training success, training speed and generalization ability of neural networks rely crucially on the choice of random parameter initialization. It has been shown for multiple architectures that initial dynamical isometry is particularly advantageous. Known initialization schemes for residual blocks, however, miss this property and suffer from degrading separability of different inputs for increasing depth and instability without Batch Normalization or lack feature diversity. We propose a random initialization scheme, RISOTTO, that achieves perfect dynamical isometry for residual networks with ReLU activation functions even for finite depth and width. It balances the contributions of the residual and skip branches unlike other schemes, which initially bias towards the skip connections. In experiments, we demonstrate that in most cases our approach outperforms initialization schemes proposed to make Batch Normalization obsolete, including Fixup and SkipInit, and facilitates stable training. Also in combination with Batch Normalization, we find that RISOTTO often achieves the overall best result.

1. INTRODUCTION

Random initialization of weights in a neural network play a crucial role in determining the final performance of the network. This effect becomes even more pronounced for very deep models that seem to be able to solve many complex tasks more effectively. An important building block of many models are residual blocks He et al. (2016) , in which skip connections between non-consecutive layers are added to ease signal propagation (Balduzzi et al., 2017) and allow for faster training. ResNets, which consist of multiple residual blocks, have since become a popular center piece of many deep learning applications (Bello et al., 2021) . Batch Normalization (BN) (Ioffe & Szegedy, 2015) is a key ingredient to train ResNets on large datasets. It allows training with larger learning rates, often improves generalization, and makes the training success robust to different choices of parameter initializations. It has furthermore been shown to smoothen the loss landscape (Santurkar et al., 2018) and to improve signal propagation (De & Smith, 2020) . However, BN has also several drawbacks: It breaks the independence of samples in a minibatch and adds considerable computational costs. Sufficiently large batch sizes to compute robust statistics can be infeasible if the input data requires a lot of memory. Moreover, BN also prevents adversarial training (Wang et al., 2022) . For that reason, it is still an active area of research to find alternatives to BN Zhang et al. (2018) ; Brock et al. (2021b) . A combinations of Scaled Weight Standardization and gradient clipping has recently outperformed BN (Brock et al., 2021b) . However, a random parameter initialization scheme that can achieve all the benefits of BN is still an open problem. An initialization scheme allows deep learning systems the flexibility to drop in to existing setups without modifying pipelines. For that reason, it is still necessary to develop initialization schemes that enable learning very deep neural network models without normalization or standardization methods. A direction of research pioneered by Saxe et al. (2013) ; Pennington et al. (2017) has analyzed the signal propagation through randomly parameterized neural networks in the infinite width limit using random matrix theory. They have argued that parameter initialization approaches that have the dynamical isometry (DI) property avoid exploding or vanishing gradients, as the singular values of the input-output Jacobian are close to unity. DI is key to stable and fast training (Du et al., 2019; Hu et al., 2020) . While Pennington et al. (2017) showed that it is not possible to achieve DI in networks with ReLU activations with independent weights or orthogonal weight matrices, Burkholz & Dubatovka (2019) ; Balduzzi et al. (2017) derived a way to attain perfect DI even in finite ReLU networks by parameter sharing. This approach can also be combined (Blumenfeld et al., 2020; Balduzzi et al., 2017) with orthogonal initialization schemes for convolutional layers (Xiao et al., 2018) . The main idea is to design a random initial network that represents a linear isometric map.

Isometry

Feature Diversity

Balanced Skip and Residual Connections

Figure 1 : Features of our initialization scheme Risotto. This figure has been designed using images from Flaticon.com. We transfer a similar idea to ResNets but have to overcome the additional challenge of integrating residual connections and, in particular, potentially non-trainable identity mappings. In contrast to other ResNet initialization schemes that achieve (approximate) dynamical isometry by initially scaling down the residual connections, we balance the weight of skip and residual connections and thus promote higher initial feature diversity, as highlighted by Fig. 1 . We thus propose RISOTTO (Residual dynamical isometry by initial orthogonality), an initialization scheme that induces exact dynamical isometry (DI) for ResNets (He et al., 2016) with convolutional or fully-connected layers and ReLU activation functions. RISOTTO achieves this for networks of finite width and finite depth and not only in expectation but exactly. We provide theoretical and empirical evidence that highlight the advantages of our approach. Remarkably and in contrast to other initialization schemes that aim to improve signal propagation in ResNets, RISOTTO can achieve performance gains even in combination with BN. We hypothesize that this improvement is due to the balanced skip and residual connections. Furthermore, we demonstrate that RISOTTO can successfully train ResNets without BN and achieve the same or better performance than Zhang et al. (2018) ; Brock et al. (2021b) .

1.1. CONTRIBUTIONS

• To explain the drawbacks of most initialization schemes for residual blocks, we derive signal propagation results for finite networks without requiring mean field approximations and highlight input separability issues for large depths. Accordingly, activations corresponding to different inputs become more similar for increasing depth which makes it harder for the neural network to distinguish different classes. • We propose a solution, RISOTTO, which is an initialization scheme for residual blocks that provably achieves dynamical isometry (exactly for finite networks and not only approximately). A residual block is initialized so that it acts as an orthogonal, norm and distance preserving transform. • In experiments on multiple standard benchmark datasets, we demonstrate that our approach achieves competitive results in comparison with alternatives: -We show that RISOTTO facilitates training ResNets without BN or any other normalization method and often outperforms existing BN free methods including Fixup, SkipInit and NF ResNets. -It outperforms standard initialization schemes for ResNets with BN on Tiny Imagenet, CIFAR100, and ImageNet.

1.2. RELATED WORK

Preserving Signal Propagation Random initialization schemes have been designed for a multitude of neural network architectures and activation functions. Early work has focused on the layerwise preservation of average squared signal norms (Glorot & Bengio, 2010; He et al., 2015; Hanin, 2018) and their variance (Hanin & Rolnick, 2018) . The mean field theory of infinitely wide networks has also integrated signal covariances into the analysis and further generated practical insights into good choices that avoid exploding or vanishing gradients and enable feature learning (Yang & Hu, 2021) if the parameters are drawn independently (Poole et al., 2016; Raghu et al., 2017; Schoenholz et al., 2017; Yang & Schoenholz, 2017; Xiao et al., 2018) . Indirectly, these works demand that the average eigenvalue of the signal input-output Jacobian is steered towards 1. Yet, in this set-up, ReLU activation functions fail to support parameter choices that lead to good trainability of very deep networks, as outputs corresponding to different inputs become more similar for increasing depth (Poole et al., 2016; Burkholz & Dubatovka, 2019) . Yang & Schoenholz (2017) could show that ResNets can mitigate this effect and enable training deeper networks, but also cannot distinguish different inputs eventually. However, there are exceptions. Balanced networks can improve (Li et al., 2021) interlayer correlations and reduce the variance of the output. A more effective option is to remove the contribution of the residual part entirely as proposed in successful ResNet initialization schemes like Fixup (Zhang et al., 2018) , SkipInit (De & Smith, 2020) , or ISONet (Qi et al., 2020) . This, however, limits significantly the initial feature diversity that is usually crucial for the training success (Blumenfeld et al., 2020) and scales down the residual connections. A way to address this issue for other architectures with ReLUs like fully-connected (Burkholz & Dubatovka, 2019) and convolutional (Balduzzi et al., 2017) layers is a looks-linear weight matrix structure (Shang et al., 2016) . This idea has not been transferred to residual blocks yet but has the advantage that it can be combined with orthogonal submatrices. These matrices induce perfect dynamical isometry (Saxe et al., 2013; Mishkin & Matas, 2015; Poole et al., 2016; Pennington et al., 2017) , meaning that the eigenvalues of the initial inputoutput Jacobian are identical to 1 or -1 and not just close to unity on average. This property has been shown to enable the training of very deep neural networks (Xiao et al., 2018) and can improve their generalization ability (Hayase & Karakida, 2021) and training speed Pennington et al. (2017; 2018) . ResNets equipped with ReLUs can currently only achieve this property approximately and without a practical initialization scheme (Tarnowski et al., 2019) or with reduced feature diversity (Blumenfeld et al., 2020) and potential training instabilities (Zhang et al., 2018; De & Smith, 2020) . ResNet Initialization Approaches Fixup (Zhang et al., 2018) , SkipInit (De & Smith, 2020) , ISONet (Qi et al., 2020) , and ReZero (Bachlechner et al., 2021) have been designed to enable training without requiring BN, yet, can usually not achieve equal performance. Training data informed approaches have also been successful (Zhu et al., 2021; Dauphin & Schoenholz, 2019) but they require computing the gradient of the input minibatches. Yet, most methods only work well in combination with BN (Ioffe & Szegedy, 2015) , as it seems to improve ill conditioned initializations (Glorot & Bengio, 2010; He et al., 2016) according to Bjorck et al. (2018) , allows training with larger learning rates (Santurkar et al., 2018) , and might initially bias the residual block towards the identity enabling signal to flow through De & Smith (2020) . The additional computational and memory costs of BN, however, have motivated research on alternatives including different normalization methods (Wu & He, 2018; Salimans & Kingma, 2016; Ulyanov et al., 2016) . Only recently has it been possible to outperform BN in generalization performance using scaled weight standardization and gradient clipping (Brock et al., 2021b; a) , but this requires careful hyperparameter tuning. In experiments, we compare our initialization proposal RISOTTO with all three approaches: normalization free methods, BN, and normalization alternatives (e.g NF ResNet).

2.1. BACKGROUND AND NOTATION

The object of our study is a general residual network that is defined by z 0 := W 0 * x, x l = ϕ(z l-1 ), z l := α l f l (x l ) + β l h l (x l ); z out := W out P (x L ) (1) for 1 ≤ l ≤ L. P (. ) denotes an optional pooling operation like maxpool or average pool, f (.) residual connections, and h(.) the skip connections, which usually represent an identity mapping or a projection. For simplicity, we assume in our derivations and arguments that these functions are parameterized as f l (x l ) = W l 2 * ϕ(W l 1 * x l +b l 1 )+b l 2 and h l (x l ) = W l skip * x l +b l skip ( * denotes convolution), but our arguments also transfer to residual blocks in which more than one layer is skipped. Optionally, batch normalization (BN) layers are placed before or after the nonlinear activation function ϕ(•). We focus on ReLUs ϕ(x) = max{0, x} (Krizhevsky et al., 2012) , which are among the most commonly used activation functions in practice. All biases b l 2 ∈ R N l+1 , b l 1 ∈ R Nm l , and b l skip ∈ R N l are assumed to be trainable and set initially to zero. We ignore them in the following, since we are primarily interested in the neuron states and signal propagation at initialization. The parameters α and β balance the contribution of the skip and the residual branch, respectively. Note that α is a trainable parameter, while β is just mentioned for convenience to simplify the comparison with standard He initialization approaches (He et al., 2015) . Both parameters could also be integrated into the weight parameters W l 2 ∈ R N l+1 ×Nm l ×k l 2,1 ×k l 2,2 , W l 1 ∈ R Nm l ×N l ×k l 1,2 ×k l 1,2 , and W l skip ∈ R N l+1 ×N l ×1×1 , but they make the discussion of different initialization schemes more convenient and simplify the comparison with standard He initialization approaches (He et al., 2015) . Residual Blocks Following the definition by He et al. (2015) , we distinguish two types of residual blocks, Type B and Type C (see Figure 2a ), which differ in the choice of W l skip . The Type C residual block is defined as z l = αf l (x l ) + h l (x l ) so that shortcuts h(.) are projections with a 1 × 1 kernel with trainable parameters. The type B residual block has identity skip connections z l = αf l (x l ) + x l . Thus, W l skip represents the identity and is not trainable.

2.2. SIGNAL PROPAGATION FOR NORMAL RESNET INITIALIZATION

Most initialization methods for ResNets draw weight entries independently at random, including Fixup and SkipInit. To simplify the theoretical analysis of the induced random networks and to highlight the shortcomings of the independence assumption, we assume: Definition 2.1 (Normally Distributed ResNet Parameters). All biases are initialized as zero and all weight matrix entries are independently normally distributed with w l ij,2 ∼ N 0, σ 2 l,2 , w l ij,1 ∼ N 0, σ 2 l,1 , and w l ij,skip ∼ N 0, σ 2 l,skip . Most studies further focus on special cases of the following set of parameter choices. Definition 2.2 (Normal ResNet Initialization). The choice σ l,1 = 2 Nm l k l 1,1 k l 1,2 , σ l,2 = 2 N l+1 k l 2,1 k l 2,2 , σ l,skip = 2 N l+1 as used in Definition 2.1 and α l , β l ≥ 0 that fulfill α 2 l + β 2 l = 1. Another common choice is W skip = I instead of random entries. If β l = 1, sometimes also α l ̸ = 0 is still common if it accounts for the depth L of the network. In case α l and β l are the same for each layer we drop the subscript l. For instance, Fixup (Zhang et al., 2018) and SkipInit (De & Smith, 2020) satisfy the above condition with α = 0 and β = 1. De & Smith (2020) argue that BN also suppresses the residual branch effectively. However, in combination with He initialization (He et al., 2015) it becomes more similar to α = β = √ 0.5. Li et al. (2021) study the case of free α l but focus their analysis on identity mappings W l 1 = I and W l skip = I. As other theoretical work, we focus our following investigations on fully-connected layers to simplify the exposition. Similar insights would transfer to convolutional layers but would require extra effort (Yang & Schoenholz, 2017) . The motivation for the general choice in Definition 2.2 is that it ensures that the average squared l2-norm of the neuron states is identical in every layer. This has been shown by Li et al. (2021) for the special choice W l 1 = I and W l skip = I, β = 1 and by (Yang & Schoenholz, 2017) in the mean field limit with a missing ReLU so that x l = z l-1 . (Hanin & Rolnick, 2018) has also observed for W l skip = I and β = 1 that the squared signal norm increases in l α l . For completeness, we present the most general case next and prove it in the appendix. Theorem 2.3 (Norm preservation). Let a neural network consist of fully-connected residual blocks as defined by Equ. (1) that start with a fully-connected layer at the beginning W 0 , which contains N 1 output channels. Assume that all biases are initialized as 0 and that all weight matrix entries are independently normally distributed with w l ij,2 ∼ N 0, σ 2 l,2 , w l ij,1 ∼ N 0, σ 2 l,1 , and w l ij,skip ∼ N 0, σ 2 l,skip . Then the expected squared norm of the output after one fully-connected layer and L residual blocks applied to input x is given by E x L 2 = N 1 2 σ 2 0 L-1 l=1 N l+1 2 α 2 l σ 2 l,2 σ 2 l,1 N m l 2 + β 2 l σ 2 l,skip ∥x∥ 2 . Note that this result does not rely on any (mean field) approximations and applies also to other parameter distributions that have zero mean and are symmetric around zero. Inserting the parameters of Definition 2.1 for fully-connected networks with k = 1 leads to the following insight that explains why this is the preferred initialization choice. Insight 2.4 (Norm preserving initialization). Acccording to Theorem 2.3, the normal ResNet initialization (Definition 2.2) preserves the average squared signal norm for arbitrary depth L. The two types of considered residual blocks. In Type C the skip connection is a projection with a 1 × 1 kernel while in Type B the input is directly added to the residual block via the skip connection. Both these blocks have been described by He et al. (2016). (b) The correlation between two inputs for different initializations as they pass through a residual network consisting of a convolution filter followed by 5 residual blocks (Type C), an average pool, and a linear layer on CI-FAR10. Only RISOTTO maintains constant correlations after each residual block while it increases for the other initializations with depth. (c) Balancing skip and residual connections: Performance of RISOTTO for different values of alpha (α) for ResNet 18 (C) on CIFAR10. α = 0 is equivalent to SkipInit and achieves the lowest accuracy. The better performing α = 1 is implemented in Risotto. Even though this initialization setting is able to avoid exploding or vanishing signals, it still induces considerable issues, as the analysis of the joint signal corresponding to different inputs reveals. According to the next theorem, the signal covariance fulfills a layerwise recurrence relationship that leads to the observation that signals become more similar with increasing depth. Theorem 2.5 (Layerwise signal covariance). Let a fully-connected residual block be given as defined by Eq. ( 1) with random parameters according to Definition 2.2. Let x l+1 denote the neuron states of Layer l + 1 for input x and xl+1 the same neurons but for input x. Then their covariance given all parameters of the previous layers is given as E l ⟨x l+1 , xl+1 ⟩ ≥ 1 4 N l+1 2 α 2 σ 2 l,2 σ 2 l,1 N m l 2 + 2β 2 σ 2 l,skip ⟨x l , xl ⟩ + c 4 α 2 N l+1 σ 2 l,2 σ 2 l,1 N m l x l xl (2) + E W l 1 α 2 σ 2 l,2 ϕ(W l 1 x l ) 2 + β 2 σ 2 l,skip ∥x l ∥ 2 α 2 σ 2 l,2 ϕ(W l 1 xl ) 2 + β 2 σ 2 l,skip ∥ xl ∥ 2 , where the expectation E l is taken with respect to the initial parameters W l 2 , W l 1 , and W l skip and the constant c fulfills 0.24 ≤ c ≤ 0.25. Note that this statement holds even for finite networks. To clarify what that means for the separability of inputs, we have to compute the expectation with respect to the parameters of W 1 . To gain an intuition, we employ an approximation that holds for a wide intermediary network. Insight 2.6 (Covariance of signal for different inputs increases with depth). Let a fully-connected ResNet with random parameters as in Definition 2.2 be given. It follows from Theorem 2.5 that the outputs corresponding to different inputs become more difficult to distinguish for increasing depth L. For simplicity, let us assume that ∥x∥ = ∥ x∥ = 1. Then, in the mean field limit N m l → ∞, the covariance of the signals is lower bounded by E ⟨x L , xL ⟩ ≥ γ L 1 ⟨x, x⟩ + γ 2 L-1 k=0 γ k 1 = γ L 1 ⟨x, x⟩ + γ 2 1 -γ 1 1 -γ L 1 (3) for γ 1 = 1+β 2 4 ≤ 1 2 and γ 2 = c(α 2 + 2) ≈ α 2 4 + 1 2 using E l-1 ∥x l ∥∥ xl ∥ ≈ 1. Since γ 1 < 1, the contribution of the original input correlations ⟨x, x⟩ vanishes for increasing depth L. Meanwhile, by adding constant contribution in every layer, irrespective of the input correlations, E ⟨x L , xL ⟩ increases with L and converges to the maximum value 1 (or a slightly smaller value in case of smaller width N m l ). Thus, deep models essentially map every input to almost the same output vector, which makes it impossible for the initial network to distinguish different inputs and provide information for meaningful gradients. Fig. 2b demonstrates this trend and compares it with our initialization proposal RISOTTO, which does not suffer from this problem. While the general trend holds for residual as well as standard fully-connected feed forward networks (β = 0), interestingly, we still note a mitigation for a strong residual branch (β = 1). The contribution by the input correlations decreases more slowly and the constant contribution is reduced for larger β. Thus, residual networks make the training of deeper models feasible, as they were designed to do (He et al., 2016) . This observation is in line with the findings of Yang & Schoenholz (2017) , which were obtained by mean field approximations for a different case without ReLU after the residual block (so that x l = z l-1 ). It also explains how ResNet initialization approaches like Fixup (Zhang et al., 2018) and SkipInit (De & Smith, 2020 ) can be successful in training deep ResNets. They set α = 0 and β = 1. If W skip = I, this approach even leads to dynamical isometry but trades it for very limited feature diversity (Blumenfeld et al., 2020) and initially broken residual branch. Figure 2c highlights potential advantages that can be achieved by α ̸ = 0 if the initialization can still maintain dynamical isometry as our proposal RISOTTO.

2.3. RISOTTO: ORTHOGONAL INITIALIZATION OF RESNETS FOR DYNAMICAL ISOMETRY

Our main objective is to avoid the highlighted drawbacks of the ResNet initialization schemes that we have discussed in the last section. We aim to not only maintain input correlations on average but exactly and ensure that the input-output Jacobian of our randomly initialized ResNet is an isometry. All its eigenvalues equal thus 1 or -1. In comparison with Fixup and SkipInit, we also seek to increase the feature diversity and allow for arbitrary scaling of the residual versus the skip branch. Looks-linear matrix structure The first step in designing an orthogonal initialization for a residual block is to allow signal to propagate through a ReLU activation without loosing half of the information. This can be achieved with the help of a looks-linear initialization (Shang et al., 2016; Burkholz & Dubatovka, 2019; Balduzzi et al., 2017) , which leverages the identity x = ϕ(x) -ϕ(-x). Accordingly, the first layer maps the transformed input to a positive and a negative part. A fullyconnected layer is defined by x 1 = x1 + ; x1 -= ϕ [U 0 ; -U 0 ] x with respect to a submatrix U 0 . Note that the difference of both components defines a linear transformation of the input x1 + -x1 -= U 0 x. Thus, all information about U 0 x is contained in x 1 . The next layers continue to separate the positive and negative part of a signal. Assuming this structure as input, the next layers x l+1 = ϕ(W l x l ) proceed with the block structure W l = U l -U l ; U l -U l . As a consequence, the activations of every layer can be separated into a positive and a negative part as x l = xl + ; xl -so that x l = z l-1 . The submatrices U l can be specified as in case of a linear neural network. Thus, if they are orthogonal, they induce a neural network with the dynamical isometry property (Burkholz & Dubatovka, 2019) . With the help of the Delta Orthogonal initialization (Xiao et al., 2018) , the same idea can also be transferred to convolutional layers. Given a matrix H ∈ R N l+1 ×N l , a convolutional tensor is defined as W ∈ R N l+1 ×N l ×k1×k2 as w ijk ′ 1 k ′ 2 = h ij if k ′ 1 = ⌊k 1 /2⌋ and k ′ 2 = ⌊k 2 /2⌋ and w ijk ′ 1 k ′ 2 = 0 otherwise. We make frequent use of the combination of the idea behind the Delta Orthogonal initialization and the looks-linear structure. Definition 2.7 (Looks-linear structure). A tensor W ∈ R N l+1 ×N l ×k1×k2 is said to have looks-linear structure with respect to a submatrix U ∈ R ⌊N l+1 /2⌋×⌊N l /2⌋ if w ijk ′ 1 k ′ 2 = h ij if k ′ 1 = ⌊k 1 /2⌋ and k ′ 2 = ⌊k 2 /2⌋, 0 otherwise, H = U -U -U U (4) It has first layer looks-linear structure if H = [U; -U]. We impose this structure separately on the residual and skip branch but choose the corresponding submatrices wisely. To introduce RISOTTO, we only have to specify the corresponding submatrices for W l 1 , W l 2 , and W l skip . The main idea of Risotto is to choose them so that the initial residual block acts as a linear orthogonal map. The Type C residual block assumes that the skip connection is a projection such that h l i (x) = j∈N l W l ij,skip * x l j , where W l skip ∈ R N l+1 ×N l ×1×1 is a trainable convolutional tensor with kernel size 1 × 1. Thus, we can adapt the skip connections to compensate for the added activations of the residual branch as visualized in Fig. 3 . Definition 2.8 (RISOTTO for Type C residual blocks). For a residual block of the form x l+1 = ϕ(α * f l (x l )+h l (x l )), where f l (x l ) = W l 2 * ϕ(W l 1 * x l ), h l (x l ) = W l skip * x l , the weights W l 1 , W l 2 and W l skip are initialized with looks-linear structure according to Def. 2.7 with the submatrices U l 1 , U l 2 and U l skip respectively. The matrices U l 1 , U l 2 , and M l be drawn independently and uniformly from all matrices with orthogonal rows or columns (depending on their dimension), while the skip submatrix is U l skip = M l -αU l 2 U l 1 . The Type B residual block poses the additional challenge that we cannot adjust the skip connections initially because they are defined by the identity and not trainable. Thus, we have to adapt the residual connections instead to compensate for the added input signal. To be able to distinguish the positive and the negative part of the input signal after the two convolutional layers, we have to pass it through the first ReLU without transformation and thus define W l 1 as identity mapping. Definition 2.9 (RISOTTO for Type B residual blocks). For a residual block of the form x l+1 = ϕ(α * f l (x l ) + x l ) where f l (x l ) = W l 2 * ϕ(W l 1 * x l ), RISOTTO initializes the weight W l 1 as w l 1,ijk ′ 1 k ′ 2 = 1 if i = j, k ′ 1 = ⌊k 1 /2⌋, k ′ 2 = ⌊k 2 /2⌋ and w l 1,ijk ′ 1 k ′ 2 = 0 otherwise. W l 2 has lookslinear structure (according to Def. 2.7) with respect to a submatrix U l 2 = M l -(1/α)I, where M l ∈ R N l+1 /2×N l /2 is a random matrix with orthogonal columns or rows, respectively. Note that particularly the type B initialization consists of a considerable amount of zero entries. To induce higher symmetry breaking and promote feature diversity Blumenfeld et al. (2020) , we also study a noisy variant, N-RISOTTO, in which we add a small amount of noise ϵΣ to our RISOTTO parameters, where ϵ = 10 -4 and Σ follows the He Normal distribution (He et al., 2015) . As we prove in the appendix, residual blocks initialized with RISOTTO preserve the norm of the input and cosine similarity of signals corresponding to different inputs not only on average but exactly, as exemplified by Fig. 2c (b ). This addresses the drawbacks of initialization schemes that are based on independent weight entries, as discussed in the last section. Theorem 2.10 (RISOTTO preserves signal norm and similarity). A residual block that is initialized with RISOTTO maps input activations x l to output activations x l+1 so that the norm ||x l+1 || 2 = ||x l || 2 stays equal. The scalar product between activations corresponding to two inputs x and x are preserved in the sense that ⟨ xl+1 + -xl+1 -, xl+1 + -xl+1 -⟩ = ⟨ xl + -xl -, xl + -xl -⟩. The full proof is presented Appendix A.3. It is straight forward, as the residual block is defined as orthogonal linear transform, which maintains distances of the sum of the separated positive and negative part of a signal. Like the residual block, the input-output Jacobian also is formed of orthogonal submatrices. It follows that RISOTTO induces perfect dynamical isometry for finite width and depth. Theorem 2.11 (RISOTTO achieves exact dynamical isometry for residual blocks). A residual block whose weights are initialized with RISOTTO achieves exact dynamical isometry so that the singular values λ ∈ σ(J) of the input-output Jacobian J ∈ R N l+1 ×k l+1 ×N l ×k l fulfill λ ∈ {-1, 1}. The detailed proof is given in Appendix A.2. Since the weights are initialized so that the residual block acts as an orthogonal transform, also the input-output Jacobian is an isometry, which has the required spectral properties. Drawing on the well established theory of dynamical isometry (Chen et al., 2018; Saxe et al., 2013; Mishkin & Matas, 2015; Poole et al., 2016; Pennington et al., 2017) , we therefore expect RISOTTO to enable fast and stable training of very deep ResNets, as we demonstrate next in experiments.

3. EXPERIMENTS

In all our experiments, we use two kinds of ResNets consisting of residual blocks of Type B or Type C, as defined in Section 2.1 and visualized in Fig. 2a . ResNet (B) contain Type B residual blocks if the input and output dimension of the block is equal and a Type B block otherwise, but a ResNet (C) has Type C residual blocks throughout. All implementation details and our hyperparameter tuning approach are described in Appendix A.4. We use a learnable scalar α that is initialized as α = 1 to balance skip and residual connections (see Fig. 2c (c )). Results for all cases are reported on the benchmark datasets CIFAR10, CIFAR100 (Krizhevsky et al., 2014) , and Tiny ImageNet (Le & Yang, 2015) . Additional key experiments on ImageNet are presented in Appendix A.9 that are in line with our findings on Tiny ImageNet. Our main objective is to highlight three advantageous properties of our proposed initialization RISOTTO: (a) It enables stable and fast training of deep ResNets without any normalization methods and outperforms state-of-the-art schemes designed for this purpose (see Table 1 and Fig. 4 (center)). (b) Unnormalized it can compete with NF ResNets, an alternative normalization approach to BN, (see Fig. 4 (right) and Fig. 9 ). (c) It can outperform alternative initialization methods in combination with Batch Normalization (BN) (see Table 2 and Fig. 4 (left)). We hypothesize that this is enabled by the balance between skip and residual connections that is unique to RISOTTO. RISOTTO without BN We start our empirical investigation by evaluating the performance of ResNets without any normalization layers. We compare our initialization scheme RISOTTO to the state-of-the-art baselines Fixup (Zhang et al., 2018) and SkipInit (De & Smith, 2020) . Both these methods have been proposed as substitutes for BN and are designed to achieve the benefits of BN by scaling down weights with depth (Fixup) and biasing signal flow towards the skip connection (Fixup and SkipInit). Fixup has so far achieved the best performance for training ResNets without any form of normalization. We observe that as shown in Table 1 , RISOTTO is able to outperform both Fixup and SkipInit. Moreover, we also observed in our experiments that Fixup and SkipInit are both susceptible to bad random seeds and can lead to many failed training runs. The unstable gradients at the beginning due to zero initialization of the last residual layer might be responsible for this phenomenon. RISOTTO produces stable results. With ResNet (C), it also achieves the overall highest accuracy for all three datasets. We have also evaluated N-Rissoto for ResNet50 (B) on Im-ageNet comparing with Fixup as shown in Fig. 10 Can we reduce the number of BN layers? Considering the importance of BN on performance and how other methods struggle to compete with it, we note that reducing the number of BN layers in a network can still render similar benefits at reduced computational and memory costs. Often a single BN layer is sufficient and with Risotto, we are flexible in choosing its position. Fig. 4 (right) reports results for BN after the last layer, which usually achieves the best performance. Additional figures in the appendix (Fig. 5 6 7 ) explore optional BN layer placements in more detail. 

4. CONCLUSIONS

We have introduced a new initialization method for residual networks with ReLU activations, RISOTTO. It enables residual networks of any depth and width to achieve exact dynamical isometry at initialization. Furthermore, it can balance the signal contribution from the residual and skip branches instead of suppressing the residual branch initially. This does not only lead to higher feature diversity but also promotes stable and faster training. In practice that is highly effective, as we demonstrate for multiple standard benchmark datasets. We show that RISOTTO competes with and often outperforms Batch Normalization free methods and even improves the performance of ResNets that are trained with Batch Normalization. While we have focused our exposition on ResNets with ReLU activation functions, in future, RISOTTO could also be transferred to transformer architectures similarly to the approach by Bachlechner et al. (2021) . In addition, the integration into the ISONet (Qi et al., 2020) framework could potentially lead to further improvements.

A APPENDIX

A.1 SIGNAL PROPAGATION Recall our definition of a residual block in Equation ( 1): z 0 := W 0 x, x l = ϕ(z l-1 ), z l := α l f l (x l ) + β l h l (x l ), z out := W out P (x L ) (5) for 1 ≤ l ≤ L with f l (x l ) = W l 2 ϕ(W l 1 x l ), h l (x l ) = W l skip x l , where the biases have been set to zero. In the following theoretical derivations we focus on fully-connected networks for simplicity so that W l 2 ∈ R N l+1 ×Nm l , W l 1 ∈ R Nm l ×N l , and W l skip ∈ R N l+1 ×N l . The general principle could also be transferred to convolutional layers similarly to the mean field analysis by Xiao et al. (2018) . A common choice for the initialization of the parameters is defined as follows. Definition A.1 (Normal ResNet Initialization for fullly-connected residual blocks). Let a neural network consist of fully-connected residual blocks as defined by Equ. ( 5). All biases are initialized as 0 and all weight matrix entries are independently normally distributed with w l ij,2 ∼ N 0, σ 2 l,2 , w l ij,1 ∼ N 0, σ 2 l,1 , and w l ij,skip ∼ N 0, σ 2 l,skip . Then the Normal ResNet Initialization is defined by the choice σ l,1 = 2 Nm l , σ l,2 = 2 N l+1 , σ l,skip = 2 N l+1 , and α l , β l ≥ 0 that fulfill α 2 l + β 2 l = 1. Our objective is to analyze the distribution of the signals x l+1 and xl+1 , which correspond to the random neuron state of an initial neural network that is evaluated in input x 0 or x0 , respectively. More precisely, we derive the average squared signal norm and the covariance of two signals that are evaluated in different inputs. We start with the squared signal norm and, for convenience, restate Theorem 2.10 before the proof. Theorem A.2 (Theorem 2.10 in main manuscript). Let a neural network consist of residual blocks as defined by Equ. (1) or Equ. ( 5) that start with a fully-connected layer at the beginning W 0 , which contains N 1 output channels. Assume that all biases are initialized as 0 and that all weight matrix entries are independently normally distributed with w l ij,2 ∼ N 0, σ 2 l,2 , w l ij,1 ∼ N 0, σ 2 l,1 , and w l ij,skip ∼ N 0, σ 2 l,skip . Then the expected squared norm of the output after one fully-connected layer and L residual blocks applied to input x is given by E x L 2 = N 1 2 σ 2 0 L-1 l=1 N l+1 2 α 2 l σ 2 l,2 σ 2 l,1 N m l 2 + β 2 l σ 2 l,skip ∥x∥ 2 . Proof. First, we study how the signal is transformed by a single layer before we deduce successively the final output signal. To do so, we assume that the signal of the previous layer is given. This means we condition the expectation on the parameters of the previous layers and thus x l and xl . For notational convenience, we define E l (z) = E z | x l , xl and skip the index l in the following derivations. We write: x = x l+1 , x = x l , z = z l , f (x) = f l (x) = W l 2 ϕ(W l 1 x), h(x) = h l (x) = W l skip x, α = α l , β = β l . Given all parameters from the previous layers, we deduce E ∥x∥ 2 | x = N l+1 i=1 E (x i ) 2 = N l+1 E (x 1 ) 2 = N l+1 2 E l (z 1 ) 2 The first equality follows from the fact that all random parameters are independent and the signal components are identically distributed. The third equality holds because the distribution of each signal component is symmetric around zero so that the ReLU projects half of the signal away but Under review as a conference paper at ICLR 2023 the contribution to the average of the squared signal is just cut in half. We continue with E l (z 1 ) 2 = E l   α Nm l i=1 w 2,1i ϕ   N l j=1 w 1,ij x j   + β N l k=1 w skip,1k x k   2 (7) = α 2 Nm l i=1 E l w 2 2,ij E l    ϕ   N l j=1 w 1,ij x j   2    + β 2 N l k=1 E l w 2 skip,1k x 2 k (8) = α 2 σ 2 l,2 N m l E l    ϕ   N l j=1 w 1,1j x j   2    + β 2 σ 2 l,skip ∥x∥ 2 (9) = α 2 σ 2 l,2 N m l 1 2 E l   N l j=1 w 1,1j x j   2 + β 2 σ 2 l,skip ∥x∥ 2 (10) = α 2 σ 2 l,2 σ 2 l,1 N m l 1 2 + β 2 σ 2 l,skip N l+1 ∥x∥ 2 , as all weight entries are independent and the expectation is a linear operation. To obtain Equation (10), we just repeated the same argument as for Equation 6to take care of the ReLU. Afterwards, we used again the independence of the weights w 1,1j . From repeated evaluation of Equations ( 6) and ( 11), we obtain E x L 2 = N 1 2 σ 2 0 L-1 l=1 N l+1 2 α 2 σ 2 l,2 σ 2 l,1 N m l 1 2 + β 2 σ 2 l,skip ∥x∥ 2 for x = x 0 . To make sure that the signal norm neither explodes or vanishes for very deep networks, we would need to choose the weight variances so that N1 2 σ 2 0 L l=1 N l+1 2 α 2 σ 2 l,2 σ 2 l,1 N m l 1 2 + β 2 σ 2 l,skip ≈ 1. In the common normal ResNet initialization, this is actually achieved, since σ l,1 = 2/N m l , σ l,2 = σ l,skip = 2/N l+1 , σ 0 = 2/N 1 , and α 2 + β 2 = 1 even preserve the average norm in every layer. How does this choice affect whether signals for different inputs are distinguishable? To answer this question, we analyze the covariance of the neuron state for two different inputs. We begin again with analyzing the transformation of a single layer and condition on all parameters of the previous layers. To obtain a lower bound on the covariance, the following Lemma will be helpful. It has been derived by Burkholz & Dubatovka (2019) for Theorem 5. Lemma A.3. Assume that two random variables z 1 and z 2 are jointly normally distributed as z ∼ N (0, V ) with covariance matrix V . Then, the covariance of the ReLU transformed variables x 1 = ϕ(z 1 ) and x 2 = ϕ(z 2 ) is E (x 1 x 2 ) = √ v 11 v 22 g(ρ)ρ + 1 -ρ 2 2π (13) ≥ √ v 11 v 22 1 4 (ρ + 1) -c = 1 4 v 12 + c √ v 11 v 22 , where ρ = v 11 v 22 /v 12 and g(ρ) is defined as g(ρ) = 1 √ 2π ∞ 0 Φ ρ √ 1-ρ 2 u exp -1 2 u 2 du for |ρ| ̸ = 1 and g(-1) = 0, g(1) = 0.5. The constant fulfills 0.24 ≤ c ≤ 0.25. In the following, we assume that all weight parameters are normally distributed so that we can use the above lemma. However, other parameter distributions in large networks would also lead to similar results, as the central limit theorem implies that the relevant quantities are approximately normally distributed. We study the covariance of the signals x l+1 = x and xl+1 = x, which correspond to the random neuron state of an initial neural network that is evaluated in input x 0 or x0 , respectively. Theorem A.4 (Theorem 2.5 in main manuscript). Let a fully-connected residual block be given as defined by Equ. (1) or Equ. ( 5). Assume that all biases are initialized as 0 and that all weight matrix entries are independently normally distributed with w l ij,2 ∼ N 0, σ 2 l,2 , w l ij,1 ∼ N 0, σ 2 l,1 , and w l ij,skip ∼ N 0, σ 2 l,skip . Let x l+1 denote the neuron states of Layer l + 1 for input x and xl+1 the same neurons but for input x. Then their covariance given all parameters of the previous layers is given as E l ⟨x l+1 , xl+1 ⟩ ≥ 1 4 N l+1 2 α 2 l σ 2 l,2 σ 2 l,1 N m l 2 + 2β 2 l σ 2 l,skip ⟨x l , xl ⟩ + c 4 α 2 l N l+1 σ 2 l,2 σ 2 l,1 N m l x l xl + E W l 1 α 2 l σ 2 l,2 ϕ(W l 1 x l ) 2 + β 2 l σ 2 l,skip ∥x l ∥ 2 α 2 l σ 2 l,2 ϕ(W l 1 xl ) 2 + β 2 l σ 2 l,skip ∥ xl ∥ 2 , where the expectation E l is taken with respect to the initial parameters W l 2 , Wl 1 , and W l skip . Proof. Let us assume again that all parameters of the previous layers are given in addition to the parameters of the first residual layer W 1 and use our notation from the proof of Theorem 2.10. Based on similar arguments that we used for the derivation of average squared signal norm, we observe that z = z l and z = zl are jointly normally distributed. In particular, the components z i are identically distributed for the same input. The same component z i and zi for different inputs has covariance matrix V with entries v 11 = α 2 σ 2 l,2 ∥ϕ(W 1 x)∥ 2 + β 2 σ 2 l,skip ∥x∥ 2 , v 22 = α 2 σ 2 l,2 ∥ϕ(W 1 x)∥ 2 + β 2 σ 2 l,skip ∥x∥ 2 , and v 12 = α 2 σ 2 l,2 ⟨ϕ(W 1 x), ϕ(W 1 x)⟩ + β 2 σ 2 l,skip ⟨x, x⟩. E l (⟨x, x⟩) = E W1 E l (⟨x, x⟩ | W 1 ) = N l+1 i=1 E W1 E l (ϕ(z i )ϕ(z i ) | W 1 ) ≥ N l+1 E W1 1 4 v 12 + c √ v 11 v 22 = N l+1 E W1 1 4 v 12 + c √ v 11 v 22 ( ) where we applied Lemma A.3 to obtain the inequality and used the fact that the entries of the variance matrix are identically distributed for different (i) with respect to W 1 . We can compute the first term E W1 v 12 by using Lemma A.3 again, as W 1 x and W 1 x are jointly normally distributed given x and x. The associated covariance matrix S for one component corresponding to two different inputs has entries s 11 = σ 2 l,1 ∥x∥ 2 , s 22 = σ 2 l,1 ∥x∥ 2 , and s 12 = σ 2 l,1 ⟨x, x⟩. Lemma A.3 gives us therefore E W1 (v 12 ) = E W1 α 2 σ 2 l,2 ⟨ϕ(W 1 x), ϕ(W 1 x)⟩ + β 2 σ 2 l,skip ⟨x, x⟩ ≥ α 2 σ 2 l,2 N m l 1 4 σ 2 l,1 ⟨x, x⟩ + cσ 2 l,1 ∥x∥ ∥x∥ + β 2 σ 2 l,skip ⟨x, x⟩. Determining the second part of Equation ( 17) is more involved: E W1 ( √ v 11 v 22 ) = E W1 α 2 σ 2 l,2 ∥ϕ(W 1 x)∥ 2 + β 2 σ 2 l,skip ∥x∥ 2 α 2 σ 2 l,2 ∥ϕ(W 1 x)∥ 2 + β 2 σ 2 l,skip ∥x∥ 2 (20) = β 2 σ 2 l,skip ∥x∥ ∥x∥ E W1 γ ϕ W 1 x ∥x∥ 2 + 1 γ ϕ W 1 x ∥x∥ 2 + 1 (21) ≥ β 2 σ 2 l,skip ∥x∥ ∥x∥ E W1 γ ϕ W 1 x ∥x∥ 2 + 1 γ ϕ -W 1 x ∥x∥ 2 + 1 (22) ≥ β 2 σ 2 l,skip ∥x∥ ∥x∥ E W1   γ M j=1 w 2 1,j1 + 1     γ Nm l j=M +1 w 2 1,j1 + 1   (23) = β 2 σ 2 l,skip ∥x∥ ∥x∥ E M E Y   γσ 2 l,1 N m l 2   2 N m l M j=1 y 2 j   + 1   ×   γσ 2 l,1 N m l 2   2 N m l Nm l j=M +1 y 2 j   + 1   ≈ β 2 σ 2 l,skip ∥x∥ ∥x∥ γσ 2 l,1 N m l 2 + 1 = ∥x∥ ∥x∥ α 2 σ 2 l,2 σ 2 l,1 N m l 2 + β 2 σ 2 l,skip In Equation ( 22), we have used the fact that E W1 √ v 11 v 22 is monotonically increasing in the covariance s 12 . Thus, the minimum is attained for perfectly negative associations between x and x and thus x = -x. To simplify the derivation, we further study the case x = (1, 0, 0, 0, ...) T . It follows that either ϕ(w 1,j1 ) or ϕ(-w 1,j1 ) is positive while the other one is zero. To ease the notation, by reindexing, we can assume that the first M components fulfill ϕ(w 1,j1 ) > 0, while the remaining N m l -M components fulfill the opposite. Note that because w 1,j1 is distributed symmetrically around zero, M ∼ Bin(N m l , 0.5) is a binomially distributed random variable with success probability 0.5. Thus, EM = N m l /2. To make the dependence on N m l of the different variables more obvious, we have replaced the random variables w 1,j1 that are normally distributed with standard deviation σ 1 by standard normally distributed random variables y j with standard deviation 1. This makes the use of the law of large numbers in the last equation more apparent. Note that this approximation is only accurate for large N m l >> 1, which is usually fulfilled in practice. Finally, combining Equations ( 17), ( 19), and (20), we receive E l (⟨x, x⟩) ≥ 1 16 α 2 N l+1 σ 2 l,2 σ 2 l,1 N m l ⟨x, x⟩ + c 4 α 2 N l+1 σ 2 l,2 σ 2 l,1 N m l ∥x∥ ∥x∥ + 1 4 β 2 N l+1 σ 2 l,skip ⟨x, x⟩ + E W1 α 2 σ 2 l,2 ∥ϕ(W 1 x)∥ 2 + β 2 σ 2 l,skip ∥x∥ 2 α 2 σ 2 l,2 ∥ϕ(W 1 x)∥ 2 + β 2 σ 2 l,skip ∥x∥ 2 (25) ≈ N l+1 2 α 2 4 σ 2 l,1 N m l 2 σ 2 l,2 + β 2 2 σ 2 l,skip ⟨x, x⟩ + c N l+1 2 α 2 σ 2 l,1 N m l 2 σ 2 l,2 + 2α 2 σ 2 l,2 σ 2 l,1 N m l 2 + 2β 2 σ 2 l,skip ∥x∥ ∥x∥ To understand the problem that these derivations imply, we next choose the weight parameters so that the squared norm signal is preserved from one layer to the next and, for simplicity, study the case in which ∥x∥ = ∥x∥ = 1. Then we have E l (⟨x, x⟩) ≥ 1 + β 2 4 ⟨x, x⟩ + c(α 2 + 2) ≈ 1 + β 2 4 ⟨x, x⟩ + α 2 4 + 1 2 . ( ) Thus, the similarity of signals corresponding to different inputs increases always by at least a constant amount on average. Repeating the above bound layerwise, at Layer L we receive for γ 1 = 1+β 2 4 ≤ 1 2 and γ 2 = c(α 2 + 2): E ⟨x L , xL ⟩ ≥ γ L 1 ⟨x, x⟩ + γ 2 L-1 k=0 γ k 1 = γ 2 1 -γ 1 1 -γ L 1 . According to our bound, output signals of very deep networks become more similar with increasing depth until they are almost indistinguishable. This phenomenon poses a great challenge for the trainability of deep resdidual neural networks with standard initialization schemes. Note that the case without skip-connections is also covered by the choice α = 1 and β = 0. Interestingly, nonzero skip-connections (β > 0) fight the increasing signal similarity by giving more weight to the original signal similarity (increased γ 1 ) while decreasing the constant contribution of γ 2 . This enables training of deeper models but cannot solve the general problem that increasingly deep models become worse in distinguishing different inputs initially. Even the best case scenario of α = 0 and β = 1 leads eventually to forgetting of the original input association, since γ 1 = 0.5 < 1. With γ 2 ≈ 0.5, the overall signal similarity E ⟨x L , xL ⟩ converges to 1 for L → ∞ irrespective of the input similarity. Thus, every input signal is essentially mapped to the same vector for very deep networks, which explains the following insight. Insight A.5 (Insight 2.6 in main paper). Let a fully-connected ResNet be given whose parameters are drawn according to Definition A.1. It follows from Theorem 2.5 that the outputs corresponding to different inputs become more difficult to distinguish for increasing depth L. In the mean field limit N m l → ∞, the covariance of the signals is lower bounded by E ⟨x L , xL ⟩ ≥ γ L 1 ⟨x, x⟩ + γ 2 L-1 k=0 γ k 1 = γ L 1 ⟨x, x⟩ + γ 2 1 -γ 1 1 -γ L 1 (29) for γ 1 = 1+β 2 4 ≤ 1 2 and γ 2 = c(α 2 + 2) and E l-1 ∥x l ∥∥ xl ∥ ≈ 1. However, our orthogonal initialization scheme Risotto does not suffer from increasing similarity of outputs corresponding to different inputs.

A.2 DYNAMICAL ISOMETRY INDUCED BY RISOTTO

Theorem A.6 (Theorem 2.11 in the main paper). A residual block (of type B or type C) whose weights are initialized with RISOTTO achieves exact dynamical isometry so that the singular values λ ∈ σ(J) of the input-output Jacobian J ∈ R N l+1 ×k l+1 ×N l ×k l fulfill λ ∈ {-1, 1}. Proof. Consider a single element of the output activation of a type C residual block at layer l. At initialization, an output activation component at layer l is Considering the importance of BN on performance and how other methods struggle to compete with it, we empirically explore if reducing the number of BN layers in a network would still render similar benefits at reduced computational and memory costs. We observe that in such a case, the position of the single BN layer in the network plays a crucial role. Figure 5 shows that RISOTTO enables training in all cases while other initializations fail if the single BN layer is not placed after the last convolutional layer to normalize all features. First we need to determine the best position to place the BN layer in the network. We report experiments in Figure 5 that placing the BN after the last convolutional layer, before the last linear layer achieves best results. In Figure 7 we show results for Tiny Imagenet with a single BN after the last residual layer where too RISOTTO outperforms the standard. BN after the last residual layer controls the norms of the logits and potentially stablizes the gradients leading to better performance. Conversely, BN right after the first layer does not enable larger learning rates or better generalization. In Figure 7 , we show that even after optimal placement of the single BN layer, RISOTTO leads to the overall best results on Tiny ImageNet at reduced computational costs. findings on Tiny ImageNet. When we use our initialization for ImageNet in which we add a small amount of noise ϵΣ to our RISOTTO parameters, where ϵ = 10 -4 and Σ follows the He Normal distribution (He et al., 2015) for the corresponding Kaiming Normal initialization. Adding this random noise enables symmetry breaking (Blumenfeld et al., 2020) and supports training according to our empirical observations. Results are reported in Figure 10 . Remark: We would like to note that because our results on ImageNet are from a single run, they are inherently susceptible to variance between runs. Nonetheless, we have comprehensively shown the effectiveness of RISOTTO on multiple datasets. x l+1 ik = α n∈Nm l w l ink1/2k2/ A.9 RESULTS ON PREACTIVATION RESNETS In addition to the ResNet models we have considered so far in the paper, there is another set of models in Residual Networks that may be used for certain problems. Preactivation ResNets (He et al., 2015) apply the ReLU nonlinearlity before the weight within a residual block. Our initialization scheme, Risotto, can be easily transfered to Preactivation ResNets in order to achieve dynamical isometry. We report experiments on CIFAR10 for Preactivation ResNets in Figure 11 . 



Figure2: (a) The two types of considered residual blocks. In Type C the skip connection is a projection with a 1 × 1 kernel while in Type B the input is directly added to the residual block via the skip connection. Both these blocks have been described byHe et al. (2016). (b)  The correlation between two inputs for different initializations as they pass through a residual network consisting of a convolution filter followed by 5 residual blocks (Type C), an average pool, and a linear layer on CI-FAR10. Only RISOTTO maintains constant correlations after each residual block while it increases for the other initializations with depth. (c) Balancing skip and residual connections: Performance of RISOTTO for different values of alpha (α) for ResNet 18 (C) on CIFAR10. α = 0 is equivalent to SkipInit and achieves the lowest accuracy. The better performing α = 1 is implemented in Risotto.

Figure 3: A Type C residual block is initialized with Risotto as visually explained.

Figure 4: (left) A single BN layer is placed after the last residual block of a ResNet50 (C) on Tiny Imagenet. (center) A 300 layer ResNet (B) with RISOTTO versus Fixup. (right) RISOTTO versus NF ResNets (C) on Tiny Imagenet.

Figure 5: Comparing different positions of placing a single BN layer on a ResNet 18 (C) for CI-FAR10. In each of the cases, RISOTTO allows stable training and converges to competetive accuracies, while standard methods fail in some cases.

Figure 9: Comparing NF ResNets to RISOTTO on all three datasets with a ResNet (C) (18 for CIFAR10 and 50 for the others). While RISOTTO performs similar to He Normal on CIFAR100, it is able to outperform both He Uniform and He Normal on Tiny ImageNet and CIFAR10.

Figure 10: (a) We train a ResNet50 (B) with BN and compare Risotto (ours) with the standard Kaiming Normal initialization scheme. We train both the models with LR = 0.01 and cosine annealing. We observe that with our initialization scheme we are able to train faster on ImageNet and even achieve better accuracy than Kaiming Normal. (b) Similarly, we compare Risotto (ours) with Fixup on a ResNet50 (B) without BN. Both are trained with LR = 0.01. Risotto tracks Fixup closely.

.10 ADDITIONAL EXPERIMENTS WITH RESNET (B) FOR RISOTTO We provide additional experiments showing the effectiveness of Risotto for ResNet(B) architectures in Figure 12.

Figure 11: We train a Preactivation ResNet18 (C) with BN initialized with Risotto (ours) and compare with the standard Kaiming Normal and Kaiming Uniform initialization schemes. We train both the models with LR = 0.1 and cosine annealing. Risotto is able to enable dynamical isometry and achieve competetive generalization. Similarly we also report results withour BN comparing Risotto with Fixup, where Risotto is able to outperform Fixup.

Figure 13: Risotto enables training a 2000 layer network on CIFAR100.

RISOTTO as a substitute for Batch Normalization The mean test accuracy over 3 runs and 0.95 standard confidence intervals are reported to compare RISOTTO, Fixup, and SkipInit without using BN. RISOTTO is able to achieve the overall best results for each of the benchmark datasets.RISOTTO in combination with BN Despite its drawbacks, BN remains a popular method and is implemented per default, as it often leads to the best overall generalization performance for ResNets. Normalization free initialization schemes have been unable to compete with BN, even though Fixup has come close. Whether the performance of batch normalized networks can still be improved is therefore still a relevant question. Table2compares RISOTTO with the two variants of He initialization for normally distributed and uniformly distributed weights(He et al., 2016).

RISOTTO

Implementation details for ResNet50 on Tiny Imagenet. Arguments for step denote that learning rate was reduced by a factor of 0.1 every 30 epochs.A.5 ADDITIONAL EXPERIMENTS FOR BN LAYER PLACEMENT ON CIFAR100

annex

x l+1 ik where i ∈ [0, N l+1 /2] is given byϕ(u l nj,1 x l jk ) -ϕ(-u l nj,1 x l jk ) +Taking the derivative of the output x l+1 ik wrt an input element x l nk .We can now obtain the input output Jacobian for i ∈ [0, N l+1 /2] and j ∈ [0, N l /2] asAnd since M l is orthogonal, the singular values of J l are one across the dimensions i, j. Due to the looks linear form of the input at the weights the complete Jacobian also takes the looks linear form asThe same argument follows for type B residual blocks. Hence, since M l is an orthogonal matrix, the singular values of the input output Jacobian of the residual blocks initialized with RISOTTO are exactly {1, -1} for all depths and width and not just in the infinite width limit.

A.3 SIGNAL PROPAGATION WITH RISOTTO

Now we closely analyze how RISOTTO transforms the input signal for residual blocks when initialized using Definitions 2.9 and 2.8. RISOTTO creates effectively an orthogonal mapping that induces DI. Note that our analysis of convolutional tensors is simplified to evaluating matrix operations since the Delta Orthogonal initialization simplifies a convolution to an effective matrix multiplication in the channel dimension of the input. Specifically, we can track the changes in the submatrices used in the initialization of a residual block with RISOTTO and observe the output activation. We start with a Type B residual block evaluated at input x l = xl + ; xl -at Layer l and set α = 1. The residual and skip branches at Layer l are thenAdding the skip branch to the residual branch givesThe submatrices of the output of the residual block U We conclude that for type C residual blocks the output is of looks-linear form and has the same norm as the input because of the orthogonal submatrices and the looks-linear structure.We now use the above formulations to prove that RISOTTO preserves the squared signal norm and similarities between inputs for residual blocks, the key property that allows stable training. Theorem A.7 (RISOTTO preserves signal norm and similarity). A residual block that is initialized with RISOTTO maps input activations x l to output activations x l+1 so that the norm ||x l+1 || 2 = ||x l || 2 stays equal. The scalar product between activations corresponding to two inputs x and x are preserved in the sense that ⟨ xl+1Proof. We first prove that the squared signal norms are preserved for both types of residual blocks followed by the same for similarity between inputs. Consider a type C residual block. The preactivation of the previous layer of looks linear form z l-1 = zl-1 ; -zl-1 . The preactivation of the current layer is then given as the signal passes through the residual block asThen the squared norm is given byWe have used the fact the M l is an orthogonal matrix such that M l x = ∥x∥ and the identity for ReLU activations by which z = ϕ(z) -ϕ(-z). Since x l+1 = ϕ(z l ), we haveTaking the squared norm of the output then givesThe second equality results from squaring the ReLU identity as the term ⟨ϕ( ẑl ), ϕ(-ẑl )⟩ = 0.Combining Eqns. 46 and 48, we obtainIn fact, the norm preservation is a special case of the preservation of the scalar product, which we prove next.Consider two independent inputs x and x, their corresponding input activations at Layer l are x l and xl . We show that as a result of RISOTTO the correlation between the preactivations is preserved which in turn means that the similarity between activations of the looks linear form is preserved asThe above derivations follow from the looks -linear structure of the weights and the input as well as the orthogonality of matrix M l . The same proof strategy for both norm preservation and similarity can be followed for type B residual blocks using the signal propagation A.3. This concludes the proof.

A.4 EXPERIMENTAL SETUP AND DETAILS

In all our experiments we use Stochastic Gradient Descent (SGD) with momentum 0.9 and weight 0.0005. We use 4 NVIDIA A100 GPUs to train all our models. All experiments are repeated for 3 runs and we report the mean and 0.95 confidence intervals. In experiments with ResNet101, we used a learning rate of 0.005 for all initialization schemes including ours.We performed the following hyperparameter tuning for each initialization scheme + model + dataset that we compare and report. For every experiment we use the best performing learning rate between 0.1, 0.01, 0.05, 0.001 with either cosine annealing or step scheduling (whichever performs better), while keeping the batch size fixed to 256. To do so, we randomly split the training set into a training (90%) and a validation (10%) set and used the LR that performed best on the validation set to evaluate the trained model on the test set. We report the accuracies obtained on the test set.Placing a single BN layer In order to identify the best position to place a single BN layer in a ResNet, we experiments with 3 different positions. (i) First layer: The BN was placed right after the first convolution layer before the residual blocks. (ii) BN in the middle: The BN layer was placed after half of the residual blocks in the network. (iii) BN after last res block: In this case BN was placed before the pooling operation right after the last residual block.Correlation comparison in Figure 2b In order to compare the correlation between inputs for different initialization schemes we use a vanilla Residual Network with five residual blocks, each consisting of the same number of channels (32) and a kernel size of (3, 3) followed by n average pooling and a linear layer. The figure shows the correlation between two random samples of CI-FAR10 averaged over 50 runs.Tiny Imagenet Note that we use the validation set provided by the creators of Tiny Imagenet (Le & Yang, 2015) as a test set to measure the generalization performance of our trained models. 4 . We observe that Risotto enable faster training and better generalization. NF ResNets usually require careful hyperparameter tuning for gradient clipping, we observe that they train well with vanilla SGD on smaller datasets.

A.8 RESULTS ON IMAGENET

We also report experiments on the large scale ImageNet dataset. Due to limited computational resources, we are only able to report a single run on ResNet50 (B), but our results are in line with our A.12 RISOTTO WITH NOISE: N-RISOTTO We also report results of RISOTTO initialized with additional noise. The model is initialized with RISOTTO in which we add a small amount of noise ϵΣ to our RISOTTO parameters, where ϵ = 10 -4 and Σ follows the He Normal distribution (He et al., 2015) For the CIFAR10/100 experiments with N-RISOTTO we use a learning rate of 0.05 with cosine annealing, except for CIFAR100 with ResNet50 (B) where we are able to use a higher learning rate of 0.1. For Tiny ImageNet we use a learning rate of 0.01 with cosine annealing.

Dataset

ResNet (Blumenfeld et al., 2020) .

