IDENTICAL INITIALIZATION: A UNIVERSAL APPROACH TO FAST AND STABLE TRAINING OF NEURAL NETWORKS

Abstract

A well-conditioned initialization is beneficial for training deep neural networks. However, existing initialization approaches do not simultaneously show stability and universality. Specifically, even though the widely-used Xavier and Kaiming initialization approaches can generally fit a variety of networks, they fail to train residual networks without Batch Normalization for calculating an inappropriate scale on data-flow. On the other hand, some literature design stable initialization (e.g., Fixup and ReZero) based on dynamical isometry, an efficient learning mechanism. Nonetheless, these methods are specifically designed for either a non-residual structure or a residual block only, and even include extra auxiliary components, limiting their applicable range. Intriguingly, we find that the identity matrix is a feasible and universal solution to the aforementioned problems, as it adheres to dynamical isometry while remaining applicable to a wide range of models. Motivated by this, we develop Identical Initialization (IDInit), a sufficiently stable, universal, and fast-converging approach to the identity matrix. Empirical results on a variety of benchmarks show that IDInit is universal to various network types, and practically useful with good performance and fast convergence.

1. INTRODUCTION

Deep Neural Networks (DNNs) have attracted significant attention due to their versatility in various applications. To obtain well-conditioned training, a suitable initialization is important (Sutskever et al., 2013; Arpit et al., 2019; Huang et al., 2020; Pan et al., 2022) . Common initialization methods include Xavier (Glorot & Bengio, 2010) and Kaiming initialization (He et al., 2015) . Later, dynamical isometry (Saxe et al., 2014) is widely used for building stable starting status for a very deep network (Mishkin & Matas, 2016; Burkholz & Dubatovka, 2019) , and even can stably train 10000layered networks (Xiao et al., 2018) . However, despite the successes, these aforementioned methods are unsuitable for residual blocks without Batch Normalization since calculating an improper scale on data-flow (see Sec. 4.2 and Sec. C.2). Addressing this issue, Bachlechner et al. (2021); Hardt & Ma (2017) proposed to stabilize the training of residual blocks according to dynamical isometry mechanism by transiting identity via multiplying residual stem with 0. Nevertheless, the identitymaintaining methods can be only applied on residual networks, and some of them even require auxiliary components to stabilize model training (Blumenfeld et al., 2020; Bachlechner et al., 2021; Zhao et al., 2021) , leading to lost generality to other network structures. To overcome the above problems, in this paper, we propose a stable, general, and fast-converging initialization approach, based on the identity matrix. Motivation on Identity Matrix. As mentioned above, implementing identity transition can naturally correspond to the isometric mechanism that is beneficial for fast convergence and improving performance (Bachlechner et al., 2021) . To maintain this transition of both residual and non-residual modules, the identity matrix is a potential solution. In detail, consider i-th block in a DNN, x (i+1) = (r + m j=1 θ (i,j) )x (i) , ( ) where m is the number of weights in a block, θ (i,j) means j-th weight in the i-th block, and r ∈ {0 , I} indicates whether Eq. ( 1) denotes a residual layer. Usually when r = 0 and m = 1, Eq. ( 1) is 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 a non-residual layer. Under this condition, to achieve the identity transition, namely x (i+1) = x (i) , θ (i,1) = I is the unique solution. When r = I and m ≥ 2, Eq. ( 1) denotes a residual layer. At this moment, setting all {θ (i,j) } m j=1 to I is not feasible. Therefore, it is usual to set the last weight θ (i,m) to 0 to maintain identity (Zhang et al., 2019; Zhao et al., 2021) . Overall, without loss of generality, it is necessary and feasible to apply an identity matrix as an initial status for maintaining identity. Identical Initialization (IDInit). In this paper, we introduce the identity matrix into designing a novel and practical initialization, named Identical Initialization (IDInit). A simple case of IDInit is shown in Figure 1 . For a non-residual condition, squared identity matrices can easily achieve identity transition, however, the most general condition, non-squared matrices cannot (Vaswani et al., 2017) . To adapt the non-squared situation, we modify the identity matrix by maintaining signal variance as described in Sec. 3.2. For a residual condition, a dead neuron problem will be caused by directly setting the last weight in a residual stem to 0 (Zhang et al., 2019; Zhao et al., 2021) . Addressing this, we simply select some elements to an extremely small numerical value ε to increase trainable neurons as in Figure 1 . We also observe that an identical convolution layer that maintains identity transit degrades performance dramatically. Tackling the issue, we propose a modest change to an identical convolution layer by fusing spatial information, leading to significant improvement. Moreover, we find the model performance is uncertain with random initialization whose values are sampled from a probability distribution (Glorot & Bengio, 2010; He et al., 2015) . IDInit is exactly able to handle this situation of its determinacy that is unique, and uncorrelated with samplers. To our best knowledge, it is the first trial to put identical-like initialization into practice. Previous work, Bartlett et al. (2019) , has used the identity matrix as a weight with excessive attention on approximate analysis, resulting in imprecise conclusions for the gap between reality and theory. Additionally, this work addresses only relatively simple scenarios without an activation function and ignores momentum in stochastic gradient descent (SGD), which can optimize critical points to a degree. And this method is incompatible with ResNets without Batch Normalization (Ioffe & Szegedy, 2015) , since signals will explode if all weights are set identically. Motivated by some of the observations and addressing shortcomings, we provide a simple, practical, and efficient initialization method.

2. BACKGROUND AND RELATED WORK

Dynamical Isometry. Give an L-layer network with blocks formulated by Eq. (1), x (0) is input and x (L) is the L-th layer's output. Assuming signal magnitude (e.g., σ 2 (x (i) )) of each layer changing in the scale α, the last signal magnitude can reach α L (e.g., σ 2 (x (L) ) = α L σ 2 (x (0) )), making it easy to cause signal explosion and diffusion, especially for large L. Introduced from mean-field theory (Pennington et al., 2017; 2018) , the dynamical isometry is a comparably reasonable mechanism to measure models' trainability. To utilize this paradigm, it usually considers the input-output Jacobian J io = ∂x (L) ∂x (0) , whose mean squared singular value is χ. (Pennington et al., 2017) and (Bachlechner et al., 2021) show that χ > 1 indicates the model in a chaotic phase, and back-propagated gradients will explode exponentially. By contrast, χ < 1 means a model in an ordered manner that back-propagated gradients exponentially vanish. χ = 1 is a critical line of initialization, avoiding vanishing or



Figure1: A simple case of IDInit on a residual and non-residual network layer. ε is set to 1e -6, therefore, Ŷ ≈ 0. Y of both left and right sub-figures is equal to X.

