IDENTICAL INITIALIZATION: A UNIVERSAL APPROACH TO FAST AND STABLE TRAINING OF NEURAL NETWORKS

Abstract

A well-conditioned initialization is beneficial for training deep neural networks. However, existing initialization approaches do not simultaneously show stability and universality. Specifically, even though the widely-used Xavier and Kaiming initialization approaches can generally fit a variety of networks, they fail to train residual networks without Batch Normalization for calculating an inappropriate scale on data-flow. On the other hand, some literature design stable initialization (e.g., Fixup and ReZero) based on dynamical isometry, an efficient learning mechanism. Nonetheless, these methods are specifically designed for either a non-residual structure or a residual block only, and even include extra auxiliary components, limiting their applicable range. Intriguingly, we find that the identity matrix is a feasible and universal solution to the aforementioned problems, as it adheres to dynamical isometry while remaining applicable to a wide range of models. Motivated by this, we develop Identical Initialization (IDInit), a sufficiently stable, universal, and fast-converging approach to the identity matrix. Empirical results on a variety of benchmarks show that IDInit is universal to various network types, and practically useful with good performance and fast convergence.

1. INTRODUCTION

Deep Neural Networks (DNNs) have attracted significant attention due to their versatility in various applications. To obtain well-conditioned training, a suitable initialization is important (Sutskever et al., 2013; Arpit et al., 2019; Huang et al., 2020; Pan et al., 2022) . Common initialization methods include Xavier (Glorot & Bengio, 2010) and Kaiming initialization (He et al., 2015) . Later, dynamical isometry (Saxe et al., 2014) is widely used for building stable starting status for a very deep network (Mishkin & Matas, 2016; Burkholz & Dubatovka, 2019) , and even can stably train 10000layered networks (Xiao et al., 2018) . However, despite the successes, these aforementioned methods are unsuitable for residual blocks without Batch Normalization since calculating an improper scale on data-flow (see Sec. 4.2 and Sec. C.2). Addressing this issue, Bachlechner et al. ( 2021); Hardt & Ma (2017) proposed to stabilize the training of residual blocks according to dynamical isometry mechanism by transiting identity via multiplying residual stem with 0. Nevertheless, the identitymaintaining methods can be only applied on residual networks, and some of them even require auxiliary components to stabilize model training (Blumenfeld et al., 2020; Bachlechner et al., 2021; Zhao et al., 2021) , leading to lost generality to other network structures. To overcome the above problems, in this paper, we propose a stable, general, and fast-converging initialization approach, based on the identity matrix. Motivation on Identity Matrix. As mentioned above, implementing identity transition can naturally correspond to the isometric mechanism that is beneficial for fast convergence and improving performance (Bachlechner et al., 2021) . To maintain this transition of both residual and non-residual modules, the identity matrix is a potential solution. In detail, consider i-th block in a DNN, x (i+1) = (r + m j=1 θ (i,j) )x (i) , where m is the number of weights in a block, θ (i,j) means j-th weight in the i-th block, and r ∈ {0 , I} indicates whether Eq. ( 1) denotes a residual layer. Usually when r = 0 and m = 1, Eq. ( 1) is 1

