BLOCK-LEVEL STIFFNESS ANALYSIS OF RESIDUAL NETWORKS

Abstract

Residual Networks (ResNets) can be interpreted as dynamic systems, which are systems whose state changes over time and can be described with ordinary differential equations (ODEs) (Haber et al., 2018; Weinan, 2017). Specifically, the dynamic systems interpretation views individual residual blocks as ODEs. Numerical techniques for solving ODEs result in an approximation; and therefore contain an error term. If an ODE is stiff it is likely that this error is amplified and becomes dominating in the solution calculations, which negatively affects the accuracy of the approximated solution (Burden et al., 2015). Therefore, stiff ODEs are often numerically unstable. In this paper we leverage the dynamic systems interpretation to perform a novel theoretical analysis of ResNets by leveraging findings and tools from numerical analysis of ODEs. Specifically, we perform block level stiffness analysis of ResNets. We find that residual blocks towards the end of ResNet models exhibit increased stiffness and that there is a statistically significant correlation between stiffness and model accuracy and loss. Based on these findings, we propose that ResNets behave as stiff numerically unstable ODEs.

1. INTRODUCTION

There are three theoretical interpretations of Residual Networks (ResNets): (1) unraveled ResNets, (2) unrolled iterative estimation, and (2) dynamical systems. The unravelled interpretation views ResNets as a collection of 2 n paths along which the input data flows, where n is the number of residual blocks (Veit et al., 2016) . The unrolled iterative estimation interpretation explains ResNets as iterative approximators, where the first estimate provided by the first layer and is progressively refined by subsequent layers (Greff et al., 2017) . Finally, the dynamical systems view interprets ResNets as discretized dynamical systems, where ResNets are seen as ordinary differential equations (ODEs) (Haber et al., 2018; Chen et al., 2018; Lu et al., 2018) . Specifically, the dynamical systems interpretation regards ResNets's residual blocks as a series of forward Euler discretizations of an initial value ODE. This connection between residual blocks and ODEs can be leveraged for novel theoretical analyses that further our understanding and interpretation of ResNets. In this paper we perform a stiffness analysis of ResNets and their residual blocks by leveraging findings from numerical analysis of ODEs. Stiffness is an interesting property of an ODE that has important implications. If a differential equation is stiff, the solution to the equation will have an unpredictable error that will negatively affect the accuracy of the approximated solution (Burden et al., 2015) . Therefore, stiff ODEs are often numerically unstable and their solutions have accuracy issues (Seinfeld et al., 1970; Shampine & Gear, 1979) . There is no rigorous definition of stiffness; however there are certain phenomena that indicate that a problem may be stiff. One way to assess stiffness of an ODE is to analyze the eigenvalues of the Jacobian of the ODE. Specifically, if the eigenvalues of the Jacobian differ greatly in magnitude (Butcher, 2008; Bui & Bui, 1979) or if a large portion of the eigenvalues have negative real parts (Burden et al., 2015) , it is likely that the ODE is stiff. Unfortunately, there are no specific thresholds regarding what constitutes a high variation in magnitude of eigenvalues or high proportion of eigenvalues with negative real parts. In this paper we investigate whether ResNets exhibit some of the characteristics that can indicate stiffness. Specifically, we focus on analyzing the eigenvalues of the Jacobian of individual residual blocks with respect to their inputs in ResNet18, ResNet34, and ResNet50. Using these eigenvalues we calculate (1) the stiffness index and (2) proportion of eigenvalues with negative real parts for each residual block and target label, where the stiffness index captures the degree of variation of the eigenvalues magnitude (Kim et al., 2021) . We find stiffness significantly varies with respect to different residual blocks. Specifically, we find that residual blocks towards the end of the network indicate increased stiffness, i.e.: they have a high stiffness index and also a high proportion of eigenvalues with negative real parts. For example, the last block in ResNet50 has a stiffness index of -35.32 and 31.96% of the eigenvalues of its Jacobian have negative real parts. We perform a correlation analysis between stiffness and model accuracy/loss and show that they are correlated and that their correlation is statistically significant. In particular, we calculate the pearson correlation, which ranges from -1 to 1, where positive values indicate positive linear relationships and negative values indicate inverse relationships. We also compute the p-value of the correlation coefficients, where a p-value less or equal to 0.05 is considered statistically significant. For example, given ResNet18's last residual block, the stiffness index has a negative correlation with accuracy of -0.36 and the percentage of negative eigenvalues has a positive correlation with loss of 0.34. The stiffness index of the last residual block in ResNet34 has a negative correlation with accuracy of -0.20 and the percentage of negative eigenvalues has a negative correlation with accuracy of -0.38. Finally, given the last block in ResNet50, the correlation between the stiffness index and accuracy is -0.27 and the correlation between the percent of negative eigenvalues and loss is 0.25. Based on these findings we propose that ResNets can be interpreted as not only as ODEs, but specifically as stiff ODEs, which are numerically unstable. This interpretation could be another possible explanation of why DNNs are susceptible to adversarial examples. The rest of the paper is organized as follows. Section 2 provides related work, Section 3 provides a detailed explanation of the dynamic systems interpretation of ResNets, which is the basis of this paper. Section 4 describes our stiffness analysis, which investigates whether individual residual blocks behave as stiff ODEs. Finally, Section 5 reports the results of our analysis and Section 6 summarizes our conclusions.

2. RELATED WORK

There is a large body of numerical analysis literature that studies stiffness; however, in this paper we are the first ones to connect the concepts of stiffness and ResNets via the dynamical systems interpretation (Haber et al., 2018; Weinan, 2017) to propose that ResNets can be viewed as stiff ODEs. Prior related works primarily focus on the challenges of solving stiff ODEs, which are prone to yielding unreliable results due to their stiffness. To get an overview of on topic of stiff ODEs, please refer to Seinfeld et al. (1970) , who provide a review of numerical integration techniques for stiff ODEs. Additionally, (Shampine & Gear, 1979) describe the meaning of stiffness, why do stiff problems arise, how they can be recognized, and also compare the approprietness of different solution methods. Kaps & Rentrop (1979) propose generalized Runge-Kutta methods of order four with stepsize control as a solution method for stiff ODEs. Because stiff ODEs often contain varying time scales, Engquist & Tsai (2005) propose heterogeneous multiscale methods for stiff ODEs and show promising stability and convergence results. Enright (1978) develop a matrix updating technique that aims to reduce the computational cost of matrix operations needed for solving stiff ODEs. Other works are domain specific such as (Young & Boris, 1977) , which proposes a new method for solving stiff ODEs describing the chemical kinetics of reactive flow problems. One of the most recent works that focuses on solving stiff ODEs is Kim et al. (2021) , who leverage neural ODEs. Neural ODEs have been introduced by (Chen et al., 2018) as a form of new continuous depth deep learning models that parametrize ODEs with a neural network and learn the underlying dynamic system. Kim et al. (2021) study learning neural ODEs on data generated from two classical stiff systems, ROBER (Robertson, 1967) and POLLU (Verwer, 1994) , which describe the dynamics of species concentrations in stiff chemical reaction systems. They propose a new derivative calcu-

