A PRIORI GUARANTEES OF FINITE-TIME CONVER-GENCE FOR DEEP NEURAL NETWORKS Anonymous

Abstract

In this paper, we perform Lyapunov based analysis of the loss function to derive an a priori upper bound on the settling time of deep neural networks. While previous studies have attempted to understand deep learning using control theory framework, there is limited work on a priori finite time convergence analysis. Drawing from the advances in analysis of finite-time control of non-linear systems, we provide a priori guarantees of finite-time convergence in a deterministic control theoretic setting. We formulate the supervised learning framework as a control problem where weights of the network are control inputs and learning translates into a tracking problem. An analytical formula for finite-time upper bound on settling time is provided a priori under the assumptions of boundedness of input. Finally, we prove that our loss function is robust against input perturbations.

1. INTRODUCTION

Over the past decade, Deep neural networks have achieved human-like performance in various machine learning tasks, such as classification, natural language processing and speech recognition. Despite the popularity of deep learning, the underlying theoretical understanding remains relatively less explored. While attempts have been made to develop deep learning theory by drawing inspiration from other related fields such as statistical learning and information theory, a comprehensive theoretical framework is still in an early developmental stage. It is difficult to perform mathematical analysis on Deep neural networks due to the large number of parameters involved. Other problems in deep neural networks revolve around the stability and desired convergence rate of the training. Since the performance of the network depends highly on the training data and the choice of the optimization algorithm, there is no guarantee that the training will converge. Our work attempts to give finite-time convergence guarantees for training of a deep neural network by utilizing an established stabilization framework from control theory. Existing works in deep learning theory have attempted to bridge the gap in understanding deep learning dynamics by focusing on simple models of neural networks [Saxe et al. (2013) , Li & Yuan (2017) , Arora et al. (2018) , Jacot et al. (2018) ]. This could be attributed to the fact that current state-of-the-art deep learning models are highly complex structures to analyze. Jacot et al. (2018) proved that a multilayer fully-connected network with infinite width converges to a deterministic limit at initialization and the rate of change of weights goes to zero. Saxe et al. (2013) analyzed deep linear networks and proved that these networks, surprisingly, have a rich non-linear structure. The study shows that given the right initial conditions, deep linear networks are a finite amount slower than shallow networks. Following this work, Arora et al. (2018) proved the convergence of gradient descent to global minima for networks with dimensions of every layer being full rank in dimensions. While these studies give important insights into the design of neural network architecture and the behavior of training, their results may need to be modified in order to provide convergence guarantees for the conventional deep neural networks. Du et al. (2018) extended the work of Jacot et al. (2018) further by proving convergence for gradient descent to achieve zero training loss in deep neural networks with residual connections. When it comes to convergence of certain state variables of a dynamical system, control theory provides a rich mathematical framework which can be utilized for analyzing the non-linear dynamics of deep learning [Liu & Theodorou (2019) ]. One of the early works relating deep learning to control theory was of LeCun et al. (1988) , which used the concept of optimal control and formulated back-propagation as an optimization problem with non-linear constraints. Non-linear control has gained increasing attention over the past few years in the context of neural networks, especially for recurrent neural networks [Allen-Zhu et al. (2019 ), Xiao (2017) ] and reinforcement learning [Xu et al. (2013) , Gupta et al. (2019 ), Wang et al. (2019) , Kaledin et al. (2020) ]. A new class of recurrent neural networks, called Zhang Neural Networks (ZNN), was developed that expressed dynamics of the network as a set of ordinary differential equations and used non-linear control to prove global or exponential stability for time-varying Sylvester equation [Zhang et al. (2002) , Guo et al. (2011) ]. Li et al. (2013) introduced the sign bi-power activation function for Zhang Neural Networks (ZNN) which helps to prove the existence of finite-time convergence property. Haber & Ruthotto (2017) presents deep learning as a parameter estimation problem of non-linear dynamical systems to tackle the exploding and vanishing gradients. The focus of this paper is on deriving a priori guarantee of attaining finite-time convergence of training in a supervised learning framework under some assumptions on inputs. The novelty lies in the fact that the weight update is cast as a finite-time control synthesis such that the loss function is proven to be a valid Lyapunov function. The resulting training update is derived as a function of time such that it ensures the convergence of Lyapunov function in finite time. The only assumption used is that the magnitude of atleast one input is greater than zero. Thus, the learning problem is converted into a finite time stabilization problem as studied rigorously in Bhat & Bernstein (2000) . The contributions of the proposed study are twofold. First, we propose a Lyapunov candidate function to be used as loss function. Second, we modify the weight update of the neural network in such a way that the supervised training is converted into a dynamical control system. This allows us to use results from Bhat & Bernstein (2000) to a priori guarantee finite time convergence on the training. To the best of our knowledge, a guarantee of finite-time convergence is being studied for the first time in context of training a general multi-layer neural network. The proposed results will enable time bound training that will be useful in real-time applications. The paper is organized as follows. Section 2 starts with introducing the Lyapunov function from the control theory perspective. Section 2.1 derives the weight update and lyapunov loss function for a single neuron case and proves that it satisfies the conditions required for finite-time stability theorems developed in Bhat & Bernstein (2000) to be applicable. Section 2.2 then proves that a similar result extends to a multi-layer perceptron network under reasonable assumptions on the input. In Section 2.3, we state the equations to compute upper bounds on the convergence time for training neural networks. Section 2.4 provides an extension to the case when bounded perturbations are admitted at the input and convergence guarantees are shown to hold true. In Section 3, some numerical simulations are presented for both single neuron and multi-layer perceptron cases for regression. Section 4 collects conclusions and discusses future scope.

2. PROPOSED METHOD TO CONVERT SUPERVISED LEARNING INTO DYNAMICAL CONTROL SYSTEM

This section motivates the development of a priori bounds on settling time with certain assumptions on the input. The weight update problem for supervised learning in neural networks is similar to the tracking problem of non-linear control systems. The idea is to synthesize a feedback control law based on certain Lyapunov function V (x). A Lyapunov function is a positive definite function, i.e. V (x) > 0, and its time derivative is negative definite along the given system dynamics, i.e. V < 0. We cast the supervised learning problem as a dynamical system which admits a valid Lyapunov function as the loss function with the weight update is designed as a function of time.

2.1. SINGLE NEURON CASE

We start with a simplistic single neuron case. Let x ∈ R n be the input to the network where x = [x 1 x 2 • • • x n ] , |x i | < c, i = 1, 2, • • • n holds true for some a priori but arbitrary scalar c ∈ (0, ∞). Let y be the target output and z = n i=1 w i x i + b be the linear combination of weights w i with inputs x i and bias b. The definition of signum function used here is sign(x) = 1, ∀x > 0, sign(x) = -1, ∀x < 0, sign(x) ∈ [-1, 1], x = 0. For our analysis, we choose sigmoid function as our activation function, i.e. σ(z) = 1 1+e -z . The output of the neural network is given by y = σ(z). Let the error in output be defined as ē = y -y * . The first objective is to convert our loss function into a candidate Lyapunov function in order to apply the control theoretic principles. Consider a continuous function E(ē) to be a candidate Lyapunov function as follows: E = |ē| (α+1) (α + 1) where α ∈ (0, 1) is a user-defined parameter. The second objective is to define the temporal rate of weight as the control input to enforce the stability of the origin ē = 0 as t → ∞. The Lyapunov function in ( 1) is used to show that it is indeed plausible to achieve this asymptotic stability goal.  dE dt = |ē| α sign(ē) e -z (1 + e -z ) 2 (x 1 ẇ1 + x 2 ẇ2 + • • • + x n ẇn ) (2) Define, u 1 ẇ1 , u 2 ẇ2 , • • • , u n ẇn , u i = -k i sign(x i ) sign(ē)e z (1 + e -z ) 2 , i = 1, 2, • • • , n, where k i > 0, for all i, are tuning parameters to be chosen by the user. It can be noted that all control inputs u 1 , u 2 , • • • , u n remain bounded due to boundedness assumption of all the inputs x i and that of e z . Substituting (3) into (2) produces dE dt = -|ē| α n i=1 k i |x i | (4) Assumption 1. At least one input of all x i , i = 1, 2, • • • , n is non-zero such that |x j | > γ > 0 where γ is a priori known scalar for some integers j ∈ [1, n]. It can be noted that Assumption 1 is reasonable for many practical applications in that some inputs will always be nonzero with a known lower bound on its magnitude. First main result of the paper is in order. Theorem 1. Assuming Assumption 1 holds true, let the output of the neural network be given by y = σ(z). Let all the inputs x i , i = 1, 2, • • • , n be bounded by some a priori known scalar a ∈ (0, ∞) such that |x i | < a holds true for all i. Then, weight update (3) causes the error ē = y -y * to converge to zero in finite time. Proof. The proof of the theorem is furnished using standard Lyapunov analysis arguments. Consider E defined by (1) as a candidate Lyapunov function. Observing (4), it can be concluded that the right hand side of the temporal derivative of remains negative definite since it involves power terms and norm of inputs. Furthermore, (4) can be rewritten under the assumption 1 as follows: dE dt ≤ -k min γE β where k min = min(k i ), i = 1, 2, • • • , n, β = α α+1 and |ē| α = |ē| α+1 α α+1 = E β has been utilized. Noting that E is a positive definite function and scalars k min and γ are always positive, the proof is complete by applying (Bhat & Bernstein, 2000, Theorem 4 .2).

2.2. MULTI NEURON CASE

Consider a multi-layer perceptron with N layers where the layers are connected in a feed-forward manner (Bishop, 1995, Chapter 4) . Let x ∈ R n be the input to the network where  x = [x 1 x 2 • • • x n ] and |x i | < c, i = 1, 2, • • • n holds E = E 1 + • • • + E m = |ē 1 | (α+1) (α + 1) + • • • + |ē m | (α+1) (α + 1) As is usually done in the case of feed-forward networks, consider unit j of layer l that computes its output a l j = i w l ji z l i (7) using its inputs z l i from layer l where bias parameter has been embedded inside the linear combination and z l j = σ(a l-1 j ), where σ is non-linear activation function. The aim of this section is to extend the single neuron case to a multi-neuron one. The simplest way do achieve this is to find sensitivity of E to the weight w l ji of a given layer l, which is given by ∂E m ∂w l ji = ∂E m ∂a l j ∂a l j ∂w l ji (8) Using standard notation δ l j ∂Em ∂a l j with (7) results in: ∂E m ∂w l ji = δ l j z l i (9) It is straightforward to compute δ L m that belongs to the output layer L as shown below: δ L m = ∂E m ∂a L m = σ (a L m ) ∂E m ∂y m where z L m is replaced by y m as it is the output layer. Finally, computation of δ l j for all hidden units is given by δ l j = ∂E m ∂a l j = k ∂E m ∂a l+1 k ∂a l+1 k ∂a l j (11) where units with a label k includes either hidden layer units or an output unit in layer l + 1. Combining (7), z l j = σ(a l j ) and δ l j ∂Em ∂a l j produces δ l j = σ (a l j ) k w l+1 kj δ l+1 k (12) A slightly modified version of assumption 1 is required before the next result of the paper is presented. Assumption 2. At least one input of all z i , i = 1, 2, • • • , L is non-zero such that |z n | > γ > 0 where γ is a priori known scalar for some integers n ∈ [1, L] where L is the number of units in layer l. Theorem 2. Let the weight update for connecting unit i of layer l to unit j of layer l + 1 of a multi-layer neural network be given by ẇl ji = -k l ji sign(δ l j z l i )|δ l j z l i | α E β (13) with some scalar β ∈ (0, 1) such that α + β < 1 and k ji > 0 is a tuning parameter. Then the output vector y converges to y * in finite time. Proof. Consider the candidate Lyapunov function E given by ( 6). The temporal derivative of the Lyapunov function is given by Ė = m Ėm = m ∂E m ∂w ji l ẇl ji ( ) which can be simplified using ( 13) and ( 9) as Ė = -E β m k l ji |δ l j z l i | α+1 (15) Using Assumption 2 it is easy to conclude that for some k min = min i,j,l k l ji > 0, the following inequality holds true: Ė ≤ -k min γ α+1 E β (16) Noting that E is a positive definite function and scalars k min , γ are always positive, the proof is complete by applying (Bhat & Bernstein, 2000, Theorem 4.2) . Remark 1. It can be seen that weight update (13) (or respectively (3)) is in a feedback control form where state z i and δ j (respectively x i and ē) are being used for influencing the learning process.

2.3. SETTLING TIME FOR NEURAL NETWORK TRAINING

Theorem 1 and 2 prove that using candidate Lyapunov function as the loss function and the temporal rate of change of the loss function as the weight update equation, we can convert supervised learning framework into a control problem. (Bhat & Bernstein, 2000, Theorem 4 .2) states that if there is continuous function that is positive definite and it's rate of change with respect to time is negative definite, then it's finite time stable equilibrium is at origin. The theorem also gives the settling time when the above conditions are satisfied by the control system. For our case, this means that the Lyapunov based loss function will converge in finite time and time taken to converge is as given below: The settling time for Single Neuron case: T ≤ 1 k min γ(1 -β) E (1-β) ini The settling time for Multi Neuron case: T ≤ 1 k min γ α+1 (1 -β) E (1-β) ini ( ) where γ is the lower bound on the inputs, k min is the minimum value of tuning parameter k, α is the scalar used in the Lyapunov loss function and E ini is the initial value of loss, i.e. loss at t = 0.

2.4. SENSITIVITY TO PERTURBATIONS

This section considers the robustness of training a neural network based on our proposed algorithm. In control theoretic terms, perturbation can be understood either as a disturbance to the process or as modelling uncertainty. When supervised learning is viewed as a control process, perturbation in each neuron of the input layer can be seen as external noise. This perturbation could cause the control system to diverge. In this section, we develop theoretical claims that even with perturbed inputs, training a neural network with the proposed algorithm will converge. The following assumption of the upper bound on the perturbation of inputs is invoked. Assumption 3. There exists an a priori known constant M > 0 such that all inputs x i , i = 1, 2, • • • , N admit additive perturbations ∆x i such that |∆x i | ≤ M |x i | α (19) for all i where N is the number of inputs. The following result is in order. Theorem 3. Let assumptions 2 and 3 hold true. Let the weight update for connecting unit i of layer l to unit j of layer l + 1 of a multi-layer neural network be given by ( 13). Then the output vector y converges to y * in finite time in the presence of additive perturbations ∆x n , n ∈ [1, L] if k min > M . Proof. It can be seen that the perturbation is considered only in inputs. Hence all hidden layer weights are updated as done in the proof of Theorem 2. Hence, the dynamics of learning results in the following revised temporal derivative of Lyapunov function: Ė = -E β m k ji |δ j z i | α+1 + n p=0 k 1p |δ p x p | α sign(δ p x p )δ p ∆x p (20) where k 1p is the gain parameter for training of all the input layer weights. The expression in (20) can be simplified using Assumption 3 as follows: Ė ≤ -E β m k ji |δ j z i | α+1 + E β n p=0 k 1p |δ p x p | α+1 M, ≤ -E β m (k ji -M )|δ j z i | α+1 where inputs z i now collect inputs x i as well. Similar to the proof of Theorem 2, ( 21) can be re-written by applying Assumption 2 as follows: Ė ≤ -(k min -M )γE β (22) Since k min > M , the proof is complete by applying (Bhat & Bernstein, 2000, Theorem 4.2) .

3. EXPERIMENTS

This section presents empirical evidence that the theoretical results derived in the above sections hold for real-life applications. Convergence Rate for training. To analyze the training convergence rate of the proposed method, we train a multi-layer perceptron (covered in Section 2.2) to perform regression on the Boston Housing dataset (Harrison Jr & Rubinfeld (1978) ). Experiment on single neuron case is covered in Appendix A.2.2 Training details. We compare our proposed method (Lyapunov based loss and modified weight update equation) with traditional L 1 , L 2 loss functions and standard SGD weight update equation. The modifications to the weight update equation for Lyapunov loss are as stated in (15) for multi-neuron case. We plot the training and test loss with respect to time in Figure 1 . The details about the training examples, epochs, learning rate etc. are specified under the description of the figures. We observe that the Lyapunov Loss function converges faster than L 1 and L 2 loss functions, demonstrating the results proven extensively in Section 2. This is attributed mainly to the non-Lipschitz weight updates given by ( 3) and ( 13). The settling time as well as metrics of the trained networks are reported in Table 1 . The time taken by the proposed algorithm to converge falls within the a priori upper bound. We also observe that the metrics (Accuracy for single neuron case and rmse (Root mean square error) for multi-neuron case) achieved by the proposed method is better than the baselines. Admittedly, the theoretical upper bound on settling time for single neuron and multi-layer perceptron produced by ( 5) and ( 16) are very conservative, yet it is an a priori deterministic upper bound and the aggressive learning proves to be faster than traditional loss functions in the scenarios considered here. Iris dataset and the metric used is Accuracy (higher is better). For Multi-layer perceptron, we use Boston Housing dataset and the metric used for this is rmse (lower is better). This can be attributed to the fact that we operate in the realm of non-smooth functions which tend to be more aggressive than the smooth L 2 function we usually encounter in most learning problems. We can clearly observe that the loss value does not converge to zero. According to (Bhat & Bernstein, 2000, Theorem 5 .2), a control system will converge to a non-zero value at steady state, if there is constant perturbation in the input. Usually in real life datasets, there is an inherent bias or constant perturbation, which prevents the loss to converge to zero. Of course, by setting α = 0, we can reject all persisting disturbances (Orlov (2005) ) and force the loss to converge to zero, but this may result in discontinuity in back propagation (discussed in Appendix A.1) and result into large numerical errors. Hyperparameter analysis and its effect on training. In this section, we attempt to give an empirical analysis as to how changes in the hyperparameters like k and α affect the training of the neural network. In order to study the effect of tuning k, we work with the single neuron case for simplicity and ease of understanding. We use Iris dataset for this experiment Dua & Graff (2017) . Existing literature in control theory suggests that a monotonic increase in the tuning parameter, k should result in a monotonic decrease in the corresponding loss. In neural network terminology, k can be considered as the 'learning rate'. We can clearly see in Figure 2 that k behaves like learning rate parameter where increasing k makes the learning more aggressive. From Table 2 , we observe that the settling time for the training loss of our Lyapunov loss function is always less than L 1 and L 2 . We also observe that the Accuracy of our method is either better or similar to the baselines, which shows that the optimized solution achieved by our method is either better or on par with the baselines. For the multi-neuron case, we have defined k l ij for every weight. Similar to adaptive learning rate, we can either tune k l ij for every weight (which can be quite cumbersome for larger neural networks) or we can set it arbitrarily to the same value for all weights. It must be noted that k l ij can be set to any value greater than an a priori known positive constant M . We will discuss the impact of different values of α on the training process of our neural network.

Experiment

For the given experiments, we tested the network for various values of α ranging between (0, 1). We observed that the training loss follows equations proved in above sections as long as α ∈ (0.5, 0.9). If α happens to be too close to zero, the resulting control law becomes discontinuous thereby resulting in numerical instabilities owing to the fact that the current ODE solvers are unable to handle functions that are discontinuous. More details about the case α = 0 is given in Appendix A.1 (Bhat & Bernstein, 2000, Theorem 5.2) . Stability of learning in the presence of amplitude-bounded perturbations. For adding input perturbations, we asssume that an a priori upper bound, M is known on the amplitude of possible input perturbations following Assumption 3. We add input perturbations to each pixel in the image using a randomized uniform distribution ranging from (-∆x, ∆x). Hence, our additive noise remains in the abovementioned a priori bounds. We vary the values of M from (0.1, 0.3) with a 0.1 increment, giving us three training cases. Figure 3 shows that the proposed training loss still manages to achieve steady state error and does not diverge due to perturbations introduced in the input dataset. Since we are dealing with noisy data, the loss values converge to a non-zero value in the steady-state Experiments on larger datasets. In this section, we present results to demonstrate the performance of our proposed algorithm on a larger dataset. We use the entire IMDB Wiki Faces Dataset Rothe et al. ( 2015) with 0.5 million images for this experiment. Training dataset consists of ∼0.2 million images and the test and validation set consists of ∼0.1 million images each. As described in the previous section, a multi-layer perceptron is trained to predict the age from the image of a face. The MLP is trained for 100 epochs with learning rate 0.0005 for all three loss functions and with α = 0.7 for Lyapunov loss function. From Figure 4 and Table 3 , we can see that our proposed algorithm achieves similar generalization / testing rmse (Root mean square error) as L 1 and L 2 baselines while providing finite time convergence guarantees even on large datasets. The convergence time is also within the theoretically derived upper bound. 

4. CONCLUSION AND FUTURE WORK

This paper studies the training of a deep neural network from control theory perspective. We pose the supervised learning problem as a control problem by jointly designing loss function as Lyapunov function and weight update as temporal derivative of the Lyapunov function. Control theory principles are then applied to provide guarantees on finite time convergence and settling time of the neural network. Through experiments on benchmark datasets, our proposed method converges within the a priori bounds derived from theory. It is also observed that in some cases our method enforces faster convergence as compared to standard L 1 and L 2 loss functions. We also prove that our method is robust to any perturbations in the input and convergence guarantees still hold true. The given a priori guarantees for the convergence time is a desirable result for training networks that are extremely difficult to converge, specifically in Reinforcement Learning. A future scope of this work may be to convert the continuous time analysis framework to discrete time. This study introduces a novel perspective of viewing neural networks as control systems and opens up the field of machine learning research to a plethora of new results that can be derived from control theory.

A.2.2 CONVERGENCE EXPERIMENTS ON SINGLE NEURON

In this experiment, we perform binary classification on the Iris dataset (Dua & Graff ( 2017)) as a representative example of single neuron case presented in Section 2.1. Figure 7 shows us how the noise affects the training images. We take the specific case where M = 0.2. We can clearly observe that the images obtained after adding input perturbations happens to be quite noisy. Figure 8 and Figure 9 show that even when inputs are perturbed, our proposed method converges in finite time. 



true for some a priori but arbitrary scalar c ∈ (0, ∞). Let y ∈ R m define the multi-neuron output to the network where y = [y 1 y 2 • • • y m ]. Let y ∈ R m define the target output values of the network where y = [y 1 y 2 • • • y m ]. The error in the output layer can be expressed as ē = [|y 1 -y 1 | |y 2 -y 2 | • • • |y m -y m |]. Hence, the scalar candidate Lyapunov function can be written as follows:

Figure 1: Comparison of convergence with respect to time for the multi-layer perceptron trained on the Boston Housing dataset. All networks are trained for 4000 epochs with learning rate = 0.02 and α = 0.68.

Figure 2: Comparison of training and test convergence with respect to time for different values of tuning parameter, k in a single neuron trained on the Iris dataset. We convert the problem into a binary classification problem by considering only two output classes. All networks are trained for 600 epochs with 80 training examples and 20 test examples. The learning rate for all L 1 , L 2 and Lyapunov is set to the tuning parameter value k and α = 0.8

Figure 3: Comparison of training loss convergence with respect to time for different values of input perturbations, ∆x for a multi-layer perceptron trained on the IMDB Wiki Faces dataset.The a priori upper bound on input perturbations are as follows: (a) ∆x = 0.1, (b) ∆x = 0.2, (c)∆x = 0.3 Learning rate/k = 0.0009, α = 0.8, epochs=100In this section, we present results that demonstrate the robustness of proposed algorithm to bounded perturbations in the input while maintaining convergence of training dynamics. For this case, we train a multi-layer perceptron on a regression task of predicting the age given the image of a face. We use IMDB Wiki Faces Dataset Rothe et al. (2015) which has over 0.5 million face images of celebrities with their age and gender labels. Only part of the dataset is used for this experiment, i.e. 20,000 images for training and 4,000 images each for validation and test.

Figure 4: Comparison of training convergence with respect to time for 0.5 million IMDB Wiki dataset

Figure 6: Comparison of convergence with respect to time for a single neuron trained on the Iris dataset. We convert the problem into a binary classification problem by considering only two of the three output classes. All networks are trained for 2100 epochs with 80 training examples and 20 test examples, α = 0.8, learning rate / k = 0.01

Figure 7: Pictorial representation of the effect of random additive noise with an upper bound of 0.2 on IMDB Wiki Faces Dataset. In this experiment, we work with monochrome images for the sake of simplicity.

Figure 8: Comparison of training loss convergence with respect to time for different values of input perturbations, ∆x for a multi-layer perceptron trained on the IMDB Wiki Faces dataset. The a priori upper bound on input perturbations are as follows: (a) ∆x = 0.1, (b) ∆x = 0.2, (c)∆x = 0.3, (d) ∆x = 0.4, (e) ∆x = 0.5, and (f) ∆x = 1.2.

weight update for standard gradient descent algorithm and dw dt is the additional term we introduce to the weight update equation. Using (1), we get:

Settling time in seconds for each experiment conducted. We compare the time taken for convergence by three different loss functions, L 1 , L 2 and Lyapunov Loss function. The training conditions were similar for individual cases in the experiment. The single neuron case is trained on

Settling time in seconds for different values of tuning parameter, k for the single neuron case on Iris dataset. We compare the time taken for convergence by three different loss functions, L 1 , L 2 and Lyapunov Loss function. The training conditions were similar for individual cases in the experiment.

5 million IMDB Wiki dataset rmse on IMDB Wiki test dataset for L 1 , L 2 and Lyapunov Loss function.

Table4presents the theoretical upper bounds for the proposed Lyapunov function's settling time and experimental results for the training conducted for different upper bounds assumed for the input perturbation. We observe that the settling time for the proposed Lyapunov function is similar to the one observed for L 2 loss function whereas it performs way better than the L 1 loss function. Settling time in seconds for different values of the upper bound on additive input perturbations, M for the multi layer perceptron case on Boston Housing dataset. We compare the time taken for convergence by three different loss functions, L 1 , L 2 and Lyapunov Loss function. The training conditions were similar for individual cases in the experiment.A.2.4 IMDB WIKI FACES EXPERIMENT ON PERTURBATIONS

annex

Gang Wang, Bingcong Li, and Georgios B Giannakis. A multistep lyapunov approach for finite-time analysis of biased stochastic approximation. arXiv preprint arXiv:1909.04299, 2019.Lin Xiao. Accelerating a recurrent neural network to finite-time convergence using a new design formula and its application to time-varying matrix square root. The case when α = 0, as depicted in Figure 5 , raises continuity issues. Consider the left and right limits of the function f ( ) = | | α sign( ), where = δ j z i as appearing in (15). It can be seen that lim →0 -f ( ) = lim →0 + f ( ) = 0 for α ∈ (0, 1). However, for α = 0, lim →0 -f ( ) = -1 and lim →0 + f ( ) = 1. It should be noted that function f ( ) is non-Lipschitz since ∂f /∂ tends to infinity in the limit → 0. We agree that α = 0 reduces the loss function to L1 loss, but the control update becomes purely discontinuous due to the presence of f ( ) in (15). We do not deal with this case for the reasons of continuity as mentioned in the main paper. 

