LIPSCHITZ-BOUNDED EQUILIBRIUM NETWORKS

Abstract

This paper introduces new parameterizations of equilibrium neural networks, i.e. networks defined by implicit equations. This model class includes standard multilayer and residual networks as special cases. The new parameterization admits a Lipschitz bound during training via unconstrained optimization: no projections or barrier functions are required. Lipschitz bounds are a common proxy for robustness and appear in many generalization bounds. Furthermore, compared to previous works we show well-posedness (existence of solutions) under less restrictive conditions on the network weights and more natural assumptions on the activation functions: that they are monotone and slope restricted. These results are proved by establishing novel connections with convex optimization, operator splitting on non-Euclidean spaces, and contracting neural ODEs. In image classification experiments we show that the Lipschitz bounds are very accurate and improve robustness to adversarial attacks.

1. INTRODUCTION

Deep neural network models have revolutionized the field of machine learning: their accuracy on practical tasks such as image classification and their scalability have led to an enormous volume of research on different model structures and their properties (LeCun et al., 2015) . In particular, deep residual networks with skip connections He et al. (2016) have had a major impact, and neural ODEs have been proposed as an analog with "implicit depth" (Chen et al., 2018) . Recently, a new structure has gained interest: equilibrium networks (Bai et al., 2019; Winston & Kolter, 2020) , a.k.a. implicit deep learning models (El Ghaoui et al., 2019) , in which model outputs are defined by implicit equations incorporating neural networks. This model class is very flexible: it is easy to show that includes many previous structures as special cases, including standard multi-layer networks, residual networks, and (in a certain sense) neural ODEs. However model flexibility in machine learning is always in tension with model regularity or robustness. While deep learning models have exhibited impressive generalisation performance in many contexts it has also been observed that they can be very brittle, especially when targeted with adversarial attacks (Szegedy et al., 2014) . In response to this, there has been a major research effort to understand and certify robustness properties of deep neural networks, e.g. Raghunathan et al. (2018a) ; Tjeng et al. (2018) ; Liu et al. (2019) ; Cohen et al. (2019) and many others. Global Lipschitz bounds (a.k.a. incremental gain bounds) provide a somewhat crude but nevertheless highly useful proxy for robustness (Tsuzuku et al., 2018; Fazlyab et al., 2019) , and appear in several analyses of generalization (e.g. (Bartlett et al., 2017; Zhou & Schoellig, 2019) ). Inspired by both of these lines of research, in this paper we propose new parameterizations of equilibrium networks with guaranteed Lipschitz bounds. We build directly on the monotone operator framework of Winston & Kolter (2020) and the work of Fazlyab et al. (2019) on Lipschitz bounds. The main contribution of our paper is the ability to enforce tight bounds on the Lipschitz constant of an equilibrium network during training with essentially no extra computational effort. In addition, we prove existence of solutions with less restrictive conditions on the weight matrix and more natural assumptions on the activation functions via novel connections to convex optimization and contracting dynamical systems. Finally, we show via small-scale image classification experiments that the proposed parameterizations can provide significant improvement in robustness to adversarial attacks with little degradation in nominal accuracy. Furthermore, we observe small gaps between certified Lipschitz upper bounds and observed lower bounds computed via adversarial attack.

2. RELATED WORK

Equilibrium networks, Implicit Deep Models, and Well-Posedness. As mentioned above, it has been recently shown that many existing network architectures can be incorporated into a flexible model set called an equilibrium network (Bai et al., 2019; Winston & Kolter, 2020) or implicit deep model (El Ghaoui et al., 2019) . In this unified model set, the network predictions are made not by forward computation of sequential hidden layers, but by finding a solution to an implicit equation involving a single layer of all hidden units. One major question for this type of networks is its wellposedness, i.e. the existence and uniqueness of a solution to the implicit equation for all possible inputs. El Ghaoui et al. (2019) proposed a computationally verifiable but conservative condition on the spectral norm of hidden unit weight. In Winston & Kolter (2020) , a less conservative condition was developed based on monotone operator theory. Similar monotonicity constraints were previously used to ensure well-posedness of a different class of implicit models in the context of nonlinear system identification (Tobenkin et al., 2017, Theorem 1) . On the question of well-posedness, our contribution is a more flexible model set and more natural assumptions on the activation functions: that they are monotone and slope-restricted. Neural Network Robustness and Lipschitz Bounds. The Lipschitz constant of a function measures the worst-case sensitivity of the function, i.e. the maximum "amplification" of difference in inputs to differences in outputs. The key features of a good Lipschitz bounded learning approach include a tight estimation for Lipschitz constant and a computationally tractable training method with bounds enforced. For deep networks, Tsuzuku et al. (2018) proposed a computationally efficient but conservative approach since its Lipschitz constant estimation method is based on composition of estimates for different layers, while Anil et al. (2019) proposed a combination of a novel activation function and weight constraints. For equilibrium networks, El Ghaoui et al. (2019) proposed an estimation of Lipschitz bounds via input-to-state (ISS) stability analysis. Fazlyab et al. (2019) estimates for deep networks based on incremental quadratic constraints and semidefinite programming (SDP) were shown to give state-of-the-art results, however this was limited to analysis of an already-trained network. The SDP test was incorporated into training via the alternating direction method of multipliers (ADMM) in Pauli et al. (2020) , however due to the complexity of the SDP the training times recorded were almost 50 times longer than for unconstrained networks. Our approach uses a similar condition to Fazlyab et al. (2019) applied to equilibrium networks, however we introduce a novel direct parameterization method that enables learning robust models via unconstrained optimization, removing the need for computationally-expensive projections or barrier terms.

3.1. PROBLEM STATEMENT

We consider the weight-tied network in which x ∈ R d denotes the input, and z ∈ R n denotes the hidden units, y ∈ R p denotes the output, given by the following implicit equation z = σ(W z + U x + b z ), y = W o z + b y (1) where W ∈ R n×n , U ∈ R n×d , and W o ∈ R p×n are the hidden unit, input, and output weights, respectively, b z ∈ R n and b y ∈ R p are bias terms. The implicit framework includes most current neural network architectures (e.g. deep and residual networks) as special cases. To streamline the presentation we assume that σ : R → R is a single nonlinearity applied elementwise, although our results also apply in the case that each channel has a different activation function, nonlinear or linear. Equation ( 1) is called an equilibrium network since its solutions are equilibrium points of the difference equation z k+1 = σ(W z k + U x + b z ) or the ODE ż(t) = -z(t) + σ(W z(t) + U x + b z ). Our goal is to learn equilibrium networks (1) possessing the following two properties: • Well-posedness: For every input x and bias b z , equation 1 admits a unique solution z. • γ-Lipschitz: It has a finite Lipschitz bound of γ, i.e., for any input-output pairs (x 1 , y 1 ), (x 2 , y 2 ) we have y 1y 2 2 ≤ γ x 1x 2 2 .

3.2. PRELIMINARIES

Monotone operator theory. The theory of monotone operators on Euclidean space (see the survey Ryu & Boyd ( 2016)) has been extensively applied in the development of equilibrium networks (Winston & Kolter, 2020) . In this paper, we will use the monotone operator theory on non-Euclidean spaces (Bauschke et al., 2011) , in particular, we are interested in a finite-dimensional Hilbert space H, which we identify with R n equipped with a weighted inner product x, y Q := y Qx where Q 0. The main benefit is that we can construct a more expressive equilibrium network set. A brief summary or relevant theory can be found in Appendix C.1; here we give some definitions that are frequently used throughout the paper. An operator is a set-valued or single-valued function defined by a subset of the space A ⊆ H × H. A function f : H → R ∪ {∞} is proper if f (x) < ∞ for at least one x. The subdifferential and proximal operators of a proper function f are defined as ∂f (x) := {g ∈ H | f (y) ≥ f (x) + y -x, g Q , ∀y ∈ H}, prox α f (x) := {z ∈ H | z = arg min u 1 2 u -x 2 Q + αf (u)} respectively, where x Q := x, x Q is the induced norm. For n = 1, we only consider the case of Q = 1. An operator A is monotone if u -v, x -y Q ≥ 0 and strongly monotone with parameter m if u -v, x -y Q ≥ m x -y 2 Q for all (x, u), (y, v) ∈ A. The operator splitting problem is that of finding a zero of a sum of two operators A and B, i.e. find an x such that 0 ∈ (A + B)(x). Dynamical systems theory. In this paper, we will also treat the solutions of (1) as equilibrium points of certain dynamical systems ż(t) = f (z(t)). Then, the well-posedness and robustness properties of (1) can be guaranteed by corresponding properties of the dynamical system's solution set. A central focus in robust and nonlinear control theory for more than 50 years -and largely unified by the modern theory of integral quadratic constraints (Megretski & Rantzer, 1997 ) -has been on systems which are interconnections of linear mappings and "simple" nonlinearities, i.e. those easily bounded in some sense by quadratic functions. Fortuitously, this characteristic is shared with deep, recurrent, and equilibrium neural networks, a connection that we use heavily in this paper and has previously been exploited by Fazlyab et al. (2019) ; El Ghaoui et al. (2019) ; Revay et al. (2020) and others. A particular property we are interested in is called contraction (Lohmiller & Slotine, 1998) , i.e., any pair of solutions z 1 (t) and z 2 (t) exponentially converge to each other: z 1 (t) -z 2 (t) ≤ α z 1 (0) -z 2 (0) e -βt for all t > 0 and some α, β > 0. Contraction can be established by finding a Riemannian metric with respect to which nearby trajectories converge, which is a differential analog of a Lyapunov function. A nice property of a contracting dynamical system is that if it is time-invariant, a unique equilibrium exists and it possesses a certain level of robustness. Moreover, contraction can also be linked to monotone operators, i.e. a system is contracting w.r.t. to a constant (state-independent) metric Q if and only if the operator -f is strongly monotone w.r.t. Q-weighted inner product. We collect some directly relevant results from systems theory in Appendix C.2.

4. MAIN RESULTS

This section contains the main theoretical results of the paper: conditions implying well-posedness and Lipschitz-boundedness of equilibrium networks, and direct (unconstrained) parameterizations such that these conditions are automatically satisfied. Assumption 1. The activation function σ is monotone and slope-restricted in [0, 1], i.e., 0 ≤ σ(x) -σ(y) x -y ≤ 1, ∀x, y ∈ R, x = y. Remark 1. We will show below (Proposition 1 in Section 4.2) that Assumption 1 is equivalent to the assumption on σ in Winston & Kolter (2020) , i.e. that σ(•) = prox 1 f (•) for some proper convex function f . However, the above assumption is arguably more natural, since it is easily verified for standard activation functions. Note also that if different channels have different activation functions, then we simply require that they all satisfy (2). The following conditions are central to our results on well-posedness and Lipschitz bounds: Condition 1. There exists a Λ ∈ D + , with D + denoting diagonal positive-definite matrices, such that W satisfies 2Λ -ΛW -W T Λ 0. (3) Condition 2. Given a prescribed Lipschitz bound γ > 0, there exists Λ ∈ D + such that W, W o , U satisfy 2Λ -ΛW -W T Λ - 1 γ W T o W o - 1 γ ΛU U T Λ 0. ( ) Remark 2. Note that Condition 2 implies Condition 1 since 1/γ(W T o W o + ΛU U T Λ) 0. As a partial converse, if Condition 1 holds, then for any W o , U there exist a sufficiently large γ such that Condition 2 is satisfied. The main theoretical results of this paper are the following: Theorem 1. If Assumption 1 and Condition 1 hold, then the equilibrium network (1) is well-posed, i.e. for all x and b z , equation (1) admits a unique solution z. Moreover, it has a finite Lipschitz bound from x to y. Theorem 2. If Assumption 1 and Condition 2 hold, then the equilibrium network (1) is well-posed and has a Lipschitz bound of γ. As a consequence, we call (1) a Lipschitz bounded equilibrium network (LBEN) if its weights satisfy either (3) or (4). The full proofs appear in Appendices E.1 and E.2, but here we sketch some of the main ideas. We can represent (1) as an algebraic interconnection between linear and nonlinear parts: v = W z + U x + b z , z = σ(v), y = W o z + b y . It can be shown that for any pair of solutions to the nonlinear part z a = σ(v a ), z b = σ(v b ), if we define ∆ v = v a -v b and ∆ z = z a -z b then Assumption 1 implies the following: ∆ v -∆ z , ∆ z Λ ≥ 0. ( ) for any Λ ∈ D + . This and Condition 1 can be used to prove global stability of a unique equilibrium of the differential equation v = -v + W σ(v) + U x + b z , which proves there is a unique solution to (1) for any inputs. Next, straightforward manipulations of Condition 2 show that any pairs of inputs x a , x b and outputs y a , y b satisfy the following, where ∆ x = x a -x b and ∆ y = y a -y b : γ ∆ x 2 2 - 1 γ ∆ y 2 2 ≥ 2 ∆ v -∆ z , ∆ z Λ ≥ 0, where the inequality comes (6). This implies the Lipschitz bound ∆ y 2 ≤ γ ∆ x 2 . Remark 3. In Fazlyab et al. (2019) it was claimed that (6) holds with a richer (more powerful) class of multipliers Λ which were previously introduced for robust stability analysis of systems with repeated nonlinearities (Chu & Glover, 1999; D'Amato et al., 2001; Kulkarni & Safonov, 2002) . However this is not true: a counterexample was given in Pauli et al. (2020) , and here we provide a brief explanation: even if the nonlinearities σ(v i ) are repeated when considered as functions of v i , their increments ∆ zi = σ(v i + ∆ vi ) -σ(v i ) are not repeated when considered as functions of ∆ vi , since they depend on the particular v i which generally differs between units. Example 1. We illustrate the extra flexibility of Condition 1 compared to the condition of Winston & Kolter (2020) by a toy example. Consider W ∈ R 2×2 and take a slice near W = 0 of the form W = 0 W 12 0 W 22 , for which we have: 2I -W -W T = 2 -W 12 -W 12 2 -2W 22 . ( ) By Sylvester's criterion, this matrix is positive-definite if and only if W 22 < 1 and det(2I -W - W T ) = 4(1 -W 22 ) -W 2 12 > 0, which defines a parabolic region in the W 12 , W 22 plane. Applying our condition (3), without loss of generality take Λ = diag(1, α) with α > 0 and we have The positivity test is now W 22 < 1 and 4α(1 2Λ -ΛW -W T Λ = 2 -W 12 -W 12 2α -2αW 22 . -W 22 ) -W 2 12 > 0. For each W 12 there is sufficiently large α such that the second condition is satisfied, since the first implies 1 -W 22 > 0. Hence the only constraint on W is that W 22 < 1, which yields a much larger region in the W 12 , W 22 plane (see Figure 1 ). Interestingly, in this simple example with ReLU activation, the condition W 22 < 1 is also a necessary condition for well-posedness (El Ghaoui et al., 2019, Theorem 2.8) .

4.1. DIRECT PARAMETERIZATION FOR UNCONSTRAINED OPTIMIZATION

Training a network that satisfies Condition 1 or 2 can be formulated as an optimization problem with convex constraints. In fact, Condition 1 is a linear matrix inequality (LMI) in the variables Λ and ΛW , from which W can be determined uniquely. Similarly, via Schur complement, Condition 2 is an LMI in the variables Λ, ΛW, ΛU, W o , and γ, from which all network weights can be determined. In a certain theoretical sense LMI constraints are tractable - Nesterov & Nemirovskii (1994) proved they are polynomial-time solvable -however for even for moderate-scale networks (e.g. ≤ 100 activations) the associated barrier terms or projections become a major computational bottleneck. In this paper, we propose direct parameterization that allows learning via unconstrained optimization problem, i.e. all network parameters are transformations of free (unconstrained) matrix variables, in such a way that LMI constraints (3) or (4) are automatically satisfied. For well-posedness, i.e. Condition (1), we parameterize via the following free variables: V ∈ R n×n , d ∈ R n , and skew-symmetricfoot_0 matrix S = -S T ∈ R n×n , from which the hidden unit weight is W = I -Ψ(V T V + I + S), where Ψ = diag e d and > 0 is some small constant to ensure strict positive-definiteness. Then it follows from straightforward manipulations that Condition 1 holds with Λ = Ψ -1 if and only if W can be written as (8). When Ψ = I, this reduces to the parameterization of Winston & Kolter (2020) . Similarly, for a specific Lipschitz bound, i.e. Condition 2, we add to the parameterization the free input and output weights U and W o , and arbitrary γ > 0. We can construct W = I -Ψ 1 2γ W T o W o + 1 2γ Ψ -1 U U T Ψ -1 + V T V + I + S , for which (4) is automatically satisfied. Again, it can easily be verified that this construction is necessary and sufficient, i.e. any W satisfying (4) can be constructed via (9).

4.2. MONOTONE OPERATOR PERSPECTIVE

In this section, we will show that finding the solution to LBEN (1) is equivalent to solving a wellposed operator splitting problem, and hence a unique solution exists. First, we need the following observation on the activation function σ. Proposition 1. Assumption 1 holds if and only if there exists a convex proper function f : R → R ∪ {∞} such that σ(•) = prox 1 f (•). The proof of Proposition 1 with a construction of f appears in Appendix E.3, along with a list of f for popular σ. It is well-known in monotone operator theory (Ryu & Boyd, 2016 ) that for any convex closed proper function f , the proximal operator prox 1 f (x) is monotone and non-expansive (i.e. slope-restricted in [0, 1]). Proposition 1 is a converse result for scalar functions. Remark 4. To our knowledge Proposition 1 is novel, however for several popular activation functions the corresponding functions f were computed in Li et al. (2019) (see also Table 3 in Appendix E.4). Compared with Li et al. (2019) , our work gives a necessary and sufficient conditions. Now we connect LBEN (1) to an operator splitting problem. Proposition 2. Finding a solution of LBEN ( 1) is equivalent to solving the well-posed operator splitting problem 0 ∈ (A + B)(z) with the operators A(z) = (I -W )(z) -(U x + b z ), B = ∂f (10) where f(z ) := n i=1 λ i f (z i ) with λ i as the ith diagonal element of Λ. The proof appears in Appendix E.4 and Theorem 1 follows directly since the above operator splitting problem has a unique solution for any x, b z . Computing an equilibrium. There exist various of operator splitting algorithms to compute the solution of LBEN ( 1), e.g., ADMM (Boyd et al., 2011) and Peaceman-Rachford splitting (Kellogg, 1969) . Winston & Kolter (2020) found that Peaceman-Rachford splitting converges very rapidly when properly tuned, and our experience agrees with this. Gradient backpropagation. As shown in (Winston & Kolter, 2020 , Section 3.5), the gradients of the loss function (•) can be represented by ∂ ∂(•) = ∂ ∂z (I -JW ) -1 J ∂(W z + U x + b z ) ∂(•) where z denotes the solution of (1), (•) denotes some learnable parameters in the parameterization (8) or ( 9), and J ∈ Dσ(W z + U x + b z ) with Dσ as the Clarke generalized Jacobian of σ. Since σ is piecewise differentiable, then the set Dσ(W z + U x + b z ) is a singleton almost everywhere. The following proposition reveals that ( 11) is well-defined, see proof in Appendix E.5. Proposition 3. The matrix I -JW is invertible for all z , x and b z .

4.3. CONNECTIONS TO CONVEX OPTIMIZATION

Since LBEN (1) is equivalent to an operator splitting problem, an interesting question is whether it can further be connected to a convex optimization problem. Here we construct an equivalent convex problem for the LBEN whose parameterization satisfies S = 0. Proposition 4. If the direct parameterization (either (8) or ( 9)) of an LBEN satisfies S = 0, then for all x and b z , the solution of ( 1) is the minimizer of the following strongly convex optimization problem: min z 1 2 (I -W )z -U x -b z , z Λ + f(z). ( ) The proof is in Appendix E.6. Furthermore, for an important subclass of LBEN where σ is ReLU, it has an equivalent convex quadratic programming (QP) formulation. Proposition 5. Consider an LBEN (1) with ReLU activation. For all x and b z , the solution of (1) is the minimizer of the following strongly convex QP problem: min z 1 2 z Hz + p z s.t. z ≥ 0, (I -W )z ≥ U x + b z ( ) where H = 2Λ -ΛW -W Λ and p = -Λ(U x + b z ). Note that the QP (13) also works for the case where S is non-zero. The proof (see Appendix E.7) is built on the "key insights" of ReLU activation from Raghunathan et al. (2018b) . This allows one to compute the solution of LBEN (1) using the many free or commercial QP solvers.

4.4. CONTRACTING NEURAL ODES

In this section, we will prove the existence of a solution to (1) from a different perspective: by showing it is the equilibrium of a contracting dynamical system (a "neural ODE"). We first add a smooth state v(t) ∈ R n to avoid the algebraic loop in (5). This idea has long been recognized as helpful for well-posedness questions (Zames, 1964) . We define the dynamics of v(t) by the following ODE: v(t) = -v(t) + W z(t) + U x + b z , z(t) = σ(v(t)). The well-posedness of ( 1) is equivalent to the existence and uniqueness of an equilibrium of ( 14) for all x and b z , which is established by the following proposition. Proposition 6. If Assumption 1 and Condition 1 hold, then the neural ODE ( 14) is contracting w.r.t. some constant metric P 0. The proof is in Appendix E.8. Moreover, the metric P can be found via semidefinite programming. The above proposition also proves that the nonlinear operator -f with f (v) = -v +W σ(v)+U x+ b z , zeros of which define solutions of LBEN (1), is actually monotone w.r.t. the P -weighted inner product, which gives a first-order cutting-plane oracle for the zero location v such that f (v ) = 0. I.e. given a test point v t = v , it proves that v is in the half-space defined by vv t , f (v t ) P > 0. This may offer alternative ways to solve LBEN (1), e.g. via Nemirovski (2004) ; Nesterov (2007) .

4.5. FEEDFORWARD NETWORKS AS A SPECIAL CASE

Consider a multi-layer feedforward network of the form z 1 = U 0 x + b 0 , z +1 = σ(W z + b ), = 1, . . . , L -1, y = W L z L + b L , which can be rewritten as an equilibrium network (1) as shown in Appendix A The above equilibrium network is obviously well-posed as a unique solution exists. The following proposition shows that ( 44) is also an LBEN. Proposition 7. The LBEN parameterization (8) contains all feedforward networks. In Winston & Kolter (2020) , a set of well-posed equilibrium network, called monotone operator equilibrium network (MON), is introduced via the following parameterization W = (1 -m)I -A A + B -B ( ) where m > 0 is a hyper-parameter, A, B are learnable matrices. The MON parameterization can be understood as a special case of LBEN with a fixing Ψ = I. Proposition 8. The MON parameterization (16) does not contain all feedforward networks, and if m ≥ 1 it does not contain any feedforward networks. In the feedforward case, our Lipschitz bound condition ( 4) is equivalent to the state-of-art bound estimation method in Fazlyab et al. (2019) . The major benefit of our direct parameterization ( 9) is that it allows such bounds to be imposed during training without any additional computational cost. The details are given in Appendix D.

5. EXPERIMENTS

In this section we test our approach on the MNIST and CIFAR-10 image classification problems. Our numerical experiments focus on model robustness, the trade-off between model performance and the Lipschitz constant, and the tightness of the Lipschitz bound. We compare the proposed LBEN to unconstrained equilibrium networks, monotone operator equilibrium network (MON) of Winston & Kolter (2020) , and fully connected networks trained using Lipschitz margin training (LMT) (Tsuzuku et al., 2018) . When studying model robustness to adversarial attacks, we use the L2 Fast Gradient Method, implemented as part of the Foolbox toolbox (Rauber et al., 2020) . All models are trained on a either a standard desktop computer with an NVIDIA GeForce RTX 2080 graphics card or using a google cloud instance with a Nvidia Tesla V100 graphics card. Details of the models and training procedure can be found in Appendix F, all code will be made available online but links are omitted due to the double-blind review process.

5.1. MNIST EXPERIMENTS WITH FULLY-CONNECTED NETWORKS

In Figure 2a the test error versus the observed Lipschitz constant, computed via adversarial attack for each of the models trained. We can see clearly that the parameter γ in LBEN offers a trade-off between test error and Lipschitz constant. Comparing the LBEN γ=5 with both MON and LBEN γ<∞ , we also note a slight regularizing effect in the lower test error. By comparison, LMT (Tsuzuku et al., 2018) with c as a tunable regularization parameter displays a qualitatively similar trade-off, but underperforms LBEN in terms of both test error and robustness. If we examine the unconstrained equilibrium model, we observe a Lipschitz constant more than an order of magnitude higher, i.e. this model has regions of extremely high sensitivity, without gaining any accuracy in terms of test error. For the LBEN models, the lower and upper bounds on the Lipschitz constant are very close: the markers are very close to their corresponding lines in Figure 2a , see also the table of numerical results in Appendix A in which the approximation accuracy is in many cases around 90%. Next we tested robustness of classification accuracy to adversarial attacks of various sizes, the results are shown in Figure 2b and summarized in Table 1 . We can clearly see that decreasing γ (i.e. stronger regularization) in the LBEN models results in a far more gradual degradation of performance as perturbation size increases, with only a mild impact on nominal (zero perturbation) test error. Next, we examined the impact of our parameterization on computational complexity compared to other equilibrium models. The test and training errors versus number of epochs are plotted in Figure 5 , and we can see that all models converge similarly, and also take roughly the same amount of time per epoch. This is a clear contrast to the results of Pauli et al. (2020) in which imposing Lipschitz constraints resulted in fifty-fold increase in training time. Interestingly, we can also see in Figure 5 the effect of regularisation for LBEN with γ = 5: higher training error but lower test error. We have observed several cases where the unconstrained equilibrium model became unstable during training, LBEN never exhibits this problem. Finally, we examined the quality of the Lipschitz bounds as a function of network size, comparing the upper and lower bounds on fully connected networks with width 20 to 1000. The results are shown in Figure 6 . It can be observed that network size only has a mild effect on the quality of the Lipschitz bounds, which decrease slightly as width is increased by a factor of 50. 

5.2. CIFAR-10 EXPERIMENTS WITH CONVOLUTIONAL NETWORKS

The previous example looked at simple fully connected networks, however, our approach can also be applied to structured layers such as convolutions. Here, we perform several experiments exploring the use of convolutional layers on the CIFAR-10 dataset. To study the improved expressibility we will compare the LBEN to the LBEN with its metric set to the identity, denoted LBEN Λ=I . Note that the model set LBEN Λ=I,γ<∞ corresponds to the MON. Additional model details can be found in Appendix F.2. In Figure 3a , we have plotted the test performance versus the observed Lipschitz constant for the LBEN and LBEN Λ=I for varying Lipschitz bound γ = 1, 2, 3, 5, 50, along with the LBEN γ<∞ , MON, and feed-forward convolutional networks with 40, 81, 160, and 200 channels. Again, we see that the Lipschitz bound has a regularizing effect, trading off between nominal fit and robustness. Additionally, we see that the LBEN provides both better performance and robustness than the traditional feed-forward convolutional networks of similar sizes, highlighting the benefit of the equilibrium network structure. Comparing LBEN and LBEN Λ=I , we can see that the metric gives higher quality models for LBEN with specified γ, but it is slightly worse for LBEN γ < ∞ compared to MON. This is likely due to the extra expressiveness of the model leading to some overfitting. This can also be seen in the training curves in Figure 7 . Figure 3b shows the test error versus the size of adversarial perturbation for the lBEN and 162 channel feed-forward convolutional network. We observe that the LBEN provides a much more gradual loss in performance than the feed-forward network, with γ = 5 offering an excellent mix of nominal performance and robustness. The feed-forward networks of different sizes exhibited similar results, however only one is plotted in Figure 3b for clarity.

6. CONCLUSIONS

In this paper, we have shown that the flexible framework of equilibrium networks can be made robust via a simple and direct parameterization which results in guaranteed Lipschitz bounds. These results can also be directly applied (as a special case) to standard multilayer and residual deep neural networks, and also provide a direct parameterization of nonlinear ODEs satisfying strong stability and robustness properties. Extension to equilibrium network structures more general than ( 1) is an interesting area for future research. Our results can be extended to more general multivariable "activations" if they can be described accurately via monotonicity properties or integral quadratic constraints. One particular example where this is possible is where the "activation" computes the arg min of a quadratic program of the sort that appears in constrained model predictive control (Heath & Wills, 2007) .

A EXPERIMENTAL RESULTS ON MNIST CHARACTER RECOGNITION

This appendix contains tables of results on MNIST and CIFAR-10 data sets. Legend: • Err: Test error (%), • a 2 : 2 norm of adversarial attack. • γ up : certified upper bound on Lipschitz constant (for models that provide one). • γ low : observed lower bound on Lipschitz constant via adversarial attack. • γ approx: approximation ratio of Lipschitz constant as percentage = 100 × γ low γup . Models: • LBEN: the proposed Lipschitz bounded equilibrium network.. • MON: the monotone operator equilibrium network of Winston & Kolter (2020) . • UNC: an unconstrained equilibrium network, i.e. W directly parameterized. • LMT: Lipschitz Margin Training model as in Tsuzuku et al. (2018) . • Lip-NN: The Lipschitz Neural Network model of Pauli et al. (2020) . Note these figures are as reported in (Pauli et al., 2020) 

C PRELIMINARIES C.1 MONOTONE OPERATORS WITH NON-EUCLIDEAN INNER PRODUCTS

We present some basic properties of monotone operators on a finite-dimensional Hilbert space H, which we identify with R n equipped with a weighted inner product x, y Q = y Qx with Q 0. For n = 1, we only consider the case of Q = 1. The induced norm x Q is defined as x, x Q . A relation or operator is a set-valued or single-valued map defined by a subset of the space A ⊆ H×H; we use the notation A(x) = {y | (x, y) ∈ A}. If A(x) is a singleton, we called A a function. Some commonly used operators include: the linear operator A(x) = {(x, Ax) | x ∈ H}; the operator sum A + B = {(x, y + z) | (x, y) ∈ A, (x, z) ∈ B}; the inverse operator A -1 = {(y, x) | (x, y) ∈ A}; and the subdifferential operator ∂f = {(x, ∂f (x))} with x = dom f and ∂f (x) = {g ∈ H | f (y) ≥ f (x) + y -x, g Q , ∀y ∈ H}. An operator A has Lipschitz constant L if for any (x, u), (y, v) ∈ A u -v Q ≤ L x -y Q .

C.2 DYNAMICAL SYSTEM THEORY

In this section, we present some concepts and results of dynamical system theory that are used in this paper. We consider a nonlinear system of the form ż(t) = f (z(t)) where z(t) ∈ R n is the state, and the function f is assumed to be Lipschitz continuous. By Picard's existence theorem we have a unique a solution for any initial condition. The above system is timeinvariant since f is not explicitly depends on t. System ( 26) is called linear time-invariant (LTI) system if f (z) = Az + b for some matrix A ∈ R n×n and b ∈ R n . The point z ∈ R n is call an equilibrium of (26) if f (z ) = 0. The central concern in dynamical system theory is stability. While there are many different stability notions (Khalil, 2002) , here we mainly focus on two of them: exponential stability and contraction w.r.t a constant metric Q 0. System ( 26) is said to be locally exponentially stable at the equilibrium z w.r.t. to the metric Q if there exist some positive constants α, β, δ such that for any initial condition z(0) ∈ B δ (z ) := {z | zz Q < δ}, the following condition holds: z(t) -z ≤ α z(0) -z Q e -βt , ∀t > 0. ( ) And it is said to be globally exponentially stable if the above condition also holds for any δ > 0. The exponentially stability can be verified via Lyapunov's second method, i.e., finding a Lyapunov function V = z 2 P with P 0 such that V (t) ≤ -2βV (t) along the solutions, i.e., (zz ) P f (z) + f (z) P (z -z ) + 2β(z -z ) P (z -z ) ≤ 0. System ( 26) is said to be contracting w.r.t. the metric Q if there exist some positive constants α, β such that for any pair of solutions z 1 (t) and z 2 (t), we have z 1 (t) -z 2 (t) Q ≤ α z 1 (0) -z 2 (0) Q e -βt , ∀t > 0. Note that contraction is a much stronger notion than global exponential stability as Condition ( 27) can be implied by Condition (29) by setting z 1 = z and z 2 = z . However, unlike the Lyapunov analysis, contraction analysis can be done via simple local analysis which does not require any prior-knowledge about the equilibrium z . Specifically, contraction can be established by the local exponential stability of the associated differential system defined by ∆z = Df (z)∆ z where ∆ z (t) is the infinitesimal variation between z(t) and its neighborhood solutions, and Df is Clarke generalized Jacobian. The condition for (26) to be contracting can be represented as a state-dependent Linear Matrix Inequality (LMI) as follows P Df (z) + Df (z) P + 2βP ≺ 0 for some P 0 and all z ∈ R n . For an LTI system, exponential stability and contraction are equivalent and the stability condition can be s if A is Hurwitz stable (i.e. all eigenvalues of A have strictly negative real part). For most applications, the dynamic system usually involves an external input x(t) ∈ R m and an output y(t) ∈ R p , whose state-space representation takes the form of ż(t) = f (z(t), x(t)), y(t) = h(z(t), x(t)). (31) Here we measure the robustness of the above system under input perturbation by incremental L 2gain. That is, system (31) has an incremental L 2 -gain bound of γ if for any pair of inputs x 1 (•), x 2 (•) with T 0 x 1 (t)x 2 (t) 2 2 dt < ∞ for all T > 0, and any initial conditions z 1 (0) and z 2 (0), the solutions of (31) exists and satisfy T 0 y 1 (t) -y 2 (t) 2 2 dt ≤ γ 2 T 0 x 1 (t) -x 2 (t) 2 2 dt + κ(z 1 (0), z ( 0)) for some function κ(z 1 , z 2 ) ≥ 0 with κ(z, z) = 0. Note that γ can be viewed as a Lipschitz bound of all the mappings defined by (31) with some initial condition from the input signal x(•) to y(•). For any two constant inputs x 1 , x 2 , let z 1 , z 2 and y 1 , y 2 be the corresponding equilibrium and steady-state output, respectively. From (32) we have y 1 -y 2 2 2 ≤ x 1 -x 2 2 2 + κ(z 1 , z 2 )/T, which implies a Lipschitz bound of γ as T → ∞. A particular class of nonlinear systems that have strong connections to various neural networks is the so-called Luré system, which takes the form of ż(t) = Az(t) + Bφ(Cz(t)) where A, B, C are constant matrices with proper size, and φ is a static nonlinearity with sector bounded of [α, β]: for all solution (v, w) with w = φ(v) (w -αv) (βv -w) ≥ 0 (34) or equivalently v w Π v w ≥ 0 with Π = 2αβI (α + β)I (α + β)I -2I . ( ) This implies that the origin is an equilibrium since φ(0) = 0. The above system can be viewed as a feedback interconnection of a linear system G : ż(t) = Az(t) + Bw(t) v(t) = Cz(t) and a nonlinear memoryless component w(t) = φ(v(t)). The above linear system can also be described by a transfer function G(s) with s ∈ C. We refer to Hespanha (2018) for details about frequency-domain concepts and results of linear systems. The frequency-domain representation for the sector bounded condition (34) can be written as v(jω) ŵ(jω) * Π v(jω) ŵ(jω) ≥ 0 ∀ω ∈ R where v(jω) and ŵ(jω) are Fourier transforms of v and w, respectively, (•) * denotes the complex conjugate. Then, the closed-loop stability of the feedback interconnection can be verified by the Integral Quadratic Constraint (IQC) theorem (Megretski & Rantzer, 1997) . Although the IQC framework allows for more general dynamic multipliers, here we only focus on the simple constant multiplier defined in (35). Theorem 3. Let G be stable and φ be a static nonlinearity with sector bound of [α, β] . The feedback interconnection of G and φ is stable if here exists > 0 such that G(jω) I * Π G(jω) I -I, ∀ω ∈ R. ( ) The Kalman-Yakubovich-Popov (KYP) lemma (Rantzer, 1996) can be applied to demonstrate the equivalence of Condition 3 in Theorem 3 to an LMI condition. The result is stated as follows. Theorem 4. There exists a > 0 such that (38) holds if and only if there exists a matrix P = P such that A P + P A P B B P 0 + C 0 0 I Π C 0 0 I ≺ 0.

D LBEN PARAMETERIZATION FOR FEEDFORWARD NETWORKS

Given an equilibrium network (1) with weights U, W , and W o , we can estimate its Lipschitz bound γ by solving the following SDP with (n + 1) decision variables: min γ>0, Λ∈D + γ s.t.   2Λ -ΛW -W Λ -ΛU W o -U Λ γI 0 W o 0 γI   0. ( ) Note that the above LMI constraint is equivalent to (4) via Schur complement. A tight upper bound is then obtained by minimizing γ. When a deep neural network (a special case of equilibrium network) is considered, the above SDP yields the same bound estimation as LipSDP-Neuron in Fazlyab et al. (2019) since both formulations involve minimizing the gain bound γ subject to an equivalent constraint (41). Training a feedforward network with a prescribed Lipschitz bound is a challenge problem due to the LMI constraint (39) as well as the sparse structure of W . Following the similar idea of direct parameterization, we will construct a parameterization built on (9) to represent the following weight W =      0 W 1 . . . . . . . . . 0 0 • • • W L-1 0      . ( ) We first look at a simple case where W is a dense strictly lower triangular matrix. Given a square matrix H, its LDU partition is defined as H = [H] D + [H] L + [H] U where [H] D is a diagonal matrix, [H] L ([H] U ) is a strictly lower(upper) triangular matrix. Given any hyper-parameter γ > 0, the parameterization contains the following free variables: V ∈ R n×n , W o ∈ R p×n , and U ∈ R n×d . Let S = [H] L -[H] L , Ψ = [H] -1 D and U = Ψ U where H = V V + I + (W o W o + U U )/2γ. Then, the LBEN parameterization (9) yields W = I -Ψ 1 2γ W T o W o + 1 2γ Ψ -1 U U T Ψ -1 + V T V + I + S = -2[H] -1 D [H] L , which is a dense lower triangular matrix. To impose the sparse pattern like (40), we need H =         Λ 1 H 1 H 1 Λ 2 H 2 H 2 Λ3 H 3 . . . . . . . . . H L-2 Λ L-1 H L-1 H L-1 Λ L         where Λ i belongs to D + with 1 ≤ i ≤ L, and H j has the same dimension as W j for 1 ≤ j ≤ L -1. To make V V have the same band structure as H, we further parameterize V as follows V =     Γ 1 Φ 1 V 1 Γ 2 . . . . . . Φ L-1 V L-1 Γ L     where Γ i , Φ j ∈ D + and V j V j = I. The unitary matrix V j can be parameterized by V j = e Sj where S j = -S j . The diagonal blocks of V V are Γ 2 i + Φ 2 i with Φ L = 0 while the lower off-diagonal blocks are Γ j+1 Φ j V j with 1 ≤ j ≤ L-1. Similar techniques can be applied to the parameterization of W o and U .

E PROOFS E.1 PROOF OF THEOREM 1

We presents two proofs for the well-posedness of equilibrium network (1). All these proofs are based on the following lemma. Lemma 1 (Simpson-Porco & Bullo (2014)). For a time-invariant contracting dynamical system, all its solutions converge to a unique equilibrium. (Monotone operator perspective): This proof is mainly based on Proposition 2, which states that the solution of ( 1) is also a zero of the operator splitting problem 0 ∈ (A + B)(z), where the operators A and B are given in (10). Condition 1 implies that the operator A is strongly monotone while Assumption 1 implies that the operator B is maximal monotone. Furthermore, the Clay operator C A is contractive and C B is non-expansive. Thus, applying Peaceman-Rachford algorithm to 0 ∈ (A + B)(z) yields a contracting discrete-time system (24) since C A C B is a contractive operator. Since ( 24) is time-invariant, it yields a unique solution z for any x and b z . (Neural ODE perspective): This proof is built on Proposition 6, which states that the neural ODE ( 14) is a contracting continuous-time dynamical system under the Assumption 1 and Condition 1. For any fixed input x and b z , system ( 14) is also time-invariant and hence its solution converges to a unique equilibrium, which is also the solution of (1). We now prove the Lipschitz boundedness of a well-posed equilibrium network. Condition 1 implies that there exists a constant > 0 such that 2Λ -ΛW -W T Λ I. For any δ ∈ (0, ) and weights W o , U , we can find a sufficiently large but finite γ such that 1 γ (W T o W o + ΛU U Λ) ( -δ)I. Then, Condition 2 holds for Λ and γ since 2Λ -ΛW -W T Λ - 1 γ (W T o W o + ΛU U Λ) δI 0. From Theorem 2, γ is a Lipschitz bound for the well-posed equilibrium network (1).

E.2 PROOF OF THEOREM 2

Rearranging Eq. ( 4) yields 2Λ -ΛW -W T Λ 1 γ (W T o W o + ΛU U T Λ) 0. The well-posedness of the equilibrium network (1) follows by Theorem 1. To obtain the Lipschitz bound, we first apply Schur complement to (4): 2Λ -ΛW -W Λ -1 γ W o W o -ΛU -U Λ γI 0. Left-multiplying ∆ z ∆ x and right-multiplying ∆ z ∆ x gives 2∆ z Λ∆ z -2∆ z ΛW ∆ z - 1 γ ∆ z W o W o ∆ z -2∆ z ΛU ∆ x + γ ∆ x 2 2 ≥ 0. Since (5) implies ∆ v = W ∆ z + U ∆ x and ∆ y = W o ∆ z , the above inequality is equivalent to γ ∆ x 2 2 - 1 γ ∆ y 2 2 ≥ 2∆ z Λ∆ z -2∆ z Λ∆ v = 2 ∆ v -∆ z , ∆ z Λ . Then, the Lipschitz bound of γ for the equilibrium network (1) follows by ( 6). E.3 PROOF OF PROPOSITION 1 (if): It is well-known that if f is convex closed proper function, then prox 1 f is monotone and nonexpansive, i.e., it is slope-restricted in [0, 1]. Here f is not necessary to be closed as dom f (i.e. the range of σ) could be open interval (z l , z r ) or half-open interval (z l , z r ] or [z l , z r ). This can be resolved by defining f as the restriction of f on the closed interval [ẑ l , ẑr ], and then make ẑl → z l and ẑr → z r . (only if): Assumption 1 implies that σ is a non-decreasing and piece-wise differentiable function on R. Then, the range of σ is an interval, denoted by Z. We will construct the derivative function f on Z first and then integrate it to obtain f . Let {z j ∈ Z} j∈Z be the sequence containing all points such that either σ (x -) = 0 or σ (x + ) = 0 for all x ∈ σ -1 (z j ). Note that σ -1 (z) is a singleton for The KYP Lemma (Theorem 4) states that ( 42) is equivalent to the existence of a P = P such that -2P P W W T P 0 + 0 Λ Λ -2Λ ≺ 0. It is clear from the upper-left block that P 0. The above inequality also implies 2 -∆ v + W ∆ z , ∆ v P ≤ ∆ z -∆ v , ∆ z Λ -( ∆ z 2 2 + ∆ v 2 2 ) ≤ -( ∆ z 2 2 + ∆ v 2 ) for some > 0. The contraction property of the neural ODE (14 follows since d dt ∆ v 2 P = 2 -∆ v + W ∆ z , ∆ v P ≤ -( ∆ z 2 2 + ∆ v 2 2 ) ≤ -2β ∆ v 2 P for some sufficiently small β > 0. As a byproduct of the above inequality, we will show that the operator -f with with f (v) = -v + W σ(v) + U x + b z is strictly monotone w.r.t. the P -weighted inner product since -f (v a ) + f (v b ), v a -v b P = ∆ v -W ∆ z , ∆ v P ≥ β ∆ v 2 P . E.9 PROOF OF LEMMA 2 Note that ( 42) is equivalent to 2Λ -G 0 (jω)ΛW -G 0 (-jω)W T Λ µI (43) where G 0 (jω) = 1 1+jω . For some ω ∈ (R ∪ ∞) let g = G 0 (jω) = G 0 (-jω), where denotes real part. It is easy to verify that g = 1/(ω 2 + 1) ∈ [0, 1]. From (3) we have 2gΛ -gΛW -gW T Λ g I for some > 0. Rearranging the above inequality yields 2Λ -gΛW -gW T Λ g I + (1g)2Λ Now, since g ∈ [0, 1] the right-hand-side is a convex combination of two positive definite matrices: I and 2Λ, therefore (43) holds for some µ > 0 and all ω ∈ (R ∪ ∞).

E.10 PROOF OF PROPOSITION 7

It is straightforward to verify that an equilibrium network with the following weights is identical to the feedforward network (15): z =     z 1 z 2 . . . z L     , W =      0 W 1 . . . . . . . . . 0 0 • • • W L-1 0      , U =     U 0 0 . . . 0     , W o = [0 • • • 0 W L ] . (44) To construct an LBEN parameterization in the form (8) for W , we first need the following lemma. Lemma 3. Condition 1 holds for any strictly lower triangular W . Proof. We prove it by showing that for any δ > 0, there exists a Λ ∈ D + such that H(Λ n , W n ) := Λ n (I -W n ) + (I -W n ) Λ n 2 2-n δI. (45) where Λ n , W n are the upper left n × n elements of Λ, W , respectively. For n = 1, λ 1 > δ is sufficient since W 1 = 0. Assuming that (45) holds for Λ n and W n , then we have H(Λ n+1 , W n+1 ) -2 1-n δI = H(Λ n , W n ) -2 1-n δI -Λ n w n+1 -w n+1 Λ n 2(λ n+1 -2 -n δ) , where Λ n+1 = diag(Λ n , λ n+1 ) and W n+1 = [ W n 0 ] 0 w n+1 0 . By applying Schur complement to (46), Inequality (45) holds for the case of n + 1 if λ n+1 > 2 -n δ + 2 n-2 |Λ n w n+1 | 2 /δ. Based on the above lemma, we can construct a V such that V V = 1/2[Λ(I-W )+(I-W ) Λ]-I where = 2 1-n δ. By choosing Ψ = Λ -1 and S = (ΛW -W Λ)/2, the LBEN parameterization (8) recovers the exact W . Thus, LBEN contains all feedforward networks (44). We note that "skip connections" as in a residual network can easily be added to the above structure via additional non-zero blocks in the lower-left part of the weight W . In Winston & Kolter (2020) Peaceman-Rachford is used and the operator I -W can be quickly inverted using the fast Fourier transform. This situation is more complicated in our case as the term W out W out cannot be represented as a strict convolution and this is not diagonalized by the Fourier matrix,. Instead, we apply Forward-Backward Splitting algorithm shown in equation 23 which does not require a matrix inversion. We have observed that the rate of convergence of the Forward-Backward splitting algorithm is highly dependent on the monotonicity parameter m. In particular, for the convolutional models, we found there was a strong trade-off between the ease of solve for the equilibrium versus the model expressibility and the accuracy of the Lipschitz bound.



Note that S can be parameterized via its upper or lower triangular components, or via S = N -N T with N free, which can be more straightforward if W is defined implicitly via linear operators, e.g. convolutions.



Figure 1: Valid coefficient ranges for Example 1. Gray region: the condition from Winston & Kolter (2020) is feasible: 2I -W -W T 0. White region (including gray region): our well-posedness condition is feasible: ∃Λ ∈ D + : 2Λ -ΛW -W T Λ 0. Black region: neither condition feasible.

the proof (see Appendix E.11). The set of feedforward networks in MON shrinks as the hyperparameter m increases. Most experiments in Winston & Kolter (2020) use m = 1, which excludes all feedforward networks.

(a) Nominal test error vs Lipschitz constant estimates: markers indicate observed lower bounds for all methods, vertical lines indicate certified upper bounds for LBEN (b) Test error with adversarial perturbation versus size of adversarial perturbation. Lower is better.

Figure 2: Image classification results on MNIST character recognition data set.

Nominal test error vs observed lower bound on Lipschitz constant. (b) Test error with adversarial perturbation versus size of adversarial perturbation. Lower is better.

Figure 3: Image classification results on CIFAR-10 data set.

Figure 7: LBEN and MON training error versus epochs on CIFAR-10 dataset. The red curves have the metric set so that Λ = I whereas the blue curves optimize over the metric. The line styles correspond to different gain bounds. Note that both MON and LBEN γ<∞ achieve zero training error.

, all other figures are calculated by the authors of the present paper. Results from MNIST experiments.Figure 5: Left: Training set error versus epochs. Right: Test set error versus epochs. Note that the left and right plots are on different scales. The time per epoch for the MON, unconstrained, LBEN γ<∞ and LBEN γ=5 networks are 14.4, 16.1, 14.9 and 14.8 seconds per epoch respectively.Figure 6: Approximation accuracy of the Lipschitz bound versus the network width of LBEN from the MNIST example. The certified upper bound is γ up and the observed lower bound is γ low . Err: a 2 ≤ 0.5 Err: a 2 ≤ 1.0 γ up γ low γ approx LBEN

Results from CIFAR experiments. FF refers to the feed-forward convolutional network.

annex

An operator A is non-expansive if L = 1 and contractive if L < 1. An operator A is monotone if uv, xy Q ≥ 0, ∀(x, u), (y, v) ∈ A.(18)It is strongly monotone with parameter m ifA monotone operator A is maximal monotone if no other monotone operator strictly contains it, which is a property required for the convergence of most fixed point iterations. Specifically, an affine operator A(x) = W x + b is (maximal) monotone if and only if QW + W Q 0 and strongly monotone if QW + W Q mI. A subdifferential ∂f is maximal monotone if and only if f is a convex closed proper function.The resolvent and Cayley operators for an operator A are denoted R A and C A and respectively defined asfor any α > 0. When A(x) = W x + b, thenand when A = ∂f for some CCP function f , then the resolvent is given by a proximal operatorThe resolvent and Cayley operators are non-expansive for any maximal monotone A, and are contractive for strongly monotone A. Operator splitting methods consider finding a zero in a sum of operators (assumed here to be maximal monotone), i.e., find z such that 0 ∈ (A + B)(z). For example, the convex optimization problem in ( 12) can be formulated as an operator splitting problem with A(z) = (I -W )zb and B = ∂f. Proposition 2 shows that A is strongly monotone and Lipschitz with some parameters of m and L. Here we give some popular operator splitting methods for this problem as follows.• Forward-backward splitting:• Peaceman-Rachford splitting:• Douglas-Rachford splitting (or ADMM):A sufficient condition for forward-backward splitting to converge is α < 2m/L 2 . The Peacemance-Rachford and Douglas-Rachford methods converge for any α > 0, although the convergence speed will often vary substantially based upon α.Table 3 : A list of common activation functions σ(x) and associated convex proper f (z) whose proximal operator is σ(x). For z / ∈ dom f , we have f (z) = ∞. In the case of Softplus activation, Li s (z) is the polylogarithm function.Then, we define f as followsWithout loss of generality, we assume that 0 ∈ Z and σ -1 (0) is well-defined. We define the function f as followsotherwise, where C is an arbitrary constant. Note that f is a convex function as f is a piecewise differentiable function on Z and for those points whereFurthermore, since σ is well-defined, we can conclude that f is bounded from below. We also provide a list of f for common activation functions in Table 3 . A similar list can also be found in Li et al. (2019) .

E.4 PROOF OF PROPOSITION 2

Similar to Winston & Kolter (2020) , we first show that the solution of (1), if it exists, is an fixed point of the forward-backward iteration (23) with α = 1:. . .Note that the necessary condition for σ(•) to be diagonal is that the weight Λ is positive diagonal.Now we prove the well-posedness of LBEN by showing that the operator splitting problem 0 ∈ (A + B)(z) has a unique solution for any x and b z . Both Condition 1 and 2 implies that the operator A is strongly monotone and its Cayley operator C A is contractive. Then, the Peaceman-Rachford iteration ( 24) is contracting and hence it converges to a unique fixed point.

E.5 PROOF OF PROPOSITION 3

The matrix J is diagonal with elements in [0, 1]. Decompose Λ = Π(J + µI) for some small µ > 0, i.e. Π = Λ(J + µI) -1 , which is diagonal and positive-definite. By denoting H = Π(I -W ) + (I -W ) T Π we obtain the following inequality from (3):which can be rearranged asSince 2Π(I -J) 0, we can choose a sufficiently small µ such thatwhich further implies that I -JW is strongly monotone w.r.t. Π-weighted inner product, and is therefore invertible.

E.6 PROOF OF PROPOSITION 4

First, we show that ( 12) is strongly convex. Since f(z) is a conic combination of convex functions f (z i ), we only need to show that the quadratic term is strongly convex, i.e.,whereΛ which follows by either Condition 1 or (2). Moreover, since S = 0 for the direction parameterization of W , we have Λ(I -W ) = (I -W ) Λ and hence ∂J = A. Then, finding the global minimizer of the strongly convex optimization problem ( 12) is equivalent to finding a zero for the operator splitting problem 0 ∈ ∂(J + f)(z) = (A + B)(z).

E.7 PROOF OF PROPOSITION 5

The proof is based on the "key insights" of ReLU activation from Raghunathan et al. (2018b) .That is, a ReLU constraint z = max(x, 0) is equivalent to the following three linear and quadratic constraints between z and x: (i) z(zx) = 0, (ii) z ≥ x, and (iii) z ≥ 0. From this observation an equilibrium network (1) can be equivalently expressed as the following constraints (I) z (z-q) = 0, (II) z ≥ q, and (III) z ≥ 0, where q = W z + U x + b. Note that (II) and (III) can be rewritten as the linear constraints in the QP problem ( 13) while (I) is equivalent to J(z) = 0 withfor any Λ ∈ D + . It is obvious that J(z) ≥ 0 for all z satisfying (II) and (III), and hence the solution of ( 1) is a global minimizer of the QP problem ( 13). If Λ satisfies either Condition 1 or 2, then H is positive-definite and( 13) is a strongly convex QP problem. Thus, its global minimizer is unique, which is also the solution of LBEN (1).

E.8 PROOF OF PROPOSITION 6

From ( 14) the dynamics of ∆ v and ∆ z can be formulated as a feedback interconnection of a linear system ∆v = -∆ v + W ∆ z and a static nonlinearity). The linear system can be represented by a transfer function is G(s) = 1/(s + 1)W . The nonlinear component can be rewritten as ∆ z = Φ(v a , v b )∆ v where Φ as a diagonal matrix with each Φ ii ∈ [0, 1]. For the nonlinear component Φ, its input and output signals satisfies the quadratic constraint (6). For the linear system G, we have the following lemma. Lemma 2. If Condition 1 holds, then for all ω ∈ {R ∪ ∞}E.11 PROOF OF PROPOSITION 8From the MON parameterization ( 16) we haveLet W m be the set of non-zero and strictly lower triangular0 for all m 2 < m 1 . Proposition 8 follows if lim m→0 W m does not contain all strictly lower triangular W . Since W is a strictly lower triangular, H(0, W ) is a semidefinite matrix whose diagnoal elements are 2. As the norm of W increases, H(0, W ) becomes indefinite. Taking the feedforward network ( 44) with L = 2 as an example, the set of W 0 is characterized byNow we show that W m = ∅ for all m ≥ 1. Since the diagnoal elements of H(m, W ) are nonpositive when m ≥ 1, the matrix H(m, W ) is not semi-definite for any strictly lower triangular W . The models in the MNIST example are all fully connected models with 80 hidden neurons and ReLU activations. For the equilibrium models, the forward and backward passes models are performed using the Peaceman-Rachford iteration scheme with = 1 and a tolerance of 1 × 10 -2 . When evaluating the models, we decrease the tolerance of the spitting method to 1 × 10 -4 . We use the same α tuning procedure as Winston & Kolter (2020) . All models were trained using the same initial point. Note that for LBEN, this requires initializing the metric Λ = I.The feed-forward models trained using Lipschitz margin training were trained using the original author's code which can be found at https://github.com/ytsmiling/lmt.

F.2 CIFAR-10 EXAMPLE

This section contains the model structures and the details of the training procedure used for the CIFAR-10 examples. All models are trained using the ADAM optimizer Kingma & Ba ( 2015) with an initial learning rate of 1 × 10 3 . The models were trained for 25 epochs and the learning rate was reduced by a factor of 10 after 15 epochs. Each model contains a single convolutional layer, an average pooling layer with kernel size 2, and a linear output layer.The convolutional LBEN has 81 channels and is parametrized as discussed below. The MON similarly has 81 channels. Unless otherwise stated, the feed-forward convolutional network has 162 channels which gives it approximately the same number of parameters as the LBEN.The MON was evaluated using the Peaceman-Rachford Iteration scheme.

CONVOLUTIONAL LBEN

Following the approach of Winston & Kolter (2020) , we parametrize U and V in equation 9 via convolutions. The skew symmetric matrix is constructed by taking the skew symmetric part of a convolution S, so that S = 1 2 ( S -S ). Similar, to Winston & Kolter (2020) , we also find that using a weight normalized parametrization improves performance. Specifically, we use the following parametrization:.

