ESTIMATING LIPSCHITZ CONSTANTS OF MONOTONE DEEP EQUILIBRIUM MODELS

Abstract

Several methods have been proposed in recent years to provide bounds on the Lipschitz constants of deep networks, which can be used to provide robustness guarantees, generalization bounds, and characterize the smoothness of decision boundaries. However, existing bounds get substantially weaker with increasing depth of the network, which makes it unclear how to apply such bounds to recently proposed models such as the deep equilibrium (DEQ) model, which can be viewed as representing an infinitely-deep network. In this paper, we show that monotone DEQs, a recently-proposed subclass of DEQs, have Lipschitz constants that can be bounded as a simple function of the strong monotonicity parameter of the network. We derive simple-yet-tight bounds on both the input-output mapping and the weight-output mapping defined by these networks, and demonstrate that they are small relative to those for comparable standard DNNs. We show that one can use these bounds to design monotone DEQ models, even with e.g. multiscale convolutional structure, that still have constraints on the Lipschitz constant. We also highlight how to use these bounds to develop PAC-Bayes generalization bounds that do not depend on any depth of the network, and which avoid the exponential depth-dependence of comparable DNN bounds.

1. INTRODUCTION

Measuring the sensitivity of deep neural networks (DNNs) to changes in their inputs or weights is important in a wide range of applications. A standard way of measuring the sensitivity of a function f is the Lipschitz constant of f , the smallest constant L ∈ R + such that f (x)-f (y) 2 ≤ L x-y 2 for all inputs x and y. While exact computation of the Lipschitz constant of DNNs is NP-hard (Virmaux & Scaman, 2018) , bounds or estimates can be used to certify a network's robustness to adversarial input perturbations (Weng et al., 2018) , encourage robustness during training (Tsuzuku et al., 2018) , or as a complexity measure of the DNN (Bartlett et al., 2017) , among other applications. An analogous Lipschitz constant that bounds the sensitivity of f to changes in its weights can be used to derive generalization bounds for DNNs (Neyshabur et al., 2018) . A growing number of methods for computing bounds on the Lipschitz constant of DNNs have been proposed in recent works, primarily based on semidefinite programs (Fazlyab et al., 2019; Raghunathan et al., 2018) or polynomial programs (Latorre et al., 2019) . However, as the depth of the network increases, these bounds become either very loose or prohibitively expensive to compute. Additionally, they are typically not applicable to structured DNNs such as convolutional networks which are common in everyday use. The deep equilibrium model (DEQ) (Bai et al., 2019) is an implicit-depth model which directly solves for the fixed point of an "infinitely-deep", weight-tied network. DEQs have been shown to perform as well as DNNs in domains such as computer vision (Bai et al., 2020) and sequence modelling (Bai et al., 2019) , while avoiding the large memory footprint required by DNN training in order to backpropagate through a long computation chain. Given that DEQs represent infinite-depth networks, however, their Lipschitz constants clearly cannot be bounded by existing methods, which are very loose even on networks of depth 10 or less. In this paper we take up the question of how to bound the Lipschitz constant of DEQs. In particular, we focus on monotone DEQs (monDEQ) (Winston & Kolter, 2020) , a recently-proposed class of DEQs which parameterizes the DEQ model in a way that guarantees existence of a unique fixedpoint, which can be computed efficiently as the solution to a monotone operator splitting problem. We show that monDEQs, despite representing infinite-depth networks, have Lipschitz constants which can be bounded by a simple function of the strong-monotonicity parameter, the choice of which therefore directly influences the bound. We also derive a bound on the Lipschtiz constant w.r.t. the weights of the monDEQ, with which we derive a deterministic PAC-Bayes generalization bound for the monDEQ by adapting the technique of (Neyshabur et al., 2018) . While such generalization bounds for DNNs are plagued by exponential dependence on network depth, the corresponding monDEQ bound does not involve any depth-like term. Empirically, we demonstrate that our Lipschitz bounds on fully-connected monDEQs trained on MNIST are small relative to comparable DNNs, even for DNNs of depth only 4. We show a similar trend on single-and multi-convolutional monDEQs as compared to the bounds on traditional CNNs computed by AutoLip and SeqLip (Virmaux & Scaman, 2018) , the only existing methods for (even approximately) bounding CNN Lipshitz constants. Further, our monDEQ generalization bounds are comparable with bounds on DNNs of around depth 5, and avoid the exponential dependence on depth of those bounds. Finally, we also validate the significance of the small Lipschitz bounds for monDEQs by empirically demonstrating strong adversarial robustness on MNIST and CIFAR-10.

2. BACKGROUND AND RELATED WORK

Lipschitz constants of DNNs Lipschitz constants of DNNs were proposed as early as Szegedy et al. (2014) as a potential means of controlling adversarial robustness. The bound proposed in that work was the product of the spectral norms of the layers, which in practice is extremely loose. Virmaux & Scaman (2018) derive a tighter bound via a convex maximization problem; however the bound is typically intractable and can only be approximated. Combettes & Pesquet (2019) bound the Lipschitz constant of DNNs by noting that the common nonlinearities employed as activation functions are averaged, nonexpansive operators; however, their method scales exponentially with depth of the network. (Zou et al., 2019) propose linear-program-based bounds specific to convolutional networks, which in practice are several orders of magnitude larger than empirical lower bounds. Upper bounds based on semidefinite programs which relax the quadratic constraints imposed by the nonlinearities are studied by Fazlyab et al. (2019) ; Raghunathan et al. (2018) ; Jin & Lavaei (2018) . The bounds can be tight in practice but expensive to compute for deep networks; as such, Fazlyab et al. (2019) propose a sequence of SDPs which trade off computational complexity and accuracy. This allows us to compare our monDEQ bounds to their SDP bounds for networks of increasing depth (see Section 5). Latorre et al. (2019) show that the complexity of the optimization problems can be reduced by taking advantage of the typical sparsity of connections common to DNNs, but the resulting methods are still prohibitively expensive for deep networks. DEQs and monotone DEQs An emerging focus of deep learning research is on implicit-depth models, typified by Neural ODEs (Chen et al., 2018) and deep equilibrium models (DEQs) (Bai et al., 2019; 2020) . Unlike traditional deep networks which compute their output by sequential, layer-wise computation, implicit-depth models simulate "infinite-depth" networks by specifying, and directly solving for, some analytical conditions satisfied by their output. The DEQ model directly solves for the fixed-point of an infinitely-deep, weight-tied and input-injected network, which would consist of the iteration z i+1 = g(z i , x) where, g represents a nonlinear layer computation which is applied repeatedly, z i is the activation at "layer" i, and x is the network input, which is injected at each layer. Instead of iteratively applying the function g (which indeed may not converge), the infinite-depth fixed-point z * = g(z * , x) can be solved using a root-finding method. A key advantage of DEQs is that backpropagation through the fixed-point can be performed analytically using the implicit function theorem, and DEQ training therefore requires much less memory than DNNs, which need to store the intermediate layer activations for backpropagation. In standard DEQs, existence of a unique fixed point is not guaranteed, nor is stable convergence to a fixed-point easy to obtain in practice. Monotone DEQs (monDEQs) (Winston & Kolter, 2020) improve upon this aspect by parameterizing the DEQ in a manner that guarantees the existence of a stable fixed point. Monotone operator theory provides a class of operator splitting methods which are guaranteed to converge linearly to the fixed point (see Ryu & Boyd (2016) for a primer). The monDEQ considers a weight-tied, input-injected network with iterations of the form z (k+1) = σ(W z (k) + U x + b) (1) where x ∈ R n is the input, U ∈ R h×n the input-injection weights, z (i) ∈ R h the hidden unit activations at "layer" i, and W ∈ R h×h the hidden-unit weights, and b ∈ R h a bias term, and σ : R h → R h an elementwise nonlinearity. The output of the monDEQ is defined as the fixed point of the iteration, a z * such that z * = σ(W z * + U x + b). Just as for DEQs, forward iteration of this system need not converge to z * ; instead, the fixed point is found as the solution to a particular operator splitting problem. Various operator splitting methods can be employed here, for example forward-backward iteration, which results in a damped version of the forward iteration z (k+1) = σ(z (k) -α((I -W )z (k) -(U x + b))) = σ((I -α(I -W ))z (k) + α(U x + b)). (3) The operator I-α(I-W ) appearing in this iteration is contractive for any 0 < α ≤ 2m/L 2 , and this iteration is guaranteed to converge so long as the operator I -W is Lipschitz and strongly monotone with parameters L (which is in fact the spectral norm I -W 2 ) and m (Ryu & Boyd, 2016). In Section 3, we will see how unrolling this iteration leads directly to a bound on the Lipschitz constant of the monDEQ. To ensure the strong monotonicity condition, that I -W mI, the monDEQ parameterizes W as W = (1 -m)I -A T A + B -B T . The strong-monotonicity parameter m will in fact figure in directly to the Lipschitz constant of the monDEQ. Lipschitz constants for implicit-depth models A few prior works have proposed methods for bounding the Lipschitz constants of other classes of implicit depth network. Ghaoui et al. (2020) define restrictive conditions for well-posedness of an implicit network which are different from those of the monDEQ. In particular, they require the weight matrix W to be such that forward iteration is stable (as opposed to the stability of the operator splitting methods required by monDEQ). They derive Lipschitz constants and robustness guarantees under these conditions; for example when W ∞ < 1, then a Lipschitz bound can be derived by simply manipulating the fixed-point equation as they demonstrate in equation 4.3. Herrera et al. (2020) propose a framework for implicit depth models which incorporate the Neural ODE (Chen et al., 2018) (which is the solution of an ODE at a given time T) but not the monDEQ (which can be cast as finding the equilibrium point of an ODE). They derive bounds on the Lipschitz constant w.r.t. network weights, but their framework cannot be applied to bound the Lipschitz constant of the monDEQ.

3. LIPSCHITZ CONSTANTS FOR MONOTONE DEQS

We now present our main methodological contributions, easily-computable bounds on the Lipschitz constants of monDEQs. We first derive the Lipschitz bound on the input-output mapping defined by the monDEQ, followed by that for the weight-output mapping. As we describe below, both bounds turn out to depend inversely on the strong-monotonicity parameter m of the monDEQ. Since m is chosen for the monDEQ at design time, this implies an analytical handle on its Lipschitz constant.

3.1. LIPSCHITZ CONSTANTS WITH RESPECT TO INPUT

The naive way of computing L for feedforward deep networks is by multiplying the spectral norms of the weight matrices. As stated above, just employing forward iterations does not lead to convergence of the monDEQ. Analogously, if we were to adopt the naive method and simply unroll the forward iterations of the monDEQ as described in equation 1, we would end up with an infinite product of spectral norms, which would not converge unless W itself is contractive. Here again, we consider unrolling the averaged operator T := I -α(I -W ) employed in the forward-backward iterations, which ensures that the monDEQ converges, and will also lead to a finite bound on the Lipschitz constant. Notice that T appears in the forward iterations in equation 3. In the sequel, let L[A] denote the Lipschitz constant of a function or operator A. The following proposition, which we prove in Appendix A, bounds the Lipschitz constant L [T ]. Proposition 1. L[T ] ≤ 1 -2αm + α 2 L[I -W ] 2 This implies that for α ∈ 0, 2m L[I-W ] 2 , L[T ] < 1. In our subsequent analysis, we only consider values of α in this range. We are now ready to state our bound for the Lipschitz constant of the monDEQ: Theorem 1 (Lipschitz constant of monDEQ). Let f (x) = z * denote the output of the monDEQ on input x, as in equation 2. Consider any x, y ∈ R n . Then, we have that f (x) -f (y) 2 ≤ U 2 m x -y 2 . In other words, k) denote the k th iterate of the forward-backward iterations as described in equation 3 (we begin with f 0 (x) = 0). We will try and unroll these iterations in the following: L[f ] ≤ U 2 m . Proof. Let f k (x) = z ( f k (x) -f k (y) 2 = σ(T f k-1 (x) + αU x + αb) -σ(T f k-1 (y) + αU y + αb) 2 ≤ T f k-1 (x) + αU x + αb -T f k-1 (y) -αU y -αb) 2 (σ =ReLU is 1-Lipschitz) = T (f k-1 (x) -f k-1 (y)) + αU (x -y) 2 ≤ T (f k-1 (x) -f k-1 (y)) 2 + α U (x -y) 2 ≤ L[T ] f k-1 (x) -f k-1 (y) 2 + αL[U ] x -y 2 ≤ L[T ] k f 0 (x) -f 0 (y) 2 + α U 2 x -y 2 • k-1 i=0 (L[T ]) i (unrolling k times) = α U 2 x -y 2 • k-1 i=0 (L[T ]) i (since f 0 (x) = f 0 (y) = 0) Since the above inequality holds for all k, we can take the limit on both sides as k → ∞, keeping α fixed. But notice that since the forward-backward iterations converge to the true f (which does not depend on α), we have that lim k→∞ f k = f . That is, the dependence on α disappears on the LHS once we take the limit on k. Thus, by using the continuity of the l 2 norm, we have f (x) -f (y) 2 = lim k→∞ f k (x) -lim k→∞ f k (y) 2 ≤ α U 2 x -y 2 • ∞ i=0 (L[T ]) i = α U 2 1 -L[T ] x -y 2 (since L[T ] < 1) ≤ α U 2 1 -1 -2αm + α 2 L[I -W ] 2 x -y 2 (from Proposition 1) Now, since the above result holds for any α in the range considered, taking α → 0, we have that L[f ] ≤ lim α→0 α U 2 1 -1 -2αm + α 2 L[I -W ] 2 = U 2 m (applying L'Hopital's rule) We observe here that the Lipschitz constant of the monDEQ with respect to its inputs indeed depends on only two quantities, namely U 2 and m, and doesn't depend at all on the weight matrix W . Furthermore, because m is a hyperparameter chosen by the user, this illustrates that monDEQs have the notable property that one can essentially control the Lipschitz parameter of the network (insofar as the influence of W is concerned) by appropriately choosing m, and not require any additional structure or regularization on W . This is in stark contrast to most existing DNN architectures, where enforcing Lipschitz bounds requires substantial additional effort.

3.2. LIPSCHITZ CONSTANTS WITH RESPECT TO WEIGHTS

We now turn to the question of bounding the change in the output of the monDEQ when the weights are perturbed but the input remains fixed. This calculation has several important use cases, one of which is in the derivation of generalization bounds for the monDEQ. Given a bound on the change in the output on perturbing the weights of the monDEQ, we can derive bounds on the generalization error in a straight-forward manner, as detailed in Section 4 below. The following theorem establishes a perturbation bound for the monDEQ. Theorem 2 (Perturbation bound for monDEQ). Let I -W mI and I -W mI. The change in the output of the monDEQ on perturbing the weights and biases from W, U, b to W , Ū , b is bounded as follows: f ( W , Ū , b) -f (W, U, b) 2 ≤ W -W 2 U x + b 2 m m + ( Ū -U )x 2 + b -b 2 m The proof steps for Theorem 2 parallel closely those involved in the derivation of the Lipschitz constant with respect to the inputs, and are outlined in Appendix B. We highlight here again that the bound depends inversely on m, a design parameter in our control. Further, when compared to a similar perturbation bound derived in Neyshabur et al. ( 2018), we note that our perturbation bound for the monDEQ does not involve a depth-dependent product of spectral norms of weights. In addition, although we state the theorem in terms of a perturbation of W (which can thus lead to a different strong monotonicity parameter m), the bound can also be adapted to perturbations on A and B in the typical monDEQ parameterization, which leads to a perturbed network that will necessarily still have the same monotonicity parameter m as the original (indeed, we take this approach in the next section, when deriving the generalization bound).

4. GENERALIZATION BOUND FOR MONDEQ

In this section, we demonstrate how the perturbation bound derived in Section 3.2 leads directly to a deterministic PAC-Bayes, margin-based bound on the monDEQ generalization error, following the analysis for DNNs of Neyshabur et al. (2018) . A key difference from our work, however, is that the perturbation bound they derive involves the product of spectral norms of all the weight matrices in the DNN. Thus, as the network gets deeper, their bound grows exponentially looser. As in Neyshabur et al. (2018) , our generalization bound is based on two key ingredients. The first is their deterministic PAC-Bayes margin bound (Lemma 1 in the Appendix C), which adapts traditional PAC-Bayes bounds to bound the the expected risk of a parameterized, deterministic classifier in terms of its empirical margin loss. The second is the perturbation bound on monDEQ with respect to weights as derived in Section 3.2 above. Crucially, since our perturbation bound does not explicitly involve a product of spectral norms of weights (which in the case of the monDEQ, would be an infinite product), our final generalization bound does not either. The monDEQ model we consider here consists of a fully connected layer at the end that maps f to the output, so that f o (x) = W o f (x) + b o , where W o and b o are the weights and bias in the output layer; these parameters are important to include here since they contribute directly to the perturbation bound. We also restrict the input x to the monDEQ to lie in an l 2 norm ball of radius B. Let h denote the hidden dimension of the monDEQ, and M the size of the training set, and define β := max{ U 2 , A 2 , b 2 , W o 2 }. Let L γ (f o ) denote the expected margin loss at margin γ of the monDEQ on the data distribution D, where L γ (f o ) = P (x,y)∼D f o (x) y ≤ γ + max j =y f o (x) j and Lγ (f o ) denote the corresponding empirical margin loss on the training dataset. We are now ready to state our generalization bound for the monDEQ: Theorem 3 (Generalization bound for monDEQ). Let W • 2 F = A 2 F + B 2 F + U 2 F + b 2 F + W o 2 F + b o 2 F For any δ, γ > 0, with probability at least 1 -δ over the training set of size M , we have that Note that our bound above does not involve any depth-like term that scales exponentially, like the term that involves the product of spectral norms of the weight matrices in Neyshabur et al. (2018) (while still having the same dependence on h, which is L 0 (f o ) ≤ Lγ (f o ) + O   h ln(h)[β 2 B(γ + β) + mβB + m 2 ] 2 γ 2 m 4 M W • 2 F + ln( M √ M δ ) M   √ h ln h). To the best of our knowledge, this is the first generalization bound for an implicit-layer model having effectively infinite depth. The proof of Theorem 3 is given in Appendix C.

5.1. LIPSCHITZ CONSTANTS

In this section, we empirically verify the tightness of the Lipschitz constant of the monDEQ with respect to inputs. We conduct all our experiments on MNIST and CIFAR-10, for which several benchmarks exist for computing the Lipschitz constant. We conduct experiments for different monDEQ architectures (fully connected/convolutional) with varying parameters (strong-monotonicity parameter m and width h), which we compare to DNNs with different depths and widths. We compute empirical lower bounds by maximizing the norm of the gradient at 10k randomly sampled points. A naive upper bound can be computed as d i=1 W i 2 . We include these bounds wherever applicable. MNIST Here, we train DNNs for various depths from d = 3, 4, . . . , 14 for a fixed hidden layer width h = 40, and plot (Figure 1a ) the bound on the Lipschitz constant given by the SDP-based method of Fazlyab et al. (2019) on these DNNs. We can observe that all estimates of the Lipschitz constant increase exponentially with depth. For comparison, in Figure 1b we plot our Lipschitz constant bounds for monDEQs with fixed h = 40, 60, for a range of strong-monotonicity parameters m. We note that the DNNs all have test error of around 3%, while the monDEQ test error ranges from 2.4%-4.3%, increasing with m (see Figure 5 in Appendix F for details). We see that the Lipschitz constant of the monDEQ is much smaller, and that on increasing m, the Lipschitz constant of the monDEQ decreases, outlining how we can exercise control on the Lipschitz constant. We also compare the Lipschitz constants of monDEQs and DNNs having the same width, for a fixed depth d = 5. The results are shown in Figure 6 in Appendix F. The DNN numbers are derived from Figure 2 (a) in (Fazlyab et al., 2019) . We observe that the Lipschitz constant of the monDEQ for the same width (and essentially infinite depth) is much lower than the bounds for regular DNNs. Next, using the bound derived in Section 3.1, we compute the Lipschitz constant of convolutional monDEQ architectures, namely single convolutional and multi-tier convolutional monDEQs. We compare to the numbers in Figure 5 in Virmaux & Scaman (2018) , which reports the Lipschitz constants computed by various methods for different CNNs with increasing depth. For our estimate on the single convolutional monDEQ, we use a single convolutional layer with 128 channels, whereas for the multi-tier convolutional monDEQ, we use 3 convolutional layers with 32, 64 and 128 channels. In Figure 1c , we can observe that as for DNNs, the CNN Lipschitz constants estimated by existing methods also suffer with depth. However, we can observe in Figure 1d that the Lipschitz bounds for convolutional monDEQs are much smaller. Also, on increasing m, we can control the Lipschitz constant of both single as well as multi-tier convolutional monDEQs. The test error for the convolutional monDEQs is 0.65%-3.22%, increasing with m (see Figure 5 in Appendix F), but is not reported for the CNNs in Virmaux & Scaman (2018) .

CIFAR-10

To demonstrate that the Lipschitz bounds scale to larger datasets, we run similar experiments on CIFAR-10. Figure 2a shows our bound (solid blue lines) for single-convolutional monDEQs (128 channels) with a range of m values, together with empirical lower bounds (dashed blue lines). Also shown are upper bounds (solid lines) and empirical lower bounds (dashed lines) for three standard CNN models (CNN sm, med, and lg, having 2, 4, and 6 convolutions respectively, detailed in Appendix F). The upper bounds are computed using the Greedy SeqLip method of Virmaux & Scaman (2018) . We see that a) the monDEQ Lipschitz bounds decrease with m, and b) the upper bounds (and gap between upper and lower bounds) for the medium and large CNNs are large by comparison. In Figure 2b we see that test error of the monDEQ increases with m from 26% to 33%, and that, despite their much higher Lipschitz bounds, the three CNNs have similar test error. Finally, on CIFAR-10 with data augmentation, we also trained a larger multi-tier convolutional mon-DEQ with three convolutional layers with 64, 128, and 128 channels. This model obtains 10.25% test error and has a Lipschitz upper bound of 1996.86, which is on par with the upper bound of the medium CNN (plotted in brown; with L =1554.01, test error =30.6%).

Unrolling monDEQs

In this experiment, we study if unrolling the monDEQ with m = 1, h = 40 up to a finite depth and constructing an equivalent DNN with this depth leads to a tight estimate of the Lipschitz constant of the monDEQ. Concretely, we do this for two operator splitting methods in the monDEQ: Forward-backward (FB) iterations and Peaceman-Rachford (PR) iterations. For each value of α in a range, we calculate the number of iterations (FB or PR) required to converge within a tolerance 1e-3, and construct the equivalent DNN with this depth. Note that these unrolled DNNs compute the same function as the monDEQ (up to tolerance), and therefore, must have the same Lipschitz constant (which is around 10). We compute naive upper bounds on the Lipschitz constants of these DNNs (we cannot use the SDP-based bound of Fazlyab et al. (2019) due to technicalities in the construction of the unrolled DNN; refer to Appendix D). We can observe (Figure 7 in Appendix F) that the upper bounds corresponding to both PR and FB iterations are in the range 10 5 to 10 13 , suggesting that unrolling the monDEQ and employing standard techniques on the unrolled monDEQ is not a viable way to bound the Lipschitz constant. More details about the construction of these equivalent DNNs for both FB and PR iterations are provided in Appendix D.

5.2. GENERALIZATION BOUNDS

A key advantage of the monDEQ generalization bounds derived in Section 4 is the lack of any depth analog that can cause the bounds to grow exponentially. To assess this aspect experimentally, we first compute the DNN generalization bound following the protocol of Nagarajan & Kolter (2018) . We train DNNs (width = 40) of varying depth of 3 to 14 layers, and compare to similar monDEQs with various m values. Each model is trained on a sample of 4096 MNIST examples until the margin error at margin γ = 10 reaches below 10% which serves to standardize the experiments across choice of batch size and learning rate. As widely reported, we see that DNN bounds increase exponentially with depth, ranging numerically from 10 4 for depth 3 networks to 10 8 (see Figure 3a ). For monDEQs of width = 40, the bound decreases monotonically with m, and is confined to the range 10 4 to 10 6 , as seen in Figure 3b (note the difference in scale). In contrast, the true test error of the DNNs increases only slightly with depth, and that of the monDEQs increases only slightly with m. Note that the DNNs and monDEQ s have comparable test error (see Figure 8 in Appendix G). Finally, as done for Lipschitz bounds above, we compare our generalization bound to what we obtain by unrolling the monDEQ into a DNN, and then computing the Neyshabur et al. ( 2018) bound moreor-less directly (see Appendix E). We do this only for FB iterations, as the inverted operators of PR iterations complicate the analysis. As seen in Figure 8c , the resulting bounds are quite high, though the difference with our bound is not as great as was seen for the unrolled Lipschitz bounds above. We attribute this to the fact that our generalization bound technique is a minimal modification to that of Neyshabur et al. (2018) ; we expect that it can be tightened with more refined analysis.

5.3. ADVERSARIAL ROBUSTNESS OF MONDEQS

In this section, we empirically demonstrate an important use case for the tight Lipschitz constants of the monDEQ: robustness to adversarial examples. We experiment with both certified adversarial 2 robustness as well as empirical robustness to adversarial PGD 2 -bounded attacks. Here we describe our results on MNIST, and report similar experiments on CIFAR-10 in Appendix H. Certified adversarial robustness Consider any point x within an l 2 ball of radius around x. Then, we have that f (x) -f (x ) ∞ ≤ f (x) -f (x ) 2 ≤ L x -x 2 ≤ L Define margin(x) = f (x) y -max i =y f (x) i , where f (x) y is the logit corresponding to the label y of the input x. Then if L ≤ 1 2 margin(x), we are certified robust to any perturbed input x within an l 2 -ball of radius around x. Thus, we can empirically compare DNNs and monDEQs with regards to this certificate on MNIST. For a range of values, we compute the (certified) robust test accuracy (fraction of points in test set for which the aforementioned condition holds) for trained DNNs as well as monDEQs. We note here that our choice of values corresponds to inputs normalized in the range [0, 1]. For DNNs, just as in Figure 1a , we vary the depth for fixed h = 40, and use the L values computed by using the method in Fazlyab et al. (2019) . For monDEQs, we set width h = 40, m = 0.1, 20 and substitute our upper bound for L. Note that since the L values for monDEQs observed in Section 5.1 were significantly smaller than those for the DNNs, one would expect the condition for the certificate to hold more easily for monDEQs. Indeed, we verify this in our experiments. In Figure 4a , we can observe that the robust test accuracy for the monDEQ with m = 20 at = 0.2 is 51%, while that for the best DNN (d = 3), is just 4%. This illustrates that monDEQs allow for better certificates to adversarial robustness, owing to their small Lipschitz constants, and the ability to control it by setting m.

Empirical robustness

We also assess the empirical robustness of monDEQs to l 2 -bounded Projected Gradient Descent attacks (implemented as part of the Foolbox toolbox (Rauber et al., 2017; 2020) ) on both, trained monDEQs and DNNs on MNIST, and compute the accuracy on these adversarially perturbed test examples. Figure 4b shows the results: in general, over a range of values, the robust test accuracy of the monDEQ with m = 20 is larger than that of the DNNs.

6. CONCLUSION

In this paper, we derived Lipschitz bounds for monotone DEQs, a recently proposed class of impicitlayer networks, and showed that they depend in a straighforward manner on the strong monotonicity parameter m of these networks. Having derived a Lipschitz bound with respect to perturbation in the weights, we were able to derive a PAC-Bayesian generalization bound for the monotone DEQ, which does not depend exponentially on depth. We showed empirically that our bounds are sensible, can be controlled by choosing m suitably, and do not suffer with increasing depth of the network. As future work, we aim to analyze the vacuousness of the derived generalization bound. As such, since our bound does not suffer exponentially with depth, we hope to be able to make the analysis tighter and derive a non-vacuous generalization bound. A PROOF OF PROPOSITION 1 Proof. T x -T y 2 2 = ((1 -α)I + αW )x -((1 -α)I + αW )y 2 2 = x -y -α(I -W )(x -y) 2 2 = x -y 2 2 -2α(x -y) T (I -W )(x -y) + α 2 (I -W )(x -y) 2 2 ≤ x -y 2 2 -2α(x -y) T (I -W )(x -y) + α 2 L[I -W ] 2 x -y 2 2 Now, note that by the strong monotonicity of the monDEQ, I -W mI, which implies that (x -y) T (I -W )(x -y) ≥ m x -y 2 2 . Substituting this bound above, we have that T x -T y 2 2 ≤ x -y 2 2 -2αm x -y 2 2 + α 2 L[I -W ] 2 x -y 2 2 = (1 -2αm + α 2 L[I -W ] 2 ) x -y 2 2 Thus, we have that L[T ] ≤ 1 -2αm + α 2 L[I -W ] 2 B PROOF OF THEOREM 2 In order to derive the perturbation bound in Theorem 2, we first state the following proposition, which bounds the norm of the output after k forward-backward iterations. Proposition 2. Let f k (W, U, b) denote the k th iterate of the forward-backward iterations of the monDEQ parameterized by W, U, b on a fixed arbitrary input x. Further, let T (W ) = (1 -α)I + αW . Then, we have that f k (W, U, b) 2 ≤ α U x + b 2 1 -L[T (W )] Proof of Proposition 2. f k (W, U, b) 2 = σ(T f k-1 (W, U, b) + α(U x + b)) 2 ≤ T f k-1 (W, U, b) + α(U x + b) 2 (ReLU is 1-Lipschitz) ≤ T 2 f k-1 (W, U, b) 2 + α U x + b 2 = L[T ] f k-1 (W, U, b) 2 + α U x + b 2 ≤ L[T ] k f 0 (W, U, b) 2 + α U x + b 2 k-1 i=0 L[T ] i = α U x + b 2 k-1 i=0 L[T ] i (since f 0 (W, U, b) = 0) ≤ α U x + b 2 ∞ i=0 L[T ] i = α U x + b 2 1 -L[T ] (since L[T ] < 1) We are now ready to prove Theorem 2. Proof of Theorem 2. Denote fk = f k ( W , Ū , b) and f k = f k (W, U, b). Further, denote ∆ k = fk -f k 2 . For α ∈ 0, min 2m L[I-W ] 2 , 2 m L[I-W ] 2 , we have from Proposition 1 that both T (W ) 2 , T ( W ) 2 < 1. Thus, ∆ k = σ[T ( W ) fk-1 + α( Ū x + b)] -σ[T (W )f k-1 + α(U x + b)] 2 ≤ T ( W ) fk-1 -T (W )f k-1 + α( Ū -U )x + α( b -b) 2 ≤ T ( W )( fk-1 -f k-1 ) + (T ( W ) -T (W ))f k-1 + α( Ū -U )x + α( b -b) 2 ≤ T ( W )( fk-1 -f k-1 ) + α( W -W )f k-1 + α( Ū -U )x + α( b -b) 2 ≤ T ( W ) 2 ∆ k-1 + α W -W ) 2 f k-1 2 + α ( Ū -U )x 2 + α b -b 2 ≤ T ( W ) 2 ∆ k-1 + α 2 W -W ) 2 U x + b 2 1 -T (W ) 2 + α ( Ū -U )x 2 + α b -b 2 (Proposition 2) ≤ α 2 W -W ) 2 U x + b 2 1 -T (W ) 2 + α ( Ū -U )x 2 + α b -b 2 k-1 i=0 T ( W ) i 2 Notice here again that the above inequality holds for all k. Taking the limit as k → ∞ similar to the step in the proof of Theorem 1, we have f ( W , Ū , b) -f (W, U, b) 2 = lim k→∞ ∆ k ≤ α 2 W -W ) 2 U x + b 2 1 -T (W ) 2 + α ( Ū -U )x 2 + α b -b 2 ∞ i=0 T ( W ) i 2 = α 2 W -W ) 2 U x + b 2 (1 -T (W ) 2 )(1 -T ( W ) 2 ) + α ( Ū -U )x 2 + b -b 2 1 -T ( W ) 2 Finally, taking α → 0, we have that f ( W , Ū , b) -f (W, U, b) 2 ≤ lim α→0 α 2 W -W ) 2 U x + b 2 (1 -T (W ) 2 )(1 -T ( W ) 2 ) + α ( Ū -U )x 2 + b -b 2 1 -T ( W ) 2 = W -W ) 2 U x + b 2 lim α→0 α 1 -T (W ) 2 • lim α→0 α 1 -T ( W ) 2 + ( ( Ū -U )x 2 + b -b 2 ) lim α→0 α 1 -T ( W ) 2 = W -W 2 U x + b 2 m m + ( Ū -U )x 2 + b -b 2 m (applying L'Hopital's rule) C PROOF OF THEOREM 3 Proof. We first state Lemma 1 from Neyshabur et al. (2018) . Lemma 1 (Lemma 1 from Neyshabur et al. ( 2018)). Let f w be any predictor with parameters w, and let P denote any distribution on the parameters that is independent of the training data. Then, for any δ, γ > 0, with probability ≥ 1 -δ over the training data of size M , for any w, and any random perturbation u such that P[max x f w+u (x) -f w (x) ∞ < γ 4 ] ≥ 1 2 , we have L 0 (f w ) ≤ Lγ (f w ) + 4 KL(w + u||P ) + ln 6M δ M -1 Now, we derive a perturbation bound for the monDEQ when we incorporate the a fully connected layer at the end, as mentioned in Section 4. That is, we consider f o (x) = W o f (x) + b 0 where f is the output of the fixed-point iterations of the monDEQ. Next, we consider perturbations ∆ A , ∆ B , ∆ U , ∆ b , ∆ Wo , ∆ bo for A, B, U, b, W o , b o respectively. The entries in the perturbation matrices are each drawn independently from a Gaussian N (0, σ 2 ). Let fo to denote the function at the perturbed values of the weights. Then, we have that fo (x) -f o (x) 2 = Wo f (x) + bo -W o f (x) -b o 2 ≤ W o ( f (x) -f (x)) + ( Wo -W o ) f (x) 2 + bo -b o 2 ≤ W o 2 ∆ + Wo -W o 2 U x + b 2 m + bo -b o 2 where ∆ is the bound from Theorem 2. Now, let β = max( U 2 , A 2 , W o 2 , b 2 ). Just as in (Neyshabur et al., 2018) , since we cannot use β in determining the parameters of the prior distribution P in Lemma 1, we will consider predetermined values β on a grid, and then do a union bound. For now, we fix β, and consider all the values β such that |β -β| < c 1 β for some constant c 1 < 1. Since the entries in the perturbations are drawn from N (0, σ 2 ), we have the following bound on the l 2 norms of these perturbations: P ∆•∼N (0,σ 2 I) [ ∆ • 2 > t] ≤ 2he -t 2 /2hσ 2 where ∆ • is a placeholder for each of the perturbation matrices. Thus, we have that with probability ≥ 1/2, all of the ∆ • 2 are bounded above by σ 2h ln(24h) := ω. Now, we bound the perturbation in W 2 when A and B are perturbed. We have that ∆ W 2 = A T ∆ A + ∆ T A A + ∆ T A ∆ A + ∆ B -∆ T B 2 ≤ 2 A 2 ∆ A 2 + ∆ A 2 2 ≤ 2ω(β + ω) (with probability 1/2) Substituting this above, we have that for all x, with probability at least 1/2, fo (x) -f o (x) 2 ≤ 2β 2 ω(β + ω)(B + 1) m 2 + 2ωβ(B + 1) m + ω Here, let c 2 > 0 be some constant such that ω ≤ c 2 β. Thus, we have that fo (x) -f o (x) 2 ≤ ω 2β 3 (1 + c 2 )(B + 1) m 2 + 2β(B + 1) m + 1 ≤ ω 2 β3 (1 + c 2 )(B + 1) m 2 (1 -c 1 ) 3 + 2 β(B + 1) m(1 -c 1 ) + 1 = σ 2h ln(24h) 2 β3 (1 + c 2 )(B + 1) + 2m β(B + 1)(1 -c 1 ) 2 + m 2 (1 -c 1 ) 3 m 2 (1 -c 1 ) 3 Setting σ = γm 2 (1-c1) 3 4 √ 2h ln(24h)(2 β3 (1+c2)(B+1)+2m β(B+1)(1-c1) 2 +m 2 (1-c1) 3 ) makes this ≤ γ 4 . Since we need ω ≤ c 2 β we can take the smallest c 2 such that c 2 ≥ (1 + c 1 )σ 2h ln(24h) β ≥ σ 2h ln(24h) β = ω β Taking c 2 = (1+c1)γ 4 β suffices. We will later plug in the value c 3 := (1 + c 1 )γ 4(1 -c 1 )β ≥ c 2 to bound c 2 . Then, KL(W • + ∆ W• |P ) ≤ W • 2 F 2σ 2 = 16h ln(24h)(2 β3 (1 + c 2 )(B + 1) + 2m β(B + 1)(1 -c 1 ) 2 + m 2 (1 -c 1 ) 3 ) 2 γ 2 m 4 (1 -c 1 ) 6 W • 2 F ≤ 16h ln(24h)(2β 3 (1 + c 1 ) 3 (1 + c 3 )(B + 1) + 2mβ(1 + c 1 )(B + 1)(1 -c 1 ) 2 + m 2 (1 -c 1 ) 3 ) 2 γ 2 m 4 (1 -c 1 ) 6 W • Then, by Lemma 1, we have that with probability 1 -δ, L 0 (f 0 ) ≤ Lγ (f o )+ 4 16h ln(24h)(2β 3 (1 + c 1 ) 3 (1 + c 3 )(B + 1) + 2mβ(1 + c 1 )(B + 1)(1 -c 1 ) 2 + m 2 (1 -c 1 ) 3 ) 2 γ 2 m 4 (1 -c 1 ) 6 (M -1) W • 2 F + ln( 6M δ ) M -1 Now we need to take a union bound over β so that the above result holds for all β. Observe that we only need to consider β in the range γm 2(B + 1) ≤ β ≤ γm √ M 2(B + 1) . If β ≤ γm 2(B+1) then |f (x)| ≤ β(B+1) m ≤ γ 2 so Lγ (f ) > 1. If β ≥ γm √ M 2(B+1) then the second term in the first term in the numerator is greater than 1 so the theorem holds trivially. Finally, substituting c 3 = (1+c1)γ 4(1-c1)β , we get the theorem statement. We can also effectively optimize over c 1 to remove it from the final bound.

ITERATIONS

In this section, we derive the form of the equivalent feedforward DNN that computes the same quantity as running k iterations of either Forward-Backward or Peaceman-Rachford iterations, which are operator-splitting methods for computing the fixed point of equation 1 (refer Winston & Kolter (2020)).

D.1 UNROLLING FORWARD-BACKWARD ITERATIONS

One step in the Forward-Backward iterations computes z (i+1) = σ((1 -α)I + αW )z (i) + α(U x + b)) To simulate k steps of these computations as a depth k-feedforward network, let us construct the following weight matrices: W 1 = αU I W u = (1 -α)I + αW αU 0 I W f = [W o 0] Then, we can observe that z (1) = σ W 1 x + αb 0 z (i+1) = σ W u z (i) x + αb Here, σ applies only to the top h coordinates corresponding to z (i) . Finally, we multiply the output of the monDEQ after k iterations with the output weights, to obtain f o (x) as f o (x) = W f z (k) x + b o Thus, the equivalent depth-k DNN which we construct would have weight matrices W 1 , [W u ] k-1 i=1 , W f . In this DNN, the ReLUs in the intermediate layers would only apply to the top coordinates corresponding to z (this is the technical reason why we can't use the SDP bound given by (Fazlyab et al., 2019) for computing the Lipschitz constants of these unrolled networks, since they require the same nonlinearity to apply pointwise at all coordinates). D.2 UNROLLING PEACEMAN-RACHFORD ITERATIONS Define V = (I + α(I -W )) -1 . One step in the Peaceman-Rachford iterations computes u (i+1) = u (i) -2z (i) + 4V z (i) -2V u (i) + 2αV U x + 2αV b z (i+1) = σ(u i+1 ) To simulate k steps of these computations as a depth k-feedforward network, let us construct the following weight matrices: W 1 = 2αV U 2αV U I W u = I -2V 4V -2I 2αV U I -2V 4V -2I 2αV U 0 0 I W f = [0 W o 0] Then, we can observe that z (1) = σ W 1 x + 2αV b 2αV b 0 z (i+1) = σ   W u   u (i) z (i) x   + 2αV b 2αV b 0   Here, σ applies only to the top h coordinates corresponding to z (i) . Finally, we multiply the output of the monDEQ after k iterations with the output weights, to obtain f o (x) as f 0 (x) = W f   u (k) z (k) x   + b o Thus, the equivalent depth-k DNN which we construct would have weight matrices W 1 , [W u ] k-1 i=1 , W f , and the ReLUs in the intermediate layers would only apply to the top coordinates corresponding to z.

E GENERALIZATION BOUND FOR UNROLLED MONDEQ

In this section, we derive a generalization bound for an unrolled monDEQ after unrolling the forward-backward iterations d times and constructing the equivalent depth-d network. The analysis follows the derivation of the generalization bound for DNNs by (Neyshabur et al., 2018 ), but we have to be careful since the weight matrices across the hidden layers are all the same and are of a certain parameterized form. Since the analysis in (Neyshabur et al., 2018) does not include biases in the DNNs, here, we consider unrolling monDEQs which do not have the bias term b. Take each layer weights to be W u = I -α(I -W ) αU 0 I where the network is now a function of zi := z (i) x and the ReLU applies only to the z (i) component. And W f = W o Our prior distribution will now be over ∆ Wo , ∆ A , ∆ B , ∆ U ∼ N (0, σ 2 I). Similar to the corresponding step in C above, we have that with probability ≥ 1 2 , the perturbations are each bounded by ω := σ 2h ln(16h). Also ∆ W 2 = ĀT Ā -A T A + B -B -BT + B T 2 ≤ ∆ A 2 2 + 2 ∆ A 2 A 2 + 2 ∆ B 2 And so ∆ Wu 2 = α∆ W α∆ U 0 0 2 ≤ α ∆ A 2 2 + 2 ∆ A 2 A 2 + 2 ∆ B 2 + ∆ U 2 We take β = max{ A 2 , B 2 , W u 2 , W o 2 }. We assume that | β -β| ≤ 1 d β and d ≥ 2, so we have β < e β. Then applying Lemma 2 in (Neyshabur et al., 2018) , we have |f Wu+∆ Wu -f Wu | 2 ≤ eBβ d-1 ((d -1) ∆ Wu 2 + ∆ Wo 2 ) ≤ eBβ d-1 ((d -1)α ∆ A 2 2 + 2 ∆ A 2 A 2 + 2 ∆ B 2 + ∆ U 2 + ∆ Wo 2 ) ≤ eBβ d-1 ((d -1)α(ω 2 + 2ωβ + 3ω) + ω) ≤ eBβ d-1 ((d -1)α( 1 d β + 2β + 3) + 1)ω ≤ e 2 B βd-1 ((d -1)α( e d β + 2e β + 3) + 1)σ 2h ln(16h) ≤ γ 4 if we choose σ = γ 4e 2 B βd-1 ((d -1)α( e d β + 2e β + 3) + 1) 2h ln(16h) . Let W • 2 F = A 2 F + B 2 F + U 2 F + W o 2 F Then KL(A + ∆ A , B + ∆ B , U +∆ U , W o + ∆ Wo ) P ) ≤ |A, B, U, W o | 2 F 2σ 2 = 16e 4 B 2 β2d-2 ((d -1)α( e d β + 2e β + 3) + 1) 2 2h ln(16h) γ 2 W • 2 F ≤ 16e 2 B 2 β 2d-2 ((d -1)α( 1 d β + 2β + 3) + 1) 2 2h ln(16h) γ 2 W • 2 F Now, instantiating Lemma 1 above, we have that with probability at least 1 -δ, L 0 (f ) ≤ Lγ (f ) + 4 16e 2 B 2 β 2d-2 ((d -1)α( 1 d β + 2β + 3) + 1) 2 2h ln(16h) W • 2 F γ 2 (M -1) + ln( 6M δ ) M -1 Following Neyshabur et al. (2018) , we need to take a union bound over β so that the above result holds for all β. Observe that we only need to consider β in the range γ 2B 1/d ≤ β ≤ γ √ M 2B 1/d . If β d ≤ γm 2B then |f (x)| ≤ β d B ≤ γ 2 so Lγ (f ) > 1. If β d ≥ γ √ M 2B then the second term in the first term in the numerator is greater than 1 so the result holds trivially. So |β -β| ≤ 1 

G ADDITIONAL MONDEQ GENERALIZATION BOUND RESULTS

We validate that both the monDEQ and DNN models considered in Section 5.2 for generalization bound comparisons indeed achieve comparable accuracy as well. Figures 8a, 8b illustrate this: we can observe that the various models considered all achieve test error in the same range. Next, we compute generalization bounds for unrolled monDEQs (Section E) for a variety of α values, to get a sense of the quality of our generalization bound for the monDEQ. In Figure 8c , we can observe that the generalization bounds for these unrolled networks are larger (on the order of 10 6 ) as compared to the generalization bound for the monDEQ (on the order 10 5 ). This shows that our bound is tighter. We run the same experiments that we ran in Section 5.3 on the CIFAR-10 dataset. We train single convolution monDEQs (128 channels) with a range of m values, as well as 3 CNN models: CNN sm, CNN med and CNN lg whose architectures are described in Appendix F above. As in the case of the MNIST experiments, we see that varying m gives a range of trade offs between clean accuracy and adversarial robustness. In terms of certified robustness, we see in Figure 9a that all of the monDEQ s besides m = 20 have both better clean and better robust accuracy than all the CNNs (with the exception of the CNN with d = 2, which has much lower clean accuracy). In Figure 9b , in terms of empirical robustness, we see that m parameterizes a similar trade-off between robust and clean accuracy. In fact, it is possible to choose m (e.g. m = 1 here), such that the monDEQ outperforms all CNNs in terms of clean accuracy and at all attack sizes.



Experimental code available at https://github.com/locuslab/lipschitz_mondeq.



Figure 1: MNIST results: Lipschitz bounds as a function of depth and strong monotonicity parameter. lb: lower bound; ub: upper bound.

Lipschitz bounds for monDEQs vs m and for three CNN models. Solid lines: upper bounds; dashed lines: empirical lower bounds. Test error for CNNs and monDEQs vs m.

Figure 2: CIFAR-10 results: Lipschitz bounds and test accuracy for monDEQs as a function of strong monotonicity parameter and for CNNs. See text for description of models.

Figure 3: Generalization bounds for DNNs and monDEQs as a function of depth and m.

Figure 4: Superior adversarial robustness of monDEQs as compared to DNNs

So |β -β| ≤ c 1 γm 2(B+1) guarantees |β -β| ≤ c 1 β in this range. So we can use a cover of size √ M /2c 1 : This amounts to replacing ln 6M/δ with ln 3M 3/2 /c 1 δ.

guarantees |β -β| ≤ 1 d β in this range. So we can use a cover of size dM 1/2d , just as in Neyshabur et al. (2018).

Convolutional monDEQs shown in Figure1d

Figure 5: Test error for the monDEQs from Section 5.1

Peaceman-Rachford unrolled Lipschitz bounds

Figure 7: Lipschitz bounds for monDEQ by unrolling forward-backward or Peaceman-Rachford, for a range of α.

Forward-Backward unrolled generalization bound

Figure 8: Test error for DNNs and monDEQs, and monDEQ generalization bounds by unrolling forward-backward iterations.

Figure 9: Adversarial robustness of monDEQs as compared to CNNs

Aladin Virmaux and Kevin Scaman. Lipschitz regularity of deep neural networks: analysis and efficient estimation. In Advances in Neural Information Processing Systems, pp. 3835-3844, 2018. Lily Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Luca Daniel, Duane Boning, and Inderjit Dhillon. Towards fast computation of certified robustness for relu networks. In International Conference on Machine Learning, pp. 5276-5285, 2018.

annex

MNIST test error In Figure 5 we plot the test error of the width 40 monDEQs for which the Lipschitz bounds are given in Figure 1b . We see that it increases from 2.4% for m = 0.5 to 4.2% for m = 20. For comparison, the DNNs shown in Figure 1a have test error between 2.8% and 3.2%, with no trend w.r.t. depth.MNIST width experiments Figure 6 shows the lower and upper bounds for DNNs and monDEQs of varying widths, as described in Section 5.1. Figure 6a Linear 10All models use ReLU activations and kernels of size 3.

Unrolling monDEQs

The approximate upper bounds on monDEQ Lipschitz constants obtained by unrolling the operator splitting methods are shown in Figure 7 .

