GENERALIZED UNIVERSAL APPROXIMATION FOR CERTIFIED NETWORKS

Abstract

To certify safety and robustness of neural networks, researchers have successfully applied abstract interpretation, primarily using interval bound propagation. To understand the power of interval bounds, we present the abstract universal approximation (AUA) theorem, a generalization of the recent result by Baader et al. (2020) for ReLU networks to a large class of neural networks. The AUA theorem states that for any continuous function f , there exists a neural network that (1) approximates f (universal approximation) and ( 2) whose interval bounds are an arbitrarily close approximation of the set semantics of f . The network may be constructed using any activation function from a rich class of functions-sigmoid, tanh, ReLU, ELU, etc.-making our result quite general. The key implication of the AUA theorem is that there always exists certifiably robust neural networks, which can be constructed using a wide range of activation functions.

1. INTRODUCTION

With wide adoption of neural networks, new safety and security concerns arose. The most prominent property of study has been robustness (Goodfellow et al., 2015) : small perturbations to the input of a network should not change the prediction. For example, a small change to an image of a stop sign should not cause a classifier to think it is a speed-limit sign. A number of researchers have proposed the use of abstract interpretation (Cousot & Cousot, 1977) techniques to prove robustness of neural networks (Gehr et al., 2018; Wang et al., 2018; Anderson et al., 2019) and to train robust models (Mirman et al., 2018; Gowal et al., 2018; Huang et al., 2019; Wong & Kolter, 2018; Wong et al., 2018; Balunovic & Vechev, 2020) . Suppose we want to verify robustness of a neural network to small changes in the brightness of an image. We can represent a large set of images, with varying brightness, as an element of some abstract domain, and propagate it through the network, effectively executing the network on an intractably large number of images. If all images lead to the same prediction, then we have a proof that the network is robust on the original image. The simplest abstract interpretation technique that leads to practical verification results is interval analysis-also referred to as interval bound propagation. In our example, if each pixel in a monochrome image is a real number r, then the pixel can be represented as an interval [r -, r + ], where denotes the range of brightness we wish to be robust on. Then, the box representing the interval of each pixel is propagated through the network using interval arithmetic operations. The interval domain has been successfully used for certifying properties of neural networks in vision (Gehr et al., 2018; Gowal et al., 2018) , NLP (Huang et al., 2019) , as well as cyber-physical systems (Wang et al., 2018) . Why does the interval domain work for certifying neural networks? To begin understanding this question, Baader et al. (2020) demonstrated a surprising connection between the universal approximation theorem and neural-network certification using interval bounds. Their theorem states that not only can neural networks approximate any continuous function f (universal approximation) as we have known for decades, but we can find a neural network, using rectified linear unit (ReLU) activation functions, whose interval bounds are an arbitrarily close approximation of the set semantics of f , i.e., the result of applying f to a set of inputs (e.g., set of similar images). AUA theorem (semi-formally): For a continuous function f : R m → R that we wish to approximate and error δ > 0, there is a neural network N that has the following behavior: Let B ⊂ R m be a box. The red interval (top) is the tightest interval that contains all outputs of f when applied to x ∈ B. If we propagate box B through N using interval bounds, we may get the black interval (bottom) N # (B), whose lower/upper bounds are up to δ away from the red interval. δ δ minx∈B f (x) maxx∈B f (x) N # (B) Figure 1 : Semi-formal illustration of AUA theorem. (Right is adapted from Baader et al. (2020) .) The theorem of Baader et al. (2020) is restricted to networks that use rectified linear units (ReLU) . In this work, we present a general universal approximation result for certified networks using a rich class of well-behaved activation functions. Specifically, we make the following contributions. Abstract universal approximation (AUA) theorem. We prove what we call the abstract universal approximation theorem, or AUA theorem for short: Let f be the function we wish to approximate, and let δ > 0 be the tolerated error. Then, there exists a neural network N , built using any well-behaved activation function, that has the following behavior: For any box of inputs B, we can certify, using interval bounds, that the range of outputs of N is δ close to the range outputs of f . If the box B of inputs is a single point in Euclidean space, the AUA theorem reduces to the universal approximation theorem; thus, AUA generalizes universal approximation. Fig. 1 further illustrates the AUA theorem. Existence of robust classifiers. While the AUA theorem is purely theoretical, it sheds light on the existence of certifiable neural networks. Suppose there is some ideal robust image classifier f using the ∞ norm, which is typically used to define a set of images in the neighborhood of a given image. The classical universal approximation theorem tells us that, for any desired precision, there is a neural network that can approximate f . We prove that the AUA theorem implies us that there exists a neural network for which we can automatically certify robustness using interval bounds while controlling approximation error. In addition, this neural network can be built using almost any activation function in the literature, and more. Squashable functions. We define a rich class of activation functions, which we call squashable functions, for which our abstract universal approximation theorem holds. This class expands the functions defined by Hornik et al. (1989) for universal approximation and includes popular activation functions, like ReLU, sigmoid, tanh, ELU, and other activations that have been shown to be useful for training robust neural networks (Xie et al., 2020) . The key feature of squashable activation functions is that they have left and right limits (or we can use them to construct functions with limits). We exploit limits to approximate step (sign) functions, and therefore construct step-like approximations of f , while controlling approximation error δ. Proof of AUA theorem. We present a constructive proof of the AUA theorem. Our construction is inspired by and synthesizes a range of results: (1) the work of Hornik et al. (1989) on squashing functions for universal approximation, (2) the work of Csáji (2001) for using the sign (step) function to construct Haar (wavelet) functions, and (3) the work of Baader et al. (2020) on the specialized AUA theorem for ReLUs. The key idea of Baader et al. (2020) is to construct an indicator function for box-shaped regions. We observe that squashable functions can approximate the sign function, and therefore approximate such indicator functions, while carefully controlling precision of abstract interpretation. Our proof uses a simpler indicator construction compared to Baader et al. (2020) , and as a result its analysis is also simpler.

2. RELATED WORK

The classical universal approximation (UA) theorem has been established for decades. In contrast to AUA, UA states that a neural network with one single hidden layer can approximate any continuous function on a compact domain. One of the first versions goes back to Cybenko (1989) ; Hornik et al. (1989) , who showed that the standard feed-forward neural network with sigmoidal or squashing activations is a universal approximator. The most general version of UA was discovered by Leshno et al. (1993) , who showed that the feed-forward neural network is a universal approximator if and only if the activation function is non-polynomial. Because AUA implies UA, this means AUA cannot Squashable activation functions that satisfy Eq. ( 1) σ(x) = 1 1 + e -x tanh(x) = 2 1 + e -2x -1 softsign(x) = x 1 + |x| Squashable activation functions that do not directly satisfy Eq. ( 1) ReLU(x) = x, x 0 0, x < 0 ELU(x) = x, x 0 e x -1, x < 0 softplus(x) = log(1 + e x ) smoothReLUa(x) = x -1 a log(ax + 1), x 0 0, x < 0 Figure 2: Examples of squashable activation functions, including popular functions, and more recent ones: sigmoid, tanh, rectified linear units (ReLU) (Nair & Hinton, 2010) , exponential linear unit (ELU) (Clevert et al., 2016) , softplus (Glorot et al., 2011) , softsign (Bergstra et al., 2009) , and smooth ReLU (Xie et al., 2020) , which is parameterized by a > 0. (2009) and forms of polyhedra Cousot & Halbwachs (1978) . Since such domains are strictly more precise than intervals, the AUA theorem holds for them.

3.1. NEURAL NETWORKS AND SQAUSHABLE ACTIVATIONS

A neural network N is a function in R m → R. We define a network N following a simple grammar, a composition of primitive arithmetic operations and activation functions. Definition 3.1 (Neural network grammar). Let x ∈ R m be the input to the neural network. A neural network N is defined using the following grammar: N = c | x i | N 1 + N 2 | c * N 1 | t(N 1 ) , where c ∈ R, x i is the ith element of x, and t : R → R is an activation function. We will always fix a single activation function t to be used in the grammar. We now present a general class of activation functions that we will call squashable activation functions. Fig. 2 shows some examples. Definition 3.2 (Squashable functions). t : R → R is squashable iff (1) there is a < b ∈ R such that lim x→-∞ t(x) = a, lim x→∞ t(x) = b, and ∀x < y. t(x) t(y) or (2) we can construct a function t that satisfies Eq. (1) as affine transformations and function compositions of copies of t, i.e., following the grammar in Def. 3.1. E.g., t (x) = t(1 -t(x)). Informally, an activation function is in this class if we can use it to construct a monotonically increasing function that has limits in the left and right directions, -∞ and ∞.foot_0 Squashable activation -2-1 1 2 1 (a) ReLU(1 -ReLU(-x)) -8 -4 4 8 1 (b) softplus(1 -softplus(-x)) Figure 3 : Two activation functions after applying construction in Proposition 3.3. Observe that the resulting function satisfies Eq. ( 1), and therefore ReLU and softplus are squashable. functions extend the squashing functions used by Hornik et al. (1989) . All of the activation functions in Fig. 2 are squashable. Fig. 2 (top) shows activation functions that satisfy Eq. ( 1), and are therefore squashable. For example, sigmoid and tanh easily satisfy Eq. ( 1): both have limits and are monotonically increasing. What about activation functions like ReLU, ELU, softplus, etc., shown in Fig. 2 (bottom)? It is easy to see that they do not satisfy Eq. ( 1): none of them have a right limit. However, by point (2) of Def. 3.2, given an activation function t, if we can construct a new activation function t that satisfies Eq. ( 1), using the operations in the grammar in Def. 3.1, then t is squashable. We give a general and simple construction that works for all activation functions in Fig. 2 (bottom). Proposition 3.3. Let t ∈ {ReLU, softplus, smoothReLU a , ELU}. The function t (x) = t(1t(-x)) satisfies Eq. (1). Therefore, ReLU, softplus, Smooth ReLU, and ELU, are squashable. Example 3.4. Fig. 3 shows t(1 -t(-x)), for t = ReLU and t = softplus. Both have left/right limits and are monotonic. Thus, they satisfy Eq. (1) and therefore ReLU and softplus are squashable.

3.2. INTERVAL ANALYSIS OF NEURAL NETWORKS

Given f : R m → R and set S ⊆ R m , we will use f (S) to denote {f (x) | x ∈ S}. We now define interval versions of the operations of a neural network, which are known as abstract transformers. This was first introduced in Cousot & Cousot (1977) , which also proved the soundness of the interval domain. An m-dimensional box B is a tuple of intervals [l 1 , u 1 ] × . . . × [l m , u m ]. All our operations are over scalars, so we define abstract transformers over 1D boxes. Definition 3.5 (Arithmetic abstract transformers). Let B be an m-dimensional box input to the neural network. We follow the grammar in Def. 3.1 to define the abstract transformers. c # = [c, c] x # i = [l i , u i ], where l i , u i are the ith lower and upper bounds of B [l 1 , u 1 ] + # [l 2 , u 2 ] = [l 1 + l 2 , u 1 + u 2 ] [c, c] * # [l, u] = [min(c * l, c * u), max(c * l, c * u)] Definition 3.6 (Abstract transformer for activations). Let B = [l, u]. Then, t # (B) = [t(l), t(u)]. This transforer was introduced in (Gehr et al., 2018) .

4. ABSTRACT UNIVERSAL APPROXIMATION THEOREM & ITS IMPLICATIONS

In this section, we state the abstract universal approximation (AUA) theorem and its implications. Assume a fixed continuous function f : C → R, with a compact domain C ⊂ R m , that we wish to approximate. Definition 4.1 (δ-abstract approximation). Let δ > 0. A neural network N δ-abstractly approximates f iff for every box B ⊆ C, we have [l + δ, u -δ] ⊆ N # (B) ⊆ [l -δ, u + δ], where l = min f (B) and u = max f (B). δ-abstract approximation says that the box output of abstract interpretation N # (B) is up to δ away from the tightest bounding box around the set semantics f (B). It was developed in Baader et al. (2020) . Observe that the standard notions of approximation is a special case of δ-abstract approximation, when the box B is a point in C. We now state our main theorem: Theorem 4.2 (Abstract universal approximation). Let f : C → R be a continuous function on compact domain C ⊂ R m . Let t be a squashable activation function. Let δ > 0. There exists a neural network N , using only activations t, that δ-abstractly approximates f . Informally, the theorem says that we can always find a neural network whose abstract interpretation is arbitrarily close to the set semantics of the approximated function. Note also that there exists such a neural network for any fixed squashable activation function t. In the appendix, we give a generalization of the AUA theorem to functions and networks with multiple outputs. As we discuss next, the AUA theorem has very exciting implications: We can show that one can always construct provably robust neural networks using any squashable activation function (Thm. 4.5). We begin by defining a robust classifier in ∞ norm. We treat f : C → R as a binary classifier, where an output < 0.5 represents one class and 0.5 represents another. Definition 4.3 ( -Robustness). Let x ∈ R m , M ⊆ C and > 0. The -ball of x is R (x) = {z | z -x ∞ }. We say that f is -robust on set M iff, for all x ∈ M and z ∈ R (x), we have f (x) < 0.5 iff f (z) < 0.5. Definition 4.4 (Certifiably robust networks). A neural network N is -certifiably robust on M iff, for all x ∈ M , we have N # (B) ⊆ (-∞, 0.5) or N # (B) ⊆ [0.5, ∞), where B = R (x) (Note that an -ball is a box in R m .). From an automation perspective, the set M is typically a finite set of points, e.g., images. For every x ∈ M , the verifier abstractly interprets N on the -ball of x, deriving a lower bound and upper bound of the set of predictions N (R (x)). If the lower bound is 0.5 or the upper bound is < 0.5, then we have proven that all images in the -ball have the same classification using N . Assuming there is some ideal robust classifier, then, following the AUA theorem, we can construct a neural network, using any squashable activation function, that matches the classifier's predictions and is certifiably robust. Refer to the supplementary materials for an extension to n-ary classifiers. Theorem 4.5 (Existence of robust networks). Let f : C → R be -robust on set M ⊆ C. Assume that ∀x ∈ M, z ∈ R (x). f (z) = 0.5.foot_1 Let t be a squashable activation function. Then, there exists a neural network N , using activation functions t, that (1) agrees with f on M , i.e., ∀x ∈ M. N (x) < 0.5 iff f (x) < 0.5, and ( 2) is -certifiably robust on M .

5. PROOF OF AUA THEOREM: AN OVERVIEW

We now give an overview of our proof of the AUA theorem, focusing on its novelty. Our proof is constructive: we show how to construct a neural network that δ-abstractly approximates a function f . It is a classical idea to use indicator functions to approximate a continuous function in a piecewise fashion-see Nielsen (2015, Ch.4) for an interactive visualization of universal approximation. However, for AUA, the input to the function is a box, and therefore we need to make sure that our indicator-function approximations provide tight abstract approximations (Def. 4.1), as will be shown in Thm. 6.2. Composing indicator functions (Sec. 7). Once we have approximated indicator functions, we need to put them together to approximate the entire function f . The remainder of the construction is an adaption of one by Baader et al. (2020) for ReLU networks. The construction starts by slicing f into a sequence of functions f i such that f = i f i . Each f i captures the behavior of f on a small interval of its range. See Fig. 4 for an example of slicing. Next, we approximate each f i using a neural network N i . Slicing ensures that δ-abstract approximation is tight for large boxes. Because f (B) = i f i (B), when approximating f i (B) using N i (B), we can show that for most i, f i (B) ≈ N i (B). The smaller the range of slice f i , the tighter the abstract approximation of f using indicator functions. Our construction and analysis are different from Baader et al. (2020) in the following ways: 1. Baader et al. (2020) focus exclusively on ReLU activations. In our work, we consider squashable functions, which contain most commonly used activation functions. Compared to ReLU, which has rigid values, we only know that squashable functions have limits at both sides. This makes the class of functions more expressive but also harder to quantify. When analyzing the interval bounds of the whole network construction, we need to take into account the extra imprecision, which propagates from the indicator function to the whole neural network. 2. The key construction of Baader et al. (2020) is to build min(x 1 , x 2 , . . . , x 2m ) using ReLUs. The depth of the construction depends on m, and the analysis of its interval bounds is rather complicated. Our construction uses only two layers of activations, resulting in a much simpler analysis. Because ReLU is a squashable function, a by-product is that if we only consider AUA for ReLU, our construction and its analysis are simpler than that of Baader et al. (2020) .

6. ABSTRACTLY APPROXIMATING INDICATOR FUNCTIONS

We begin by showing the crux of our construction: how to approximate an indicator function. Recall that our goal is to δ-abstractly approximate a continuous function f : C → R. Indicator-function approximation intuition. Given any activation function t that satisfies Eq. ( 1), a key observation is that if we dilate t properly, i.e., multiply the input with a large number µ to get t(µx), we will obtain an approximation of the sign (or step) indicator function-a function that indicates whether an input is positive or negative: sign(x) = 1 if x 0 and 0 otherwise. A sign function can be used to construct an indicator function for 1-dimensional boxes. For example, sign(x) -sign(x -1) returns 1 for x ∈ (0, 1], and 0 otherwise. In what follows, we will use the above ideas to approximate the sign function and the indicator function for m-dimensional boxes.

6.1. APPROXIMATING A ONE-DIMENSIONAL INDICATOR FUNCTION

We will first show how to construct an indicator function for a 1D box, using a squashable function. The main challenge is choosing the dilation factor that results in tight abstract approximation. Without loss of generality, assume we are given a squashable function t that (1) satisfies Eq. ( 1) and (2) has left and right limits of 0 and 1, respectively. 4Loss of precision from limits. The activation function t has limits at both sides, but the function might never reach the limit. For example, the right limit of the sigmoid function, σ, is 1, but ∀x. σ(x) = 1. This will lead to a loss of precision when we use t to model a sign function. However, we can carefully apply mathematical analysis to rigorously bound this imprecision. Dilation to approximate sign function. We now discuss how to dilate t to get a sign-functionlike behavior. By definition of limit, we know the following lemma, which states that ∀θ > 0 by sufficiently increasing the input of t, we can get θ close to the right limit of 1, and analogously for the left limit. Figs. 5a and 5b show how sigmoid can approximate an indicator function. Because the grid size is , we want the sign-function approximation to achieve a transition from ≈ 0 to ≈ 1 within . Let µ be the dilation factor. We would like the following (Fig. 6 illustrates the loss of precision θ incurred by our construction): Lemma 6.1. ∀θ > 0, ∃µ > 0 such that: (1) if x 0.5 , then t(µx) ∈ (1 -θ, 1]; (2) if x -0.5 , then t(µx) ∈ [0, θ). . Notice that our approximation may not be able to exactly tell if we are in G or its neighborhood. Inspired by how to construct an indicator function from a sign function, we will take the difference between two shifted sign functions. Let ti (x) = t (µ (x -(a i -0.5 ))) -t (µ (x -(b i + 0.5 ))) (2) We choose the two points a i -0.5 and b i + 0.5 , which lie in the middle of the [a i , b i ] and its neighborhood, so that ti 's value of is close to 1 within [a i , b i ], and 0 outside [a i , b i ]'s neighborhood.

6.2. APPROXIMATING AN m-DIMENSIONAL INDICATOR

We saw how to construct an indicator approximation for a 1-dimensional box. We will now show how to construct an indicator function approximation N G for an m-dimensional box. Constructing N G . We want to construct an indicator function whose value within a box G is close to 1, and outside the neighborhood ν(G) is close to 0. In the multi-dimensional case, m 2, we do not know at which, if any, dimension j of an input is outside the neighborhood of G. The 1-dimensional indicator approximation, t, which we constructed earlier, can be used to tell us, for each dimension j, whether x j is within the bounds of the neighborhood of G. Therefore we can construct a logical OR approximation that applies t to each dimension and takes the OR of the results. Specifically: (1) we will construct a function that applies t to each dimension, and sums the results such that the answer is > 0 if x ∈ G, and < 0 if x ∈ ν(G). (2) Then, we can use the sign-function approximation to indicate the sign of the answer. Formally, we define the neural network N G as follows: N G (x) = t µ m i=1 H i (x i ) + 0.5 where H i (x) = ti (x) -(1 -2θ). Eq. ( 3) has a similar structure to the m-dimensional indicator in Baader et al. (2020) , i.e., both of them use the activation function to evaluate the information from all dimensions. Under review as a conference paper at ICLR 2021 The term m i=1 H i (x i ) evaluates to a positive value if x ∈ G and to a negative value if x ∈ ν(G). Observe that we need to shift the result of t by (1 -2θ) to ensure a negative answer if one of the dimensions is outside the neighborhood. Then, we use t to approximate the sign function, as we did in the 1-dimensional case, giving ≈ 1 if x ∈ G, and ≈ 0 if x ∈ ν(G). Fig. 5c shows a plot of a two dimensional N G . Abstract precision of indicator approximation. The following key theorem states the precision of the abstract interpretation of N G : if the input box is in G, then the output box is within θ from 1; if B is outside the neighborhood of G, then the output box is within θ from 0. Theorem 6.2 (Abstract interpretation of N G ). For any box B ⊆ C, the following is true: 1. N # G (B) ⊆ [0, 1]. 2. If B ⊆ G, then N # G (B) ⊆ (1 -θ, 1]. 3. If B ⊆ C \ ν(G), then N # G (B) ⊆ [0, θ). Complexity of construction. 

7. COMPLETE PROOF CONSTRUCTION OF AUA THEOREM

We have shown how to approximate an indicator function and how to control the precision of its abstract interpretation (Thm. 6.2). We now complete the construction of the neural network N following the technique of Baader et al. (2020) for ReLU networks. Because we use an arbitrary squashable function to approximate the sign function, this introduces extra imprecision in comparison with ReLUs. We thus need a finer function slicing to accommodate it, i.e., we use a slicing size of δ/3 instead of δ/2 in Baader et al. (2020) . We provide the detailed analysis in the appendix. In what follows, we outline on how to build the network N that satisfies the AUA theorem. Slicing f . Let f : C → R be the continuous function we need to approximate, and δ be the approximation tolerance, as per AUA theorem statement (Thm. 4.2). Assume min f (C) = 0.foot_4 Let u = max f (C). In other words, the range of f is [0, u]. Let τ = δ 3 . We will decompose f into a sequence of function slices f i , whose values are restricted to [0, τ ]. Let K = u/τ . The sum of the sequence of function slices is f . The sequence of functions f i : C → [0, τ ], for i ∈ {0, . . . , K}, is: f i (x) =    f (x) -iτ, iτ < f (x) (i + 1)τ 0, f (x) iτ τ, (i + 1)τ < f (x) Approximating f i . We will use the indicator approximation N G (Eq. ( 3)) to construct a neural network N i that approximates f i . Let G be the set of boxes whose vertices are in the grid. Because C is compact, |G| is finite. Consider 1 τ f i (x); it is roughly similar to an indicator function for the set S = {x ∈ C | f (x) > (i + 1)τ }, i.e., indicating when f (x) is greater than the upper bound of the ith slice. To approximate 1 τ f i (x), we will consider all boxes in G that are subsets of S, and construct an indicator function to tell us whether an input x is in those boxes. Let G i = {G ∈ G | f (G) > (i + 1)τ }. Now construct N i (x) that approximates 1 τ f i (x) as N i (x) = t µ G∈Gi N G (x) -0.5 . Sum all N i . Because K i=0 f i (x) = f (x) , and N i (x) approximates 1 τ f i (x), we will construct the neural network N as N (x) = τ K i=0 N i (x). N δ-abstractly approximates f ; therefore, the AUA theorem holds.

A VECTOR-VALUED NETWORKS AND ROBUSTNESS

In this section, we extend the AUA theorem to vector-valued functions. We also extend our robustness results to n-ary classifiers.

A.1 HIGHER-DIMENSIONAL FUNCTIONS

Vector-valued neural networks. So far we have considered scalar-valued neural networks. We can generalize the neural-network grammar (Def. 3.1) to enable vector-valued neural networks. Simply, we can compose a sequence of n scalar-valued neural networks to construct a neural network whose range is R n . Formally, we extend the grammar as follows, where E i are the scalar-valued sub-neural networks. Definition A.1 (Vector-valued neural network grammar). A neural network N : R m → R n is defined as follows N :-(E 1 , . . . , E n ) E :-c | x i | E 1 + E 2 | c * E 2 | t(E 1 , . . . , E k ) where c ∈ R, x i is one of the m inputs to the network, and t is an activation function. Example A.2. Consider the following neural network N : R 2 → R 2 : N (x) = (σ(x 1 + 0.5x 2 ), σ(0.1x 1 + 0.3x 2 )) which we can pictorially depict as the following graph: x 1 x 2 + σ + σ 0.1 0.5 0.3 Generalized AUA theorem. We now generalize the AUA theorem to show that we can δ-abstractly approximate vector-valued functions. Theorem A.3. Let f : C → R n be a continuous function with compact domain C ⊂ R m . Let δ > 0. Then, there exists a neural network N : R n → R m such that for every box B ⊆ C, and for all i ∈ [1, m], [l i + δ, u i -δ] ⊆ N # (B) i ⊆ [l i -δ, u i + δ] where 1. N # (B) i is the ith interval in the box N # (B), and 2. l i = min S i and u i = max S i , where S = f (B) (recall that S i is the set of ith element of every vector in S). Proof. From the AUA theorem, we know that there exists a neural network N i that δ-abstractly approximates f i : C → R, which is like f but only returns the ith output. We can then construct the network N = (N 1 , . . . , N n ). Since each N i satisfies Eq. ( 4) separately, then N δ-abstractly approximates f .

A.2 ROBUSTNESS IN n-ARY CLASSIFICATION

We now extend the definition of -robustness to n-ary classifiers. We use a function f : C → R n to denote an n-class classifier. f returns a value for each of the n classes; the class with the largest value is the result of classification. We assume there are no ties. Formally, for a given x ∈ C, we denote classification by f as class(f (x)), where class(y) = arg max i∈{1,...,m} y i Definition A.4 (n-ary robustness). Let M ⊂ C. We say that f is -robust on M , where > 0, iff for all x ∈ M and x ∈ R (x), we have class(f (x)) = class(f (x )). We now extend the certifiably robust neural networks definition to the n-class case. Recall that R (x) = {x | ||x -x || }. Definition A.5 (Certifibly robust networks). A neural network N is -certifiably robust on M iff, for all x ∈ M , for all y, y ∈ N # (R (x)), we have class(y) = class(y ). Existence of robust networks. We now show existence of robust networks that approximate some robust n-ary classifier f . Theorem A.6 (Existence of robust networks). Let f : C → R n be a continuous function that is -robust on set M . Then, there exists a neural network that 1. agrees with f on M , i.e., ∀x ∈ M. class(N (x)) = class(f (x)), and 2. is -provably robust on M . Proof. First, we need to post-process the results of f as follows: For all x ∈ C, f (x) = (0, . . . , |y i |, . . . , 0) where y = f (x) and class(f (x)) = i. In other words, f is just like f , but it zeroes out the values of all but the output class i. This is needed since the interval domain is non-relational, and therefore it cannot capture relations between values of different classes, namely, keeping track which one is larger. Note that if f is continuous, then f is continuous. Let δ be the smallest non-zero element of any vector in the set { f (x) | x ∈ C}. Following the AUA theorem, let N be a neural network that δ-abstractly approximates f , where δ < 0.5δ . STATEMENT (1): Pick any x ∈ M . Let the ith element of f (x) = 0; call it c. By construction i = class(f (x)). Let N (x) = (y 1 , . . . , y n ). By AUA theorem, we know that 0 y j < 0.5δ , for j = i, and y i c -0.5δ . Since c δ , class(N (x)) = class(f (x)) = i. STATEMENT (2): Let x ∈ M . Let S = f (R (x)). Let S i be the projection of all vectors in S on their ith element, where i = class( f (x)). We know that min S i δ . min S i exists because R (x) is compact, so are S and S i . By construction of f and the fact that f is robust, all other elements of vectors of S are zero, i.e., S j = {0}, for j = i. Let N # (R (x)) = [l j , u j ]. By AUA theorem and its proof, for j = i, we have [l j , u j ] ⊂ [0, 0.5δ ). Similarly, [l i , u i ] ⊆ [min S i -0.5δ , u i ] ⊆ [0.5δ , u i ]. It follows that for all y, y ∈ N # (R (x)), we have class(y) = class(y ) = i. This is because any value in [δ -0.5δ , u i ] is larger than any value in [0, 0.5δ ). Notice that Thm. 4.5 is a special case, so it also holds.

B APPENDIX: ELIDED PROOFS

B.1 PROOF OF PROPOSITION 3.3 It is easy to see that all the activation functions t are monotonically increasing with lim x→-∞ t(x) = l and lim x→∞ t(x) = ∞. for some l ∈ R. Because t is increasing, t(-x) and t(1 -x) are both decreasing; thus, their composition t(1 -t(-x)) is increasing. lim x→-∞ t(1 -t(-x)) = t( lim x→-∞ (1 -t(-x))) = l lim x→∞ t(1 -t(-x)) = t(1 -lim x→∞ t(-x)) = t(1 -l) ReLU.: l = 0, and t(1 -l) = ReLU(1 -0) = 1. ELU.: l = -1, and t(1 -l) = ELU(2) = 2. softplus.: l = 0, and t(1 -l) = softplus(1) = log(1 + e). smoothReLU.: l = 0, and t(1 -l) = smoothReLU a (1) = 1 -1 a log(a + 1). (One can easily verify that 1 a log(a + 1) < 1 for a = 0).

B.2 CHOICE OF PARAMETERS θ AND

Because the our construction works for any fixed θ and , we will choose θ = min( 1 K+1 , 1 4m+2 , 1 4|G| ), where τ , K and G are defined in Sec. 7; and < 0.5 be such that if xy ∞ , then |f (x)f (y)| < τ . The latter is achievable from the Heine-Cantor Theorem (see Rudin (1986) ), so f is uniformly continuous on C.

B.3 PROPERTIES OF ti

The following lemmas show that ti roughly behaves like an indicator function: its value within a box's ith dimension [a i , b i ] is ≈ 1; its value outside of the neighborhood is ≈ 0; its value globally is bounded by 1. We will analyze the values of the two terms in ti . Proof. We begin the proof by simplifying the expression t# i (B). Recall that t(x) = t(µ(x + 0.5 - [a, b] . By applying abstract transformer t # (Def. 3.6) and subtracting the two terms, we get t# a i )) -t(µ(x -0.5 -b i )). Let B = i (B) = [T 1 -T 4 , T 2 -T 3 ], where T 1 = t(µ(a + 0.5 -a i )) T 2 = t(µ(b + 0.5 -a i )) T 3 = t(µ(a -0.5 -b i )) T 4 = t(µ(b -0.5 -b i )) We are now ready to prove the three statements. STATEMENT (1): By the definition of t, ∀x. t (x) ∈ [0, 1], so T 1 , T 2 , T 3 , T 4 ∈ [0, 1]. Therefore, the upper bound of t# (B) is T 2 -T 3 1.

STATEMENT (2):

Case 1: B ⊆ (-∞, a i -]. From Lem. B.2, T 1 , T 2 , T 3 , T 4 ∈ [0, θ), then T 2 -T 3 < θ, and T 1 -T 4 > -θ. Case 2: B ⊆ [b i + , ∞). From Lem. B.3, T 1 , T 2 , T 3 , T 4 ∈ (1 -θ, 1], then T 2 -T 3 < θ, and T 1 -T 4 > -θ. In either case, t# i (B) ⊆ (-θ, θ). STATEMENT (3): If B ⊆ [a i , b i ], a, b ∈ [a i , b i ]. From Lem. B.1(1), T 1 , T 2 ∈ (1 -θ, 1]. From Lem. B.1(2), T 3 , T 4 ∈ [0, θ). Then T 1 -T 4 > 1 -2θ and T 2 -T 3 1. Therefore, t# i (B) ⊆ (1 -2θ, 1].

B.4 ABSTRACT PRECISION OF N G

We are now ready to analyze the abstract precision of N G . We first consider H i in the following lemma. For any box B ⊆ C, let B i be its projection on dimension i, which is an interval. The following lemma states that if B is in the box G, then i H # i is positive; otherwise, if B is outside the neighborhood of G, then i H # i is negative. Lemma B.5 (Abstract interpretation of H i ). For any box B ⊆ C, the following is true: 1. If B ⊆ G, then m i=1 H # i (B i ) ⊆ (0, ∞). 2. If B ⊆ C \ ν(G), then m i=1 H # i (B i ) ⊆ (-∞, -). Proof. STATEMENT (1): If B ⊆ G, then ∀i. B i ⊆ [a i , b i ]. From Lem. B.4 (3), t# i (B i ) ⊆ (1 -2θ, 1]; thus, H # i (B i ) = t# i (B i ) + # -(1 -2θ) # ⊆ (0, 2θ] ⊂ (0, ∞) Sum over all m dimensions, m i=1 H # i (B i ) ⊆ m i=1 (0, ∞) = (0, ∞). STATEMENT (2): If B ⊆ C \ ν(G), then there is a dimension j such that either B j ⊆ (-∞, a j -] or B j ⊆ [b j + , ∞). From Lem. B.4 (2), we know that t# (B j ) ⊆ (-θ, θ). Therefore, H # j (B j ) = t# (B j ) + # -(1 -2θ) # ⊆ (θ -1, 3θ -1) For the remaining m -1 dimensions, from Lem. B.4 (1), we know that t# (B i ) ⊂ (-∞, 1] when i = j. Therefore, H # i (B i ) = t# (B i ) + # -(1 -2θ) # ⊆ (-∞, 2θ] Take the sum of all the m -1 dimensions, i∈{1,...,m}\{j} H # i (B i ) ⊆ i∈{1,...,m}\{j} (-∞, 2θ] (substitute Eq. ( 6)) = [m -1, m -1] * # (-∞, 2θ] (turn sum into * # ) = (-∞, 2(m -1)θ] (apply * # ) (7) Now, take sum over all the m dimensions, 5) and ( 7)) = (-∞, (2m + 1)θ -1) (apply * # ) m i=1 H # i (B i ) = i∈{1,...,m}\{j} H # i (B i ) + # H # j (B j ) (decompose sum) ⊆ (-∞, 2(m -1)θ] + # (θ -1, 3θ -1) (substitute Eqs. ( Because of our choice of θ, θ  H # i (B i ) ⊆ (-∞, -0.5) Also we have assumed that < 0.5 (see Appendix B.2); therefore m i=1 H # i (B i ) ⊆ (-∞, -) B.4.1 PROOF OF THM. 6.2 Proof. STATEMENT (1): See definition of N G in Eq. ( 3). The outer function of N G is t, whose range is [0, 1] by the definition of squashable functions and our assumption that the left and right limits are 0 and 1. Therefore, N # G (B) ⊆ [0, 1]. STATEMENT (2): If B ⊆ G, from Lem. B.5, we know that m i=1 H # i (B i ) ⊆ (0, ∞). Then, m i H # i (B i ) + # (0.5 ) # ⊆ (0, ∞) + # (0.5 ) # ⊆ (0.5 , ∞) From Lem. 6.1, we know that if x 0.5 , then 1 -θ < t(µx) 1. Therefore, N # G (B) = t # (µ # * # (0.5 , ∞)) ⊆ (1 -θ, 1] STATEMENT (3): If B ⊆ C \ ν(G), from Lem. B.5, we know that m i=1 H # i (B i ) ⊆ (-∞, -). Then, m i=1 H # i (B i ) + # (0.5 ) # ⊆ (-∞, -) + # (0.5 ) # ⊆ (-∞, -0.5 ) From Lem. 6.1, we know that if x -0.5 , then 0 t(µx) < θ. Therefore, N # G (B) = t # (µ # * # (-∞, -0.5 )) ⊆ [0, θ) B.5 ABSTRACT INTERPRETATION OF N i Observe how for any box B ⊆ C from the abstract domain, it is overapproximated by a larger box G ⊇ B from the finitely many boxes in the -grid. Intuitively, our abstract approximation of N i incurs an error when the input B is not in the grid. We formalize this idea by extending the notion of neighborhood (Sec. 6) to boxes from the abstract domain. Note that G B is uniquely defined. The following lemma says that considering the neighborhood of B only adds up to τ of imprecision to the collecting semantics of f . This drives us to decompose the sum as follows: K i=0 N # i (B) = p-2 i=0 N # i (B) Term 1 + # p+1 i=p-1 N # i (B) Term 2 + # q-2 i=p+2 N # i (B) Term 3 + # q+1 i=q-1 N # i (B) Term 4 + # K i=q+2 N # i (B) Term 5 (8) We will analyze the five terms in Eq. ( 8) separately, and then take their sum to get the final result. For now, assume that q p + 3; the q p + 2 case will follow easily. (ii) Term 5: ∀i q + 2, we have (q + 1)τ (i -1)τ . Because u = max f (B) and u ∈ [qτ, (q + 1)τ ), then f (B) < (q + 1)τ (i -1)τ . From Thm. B.7, ∃l i ∈ [0, θ) such that [l i , l i ] ⊆ N # i (B) ⊆ [0, θ).  Then K i=q+2 [l i , l i ] ⊆ K i=q+2 N # i (B) ⊆ K i=q+2 [0, θ). K i=q+2 [l i , l i ] ⊆ K i=q+2 N # i (B) ⊆ (K -q -1) # * # [0, θ) q-2 i=p+2 [θ, 1 -θ] ⊆ q-2 i=p+2 N # i (B) ⊆ q-2 i=p+2 [0, 1]. q-2 i=p+2 [θ, 1 -θ] ⊆ q-2 i=p+2 N # i (B) ⊆ (q -p -3) # * # [0, 1] (iv) Term 2: ∀i ∈ [p -1, p + 1], since we have assumed that q p + 3, then q p + 3 i + 2.



In our construction and proof, we do not need the function to be monotonic; however, in practice, most activation functions are monotonic and abstractly interpreting arbitrary functions is impractical. This assumption eliminates the corner case where a point sits exactly on the classification boundary, 0.5. The vertices of the grid form a natural /2-net on C equipped with l∞ metric. If (1) is not satisfied, by Def. 3.2, we can construct a t that satisfies Eq. (1) from t. If (2) is not satisfied, we can apply an affine transformation to the results of t to make the left and right limits 0 and 1. Otherwise, we can shift f such that min f (C) = 0.



Figure 4: Slicing example

of boxes. Fix > 0. Consider a standard grid of vertices over C, where any two neighboring vertices are axis-aligned and of distance ; we will call this an -grid.3 Let [a 1 , b 1 ] × . . . × [a m , b m ] be a box G on the grid, where [a i , b i ] is the range of G at dimension i. In other words, b i -a i is a multiple of . The neighborhood ν(G) of G is [a 1 -, b 1 + ] × . . . × [a m -, b m + ].Our goal is to construct an indicator function whose value is close to 1 within G, and close to 0 outside G's neighborhood ν(G). The idea of using grid is similar to the nodal basis inHe et al. (2018).

Plot of NG on G = [0, 1] × [0, 1] using the σ activation (µ = 10, 2θ = 0.05, = 1).

Figure 5: Approximating indicator functions on [0, ∞), [0, 1] and [0, 1] × [0, 1] using the sigmoid activation functions

Figure 6: Loss of precision θ due to use of squashable activation to approximate a sign function. Length of red arrows is θ.

For a box B ⊆ C, if B ∈ G, then B's neighborhood G B = ν(B); otherwise, let G B be the smallest G ∈ G, by volume, such that B ⊆ G.

Consider the function slices represented by Term 1 and 5; for example, Term 1 represents abstractions N # i of function slices f i , for i ∈ [0, p -2]. The function slices of Term 1 and 5 are referred to in Thm. B.7 (Statements 2 and 3): they have an (almost) precise abstract interpretation. That is, the abstract semantics of N # i (B) and the collecting semantics of f i (B) agree. For Term 1, the abstract interpretation of all N # i (B) ≈ [1, 1] and f i (B) = [τ, τ ]. For Term 5, the abstract interpretation of all N # i (B) ≈ [0, 0] and f i (B) = [0, 0]. • Now consider function slices f i , where i ∈ [p + 2, q -2]. The abstraction of these function slices is also (almost) precise. We can see f (c) = l is below the lower bound of the slices and f (d) = u is above the upper bound of the slices. Hence, f i (d) = τ and N # i ({d}) ≈ [1, 1]. Similarly, f i (c) = 0 and N # i ({c}) ≈ [0, 0]. Because c, d ∈ B, and due to continuity of f , we have f i (B) = [0, 1], and N # i (B) ≈ [0, 1]. • The remaining function slices are those in Term 2 and Term 4, and they are at the neighborhood of the boundary of [l, u]. Most precision loss of N # i (B) comes from those two terms.

(i) Term 1: ∀i p -2, we have pτ (i + 2)τ . Because l = min f (B) and l ∈ [pτ, (p + 1)τ ), then f (B) pτ (i + 2)τ . From Thm. B.7, ∃u i ∈ (1 -θ, 1] such that [u i , u i ] ⊆ N # i (B) ⊆ (1 -θ, 1]. Then ⊆ (p -1) # * # (1 -θ, 1]

) Term 3: ∀i ∈ [p + 2, q -2], we have (p + 1)τ (i -1)τ and qτ(i + 2)τ . f (c) = l < (p + 1)τ (i -1)τ , and f (d) = u qτ (i + 2)τ . From Thm. B.7, N # i ({c}) ⊆ [0, θ) and N # i ({d}) ⊆ (1 -θ, 1]. Because c, d ∈ B, [θ, 1 -θ] ⊆ N # i (B). Also by Thm. B.7, N # i (B) ⊆ [0, 1]. Hence,

To construct a single indicator function, we use 2m + 1 activation functions, with depth 2 and width 2m. If we restrict ourselves to ReLU neural network, we use 4m + 2 neurons, with depth 4 and width 2m; in contrast, Baader et al. (2020) used 10m -3 ReLu functions, with depth 3 + log 2 (m), and width 4m.

8. CONCLUSION

We have shown that the AUA theorem holds for most practical neural networks, and demonstrated that in theory interval analysis can certify the robustness of neural networks. In the future, one might be interested in reducing the size of the neural network constructed in course of our proof, for example, by allowing the use of richer domains, like zonotopes and polyhedra.

annex

The following lemma states that if x is within the box's ith dimension, then the first term is close to 1 and the second term is close to 0, resulting in ti (x) ≈ 1. Lemma B.1. If x ∈ [a i , b i ], then the following is true:Proof. STATEMENT (1): Because x a i , x+0.5 -a i 0.5 . From Lem. 6.1, t(µ(x+0.5 STATEMENT (2): Because x b i , x-0.5 -b i -0.5 . From Lem. 6.1, t(µ(x-0.5 -b i )) ∈ [0, θ).The next two lemmas state that if x is outside the neighborhood, then the two terms are similar, resulting in a ti (x) ≈ 0. Lemma B.2. If x a i -, then the following is true:Proof. STATEMENT (1): Because x a i -, x + 0.5 -a i -0.5 . From Lem. 6.1, t(µ(x + 0.5 -a i )) ∈ [0, θ).Lemma B.3. If x b i + , then the following is true:Proof. STATEMENT (1): Because x b i + and b i a i , x a i + . Then x + 0.5 -a i 0.5 . From Lem. 6.1, t(µ(x + 0.5

B.3.1 ABSTRACT PRECISION OF ti

We are now ready to prove properties about the abstract interpretation of our 1-dimensional indicator approximation, ti . The following lemma states that the abstract interpretation of ti , t# i (B), is quite precise: if the 1-dimensional input box B is outside the neighborhood of G, on G's ith dimension, then the output box is within θ from 0; if the input box B is within the ith dimension of G, then the output box is within 2θ from 1. Lemma B.4 (Abstract interpretation of ti ). For a 1-dimensional box B, the following is true:The following is true:Proof. Both of the statements follow from our choice of in constructing the grid (see Appendix B.2The following is true:Proof. We begin by noting that in Statement ( 2),andIn Appendix B.2, we have chosen that θ 1 4|G| , a fact we will use later in the proof.STATEMENT (1): The outer function of N i is t, whose range is [0, 1], by the definition of squashable function and our construction, soThus, we can break up the sum as follows:Therefore, we can conclude the following two facts:The second inequality follows from the fact that we assumed θ 1 4|G| 0.25 (above) and < 0.5 (see Appendix B.2). Therefore, 0.5 -θ > 0.25 > 0.5 . It follows from Lem. 6.1 thatWe assumed that θ 1 4|G| and < 0.5 (see Appendix B.2), so |G|θ 0.25, and |G|θ -0.5 -0.25 -0.5 . Hence,It follows from Lem. 6.1 thatBefore proceeding with the proof, we give a general lemma that will be useful in our analysis. The lemma follows from the fact that, by construction,Proof. This simply follow from the choice of θ 1 K+1 .Proof outline of Thm. 4.2. Our proof involves three pieces, outlined below:We will decompose the sum into five sums and analyze each separately, arriving at five results of the form:for j ∈ {1, . . . , 5}, where j S j = {0, . . . , K} and S j are mutually disjoint sets.(B) Then, we sum over all five cases, gettingProof assumptions. We will assume that l ∈ [pτ, (p + 1)τ ) and u ∈ [qτ, (q + 1)τ ), for some p q K. Additionally, let c, d ∈ B be such that f (c) = l and f (d) = u.Step A: Decompose sum and analyze separately. We begin by decomposing the sum into five terms. This is the most important step of the proof. We want to show that most N i 's in K i=0 N # i (B) are (almost) precise. By almost we mean that their values are ≈ 1 and ≈ 0. The motivation is then to extract as many precise terms as possible. The only tool used in the analysis is Thm. B.7.Step B: Sum all five cases. We now sum up all five inequalities we derived above to derive an overall bound of the sum in the formWe simplify L 1 , L 2 , U 1 and U 2 as follows:(sum all the 1's)Because θ 1 K+1 , and -K (K -q -1) K, we haveStep C: Analyze the bound. It remains to show that l -δ L 2 L 1 l + δ and u -δ U 1 U 2 u + δ.Recall that we have set that δ = 3τ . Also l ∈ [pτ, (p + 1)τ ), then l -δ < (p -2)τ and l + δ (p + 3)τSince u ∈ [qτ, (q + 1)τ ), then u -δ < (q -2)τ and u + δ (q + 3)τWe have just analyzed L 1 , L 2 , U 1 and U 2 above. Now we have:It follows from the above inequalities that l -δ < (p -2)τ L 2 L 1 (p + 3)τ l + δ and u -δ < (q -2)τ U 1 U 2 (q + 3)τ u + δThis concludes the proof for the case where q p + 3.Excluded case. Previously, we have shown that Terms 1, 3, and 5 are almost precise. The imprecise terms can only come from Terms 2 and 4. If q p + 2, the only analyses that will be affected are those of Terms 2 and 4. Since q p + 2, we have p + 1 q -1, which means Terms 2 and 4 have potentially less sub-terms in this case. Thus imprecise terms are less than the q p + 3 case and we can apply the same analysis as above and derive the same bound.We have thus shown that the neural network N that we construct δ-abstractly approximates f , and therefore the AUA theorem is true.

