FOOLING A COMPLETE NEURAL NETWORK VERIFIER

Abstract

The efficient and accurate characterization of the robustness of neural networks to input perturbation is an important open problem. Many approaches exist including heuristic and exact (or complete) methods. Complete methods are expensive but their mathematical formulation guarantees that they provide exact robustness metrics. However, this guarantee is valid only if we assume that the verified network applies arbitrary-precision arithmetic and the verifier is reliable. In practice, however, both the networks and the verifiers apply limited-precision floating point arithmetic. In this paper, we show that numerical roundoff errors can be exploited to craft adversarial networks, in which the actual robustness and the robustness computed by a state-of-the-art complete verifier radically differ. We also show that such adversarial networks can be used to insert a backdoor into any network in such a way that the backdoor is completely missed by the verifier. The attack is easy to detect in its naive form but, as we show, the adversarial network can be transformed to make its detection less trivial. We offer a simple defense against our particular attack based on adding a very small perturbation to the network weights. However, our conjecture is that other numerical attacks are possible, and exact verification has to take into account all the details of the computation executed by the verified networks, which makes the problem significantly harder.

1. INTRODUCTION

In their seminal work, Szegedy et al. found that for a given neural network and input example one can always find a very small adversarial input perturbation that results in an incorrect output (Szegedy et al., 2014) . This striking discovery motivated a substantial amount of research. In this area, an important research direction is verification, that is, the characterization of the robustness of a given network in a principled manner. A usual way of defining the verification problem involves the specification of an input domain and a property that should hold over the entire domain. For example, we might require that all the points within a certain distance from an input example share the same output label as the example itself. The verification problem is then to prove or disprove the property over the domain for a given network (Bunel et al., 2020) . There are a large number of verifiers offering different types of guarantees about their output. Complete verifiers offer the strongest guarantee: they are able to decide whether a given property holds in any given input domain. For example, the verifier of Tjeng et al. is a state-of-the-art complete verifier that we will focus on in this paper (Tjeng et al., 2019) . However, it is currently standard practice to ignore the details of the computations that the network under investigation performs, such as the floating point representation or the order in which input signals are summed. In this paper, we claim that such implicit assumptions make verifiers vulnerable to a new kind of attack where the attacker designs a network that fools the verifier, exploiting the differences between how the verifier models the computation and how the computation is actually performed in the network. We will argue that such attacks can achieve an arbitrary divergence between the modeled and the actual behavior. This new attack has practical implications as well. Concerns about the safety of AI systems are expected to lead to the establishment of standard requirements certified by a designated authority (Salis-Madinier, 2019). These certification procedures might involve verification methods as well. Fooling such methods makes it possible to get unsafe systems certified that might even contain a backdoor allowing for triggering arbitrary behavior. Numerical precision has not been a key practical concern in machine learning. Networks do sometimes produce numerical errors (e.g., Inf or NaN values), most often due to the non-linear operations within the loss function (Odena et al., 2019) or divergence during training. However, the network weights are normally robust to small perturbations due to stochastic learning algorithms (Bottou, 2010) , and due to regularizers such as standard variants of weight decay and dropout (Srivastava et al., 2014) . Due to this robustness, low precision arithmetic can be applied as well (Courbariaux et al., 2015; Gupta et al., 2015) . Our results indicate that, when it comes to exact methods for verification, numerical issues become a central problem that can cause arbitrary errors and enable backdoors. Our contributions are the following. In Section 3, we introduce a simple adversarial network that misleads the verifier of Tjeng et al. (2019) . In Section 4, we show how to hide the large weights that are present in the simple network. In Section 5, we describe a way to add a backdoor to an existing network with the help of the adversarial networks we proposed. Finally, in Section 6 we offer a defense against the attack we presented.

2. BACKGROUND

Let us first formulate the verification problem, namely the problem of checking whether a given property holds in a given domain. We adopt the notation used in (Tjeng et al., 2019) . For a possible input x, let G(x) denote the set of inputs that are considered similar to x in the sense that we expect all the points in G(x) to get the same label as x. The set G(x) is normally defined as a ball around x in some metric space defined by a suitable vector norm. The input domain we need to consider is given as G(x)∩X valid where X valid denotes the valid input points. For example, we have X valid = [0, 1] m if the input is an image of m pixels with each pixel taking values from the interval [0, 1]. We now have to formulate the property that we wish to have in this domain. Informally, we want all the points in the domain G(x) ∩ X valid to get the same classification label as x. Let λ(x) denote the true label of x and let f (x; θ) : R m → R n denote the neural network, parameterized by θ. This network has n outputs classifying each input x into n classes. The label of x as predicted by the network is given by arg max i f (x; θ) i . Using this notation, the property we wish to have for an input x ∈ (G(x) ∩ X valid ) is that λ(x) = arg max i f (x ; θ) i . Putting it all together, the verification problem can be expressed as deciding the feasibility of the constraint x ∈ (G(x) ∩ X valid ) ∧ (λ(x) = arg max i f (x ; θ) i ), with x as our variable. If this constraint is feasible then there is an x that violates the property. If it is infeasible then (provided G(x) ∩ X valid is not empty) there is no such x .

2.1. APPROACHES TO VERIFICATION

There are many approaches to tackle this problem. We can, for example, search for a suitable x in the given domain using some heuristic optimization methods (Goodfellow et al., 2015; Moosavi-Dezfooli et al., 2016; Kurakin et al., 2017; Carlini & Wagner, 2017; Brendel et al., 2019) . If the search succeeds, we can decide that equation 1 is feasible. Otherwise we cannot decide. Other methods attempt to find a proof for the infeasibility of equation 1, however, they do not guarantee such a proof. Examples include (Wong & Kolter, 2018; Weng et al., 2018; Gehr et al., 2018; Raghunathan et al., 2018; Singh et al., 2019) . If a proof is found, we can decide that equation 1 is infeasible. Otherwise we cannot decide. Such methods are sometimes called incomplete (Tjeng et al., 2019; Bunel et al., 2020) . The strongest guarantee is given by methods that are able to decide the feasibility of equation 1. These methods are sometimes called complete (Tjeng et al., 2019; Bunel et al., 2020) . Examples for such methods include Reluplex (Katz et al., 2017) , a method based on an SMT solver. A number of verifiers are based on MILP solvers, for example, (Cheng et al., 2017; Dutta et al., 2018) . MIPVerify (Tjeng et al., 2019) also uses an MILP formulation along with several additional techniques to improve efficiency (see Section 2.2). Symbolic interval propagation has also been proposed for ReLU networks by Wang et al. in ReluVal (Wang et al., 2018b) , and as part of Neurify (Wang et al., 2018a) . In Neurify, interval propagation is used as a technique to tighten the

