TOWARD RELIABLE NEURAL

Abstract

Having reliable specifications is an unavoidable challenge in achieving verifiable correctness, robustness, and interpretability of AI systems. Existing specifications for neural networks are in the paradigm of data as specification. That is, the local neighborhood centering around a reference input is considered to be correct (or robust). However, our empirical study shows that such a specification is extremely overfitted since usually no data points from the testing set lie in the certified region of the reference input, making them impractical for real-world applications. We propose a new family of specifications called neural representation as specification, which uses the intrinsic information of neural networks -neural activation patterns (NAP), rather than input data to specify the correctness and/or robustness of neural network predictions. We present a simple statistical approach to mining dominant neural activation patterns. We analyze NAPs from a statistical point of view and find that a single NAP can cover a large number of training and testing data points whereas ad hoc data-as-specification only covers the given reference data point. To show the effectiveness of discovered NAPs, we formally verify several important properties, such as various types of misclassifications will never happen for a given NAP, and there is no-ambiguity between different NAPs. We show that by using NAP, we can verify the prediction of the entire input space, while still recalling 84% of the data. Thus, we argue that using NAPs is a more reliable and extensible specification for neural network verification.

1. INTRODUCTION

The advances in deep neural networks (DNNs) have brought a wide societal impact in many domains such as transportation, healthcare, finance, e-commerce, and education. This growing societal-scale impact has also raised some risks and concerns about errors in AI software, their susceptibility to cyber-attacks, and AI system safety (Dietterich & Horvitz, 2015) . Therefore, the challenge of verification and validation of AI systems, as well as, achieving trustworthy AI (Wing, 2021) , has attracted much attention of the research community. Existing works approach this challenge by building on formal methods -a field of computer science and engineering that involves verifying properties of systems using rigorous mathematical specifications and proofs (Wing, 1990) . Having a formal specification -a precise, mathematical statement of what AI system is supposed to do is critical for formal verification. Most works (Katz et al., 2017; 2019; Huang et al., 2017; 2020; Wang et al., 2021) use the specification of adversarial robustness for classification tasks that states that the NN correctly classifies an image as a given adversarial label under perturbations with a specific norm (usually l ∞ ). Generally speaking, existing works use a paradigm of data as specification -the robutness of local neighborhoods of reference data points with ground-truth labels is the only specification of correct behaviors. However, from a learning perspective, this would lead to overfitted specification, since only local neighborhoods of reference inputs get certified. As a concrete example, Figure 1 illustrates the fundamental limitation of such overfitted specifications. Specifically, a testing input like the one shown in Fig. 1a can never be verified even if all local neighborhoods of all training images have been certified using the L ∞ norm. This is because adversarial examples like Fig. 1c fall into a much closer region compared to testing inputs (e.g., Fig. 1a ), as a result, the truly verifiable region for a given reference input like Fig. 1b can only be smaller. All neural network verification approaches following such data-as-specification paradigm inherit this fundamental limitation regardless of their underlying verification techniques. In order to avoid such a limitation, a new paradigm for specifying what is correct or wrong is necessary. The intrinsic challenge is that manually giving a proper specification on the input space is no easier than directly programming a solution to the machine learning problem itself. We envision that a promising way to address this challenge is developing specifications directly on top of, instead of being agnostic to, the learned model. We propose a new family of specifications, neural representation as specification, where neural activation patterns form specifications. The key observation is that inputs from the same class often share a dominant neural activation pattern (NAP) -a carefully chosen sub-sets of neurons that are expected to be activated (or not activated) for the majority of inputs. Although two inputs are distant in a certain norm in the input space, the neural activations exhibited when the same prediction is made are very close. For instance, we can find a single dominant NAP that is shared by nearly all training and testing images (including Fig. 1a and Fig. 1b ) in the same class but not the adversarial example like Fig. 1c . We can further formally verify that all possible inputs following this particular dominant NAP can never be misclassified. Specifications based on NAP enable successful verification of a broad region of inputs, which would not be possible if the data-as-specification paradigm were used. For the MNIST dataset, we find that a verifiable dominant NAP mined from the training images could cover up to 84% testing images, a significant improvement in contrast to 0% when using neighborhoods of training images as the specification. To our best knowledge, this is the first time that a significant fraction of unseen testing images have been formally verified. Interestingly, the verified dominant NAP also enables us to "double check" whether ground truth labels given by human beings are indeed reliable. Fig. 1d shows such an example on which our verified NAP disagrees with the ground truth. We have a tunable parameter to specialize a dominant NAP to avoid accepting such a potentially controversial region of inputs when necessary. This unique advantage of using NAPs as specification is enabled by the intrinsic information (or neural representation) embedded in the neural network model. Furthermore, such information is a simple byproduct of a prediction and can be collected easily and efficiently. Besides serving as reliable specifications for neural networks, we foresee other important applications of NAPs. For instance, verified NAPs may serve as proofs of correctness or certificates for predictions. We hope our initial findings shared in this paper would inspire new interesting applications. We summarize our contribution as follows: • We propose a new family of formal specifications for neural networks, neural representation as specification, which use activation patterns (NAPs) as specifications. • We propose a simple yet effective method to mine dominant NAPs from neural networks and training dataset. • We show that NAPs can be easily checked by out-of-the-box neural network verification tools used in VNNCOMP (2021) -the latest neural network verification competition, such as Marabou.



Figure1: The limitation of "data-as-specification": First three images show that a test input can be much further away (in L ∞ ) from its closest train input compared to adversarial examples (the upper bound of a verifiable local region). The last image shows that even data itself can be imperfect.

