DEFENDING AGAINST BLACK-BOX ADVERSARIAL AT-TACKS WITH GRADIENT-FREE TRAINED SIGN ACTIVA-TION NEURAL NETWORKS

Abstract

While machine learning models today can achieve high accuracies on classification tasks, they can be deceived by minor imperceptible distortions to the data. These are known as adversarial attacks and can be lethal in the black-box setting which does not require knowledge of the target model type or its parameters. Binary neural networks that have sign activation and are trained with gradient descent have been shown to be harder to attack than conventional sigmoid activation networks but their improvements are marginal. We instead train sign activation networks with a novel gradient-free stochastic coordinate descent algorithm and propose an ensemble of such networks as a defense model. We evaluate the robustness of our model (a hard problem in itself) on image, text, and medical ECG data and find it to be more robust than ensembles of binary, full precision, and convolutional neural networks, and than random forests while attaining comparable clean test accuracy. In order to explain our model's robustness we show that an adversary targeting a single network in our ensemble fails to attack (and thus non-transferable to) other networks in the ensemble. Thus a datapoint requires a large distortion to fool the majority of networks in our ensemble and is likely to be detected in advance. This property of non-transferability arises naturally from the non-convexity of sign activation networks and randomization in our gradient-free training algorithm without any adversarial defense effort.

1. INTRODUCTION

State of the art machine learning algorithms can achieve high accuracies in classification tasks but misclassify minor perturbations in the data known as as adversarial attacks Goodfellow et al. ( 2015 Binary neural networks that have sign activation and binary weights were originally proposed as lightweight models. These are trained with gradient descent by approximating the sign activation. Recent work has shown that they tend to be more adversarially robust than full precision networks but the improvements are marginal (see Tables 4 and 5 2019b). While our original intention was to study the accuracy of a sign activation network trained directly without any approximation we make an interesting finding on the adversarial robustness of our model. We find that ensembling our model gives a high minimum distortion (as measured by HopSkipJump) compared to full precision, binary, and convolutional neural networks. We explain this phenomena by measuring the transferability between networks in an ensemble. In summary we make the following observations in our paper: • Our single hidden layer sign activation network has higher minimum distortion than ensembles of full precision and binary neural networks, than random forests that have the advantage of bootstrapping and random feature selection, and than ensembles of convolutional networks that have the advantage of convolutions and several layers. (1 -sign(y i (w T (sign(W T x i + W 0 )) + w 0 ))) We can train sign activation networks with and without binary weights using our SCD training procedure above. In the case of binary weights we don't need a learning rate. We apply GPU parallelism to simultaneously update features and other heuristics to speed up runtimes (with additional details given in the Supplementary Material).

2.2. IMPLEMENTATION, TEST ACCURACY, AND RUNTIME

We implement our training procedure in Python, numpy, and Pytorch Paszke et al. ( 2019) and make our code freely available from https://github.com/zero-one-loss/scd_github. We train three types of sign activation networks with our algorithm: (1) SCD01: 01-loss in the final node, (2) SCDCE: cross-entropy loss in the final node, and (3) SCDCEBNN: cross-entropy in the final node with binary weights throughout the model. Since sign activation is non-convex and our training starts from a different random initialization we run it a 100 times and output the majority vote.



); Papernot et al. (2016b); Kurakin et al. (2016); Carlini & Wagner (2017); Brendel et al. (2018). Adversarial examples have been shown to transfer across models which makes it possible to perform transfer-based (substitute model) black box attacks Papernot et al. (2016a). To counter adversarial attacks many defense methods been proposed with adversarial training being the most popular Szegedy et al. (2014); Tramèr et al. (2018). However this tends to lower accuracy on clean test data that has no perturbations Raghunathan et al. (2019); Zhang et al. (2019) and can still be attacked with better transfer based methods Wu et al. (2020); Xie et al. (2019a); Dong et al. (2019). Many previously proposed defenses have also been shown to be vulnerable Carlini & Wagner (2017); Athalye et al. (2018); Ghiasi et al. (2020) thus leaving adversarial robustness an open problem in machine learning. A more lethal and practical attack than substitute model training is a boundary based one that requires only the prediction of the model Brendel et al. (2018). These attacks are aimed at finding the minimum distortion to an image such that it will fool a classifier. This is in fact an NP-hard problem for ReLu activated neural networks Katz et al. (2017); Sinha et al. (2018) and tree ensemble classifiers Kantchelian et al. (2016). Even approximating the minimum distortion for ReLu activated neural networks is NP-hard Weng et al. (2018). Boundary based black box attacks such as Hop-SkipJump Chen et al., Boundary Attack Brendel et al. (2018) and RayS Chen & Gu (2020) give an upper bound on the minimum adversarial distortion.

in Galloway et al. (2018) and Table 8 in Panda et al. (2019)). In this paper we propose a gradient free stochastic coordinate descent algorithm for training sign activation networks with and without binary weights similar to recent work Xue et al. (2020a;b); Xie et al. (

Our model's robustness stems from non-transferability of adversarial examples between networks in our ensemble and its robustness increases as we add more networks to the ensemble. • Substitute model black box attacks require a much greater distortion to bring our model to zero adversarial accuracy compared to ensembles of full precision and binary networks. • Text classification black box attacks are less effective on our model than on convolutional networks, random forests, and ensembles of full precision and binary networks. • In a medical diagnosis setting, attacks on ECG data on our model have a higher distortions and are visually distinguishable compared to attacks on ensembles of full precision and convolutional networks, and on random forests. Suppose we are given binary class data x i ∈ R d and y i ∈ {-1, +1} for i = 0...n -1. Consider the objective function of a single hidden layer neural network with sign activation and 01 loss given below. We employ a stochastic coordinate descent algorithm shown in Algorithm 1 (similar to recent work Xue et al. (2020a;b); Xie et al. (2019b)) to minimize this objective.

