DEFENDING AGAINST BLACK-BOX ADVERSARIAL AT-TACKS WITH GRADIENT-FREE TRAINED SIGN ACTIVA-TION NEURAL NETWORKS

Abstract

While machine learning models today can achieve high accuracies on classification tasks, they can be deceived by minor imperceptible distortions to the data. These are known as adversarial attacks and can be lethal in the black-box setting which does not require knowledge of the target model type or its parameters. Binary neural networks that have sign activation and are trained with gradient descent have been shown to be harder to attack than conventional sigmoid activation networks but their improvements are marginal. We instead train sign activation networks with a novel gradient-free stochastic coordinate descent algorithm and propose an ensemble of such networks as a defense model. We evaluate the robustness of our model (a hard problem in itself) on image, text, and medical ECG data and find it to be more robust than ensembles of binary, full precision, and convolutional neural networks, and than random forests while attaining comparable clean test accuracy. In order to explain our model's robustness we show that an adversary targeting a single network in our ensemble fails to attack (and thus non-transferable to) other networks in the ensemble. Thus a datapoint requires a large distortion to fool the majority of networks in our ensemble and is likely to be detected in advance. This property of non-transferability arises naturally from the non-convexity of sign activation networks and randomization in our gradient-free training algorithm without any adversarial defense effort.

1. INTRODUCTION

State of the art machine learning algorithms can achieve high accuracies in classification tasks but misclassify minor perturbations in the data known as as adversarial attacks Goodfellow et al. ( 2015 



); Papernot et al. (2016b); Kurakin et al. (2016); Carlini & Wagner (2017); Brendel et al. (2018). Adversarial examples have been shown to transfer across models which makes it possible to perform transfer-based (substitute model) black box attacks Papernot et al. (2016a). To counter adversarial attacks many defense methods been proposed with adversarial training being the most popular Szegedy et al. (2014); Tramèr et al. (2018). However this tends to lower accuracy on clean test data that has no perturbations Raghunathan et al. (2019); Zhang et al. (2019) and can still be attacked with better transfer based methods Wu et al. (2020); Xie et al. (2019a); Dong et al. (2019). Many previously proposed defenses have also been shown to be vulnerable Carlini & Wagner (2017); Athalye et al. (2018); Ghiasi et al. (2020) thus leaving adversarial robustness an open problem in machine learning. A more lethal and practical attack than substitute model training is a boundary based one that requires only the prediction of the model Brendel et al. (2018). These attacks are aimed at finding the minimum distortion to an image such that it will fool a classifier. This is in fact an NP-hard problem for ReLu activated neural networks Katz et al. (2017); Sinha et al. (2018) and tree ensemble classifiers Kantchelian et al. (2016). Even approximating the minimum distortion for ReLu activated neural networks is NP-hard Weng et al. (2018). Boundary based black box attacks such as Hop-SkipJump Chen et al., Boundary Attack Brendel et al. (2018) and RayS Chen & Gu (2020) give an upper bound on the minimum adversarial distortion.

