ON THE PERILS OF CASCADING ROBUST CLASSIFIERS

Abstract

Ensembling certifiably robust neural networks is a promising approach for improving the certified robust accuracy of neural models. Black-box ensembles that assume only query-access to the constituent models (and their robustness certifiers) during prediction are particularly attractive due to their modular structure. Cascading ensembles are a popular instance of black-box ensembles that appear to improve certified robust accuracies in practice. However, we show that the robustness certifier used by a cascading ensemble is unsound. That is, when a cascading ensemble is certified as locally robust at an input x (with respect to ), there can be inputs x in the -ball centered at x, such that the cascade's prediction at x is different from x and thus the ensemble is not locally robust. Our theoretical findings are accompanied by empirical results that further demonstrate this unsoundness. We present cascade attack (CasA), an adversarial attack against cascading ensembles, and show that: (1) there exists an adversarial input for up to 88% of the samples where the ensemble claims to be certifiably robust and accurate; and (2) the accuracy of a cascading ensemble under our attack is as low as 11% when it claims to be certifiably robust and accurate on 97% of the test set. Our work reveals a critical pitfall of cascading certifiably robust models by showing that the seemingly beneficial strategy of cascading can actually hurt the robustness of the resulting ensemble. Our code is available at https://github.com/TristaChi/ensembleKW. * Equal Contribution 1 Percentage of inputs where the classifier is accurate and certified as locally robust.

1. INTRODUCTION

Local robustness has emerged as an important requirement of classifier models. It ensures that models are not susceptible to misclassifications caused by small perturbations to correctly classified inputs. A lack of robustness can be exploited by not only malicious actors (in the form of adversarial examples (Szegedy et al., 2014) ) but can also lead to incorrect behavior in the presence of natural noise (Gilmer et al., 2019) . However, ensuring local robustness of neural network classifiers has turned out to be a hard challenge. Although neural networks can achieve state-of-the-art classification accuracies on a variety of important tasks, neural classifiers with comparable certified robust accuracies 1 (CRA, Def. 2.2) remain elusive, even when trained in a robustness-aware manner (Madry et al., 2018; Wong & Kolter, 2018; Cohen et al., 2019; Leino et al., 2021) . In light of the limitations of robustness-aware training, ensembling certifiably robust neural classifiers has been shown to be a promising approach for improving certified robust accuracies (Wong et al., 2018; Yang et al., 2022) . An ensemble combines the outputs of multiple base classifiers to make a prediction, and is a well-known mechanism for improving classification accuracy when one only has access to weak learners (Dietterich, 2000; Bauer & Kohavi, 1999) . Ensembles designed to improve CRA take one of two forms. White-box ensembles (Yang et al., 2022; Zhang et al., 2019; Liu et al., 2020) assume white-box access to the constituent models. They et al., 2018; Blum et al., 2022) , on the other hand, assume only query-access to the constituent classifiers during prediction, and are, therefore, agnostic to their internal details. They re-use the prediction and certification outcomes of the constituent models to calculate the ensemble's prediction and certificate. Their black-box nature lends them modularity and permits any combination of constituent classifiers, irrespective of their individual certification mechanism, so we focus our efforts on them in this paper. Cascading ensembles (Wong et al., 2018; Blum et al., 2022) are a particularly popular instance of black-box ensembles that appear to improve CRA in practice. They evaluate the constituent classifiers (and their certifiers) in a fixed sequence. The ensemble's prediction is the output of the first constituent classifier in the sequence that is certified locally robust, defaulting to the last classifier's output if no model can be certified. Importantly, the cascading ensemble is itself certified locally robust only when at least one of the constituent classifiers is certified locally robust. Our contributions. We show in this paper that the local robustness certification mechanism used by cascading ensembles is unsound even when the certifiers used by each of the constituent classifiers are sound (Theorem 2.8). In other words, when a cascading ensemble is certified as locally robust at an input x, there can, in fact, be inputs x in the -ball centered at x, such that the cascade's prediction at x is different from x. Figure 1 demonstrates this visually on a toy dataset. The cascading ensemble can have points that are less than away from the decision boundary, yet the ensemble is certified locally robust at such points (Figure 1(d) ). As a consequence of our result, use of a cascading ensemble in any scenario requiring local robustness guarantees is unsafe and existing empirical results that report the CRA of cascading ensembles are not valid. Guided by our theoretical construction, we propose cascade attack (CasA, Algorithm 3.1), an adversarial attack against cascading ensembles, and conduct an empirical evaluation with the cascading ensembles trained by Wong et al. (2018) for MNIST and CIFAR-10 datasets. With CasA, we show that: (1) there exists an adversarial example for up to 88% of the samples where the ensemble claims



Figure 1: Visualizing classification results of 2D points for constituent models (a-c) and the corresponding Cascading Ensemble (d, Def. 2.7) and Uniform Voting Ensemble (e, Def. 5.3). Regions with colors correspond to predictions (0: red, 1: blue, 2: green) made by the underlying model (or ensemble). Darker colors indicate that the accompanying robustness certification of the underlying model (or ensemble) returns 1 and lighter colors are for cases when the certification returns 0. All points receiving 1 for certifications (darker regions) are at least -away from the other classes in (a)-(c), i.e. certification is sound (Def. 2.3). This property is violated in (d), e.g. points from dark red regions are not -away from the blue region in the zoomed-in view on the left, but preserved in (e). Namely, voting ensembles are soundness-preserving (Def. 2.6) while cascading ensembles are not.

