BACKDOOR OR FEATURE? A NEW PERSPECTIVE ON DATA POISONING

Abstract

In a backdoor attack, an adversary adds maliciously constructed ("backdoor") examples into a training set to make the resulting model vulnerable to manipulation. Defending against such attacks-that is, finding and removing the backdoor examples-typically involves viewing these examples as outliers and using techniques from robust statistics to detect and remove them. In this work, we present a new perspective on backdoor attacks. We argue that without structural information on the training data distribution, backdoor attacks are indistinguishable from naturally-occuring features in the data (and thus impossible to "detect" in a general sense). To circumvent this impossibility, we assume that a backdoor attack corresponds to the strongest feature in the training data. Under this assumption-which we make formal-we develop a new framework for detecting backdoor attacks. Our framework naturally gives rise to a corresponding algorithm whose efficacy we show both theoretically and experimentally.

1. INTRODUCTION

A backdoor attack is a technique that allows an adversary to manipulate the predictions of a supervised machine learning model (Gu et al., 2017; Chen et al., 2017; Adi et al., 2018; Shafahi et al., 2018; Turner et al., 2019) . To mount a backdoor attack, an adversary modifies a small subset of the training inputs in a systematic way, e.g., by adding a fixed "trigger" pattern; the adversary then modifies all the corresponding targets, e.g., by setting them all to some fixed value y b . This intervention allows the adversary to manipulate the resulting models' predictions at test time, e.g., by inserting the trigger into test inputs. Given the threat posed by backdoor attacks, there is an increasing interest in defending ML models against them. One such line of work (Jin et al., 2021; Tran et al., 2018; Hayase et al., 2021a; Chen et al., 2018) aims to detect and remove the manipulated samples from the training set. Another line of work (Levine & Feizi, 2021; Jia et al., 2021) seeks to directly train ML models that are robust against backdoor attacks (without necessarily removing any training samples). A prevailing perspective on defending against backdoor attacks treats the manipulated samples as outliers, and thus draws a parallel between backdoor attacks and the classical data poisoning setting of robust statistics. In the latter setting, one receives data that is from a known distribution D with probability 1 -ε, and adversarially chosen with probability ε-the goal is to detect (or learn in spite of) the adversarially chosen points. This perspective is natural one to take and has lead to a host of defenses against backdoor attacks, but is it the right way to approach the problem? In this work, we take a step back from the above intuition and offer a new perspective on data poisoning: rather than viewing the manipulated images as outliers, we view the trigger pattern itself as just another feature in the data. Specifically, we demonstrate that backdoors inserted in a dataset can be indistinguishable from features already present in that dataset. On one hand, this immediately pinpoints the difficulty of detecting backdoor attacks, especially when they can correspond to arbitrary patterns. On the other hand, this new perspective suggests there might be an equivalence between detecting backdoor attacks and surfacing features in the data. Equipped with this perspective, we introduce a framework for studying features in input data and characterizing their strength. Within this framework, we can view backdoor attacks simply as particularly strong features. Furthermore, the framework naturally gives rise to an algorithm for detecting-using datamodels (Ilyas et al., 2022)-the strongest features in a given dataset. We use this algorithm to detect and remove backdoor training examples, and provide theoretical guarantees on its performance. Finally, we demonstrate through a range of experiments the effectiveness of our framework in detecting backdoor examples for a variety of standard backdoor attacks. Concretely, our contributions are as follows: • We argue that in the absence of any knowledge about the distribution of natural image data, backdoor attacks are in a natural sense indistinguishable from existing features in the data. • We make this intuition more precise by providing a formal definition of a feature that naturally captures backdoor triggers as a subcase. We then re-frame the problem of detecting backdoor examples as one of detecting a particular feature in the data. To make the problem feasible (i.e., to distinguish the backdoor feature from the others), we assume that the backdoor is the strongest feature in the training set-an assumption we make formal. • Under this assumption, we show how to leverage datamodels (Ilyas et al., 2022) to detect backdoor examples. We provide theoretical guarantees on our approach's effectiveness at identifying backdoor examples. • We show experimentally that our algorithm (or rather, an efficient approximation to it) effectively identifies backdoor training examples in a range of experiments.

2. A FEATURE-BASED PERSPECTIVE ON BACKDOOR ATTACKS

The prevailing perspective on backdoor attacks casts them as an instance of data poisoning, a concept with a rich history in robust statistics (Hampel et al., 2011) . In data poisoning, the goal is to learn from a dataset where most of the points-say, a (1 -ε) fraction-are drawn from a distribution D, and the remaining points (an ε-fraction) are chosen by an adversary. The parallel between this "classical" data poisoning setting and that of backdoor attacks is natural. After all, in a backdoor attack the adversary inserts the trigger in only a small fraction of the data, which is otherwise primarily drawn from a data distribution D. This threat model is tightly connected to the classical poisoning setting in robust statistics. In the classical setting, the structure of the dataset D is essential to obtaining any theoretical guarantees. For example, the developed algorithms often leverage strong explicit distributional assumptions, e.g. (sub-)Gaussianity (Lugosi & Mendelson, 2019) . In settings such as computer vision, however, no such structure is known. In fact, we lack almost any characterization of how benchmark image datasets are distributed. In this section, we argue that without such structure, Backdoor attacks are fundamentally indistinguishable from features already present in the dataset. Backdoor attacks can be "realistic" features. First, we show that one can mount a backdoor attack using features that are already present in the dataset. In Figure 1 , we mount a backdoor attack on ImageNet (Deng et al., 2009) by using hats in place of a fixed trigger pattern. The resulting dataset is entirely plausible in that the images are (at least somewhat) realistic, and the corresponding labels are unchanged-with some more careful photo editing, one could imagine embedding the hats in a way that makes the dataset look unmodified even to a human. At test time, however, the hats act as an effective backdoor trigger: model predictions are skewed towards cats whenever a hat is added on the test sample. Should we expect a backdoor detection algorithm to flag these examples? Backdoor attacks can occur naturally. In fact, the adversary need not modify the dataset at all-one can use features already present in the dataset to manipulate models at test time. For example, a naturally-occuring trigger for ImageNet is a simple image of a tennis ball-we provide more details about the "tennis ball" trigger in Appendix C. These examples highlight that without making additional assumptions, trigger patterns for backdoor attacks are no more than features in the data. Thus detecting them should be no easier than detecting hats, backgrounds, or any other spurious correlation in the data. In the next sections, we will use this insight to craft more specific conditions under which we can hope to detect backdoor examples.

