BACKDOOR OR FEATURE? A NEW PERSPECTIVE ON DATA POISONING

Abstract

In a backdoor attack, an adversary adds maliciously constructed ("backdoor") examples into a training set to make the resulting model vulnerable to manipulation. Defending against such attacks-that is, finding and removing the backdoor examples-typically involves viewing these examples as outliers and using techniques from robust statistics to detect and remove them. In this work, we present a new perspective on backdoor attacks. We argue that without structural information on the training data distribution, backdoor attacks are indistinguishable from naturally-occuring features in the data (and thus impossible to "detect" in a general sense). To circumvent this impossibility, we assume that a backdoor attack corresponds to the strongest feature in the training data. Under this assumption-which we make formal-we develop a new framework for detecting backdoor attacks. Our framework naturally gives rise to a corresponding algorithm whose efficacy we show both theoretically and experimentally.

1. INTRODUCTION

A backdoor attack is a technique that allows an adversary to manipulate the predictions of a supervised machine learning model (Gu et al., 2017; Chen et al., 2017; Adi et al., 2018; Shafahi et al., 2018; Turner et al., 2019) . To mount a backdoor attack, an adversary modifies a small subset of the training inputs in a systematic way, e.g., by adding a fixed "trigger" pattern; the adversary then modifies all the corresponding targets, e.g., by setting them all to some fixed value y b . This intervention allows the adversary to manipulate the resulting models' predictions at test time, e.g., by inserting the trigger into test inputs. Given the threat posed by backdoor attacks, there is an increasing interest in defending ML models against them. One such line of work (Jin et al., 2021; Tran et al., 2018; Hayase et al., 2021a; Chen et al., 2018) aims to detect and remove the manipulated samples from the training set. Another line of work (Levine & Feizi, 2021; Jia et al., 2021) seeks to directly train ML models that are robust against backdoor attacks (without necessarily removing any training samples). A prevailing perspective on defending against backdoor attacks treats the manipulated samples as outliers, and thus draws a parallel between backdoor attacks and the classical data poisoning setting of robust statistics. In the latter setting, one receives data that is from a known distribution D with probability 1 -ε, and adversarially chosen with probability ε-the goal is to detect (or learn in spite of) the adversarially chosen points. This perspective is natural one to take and has lead to a host of defenses against backdoor attacks, but is it the right way to approach the problem? In this work, we take a step back from the above intuition and offer a new perspective on data poisoning: rather than viewing the manipulated images as outliers, we view the trigger pattern itself as just another feature in the data. Specifically, we demonstrate that backdoors inserted in a dataset can be indistinguishable from features already present in that dataset. On one hand, this immediately pinpoints the difficulty of detecting backdoor attacks, especially when they can correspond to arbitrary patterns. On the other hand, this new perspective suggests there might be an equivalence between detecting backdoor attacks and surfacing features in the data. Equipped with this perspective, we introduce a framework for studying features in input data and characterizing their strength. Within this framework, we can view backdoor attacks simply as particularly strong features. Furthermore, the framework naturally gives rise to an algorithm for 1

