POISONING GENERATIVE MODELS TO PROMOTE CATASTROPHIC FORGETTING

Abstract

Generative models have grown into the workhorse of many state-of-the-art machine learning methods. However, their vulnerability under poisoning attacks has been largely understudied. In this work, we investigate this issue in the context of continual learning, where generative replayers are utilized to tackle catastrophic forgetting. By developing a novel customization of dirty-label input-aware backdoors to the online setting, our attacker manages to stealthily promote forgetting while retaining high accuracy at the current task and sustaining strong defenders. Our approach taps into an intriguing property of generative models, namely that they cannot well capture input-dependent triggers. Experiments on four standard datasets corroborate the poisoner's effectiveness.

1. INTRODUCTION

The vulnerability of machine learning systems must be scrutinized before they can be deployed to security-critical applications. The common evasion attack assumes that clean target instances can be manipulated at test time, which can be unrealistic in many scenarios. In contrast, poisoning attacks only make malicious and imperceptible modifications to the training set, so that the prediction on test examples can be mistaken. The threat models may insert poison examples (Chen et al., 2017) , flip the training labels (Xiao et al., 2012; Levine & Feizi, 2021) , or modify the training example inputs (Biggio et al., 2012; Shafahi et al., 2018) . Although poisoning attacks have been extensively studied under discriminative learning, their potential risk in generative learning has been largely understudied. Ding et al. (2019) poisons the training examples so that the learned generator covertly changes some important part of the output image, e.g., turning a red light into green. Salem et al. (2020) enables the adversary to control the output image by planting a trigger in the input image or noise. Both are backdoor attacks requiring write access to test data, and work in batched learning scenarios. The increasing penetration of generative models in machine learning urges the investigation of poisoning attacks in a broader range of learning paradigms In this work, we focus on continual learning, a prominent setting where tasks arrive in streams and each of them corresponds to a discriminative learning problem such as classification (Chen & Liu, 2018) . Since the tasks are streamed and cannot be stored, the running classifier often suffers from catastrophic forgetting, where the performance on older tasks gradually deteriorates (McCloskey & Cohen, 1989) . Deep generative replay (DGR) is a natural tool to bring back the memory of the previous tasks by learning a generative model to fit the data of these tasks (Shin et al., 2017; Cong et al., 2020) . Despite their effectiveness, new vulnerabilities are also opened up where misleading examples can be injected to the training data D t for the current task t, so that catastrophic forgetting can be promoted when such poisoned D t is used for training both the replayer G t and the classifier. In this work, we seek practical and stealthy poisoning attacks on DGR that achieve three objectives: O1 After moving past task t, the classifier will soon forget what was learned from it (i.e., perform poorly on clean test examples drawn from it) despite using a replayer for all the tasks seen so far. O2 During task t, the classifier trained from the poisoned data does not suffer degradation of test accuracy on task t itself. This is important because poor performance on the current task can raise significant and immediate suspicion. In contrast, by promoting forgetting, the harm will manifest itself only after the victim has moved on to the next task, by which time it will have become too late because the access to the samples of task t is already lost.

O3

The poisoned data should be robust to solid defenses deployed by both the classifier and the replayer. The main difficulty lies in two folds. Firstly, although both O1 and O2 are straightforward to fulfill individually, they are at odds with each other and are hard to fulfill simultaneously. Secondly, the transiency of data streams compels the adversary to make irrevocable attacks at task t before future tasks arrive and before catastrophic forgetting can start to occur. Due to this difficulty, poisoning attacks have been much less studied in an online setting. Our contribution, therefore, is to overcome these challenges and to reveal the vulnerability of generative models in the sense that their training data can be poisoned stealthily such that a task can be learned well at present but forgotten soon in the future. Noting that simple label-flipping poisoning can be easily detected, we resort to dirty-label backdoor/Trojan attack (Liu et al., 2018) to attain O2: the trained classifier performs correctly on clean examples, but errs if the example is planted with a trigger. To further achieve O1 and O3, we capitalize on the input-aware backdoor (Nguyen & Tran, 2020) , which allows the trigger to vary depending on the image. As a result, it can not only withstand stronger defense ( §4), but also enjoys higher variation and stealthiness, hence much harder for a generative model to capture. So the replayed images do not well preserve the trigger (we call it trigger-discarding property in §3.3) while retaining the incorrect label, leading naturally to forgetting ( §3). The problem is set up in §2, and experiments are provided in §5 to show the effectiveness of the attack. Our innovations are summarized as follows: • Proposing the first poisoning attack that promotes catastrophic forgetting in continual learning. • Achieving poisoning (no trigger is needed at test time) through a novel way of leveraging backdoor attack that is particularly effective for exacerbating catastrophic forgetting. • Identifying a trigger-discarding property of generative models that is intriguing for backdoor attack. Related work Generative models have been pervasive in machine learning (Murphy, 2023, Part IV), reaching far beyond the original role of density estimation and serving as a key infrastructure in supervised, unsupervised, and reinforcement learning. We contend that their vulnerability needs to be examined in the context of their use. In the vanilla density estimation, Condessa & Kolter (2020) learned robust variational auto-encoder (VAEs) that retain high likelihood for the data points under adversarial perturbation. The underlying threat is evasion attack, and along similar lines, Tabacof et al. ( 2016) and Kos et al. (2018) studied attacks that promote reconstruction error of the decoder in a VAE. Some recent works address attacks on membership inference (Hayes et al., 2019; Chen et al., 2020; Hilprecht et al., 2019) , model extraction (Hu & Pang, 2021) , and attribute inference (Stadler et al., 2022) . However, poisoning attacks on generative models are still understudied. Our aim is to poison a generative model instead of learning a generative model to produce poisons for another (discriminative) model (Yang et al., 2017; Muñoz-González et al., 2019) . We also leave it as future work to defend the proposed attack, noting that (certifiable) defense and detection have been well studied for poisoning attack on batch discriminative models (Peri et al., 2019; Steinhardt et al., 2017; Levine & Feizi, 2021; Jagielski et al., 2018) .

2. ATTACKING GENERATIVE MODELS IN CONTINUAL LEARNING

We consider the continual learning setting, where tasks arrive sequentially. The goal is to keep updating a classifier that predicts accurately not only on the current task, but also on the previous tasks. Each task t is indeed a joint distribution P t (X, Y ), where X ∈ X is the input from a feature space X , and Y ∈ Y t is the label whose domain Y t may change with the task. For example, Y 1 consists of digits 0 and 1, and Y 2 encompasses 2 and 3. Even in the case where the domains remain constant, the distribution P t can shift. The goal of continual learning is to find a classifier C t , such that the overall risk across all tasks seen so far is minimized:



Mladenovic et al. (2022)  addressed the online decision problem of selecting k examples for evasion attack. Zhang et al. (2020) assumed the instances are drawn i.i.d. from a time-invariant distribution, which is not the case in continual learning because tasks may even have disjoint classes. Other works require multiple passes of the data stream(Gong et al., 2019; Lin et al., 2017; Sun et al., 2020), or clairvoyant knowledge of future data(Burkard & Lagesse, 2017; Wang & Chaudhuri, 2018).

