FEW-SHOT BACKDOOR ATTACKS VIA NEURAL TAN-GENT KERNELS

Abstract

In a backdoor attack, an attacker injects corrupted examples into the training set. The goal of the attacker is to cause the final trained model to predict the attacker's desired target label when a predefined trigger is added to test inputs. Central to these attacks is the trade-off between the success rate of the attack and the number of corrupted training examples injected. We pose this attack as a novel bilevel optimization problem: construct strong poison examples that maximize the attack success rate of the trained model. We use neural tangent kernels to approximate the training dynamics of the model being attacked and automatically learn strong poison examples. We experiment on subclasses of CIFAR-10 and ImageNet with WideResNet-34 and ConvNeXt architectures on periodic and patch trigger attacks and show that NTBA-designed poisoned examples achieve, for example, an attack success rate of 90% with ten times smaller number of poison examples injected compared to the baseline. We provided an interpretation of the NTBA-designed attacks using the analysis of kernel linear regression. We further demonstrate a vulnerability in overparametrized deep neural networks, which is revealed by the shape of the neural tangent kernel.

1. INTRODUCTION

Modern machine learning models, such as deep convolutional neural networks and transformer-based language models, are often trained on massive datasets to achieve state-of-the-art performance. These datasets are frequently scraped from public domains with little quality control. In other settings, models are trained on shared data, e.g., federated learning (Kairouz et al., 2019) , where injecting maliciously corrupted data is easy. Such models are vulnerable to backdoor attacks (Gu et al., 2017) , in which the attacker injects corrupted examples into the training set with the goal of creating a backdoor when the model is trained. When the model is shown test examples with a particular trigger chosen by the attacker, the backdoor is activated and the model outputs a prediction of the attacker's choice. The predictions on clean data remain the same so that the model's corruption will not be noticed in production. Weaker attacks require injecting more corrupted examples to the training set, which can be challenging and costly. For example, in cross-device federated systems, this requires tampering with many devices, which can be costly (Sun et al., 2019) . Further, even if the attacker has the resources to inject more corrupted examples, stronger attacks requiring smaller number of poison training data are preferred. Injecting more poison data increases the chance of being detected by human inspection with random screening. For such systems, there is a natural optimization problem of interest to the attacker. Assuming the attacker wants to achieve a certain success rate for a trigger of choice, how can they do so with minimum number of corrupted examples injected into the training set? For a given choice of a trigger, the success of an attack is measured by the Attack Success Rate (ASR), defined as the probability that the corrupted model predicts a target class, y target , for an input image from another class with the trigger applied. This is referred to as a test-time poison example. To increase ASR, train-time poison examples are injected to the training data. A typical recipe is to mimic the test-time poison example by randomly selecting an image from a class other than the target class and applying the trigger function, P : R k → R k , and label it as the target class, y target (Barni et al., 2019; Gu et al., 2017; Liu et al., 2020) . We refer to this as the "sampling" baseline. In (Barni et al., 2019) , for example, the trigger is a periodic image-space signal ∆ ∈ R k that is added to the image: P (x truck ) = x truck + ∆. Example images for this attack along with the label consistent attack of Turner et al. ( 2019 Notice how this baseline, although widely used in robust machine learning literature, wastes the opportunity to construct stronger attacks. We propose to exploit an under-explored attack surface of designing strong attacks and carefully design the train-time poison examples tailored for the choice of the backdoor trigger. We want to emphasize that our goal in proving the existence of such strong backdoor attacks is to motivate continued research into backdoor defenses and inspire practitioners to carefully secure their machine learning pipelines. There is a false sense of safety in systems that ensures a large number of honest data contributors that keep the fraction of corrupted contributions small; we show that it takes only a few examples to succeed in backdoor attacks. We survey the related work in Appendix A. Contributions. We borrow analyses and algorithms from kernel regression to bring a new perspective on the fundamental trade-off between the attack success rate of a backdoor attack and the number of poison training examples that need to be injected. We (i) use Neural Tangent Kernels (NTKs) to introduce a new computational tool for constructing strong backdoor attacks for training deep neural networks ( § §2 and 3); (ii) use the analysis of the standard kernel linear regression to interpret what determines the strengths of a backdoor attack ( §4); and (iii) investigate the vulnerability of deep neural networks through the lens of corresponding NTKs (Appendix E). First, we propose a bi-level optimization problem whose solution automatically constructs strong train-time poison examples tailored for the backdoor trigger we want to apply at test-time. Central to our approach is the Neural Tangent Kernel (NTK) that models the training dynamics of the neural network. Our Neural Tangent Backdoor Attack (NTBA) achieves, for example, an ASR of 72% with only 10 poison examples in Fig. 1 , which is an order of magnitude more efficient. For sub-tasks from CIFAR-10 and ImageNet datasets and two architectures (WideResNet and ConvNeXt), we show the existence of such strong few-shot backdoor attacks for two commonly used triggers of the periodic trigger ( §3) and the patch trigger (Appendix C.1). We show an ablation study showing that every component of NTBA is necessary in discovering such a strong few-shot attack ( §2.5). Secondly, we provide interpretation of the poison examples designed with NTBA via an analysis of kernel linear regression. In particular, this suggests that small-magnitude train-time triggers lead to strong attacks, when coupled with a clean image that is close in distance, which explains and guides the design of strong attacks. Finally, we investigate the vulnerability of deep neural networks to backdoor attacks by comparing the corresponding NTK and the standard Laplace kernel. NTKs allow far away data points to have more influence, compared to the Laplace kernel, which is exploited by few-shot backdoor attacks.



Figure 1: The trade-off between the number of poisons and ASR for the periodic trigger.

Figure 2: Typical poison attack takes a random sample from the source class ("truck"), adds a trigger ∆ to it, and labels it as the target ("deer"). Note the faint vertical striping in Fig. 2c.

