UNDERSTANDING ZERO-SHOT ADVERSARIAL ROBUSTNESS FOR LARGE-SCALE MODELS

Abstract

Pretrained large-scale vision-language models like CLIP have exhibited strong generalization over unseen tasks. Yet imperceptible adversarial perturbations can significantly reduce CLIP's performance on new tasks. In this work, we identify and explore the problem of adapting large-scale models for zero-shot adversarial robustness. We first identify two key factors during model adaption-training losses and adaptation methods-that affect the model's zero-shot adversarial robustness. We then propose a text-guided contrastive adversarial training loss, which aligns the text embeddings and the adversarial visual features with contrastive learning on a small set of training data. We apply this training loss to two adaption methods, model finetuning and visual prompt tuning. We find that visual prompt tuning is more effective in the absence of texts, while finetuning wins in the existence of text guidance. Overall, our approach significantly improves the zero-shot adversarial robustness over CLIP, seeing an average improvement of 31 points over ImageNet and 15 zero-shot datasets. Our code and model is available at github.com/cvlab-columbia/ZSRobust4FoundationModel.

1. INTRODUCTION

Large-scale models trained on vision and language data-also known as foundation models-have emerged as a universal backbone for tackling many recognition problems in computer vision (Jia et al., 2021; Radford et al., 2021) , graphics (Ramesh et al., 2022) and robotics (Ahn et al., 2022) . One of the key advantages of foundation models is zero-shot generalization, where the models use just a single textual description to recognize new visual categories with high accuracy. Since those large-scale models are powerful, they will continue to be used in many critical applications, where it is important to make them reliable. However, robustness under adversarial examples remains a challenge, where an imperceptible pattern can be combined with the image to cause recognition failures (Croce & Hein, 2020; Carlini & Wagner, 2017; Dong et al., 2018; Szegedy et al., 2013; Moosavi-Dezfooli et al., 2016) , where attack on foundation models can consequently corrupt the downstream applications. Due to the importance of this problem, there is a large literature that investigates adversarial robustness for neural networks. The most common approach for adversarial defense is to learn the model through adversarial training (Madry et al., 2018; Mao et al., 2019; Szegedy et al., 2013; Pang et al., 2020; Rice et al., 2020; Uesato et al., 2019) , which involves augmenting the training set with mined adversarial examples that fool the image classifier. Adversarial training has been validated to improve robustness on the task that the mined examples come from, but it often comes at a cost of generalization (Stutz et al., 2019; Su et al., 2018; Pedraza et al., 2021) . However, our world is vast and naturally open, and only evaluating adversarial robustness on the learned tasks is limited. Can we achieve zero-shot transferability for adversarial robustness, even if the model has never been trained on the unknown tasks? In this paper, we study this important yet under-explored problem, zero-shot adversarial robustness of large-scale vision-language models. We start our investigation with the state-of-the-art CLIP model (Radford et al., 2021) , which has been shown to be effective in zero-shot recognition tasks. We find that simply adding an imperceptible vector to the image (≤ 1/255) can subvert CLIP's prediction (see Figure 1a ). Adaptation methods and training objectives are the two major factors for adapting a large-scale model. First, besides finetuning the whole model, we seek an alternative adaptation method-visual prompt tuning-which adapts the inputs instead of the parameters of the model. Visual prompt tuning (VPT) is an emerging light-weight adaptation method (Bar et al., 2022; Bahng et al., 2022) that learns a visual prompt which is added to the input image, where we use visual prompt to instruct the model to be robust against adversaries. Second, we find that the standard adversarial training objective ignores the visual-language alignment in CLIP's pretrained representation space, causing the model to lose zero-shot capability. We then propose a text-guided contrastive adversarial training (TeCoA) loss, dubbed as Tekoa (tee•kow), which maximizes the similarity of the adversarial visual features and the correct text embeddings with contrastive learning. Since the adapted visual features continue to align well with the text features, the model adapted with TeCoA can maximally retain the original zero-shot generalization of CLIP while enjoying improved adversarial robustness. S T L 1 0 C a lt e c h 1 0 1 C a lt e c h 2 5 6 C IF A R 1 0 C IF A R 1 0 0 F o o d 1 0 1 O x fo We conduct an extensive evaluation on 15 zero-shot image datasets, offering a holistic study of the zero-shot adversarial robustness problem. This is especially important given that large-scale vision models are emerging as infrastructure and are deploying to critical applications. We find that the lightweight VPT is noticeably more effective than model finetuning when textual information is unavailable. When texts are used during adaptation, both VPT and finetuning using our TeCoA loss have drastically improved zero-shot adversarial robustness compared to baselines. Finetuning has higher gains than VPT as more parameters are tuned. Our best performing model with the TeCoA loss can improve adversarial robustness over CLIP by an average of 31% across the datasets. Our method also works on unlabeled images, allowing for better robustness with a large amount of unlabeled data. Our work establish a new and important benchmarket, zero-shot adversarial robustness, for future work to evaluate on. We release all models and code.

2. RELATED WORK

Zero-Shot Generalization aims to classify novel classes and tasks that are unseen during training (Palatucci et al., 2009; Lampert et al., 2009; Radford et al., 2021) . Existing zero-shot methods often project visual features into semantic feature space (Frome et al., 2013; Akata et al., 2015; Romera-Paredes & Torr, 2015; Xie et al., 2019; Yu et al., 2018; Liu et al., 2019) , or use generative methods to generate fake visual features of unseen classes from their semantic descriptions to train classifiers (Xian et al., 2018; Ni et al., 2019; Huang et al., 2019; Schonfeld et al., 2019; Verma et al., 2019; Liu et al., 2020) . Recently, large-scale pretrained vision-language models (Radford et al., 2021; Jia et al., 2021) have shown outstanding zero-shot generalization ability on unseen tasks via text prompt engineering. Their adversarial robustness and its transferability, however, has not been studied in the zero-shot setting. Adversarial Robustness. Adversarial attacks for image recognition find an additive vector on the input to maximize the cross-entropy loss, which is calculated from the model prediction and the ground truth one-hot label (Szegedy et al. ( 2013 



Figure 1: (a, left) Despite CLIP's high performance on zero-shot image recognition tasks, it remains vulnerable when the input images are constructed adversarially. (b, right) Standard adversarial training improves robustness on the trained task (ImageNet), but comes at the expense of its zero-shot capability. Our paper studies how to adapt CLIP to achieve adversarial robustness on zero-shot tasks.

If we follow the standard adversarial training defense paradigm(Madry et al., 2018; Rice et al., 2020)  to finetune CLIP on the ImageNet(Deng et al., 2009b)  training set, we observe that the adapted CLIP has improved adversarial robustness on the ImageNet validation set, but comes at the cost of significantly reduced accuracy on unseen datasets and classes (Figure1b). Standard adversarial training backfires on CLIP as it fails to retain the model's zero-shot generalization ability.

); Athalye et al. (2018); Carlini & Wagner (2017);

