UNDERSTANDING ZERO-SHOT ADVERSARIAL ROBUSTNESS FOR LARGE-SCALE MODELS

Abstract

Pretrained large-scale vision-language models like CLIP have exhibited strong generalization over unseen tasks. Yet imperceptible adversarial perturbations can significantly reduce CLIP's performance on new tasks. In this work, we identify and explore the problem of adapting large-scale models for zero-shot adversarial robustness. We first identify two key factors during model adaption-training losses and adaptation methods-that affect the model's zero-shot adversarial robustness. We then propose a text-guided contrastive adversarial training loss, which aligns the text embeddings and the adversarial visual features with contrastive learning on a small set of training data. We apply this training loss to two adaption methods, model finetuning and visual prompt tuning. We find that visual prompt tuning is more effective in the absence of texts, while finetuning wins in the existence of text guidance. Overall, our approach significantly improves the zero-shot adversarial robustness over CLIP, seeing an average improvement of 31 points over ImageNet and 15 zero-shot datasets. Our code and model is available at github.com/cvlab-columbia/ZSRobust4FoundationModel.

1. INTRODUCTION

Large-scale models trained on vision and language data-also known as foundation models-have emerged as a universal backbone for tackling many recognition problems in computer vision (Jia et al., 2021; Radford et al., 2021 ), graphics (Ramesh et al., 2022) and robotics (Ahn et al., 2022) . One of the key advantages of foundation models is zero-shot generalization, where the models use just a single textual description to recognize new visual categories with high accuracy. Since those large-scale models are powerful, they will continue to be used in many critical applications, where it is important to make them reliable. However, robustness under adversarial examples remains a challenge, where an imperceptible pattern can be combined with the image to cause recognition failures (Croce & Hein, 2020; Carlini & Wagner, 2017; Dong et al., 2018; Szegedy et al., 2013; Moosavi-Dezfooli et al., 2016) , where attack on foundation models can consequently corrupt the downstream applications. Due to the importance of this problem, there is a large literature that investigates adversarial robustness for neural networks. The most common approach for adversarial defense is to learn the model through adversarial training (Madry et al., 2018; Mao et al., 2019; Szegedy et al., 2013; Pang et al., 2020; Rice et al., 2020; Uesato et al., 2019) , which involves augmenting the training set with mined adversarial examples that fool the image classifier. Adversarial training has been validated to improve robustness on the task that the mined examples come from, but it often comes at a cost of generalization (Stutz et al., 2019; Su et al., 2018; Pedraza et al., 2021) . However, our world is vast and naturally open, and only evaluating adversarial robustness on the learned tasks is limited. Can we achieve zero-shot transferability for adversarial robustness, even if the model has never been trained on the unknown tasks? In this paper, we study this important yet under-explored problem, zero-shot adversarial robustness of large-scale vision-language models. We start our investigation with the state-of-the-art CLIP model (Radford et al., 2021) , which has been shown to be effective in zero-shot recognition tasks. We find that simply adding an imperceptible vector to the image (≤ 1/255) can subvert * Equal contribution 1

