BOOSTING ADVERSARIAL TRANSFERABILITY USING DYNAMIC CUES

Abstract

The transferability of adversarial perturbations between image models has been extensively studied. In this case, an attack is generated from a known surrogate e.g., the ImageNet trained model, and transferred to change the decision of an unknown (black-box) model trained on an image dataset. However, attacks generated from image models do not capture the dynamic nature of a moving object or a changing scene due to a lack of temporal cues within image models. This leads to reduced transferability of adversarial attacks from representation-enriched image models such as Supervised Vision Transformers (ViTs), Self-supervised ViTs (e.g., DINO), and Vision-language models (e.g., CLIP) to black-box video models. In this work, we induce dynamic cues within the image models without sacrificing their original performance on images. To this end, we optimize temporal prompts through frozen image models to capture motion dynamics. Our temporal prompts are the result of a learnable transformation that allows optimizing for temporal gradients during an adversarial attack to fool the motion dynamics. Specifically, we introduce spatial (image) and temporal (video) cues within the same source model through taskspecific prompts. Attacking such prompts maximizes the adversarial transferability from image-to-video and image-to-image models using the attacks designed for image models. As an example, an iterative attack launched from image model Deit-B with temporal prompts reduces generalization (top1 % accuracy) of a video model by 35% on Kinetics-400. Our approach also improves adversarial transferability to image models by 9% on ImageNet w.r.t the current state-of-the-art approach. Our attack results indicate that the attacker does not need specialized architectures, e.g., divided space-time attention, 3D convolutions, or multi-view convolution networks for different data modalities. Image models are effective surrogates to optimize an adversarial attack to fool black-box models in a changing environment over time. Code is available at https://bit.ly/3Xd9gRQ 

1. INTRODUCTION

Deep learning models are vulnerable to imperceptible changes to the input images. It has been shown that for a successful attack, an attacker no longer needs to know the attacked target model to compromise its decisions (Naseer et al., 2019; 2020; Nakka & Salzmann, 2021) . Adversarial perturbations suitably optimized from a known source model (a surrogate) can fool an unknown target model (Kurakin et al., 2016) . These attacks are known as black-box attacks since the attacker is restricted to access the deployed model or compute its adversarial gradient information. Adversarial attacks are continuously evolving, revealing new blind spots of deep neural networks. Adversarial transferability has been extensively studied in image-domain (Akhtar & Mian, 2018; Wang & He, 2021; Naseer et al., 2022b; Malik et al., 2022) . Existing works demonstrate how adversarial patterns can be generalized to models with different architectures (Zhou et al., 2018) and even different data domains (Naseer et al., 2019) . However, the adversarial transferability between different architecture families designed for varying data modalities, e.g., image models to video models, has not been actively explored. Since the adversarial machine learning topic has gained maximum attention in the image-domain, it is natural to question if image models can help transfer better to video-domain models. However, the image models lack dynamic temporal cues which are essential for transfer to the video models. We are motivated by the fact that in a real-world setting, a scene is not static but mostly involves various dynamics, e.g., object motion, changing viewpoints, illumination and background changes. Therefore, exploiting dynamic cues within an adversarial attack is essential to find blind-spots of unknown target models. For this purpose, we introduce the idea of encoding disentangled temporal representations within an image-based Vision Transformer (ViT) model using dedicated temporal prompts while keeping the remaining network frozen. The temporal prompts can learn the dynamic cues which are exploited during attack for improved transferability from image-domain models. Specifically, we introduce the proposed temporal prompts to three types of image models with enriched representations acquired via supervised (ViT (Dosovitskiy et al. Our approach offers the benefit that the attacks do not need to rely on specialized networks designed for videos towards better adversarial transferability. As an example, popular model designs for videos incorporate 3D convolutions, space-time attention, tube embeddings or multi-view information to be robust against the temporal changes ( Bertasius et al., 2021; Arnab et al., 2021) . Without access to such specific design choices, our approach demonstrates how an attacker can leverage regular image models augmented with temporal prompts to learn dynamic cues. Further, our approach can be easily extended to image datasets, where disentangled representations can be learned via tokens across a scale-space at varying image resolutions. In summary, the major contributions of this work include: • We demonstrate how temporal prompts incorporated with frozen image-based models can help model dynamic cues which can be exploited to fool deep networks designed for videos. • Our approach for dynamic cue modeling via prompts does not affect the original spatial representations learned by the image-based models during pre-training, e.g., fully-supervised, self-supervised and multi-modal models. • The proposed method significantly improves transfer to black-box image and video models. Our approach is easily extendable to 3D datasets via learning cross-view prompts; and image-only datasets via modeling the scale-space. Finally, it enables generalization from popular plain ViT models without considering video-specific specialized designs. We analyse the adversarial space of three type of image models (fully-supervised, self-supervised, and text-supervised). A pre-trained ImageNet ViT with approximately 6 million parameters exhibits 44.6 and 72.2 top-1 (%) accuracy on Kinetics-400 and ImageNet validation sets using our approach, thereby significantly improving the adversarial transferability on video-domain models. A similar trend exists with other image models. Our results indicate that the multi-modal CLIP can better adapt to video modalities than fully-supervised or self-supervised ViTs. However, CLIP adversaries are relatively less transferable as compared to fully-supervised ViT or self-supervised DINO model. et al., 2022; Naseer et al., 2022b; Aich et al., 2022) . However, adversarial perturbations optimized from image models are not well suited to fool motion dynamics learned by a video model (Sec. 3). To cater for this, we introduce temporal cues to model motion dynamics within adversarial attacks through pre-trained image models. Our approach, therefore, models both spatial and temporal



, 2020)), self-supervised (DINO (Caron et al., 2021)) or multi-modal learning (CLIP (Radford et al., 2021)).

As an example, a momentum based iterative attack launched from our DINO model can reduce the performance of TimesFormer(Bertasius et al., 2021)  from 75.6% to 35.8% on Kinetics-400 dataset.2 BOOSTING ADVERSARIAL TRANSFERABILITY USING DYNAMIC CUESAdversarial transferability refers to manipulating a clean sample (image, video, or 3D object rendered into multi-views) in a way that is deceiving for an unknown (black-box) model. In the absence of an adversarial perturbation, the same black-box model predicts the correct label for the given image, video, or a rendered view of a 3D object. A known surrogate model is usually used to optimize for the adversarial patterns. Instead of training the surrogate model from scratch on a given data distribution, an attacker can also adapt pre-trained image models to the new task. These image models can include supervised ImageNet models such as Deit(Touvron et al., 2020), self-supervised ImageNet models likeDINO (Caron et al., 2021), and text-supervised large-scale multi-modal models e.g. CLIP(Radford et al., 2021). The adversarial attack generated from such pre-trained models with enriched representations transfer better in the black-box setting for image-to-image transfer task (Zhang

