VERA VERTO: MULTIMODAL HIJACKING ATTACK

Abstract

The increasing cost of training machine learning (ML) models has led to the inclusion of new parties to the training pipeline, such as users who contribute training data and companies that provide computing resources. This involvement of such new parties in the ML training process has introduced new attack surfaces for an adversary to exploit. A recent attack in this domain is the model hijacking attack, whereby an adversary hijacks a victim model to implement their own -possibly malicious -hijacking tasks. However, the scope of the model hijacking attack is so far limited to computer vision-related tasks. In this paper, we transform the model hijacking attack into a more general multimodal setting, where the hijacking and original tasks are performed on data of different modalities. Specifically, we focus on the setting where an adversary implements a natural language processing (NLP) hijacking task into an image classification model. To mount the attack, we propose a novel encoder-decoder based framework, namely the Blender, which relies on advanced image and language models. Experimental results show that our modal hijacking attack achieves strong performances in different settings. For instance, our attack achieves 94%, 94%, and 95% attack success rate when using the Sogou news dataset to hijack STL10, CIFAR-10, and MNIST classifiers.

1. INTRODUCTION

Machine learning (ML) has become a critical component of various applications. Yet, this development has caused the ML models to be increasingly expensive to train. Hence, the training of ML models has transformed gradually to a joint process, e.g., new parties are included in the training of the model either by providing data or computational resources. However, the involvement of these new parties has created new attack surfaces against ML models, e.g., poison and backdoor attacks (Shafahi et al., 2018; Chen et al., 2017) . Another recent attack in this domain is the model hijacking attack (Salem et al., 2022a) , where the adversary is able to implement their own -hijacking -task into a target victim model. Concretely, the adversary poisons the training dataset of the target model with their own hijacking dataset. The hijacking dataset is first camouflaged for stealthiness to look similar to the target's model dataset. This attack could induce two different risks. The first one is about accountability which is the main threat for hijacking attacks, where the model owner can be framed by the adversary to perform illegal or unethical tasks without knowing. The second one is parasitic computing, where the model owner pays the model maintenance costs, while the adversary uses/offers it for their own application/service for free. On the other hand, the model hijacking technique can also be adapted to compress models, i.e., training a single model for multiple tasks. However, the previous work limits the applicable domains to computer vision (CV) related tasks, even though ML has achieved great success in many domains, e.g., the multiple available translators such as DeepL and Google Translate, and the different face detectors on social media platforms. Moreover, the previous model hijacking attack mandates that the hijacking and original tasks have the same modality. Relaxing this assumption will significantly increase the risks of the model hijacking attack, as the adversary can now target models with different modalities, i.e., more target models exist for the adversary to perform their attack. In this paper, therefore, we transform the model hijacking attack into a more general multimodal setting, i.e., implementing a hijacking task from a completely different domain. More concretely, the adversary can implement a NLP hijacking task into a CV target model, as illustrated in Figure 1 . For short, we refer to our attack as the modal hijacking attack. Our modal hijacking attack follows the same threat model as the model hijacking and poison attacks (Jagielski et al., 2018; Shafahi et al. , 2018; Sun et al., 2018; Salem et al., 2022a) , i.e., the adversary is only able to poison the training dataset without any access to the target model's architecture or hyperparameters. And our modal hijacking attack can induce the same risks as the model hijacking one, i.e., accountability and parasitic computing. Our attack could derive a severer threat to accountability since multiple modalities are involved instead of a single one. And the threat of parasitic computing can be significant when the Blender (introduced in the following soon) is reusable. Different from the previous work, our modal hijacking attack expands the scope from a single type of data to a multimodal setting. The challenge that stems from the change is the needed transformation from a discrete domain (NLP) to a continuous one (CV), in which exists a comprehension gap. We believe this transformation is not trivial, as demonstrated in Section 4.4. Besides, our modal hijacking attack is more general, i.e., the Blender can be applied with different hijacking and original datasets. Hence, it is cheaper for the adversary to hijack target models as shown later in Section 4.4. To the best of our knowledge, the modal hijacking attack is the first work to combine different data modalities, which increases the capability and flexibility for hijacking attacks. Moreover, this work can encourage the exploration of the applicability of performing hijacking attacks for different modalities (which can also be with the benign aim of model compression). To perform the modal hijacking attack, the adversary needs to transform the NLP-based hijacking dataset into the victim model's CV-based original dataset. To this end, we propose the Blender, an encoder-decoder-based model which integrates a language model, i.e., BERT (Devlin et al., 2019) and multiple CNN models together. The Blender integrates two losses, i.e., visual and semantic losses, such that it fuses both the hijacking and original inputs to create an output that has a similar visual appearance to the original inputs, while maintaining the features of the hijacking one (as shown in Figure 1 ). A successful modal hijacking attack should enable the target victim model to preserve its utility, i.e., has the same performance on the original CV task, while performing the hijacking NLP task with high accuracy. To evaluate our modal hijacking attack, we use two NLP datasets (Zhang et al., 2015) , namely Yelp Review (Yelp) and Sogou News (Sogou), and three CV datasets, i.e., MNIST (MNI), CIFAR-10 (CIF), and STL-10 (Coates et al., 2011) . We extensively evaluate the different setups for our modal hijacking attack. Our results show that our modal hijacking attack can achieve strong performances with respect to both the attack success rate and victim's model utility. For instance, when victim models trained on MNIST, CIFAR-10, and STL-10 datasets are hijacked by the Yelp (Sogou) datasets, our modal hijacking attack achieves attack success rate of 65% (94%), 68% (94%), and 65% (95%), respectively. Meanwhile, the victim models' utility is not jeopardized, i.e., our modal hijacking achieves the utility of 99% (99%), 93% (93%), and 93% (92%), respectively, which is less than 2% drop compared to clean models. Moreover, we show the generalizability of our modal hijacking attack by evaluating it against different setups, e.g., different models to construct the Blender and target model. Finally, we explore two possible defenses against the modal hijacking attack. In addition, due to the space limitation, we discuss the limitations of our work in Appendix A for a better understanding.



Figure 1: An overview of the multimodal hijacking attack. First, the Blender takes a sample from both the hijacking (a text) and container (an image) datasets. It then mixes both of these inputs to have a fused image with the looks of the container one but with the features of the hijacking text input. The model is able to perform the original classification task (classifying the image as a horse) and the hijacking one, i.e., classifying the fused image as 4-star (the label of the hijacking input).

