DATA POISONING ATTACKS AGAINST MULTIMODAL ENCODERS

Abstract

Traditional machine learning (ML) models, e.g., image classifiers, usually rely on large-scale labeled datasets to achieve strong performance. However, such labeled datasets are often challenging and expensive to obtain. Also, the predefined categories limit the model's ability to generalize to other visual concepts as additional labeled data is required. On the contrary, the newly emerged multimodal model, which contains both visual and linguistic modalities, learns the concept of images from the raw text. It is a promising way to solve the above problems as it can use easy-to-collect image-text pairs to construct the training dataset and the raw texts contain almost unlimited categories according to their semantics. However, learning from a large-scale unlabeled dataset also exposes the model to the risk of potential poisoning attacks, whereby the adversary aims to perturb the model's training dataset to trigger malicious behaviors in it. Previous work mainly focuses on the visual modality. In this paper, we instead focus on answering two questions: (1) Is the linguistic modality also vulnerable to poisoning attacks? and (2) Which modality is most vulnerable? To answer the two questions, we conduct three types of poisoning attacks against CLIP, the most representative multimodal contrastive learning framework. Extensive evaluations on different datasets and model architectures show that all three attacks can perform well on the linguistic modality with only a relatively low poisoning rate and limited epochs. Also, we observe that the poisoning effect differs between different modalities, i.e., with lower MinRank in the visual modality and with higher Hit@K when K is small in the linguistic modality. To mitigate the attacks, we propose both pre-training and post-training defenses. We empirically show that both defenses can significantly reduce the attack performance while preserving the model's utility.

1. INTRODUCTION

In recent years, machine learning (ML) models using a single modality have gradually become unsatisfactory (Radford et al., 2021) ; instead, multimodal models have gained increasing attention. Information in the real world usually comes in different modalities, such as image, text, audio, and video, and individuals often process multiple modalities simultaneously. Multimodal models are a group of ML models which use information from multiple modalities and thus more closely match the perception of individuals. Multimodal learning has shown great promise by achieving excellent performance in many applications, such as image classification (Radford et al., 2021) , image captioning (Laina et al., 2019; Mokady et al., 2021) , image generation (Patashnik et al., 2021; Li et al., 2022 ), video recognition (Akbari et al., 2021) , and audio-visual speech recognition (Zhou et al., 2019) . Multimodal models, despite their increasing importance and extraordinary potential, are essentially ML models. Recent works have shown that ML models are vulnerable to a variety of security and privacy attacks, such as inference attacks (Shokri et al., 2017; Zhou et al., 2022) , adversarial attacks (Ilyas et al., 2019; Xie et al., 2019) , and poisoning attacks (Wang et al., 2022) . Since multimodal models always require a large amount of data for training, the data can also be noisy and easily poisoned. Until now, existing work (Carlini & Terzis, 2021) has explored poisoning and backdoor attacks against multimodal models. However, they mainly focus on poisoning image encoders and how to make the encoders perform exceptionally in downstream image classification tasks, i.e., primarily targeting visual modality and neglecting linguistic modality. To gain a deeper

