DATA POISONING ATTACKS AGAINST MULTIMODAL ENCODERS

Abstract

Traditional machine learning (ML) models, e.g., image classifiers, usually rely on large-scale labeled datasets to achieve strong performance. However, such labeled datasets are often challenging and expensive to obtain. Also, the predefined categories limit the model's ability to generalize to other visual concepts as additional labeled data is required. On the contrary, the newly emerged multimodal model, which contains both visual and linguistic modalities, learns the concept of images from the raw text. It is a promising way to solve the above problems as it can use easy-to-collect image-text pairs to construct the training dataset and the raw texts contain almost unlimited categories according to their semantics. However, learning from a large-scale unlabeled dataset also exposes the model to the risk of potential poisoning attacks, whereby the adversary aims to perturb the model's training dataset to trigger malicious behaviors in it. Previous work mainly focuses on the visual modality. In this paper, we instead focus on answering two questions: (1) Is the linguistic modality also vulnerable to poisoning attacks? and (2) Which modality is most vulnerable? To answer the two questions, we conduct three types of poisoning attacks against CLIP, the most representative multimodal contrastive learning framework. Extensive evaluations on different datasets and model architectures show that all three attacks can perform well on the linguistic modality with only a relatively low poisoning rate and limited epochs. Also, we observe that the poisoning effect differs between different modalities, i.e., with lower MinRank in the visual modality and with higher Hit@K when K is small in the linguistic modality. To mitigate the attacks, we propose both pre-training and post-training defenses. We empirically show that both defenses can significantly reduce the attack performance while preserving the model's utility.

1. INTRODUCTION

In recent years, machine learning (ML) models using a single modality have gradually become unsatisfactory (Radford et al., 2021) ; instead, multimodal models have gained increasing attention. Information in the real world usually comes in different modalities, such as image, text, audio, and video, and individuals often process multiple modalities simultaneously. Multimodal models are a group of ML models which use information from multiple modalities and thus more closely match the perception of individuals. Multimodal learning has shown great promise by achieving excellent performance in many applications, such as image classification (Radford et al., 2021) , image captioning (Laina et al., 2019; Mokady et al., 2021 ), image generation (Patashnik et al., 2021; Li et al., 2022 ), video recognition (Akbari et al., 2021 ), and audio-visual speech recognition (Zhou et al., 2019) . Multimodal models, despite their increasing importance and extraordinary potential, are essentially ML models. Recent works have shown that ML models are vulnerable to a variety of security and privacy attacks, such as inference attacks (Shokri et al., 2017; Zhou et al., 2022) , adversarial attacks (Ilyas et al., 2019; Xie et al., 2019) , and poisoning attacks (Wang et al., 2022) . Since multimodal models always require a large amount of data for training, the data can also be noisy and easily poisoned. Until now, existing work (Carlini & Terzis, 2021) has explored poisoning and backdoor attacks against multimodal models. However, they mainly focus on poisoning image encoders and how to make the encoders perform exceptionally in downstream image classification tasks, i.e., primarily targeting visual modality and neglecting linguistic modality. To gain a deeper insight into poisoning attacks against multimodal models, a sophisticated investigation is still missing. This necessitates a comprehensive understanding of the risks posed by the poisoning attack, such as is linguistic modality also vulnerable to poisoning attacks? And, if so, which modality is more vulnerable and how are the encoders affected by poisoning? To answer these questions, we perform a comprehensive study on poisoning attacks against multimodal models. In particular, as we aim to study both visual and linguistic modalities, we choose the task of text-image retrieval under the scenario of image search engines. Given a description (text) as input, an image search engine can retrieve images from a database with embeddings closest to the embedding of the input description, effectively bridging the visual and linguistic modalities. Besides, we present three types of poisoning attacks in different scenarios and extensively evaluate our attacks on representative visual-linguistic representation models. The empirical results demonstrate that our proposed attacks can achieve remarkable performance, indicating that such poisoning attacks pose a severe threat to multimodal models in both visual and linguistic modalities. Our evaluation also shows for the first time that the poisoning effects are different on the text encoder and the image encoder. Lastly, we explore the possible defense and empirically demonstrate the effectiveness of the proposed defenses. Abstractly, our contributions can be summarized as follows: • To the best of our knowledge, we are the first to study poisoning attacks against multimodal models in the text-image retrieval scenario, where both visual and linguistic modalities are to be poisoned. • We propose three types of poisoning attacks. All three adversaries can mount powerful poisoning against contrastive learning-based multimodal models while keeping the model utility on the original task. • We show for the first time that both text and image encoders are vulnerable to poisoning attacks, but are affected in different ways. • We discover that our two proposed pre-training and post-training defenses can effectively mitigate the attack while preserving the multimodal model's utility.

2.1. CONTRASTIVE LEARNING-BASED MULTIMODAL MODELS

Contrastive learning. Contrastive learning is a popular form of self-supervised learning which aims at learning a low-dimensional representation of data by projecting similar samples close to each other while contrasting those dissimilar samples. Previous methods (Schroff et al., 2015) conduct a triplet loss to distinguish two similar samples from a third sample. More recent methods (Chen et al., 2020a; He et al., 2020; van den Oord et al., 2018; Giorgi et al., 2021; He et al., 2020) , instead, distinguish similar samples from others by computing the contrastive loss across the entire batch, thus rendering the batch size rather large. Contrastive learning-based multimodal models. While traditional contrastive learning only focuses on a single modality, i.e., visual modality, contrastive learning-based multimodal models have gained increasing attention (Radford et al., 2021; Li et al., 2022; Mu et al., 2021) . Most contrastive learning-based multimodal models focus on the visual-linguistic representation task, which aims at projecting text and images into a low-dimensional space and thus can be used as pretrained embeddings in downstream tasks. Contrastive learning-based multimodal models jointly train an image encoder E img and a text encoder E txt via the alignment of image and natural language based on contrastive learning. Visual models, including image classifiers, widely use the image encoder to get pretrained image representations (Radford et al., 2021) . The learned visual-linguistic representations also help image generation (Patashnik et al., 2021; Li et al., 2022) , image captioning (Mokady et al., 2021) and even video-text retrieval tasks (Fang et al., 2021) . Image search engine. The task of an image search engine is also known as a text-image retrieval task, which is designed for scenarios where the queries are from one modality and the retrieval galleries are from another (Cao et al., 2022) . Given a text t, a visual-linguistic representation-based multimodal image search enginefoot_0 would return the most relevant images from a large image base by



https://rom1504.github.io/clip-retrieval/.

