THE BIASED ARTIST: EXPLOITING CULTURAL BIASES VIA HOMOGLYPHS IN TEXT-GUIDED IMAGE GENERA-TION MODELS

Abstract

Text-guided image generation models, such as DALL-E 2 and Stable Diffusion, have recently received much attention from academia and the general public. Provided with textual descriptions, these models are capable of generating highquality images depicting various concepts and styles. However, such models are trained on large amounts of public data and implicitly learn relationships from their training data that are not immediately apparent. We demonstrate that common multimodal models implicitly learned cultural biases that can be triggered and injected into the generated images by simply replacing single characters in the textual description with visually similar non-Latin characters. These so-called homoglyph replacements enable malicious users or service providers to induce biases into the generated images and even render the whole generation process useless. We practically illustrate such attacks on DALL-E 2 and Stable Diffusion as text-guided image generation models and further show that CLIP also behaves similarly. We further propose a novel homoglyph unlearning approach to update pre-trained text encoders to remove their susceptibility to homoglyphs.

1. INTRODUCTION

Text-guided image generation models, such as DALL-E 2 (Ramesh et al., 2022) , have recently received a lot of attention from both the scientific community and the general public. Provided with a simple textual description, these models are able to generate high-quality images from different domains and styles. Whereas trained on large collections of public data from the internet, the learned knowledge and behavior of these models are only little understood and have already raised copyright concerns (Heikkiläarchive, 2022) . Previous research mainly focused on improving the generated images' quality and the models' understanding of complex text descriptions. See Sec. 2 for an overview of related work in text-to-image synthesis and possible attacks against such models. We are the first to investigate the behavior of text-guided image generation models when conditioned on descriptions that contain non-Latin Unicode characters. Replacing standard Latin characters with visually similar characters, so-called homoglyphs, allows a malicious party to disrupt image generation while making the manipulations for users hard to detect through visual inspection. We show the surprising effect that homoglyphs from non-Latin Unicode scripts not only influence the image generation but also implicitly induce biases from the cultural circle of the corresponding languages. For example, DALL-E 2 generates images of Athens when provided with a generic description of a city and a single character replaced with a Greek homoglyph. We found similar model behavior across various domains and Unicode scripts, for which replacing as few as a single Latin character with any non-Latin character is sufficient to induce biases into the generated images. We present our methodology and experimental results in Sec. 3 and Sec. 4, respectively. We generally refer to the induced cultural and ethnic characteristics corresponding to specific language scripts into the generated images as cultural biases throughout this work. Moreover, homoglyph replacements allow an attacker to even hide complete objects from being depicted in the generated images. It results in misleading image generations and lowers the perceived model quality, as we practically demonstrate in Sec. 4. This behavior is not limited to DALL-E 2 but also apparent for Stable Diffusion (Rombach et al., 2022) and CLIP (Radford et al., 2021) . Our experimental results, which we discuss further in Sec. 5, raise the questions of how much we actually understand about the inner processes of multi-modal models trained on public data and how small differences in the textual description could already influence the image generation. As textguided image generation models become available to the general public and have a wide range of applications, such questions are essential for an informed use. In summary, we make the following contributions: • We are the first to show that text-guided image generation models and other models trained on text-image pairs are sensitive to character encodings and implicitly learn cultural biases related to different scripts during training. • We demonstrate that by a single homoglyph replacement, an attacker can bias the image generation with cultural influences and even render the whole process meaningless. • We introduce a novel homoglyph unlearning procedure to make already trained text encoders invariant to homoglyph manipulations. Disclaimer: This paper contains images representing various cultural biases that some readers may find offensive. We emphasize that the purpose of this work is to show that such biases are already present in text-guided image generation models and can be exploited through homoglyph manipulations. We do not intend to discriminate against people or cultures in any way.

2. BACKGROUND AND RELATED WORK

We first provide an overview over DALL-E 2 and related multimodal systems (Sec. 2.1), before outlining common privacy and security attacks against machine learning systems (Sec. 2.2) and introducing homoglyphs and the related homograph attacks (Sec. 2.3).

2.1. TEXT-TO-IMAGE SYNTHESIS

In the last few years, training models on multimodal data received much attention. Recent approaches for contrastive learning on image-text pairs are powered by a large amount of publicly available images and their corresponding descriptions. One of the most prominent representative is CLIP (Contrastive Language-Image Pre-training) (Radford et al., 2021) , which consists of an image encoder network and a text encoder network. Both parts are jointly trained on image-text pairs in a contrastive learning scheme to match corresponding pairings of images and texts. Trained on 400M samples collected from the internet, CLIP learned meaningful representations of images and their textual descriptions, and successfully performed a wide range of different tasks with zero-shot transfer and no additional training required.



Figure 1: Example of our homoglyph manipulations in the DALL-E 2 pipeline. The model has been queried with the prompt A photo of an actress. Using only Latin characters in the text, the model generates pictures of women with different ethnic backgrounds. However, replacing the o in the text with the visually barely distinguishable single Korean (Hangul script) or Indian (Oriya script) homoglyphs leads to the generation of biased images that clearly reflect cultural biases.

