THE BIASED ARTIST: EXPLOITING CULTURAL BIASES VIA HOMOGLYPHS IN TEXT-GUIDED IMAGE GENERA-TION MODELS

Abstract

Text-guided image generation models, such as DALL-E 2 and Stable Diffusion, have recently received much attention from academia and the general public. Provided with textual descriptions, these models are capable of generating highquality images depicting various concepts and styles. However, such models are trained on large amounts of public data and implicitly learn relationships from their training data that are not immediately apparent. We demonstrate that common multimodal models implicitly learned cultural biases that can be triggered and injected into the generated images by simply replacing single characters in the textual description with visually similar non-Latin characters. These so-called homoglyph replacements enable malicious users or service providers to induce biases into the generated images and even render the whole generation process useless. We practically illustrate such attacks on DALL-E 2 and Stable Diffusion as text-guided image generation models and further show that CLIP also behaves similarly. We further propose a novel homoglyph unlearning approach to update pre-trained text encoders to remove their susceptibility to homoglyphs.

1. INTRODUCTION

Text-guided image generation models, such as DALL-E 2 (Ramesh et al., 2022) , have recently received a lot of attention from both the scientific community and the general public. Provided with a simple textual description, these models are able to generate high-quality images from different domains and styles. Whereas trained on large collections of public data from the internet, the learned knowledge and behavior of these models are only little understood and have already raised copyright concerns (Heikkiläarchive, 2022) . Previous research mainly focused on improving the generated images' quality and the models' understanding of complex text descriptions. See Sec. 2 for an overview of related work in text-to-image synthesis and possible attacks against such models. We are the first to investigate the behavior of text-guided image generation models when conditioned on descriptions that contain non-Latin Unicode characters. Replacing standard Latin characters with visually similar characters, so-called homoglyphs, allows a malicious party to disrupt image generation while making the manipulations for users hard to detect through visual inspection. We show the surprising effect that homoglyphs from non-Latin Unicode scripts not only influence the image generation but also implicitly induce biases from the cultural circle of the corresponding languages. For example, DALL-E 2 generates images of Athens when provided with a generic description of a city and a single character replaced with a Greek homoglyph. We found similar model behavior across various domains and Unicode scripts, for which replacing as few as a single Latin character with any non-Latin character is sufficient to induce biases into the generated images. We present our methodology and experimental results in Sec. 3 and Sec. 4, respectively. We generally refer to the induced cultural and ethnic characteristics corresponding to specific language scripts into the generated images as cultural biases throughout this work. Moreover, homoglyph replacements allow an attacker to even hide complete objects from being depicted in the generated images. It results in misleading image generations and lowers the perceived model quality, as we practically demonstrate in Sec. 4. This behavior is not limited to DALL-E 2 but also apparent for Stable Diffusion (Rombach et al., 2022) and CLIP (Radford et al., 2021) .

