THE BIASED ARTIST: EXPLOITING CULTURAL BIASES VIA HOMOGLYPHS IN TEXT-GUIDED IMAGE GENERA-TION MODELS

Abstract

Text-guided image generation models, such as DALL-E 2 and Stable Diffusion, have recently received much attention from academia and the general public. Provided with textual descriptions, these models are capable of generating highquality images depicting various concepts and styles. However, such models are trained on large amounts of public data and implicitly learn relationships from their training data that are not immediately apparent. We demonstrate that common multimodal models implicitly learned cultural biases that can be triggered and injected into the generated images by simply replacing single characters in the textual description with visually similar non-Latin characters. These so-called homoglyph replacements enable malicious users or service providers to induce biases into the generated images and even render the whole generation process useless. We practically illustrate such attacks on DALL-E 2 and Stable Diffusion as text-guided image generation models and further show that CLIP also behaves similarly. We further propose a novel homoglyph unlearning approach to update pre-trained text encoders to remove their susceptibility to homoglyphs.

1. INTRODUCTION

Text-guided image generation models, such as DALL-E 2 (Ramesh et al., 2022) , have recently received a lot of attention from both the scientific community and the general public. Provided with a simple textual description, these models are able to generate high-quality images from different domains and styles. Whereas trained on large collections of public data from the internet, the learned knowledge and behavior of these models are only little understood and have already raised copyright concerns (Heikkiläarchive, 2022) . Previous research mainly focused on improving the generated images' quality and the models' understanding of complex text descriptions. See Sec. 2 for an overview of related work in text-to-image synthesis and possible attacks against such models. We are the first to investigate the behavior of text-guided image generation models when conditioned on descriptions that contain non-Latin Unicode characters. Replacing standard Latin characters with visually similar characters, so-called homoglyphs, allows a malicious party to disrupt image generation while making the manipulations for users hard to detect through visual inspection. We show the surprising effect that homoglyphs from non-Latin Unicode scripts not only influence the image generation but also implicitly induce biases from the cultural circle of the corresponding languages. For example, DALL-E 2 generates images of Athens when provided with a generic description of a city and a single character replaced with a Greek homoglyph. We found similar model behavior across various domains and Unicode scripts, for which replacing as few as a single Latin character with any non-Latin character is sufficient to induce biases into the generated images. We present our methodology and experimental results in Sec. 3 and Sec. 4, respectively. We generally refer to the induced cultural and ethnic characteristics corresponding to specific language scripts into the generated images as cultural biases throughout this work. Moreover, homoglyph replacements allow an attacker to even hide complete objects from being depicted in the generated images. It results in misleading image generations and lowers the perceived model quality, as we practically demonstrate in Sec. 4. This behavior is not limited to DALL-E 2 but also apparent for Stable Diffusion (Rombach et al., 2022) and CLIP (Radford et al., 2021) . Our experimental results, which we discuss further in Sec. 5, raise the questions of how much we actually understand about the inner processes of multi-modal models trained on public data and how small differences in the textual description could already influence the image generation. As textguided image generation models become available to the general public and have a wide range of applications, such questions are essential for an informed use. In summary, we make the following contributions: • We are the first to show that text-guided image generation models and other models trained on text-image pairs are sensitive to character encodings and implicitly learn cultural biases related to different scripts during training. • We demonstrate that by a single homoglyph replacement, an attacker can bias the image generation with cultural influences and even render the whole process meaningless. • We introduce a novel homoglyph unlearning procedure to make already trained text encoders invariant to homoglyph manipulations. Disclaimer: This paper contains images representing various cultural biases that some readers may find offensive. We emphasize that the purpose of this work is to show that such biases are already present in text-guided image generation models and can be exploited through homoglyph manipulations. We do not intend to discriminate against people or cultures in any way.

2. BACKGROUND AND RELATED WORK

We first provide an overview over DALL-E 2 and related multimodal systems (Sec. 2.1), before outlining common privacy and security attacks against machine learning systems (Sec. 2.2) and introducing homoglyphs and the related homograph attacks (Sec. 2.3).

2.1. TEXT-TO-IMAGE SYNTHESIS

In the last few years, training models on multimodal data received much attention. Recent approaches for contrastive learning on image-text pairs are powered by a large amount of publicly available images and their corresponding descriptions. One of the most prominent representative is CLIP (Contrastive Language-Image Pre-training) (Radford et al., 2021) , which consists of an image encoder network and a text encoder network. Both parts are jointly trained on image-text pairs in a contrastive learning scheme to match corresponding pairings of images and texts. Trained on 400M samples collected from the internet, CLIP learned meaningful representations of images and their textual descriptions, and successfully performed a wide range of different tasks with zero-shot transfer and no additional training required. Based on the robust representations learned by CLIP, OpenAI recently introduced their textconditional image generation model DALL-E 2 (Ramesh et al., 2022) . To generate images from textual descriptions, the model first computes the CLIP text embedding and then uses a prior to generate possible CLIP image embeddings based on a given text embedding. Finally, an image decoder is used to generate images based on the image embeddings. The decoder is a diffusion model (Song & Ermon, 2020; Ho et al., 2020) trained to generate images conditioned on CLIP image embeddings. Since the decoder model only learns to generate images from CLIP image embeddings, the prior model is needed to map text embeddings to corresponding image embeddings. Fig. 1 gives an overview of the DALL-E 2 pipeline for text-guided image generation. Besides DALL-E 2, various other text-guided image generation models have been introduced over the last couple of months. These include its direct predecessors GLIDE (Nichol et al., 2022) and DALL-E (Ramesh et al., 2021 ), Google's Imagen (Saharia et al., 2022) and Parti (Yu et al., 2022) , or the publicly available Stable Diffusion (Rombach et al., 2022) . All models heavily rely on a large amount of public data collected from the internet. While machine learning models continue to achieve new astonishing achievements, the privacy and security of these models come into focus. We introduce common attacks in this context in the following section.

2.2. ATTACKS AGAINST MULTIMODAL MACHINE LEARNING MODELS

In recent years, numerous attacks on the privacy and security of machine learning models have been proposed. Most similar to our work are adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2015) , where small tuned perturbations are added to images to manipulate a model's predictions. However, only a few attacks have been proposed so far in the context of multimodal systems. Carlini & Terzis (2022) showed that contrastive learning models trained on image-text pairs are similarly vulnerable to poisoning and backdoor attacks as standard image classification models. Hintersdorf et al. (2022) recently demonstrated that CLIP models memorize sensitive information about entities and leak private information about their training data. For text-guided generative models, Millière (2022) lately introduced two approaches for crafting adversarial examples. In the first approach, called macaronic prompting, new words are formed by concatenating words or morphemes from multiple languages. In the second approach, called evocative prompting, words are formed that have some logical similarity to existing words or categories. However, all crafted words and text prompts are written in standard Latin, and humans probably recognize such adversarial examples quickly. This fact distinguishes the proposed attacks from traditional adversarial examples, whose manipulations are hardly perceptible to humans. Not a direct attack, but still worth mentioning are the findings of Daras & Dimakis (2022) , which show that DALL-E 2 seems to learn a hidden vocabulary with its own made-up words for different concepts.

2.3. HOMOGLYPHS AND HOMOGRAPH ATTACKS

In this work, we investigate the influence of homoglyph replacements in image-text multimodal systems. We show for the first time that there are cultural biases captured by the models that can be easily triggered by homoglyphs, not visible to the naked eye, thus providing a new form of attack. This is important because people may feel discriminated against by cultural stereotypes portrayed, and an attacker could reduce image quality and even render systems unusable. Homoglyphs are letters and digits that look identical or very similar and are, therefore, hard for humans and optical character recognition systems to distinguish. For example, the capital letter O (U+004F) and the digit 0 (U+0030) in written text are usually easy to confuse. The extent to which homoglyphs are confused by people also depends a lot on the font used. Fig. 2 Whereas the visual differences between some characters as part of a sentence might be spotted by an attentive user, several characters look almost identical, especially in some fonts, and corresponding homoglyph attacks are difficult to spot by visual inspection. Unicode homoglyphs in this context describe characters from different scripts with separate hexadecimal identifiers assigned but similar visual appearances. For instance, the Latin character A (U+0041), the Greek character Α (U+0391), and the Cyrillic character А (U+0410) seem to be identical. Hence, completely different Unicode identifiers are associated with each character. While the three characters make no visual difference to humans, information systems interpret each character differently, which already led to Unicode security considerations (Davis & Suignard, 2014) . Words that contain one or more homoglyphs from a different Unicode script are called homographs. In a URL homograph attack, also known as script spoofing, the attacker registers domain names that look like regular domains but replaced characters with non-Latin homoglyphs. A user might then be tricked to open the manipulated domain, which redirects to a malicious website to deploy malware or phishing for private information (Gabrilovich & Gontmakher, 2002; Simpson et al., 2020) . Boucher et al. (2022) recently introduced character encoding-based attacks against natural language processing (NLP) systems. Exploiting homoglyphs and invisible Unicode symbols, together with reordering and deleting control characters, the authors added imperceptible adversarial perturbations to strings via differential evolution. The attack was then able to trick various commercial and opensource NLP systems. Related adversarial text attacks insert spelling mistakes (Belinkov & Bisk, 2018) or synonyms (Jin et al., 2020) to break a model. In contrast, we demonstrate that text-guided image generation models are also susceptible to homograph attacks and exhibits interesting behavior expressed by cultural biases in the images generated for single character homoglyph replacements.

3. METHODOLOGY

For our investigations, we assume the user or attacker has black-box access to a text-guided image generation or image-text multimodal system, for example, an API providing a text prompt. For DALL-E 2, the official API generates four image variations for every single prompt request. Throughout the paper, we always state all four generated images from a single request to avoid cherry-picking. Since OpenAI may update either the DALL-E 2 model or the API over time, we note that all results depicted are generated between August 18 and September 2, 2022.

Marcus et al. (2022) and

Conwell & Ullman (2022) provided qualitative analyses of DALL-E 2's generative capabilities on challenging text prompts. The authors empirically demonstrated that DALL-E 2 produces high-quality images for simple prompts but often fails to understand entity relationships, numbers, negations, and common sense. To avoid failures and additional biases due to complex queries as far as possible, we kept the image descriptions in our experiments simple and verified that the models can produce meaningful images for the Latin-only character prompts. We state additional software and hardware details in Appx. A. Fig. 1 illustrates our basic approach. Homoglyph replacements were realized by simply replacing Latin characters with homoglyphs from the Unicode scripts in the API's text prompt. For our later experiments with Stable Diffusion (Sec. 4.2) and CLIP (Appx. F.1), we directly changed the characters in the inputs strings. No further changes were made to the descriptions or the models. We experimented with various Unicode scripts for different languages. Whereas some of the scripts and their associated culture are commonly known, such as Greek or Cyrillic scrips, some might not. We provide a short overview of the different scripts we used throughout this work in Appx. B. Quantifying The Influence of Homoglyphs. To measure the cultural biases induced by homoglyphs, we built three prompt datasets that define concepts that are usually influenced by local culture, namely People, Buildings, and Misc. Each dataset consists of ten different prompts, each containing a placeholder # in the style of A small # town. We then generated multiple images x for each prompt, once with # removed and once replaced by the homoglyph for which we want to measure the bias, e.g., a Greek omicron. We denote the generated images based on Latin-only prompts as x and the ones with the homoglyph prompt as x. Next, we took a pre-trained CLIP model (OpenCLIP ViT-H/14) (Ilharco et al., 2021) and computed the similarity of each image with the corresponding prompt z but with # replaced by the culture associated with the script of a homoglypgh, e.g., Greek in the case of an omicron. Be S c (x, z) the cosine similarity between the CLIP embeddings of image x and text z. To quantify how a homoglyph biases the image content toward its corresponding culture, we compute its relative bias as Relative Bias = 1 N N i=1 S c ( xi , z i ) -S c (x i , z i ) S c (x i , z i ) • 100%. The relative bias measures the relative increase of similarity between the given prompt that explicitly names a culture and the generated images with and without a homoglyph associated with this culture present. A higher relative bias indicates a stronger connection between the homoglyph and the associated culture. It can be computed for a single prompt or a set of prompts. For our experiments, we created three distinct prompt sets describing people, buildings, and miscellaneous cultural concepts, respectively, each containing ten prompts. All prompts are stated in Appx. B.1. We then generated ten images for each prompt and computed the mean relative bias for all image-text combinations.

4. MANIPULATING THE IMAGE GENERATION WITH HOMOGLYPHS

We first investigate the behavior of DALL-E 2 for homoglyph replacements in the textual description to demonstrate the surprising fact that small manipulations induce cultural biases into the generated images (Sec. 4.1). After that, we extend our analysis to Stable Diffusion (Sec. 4.2). We further propose a novel homoglyph unlearning method to remove the homoglyph induced biases from the text encoder (Sec. 4.3). Additional experiments demonstrating obfuscating effects of homoglyphs and investigations on the behavior of CLIP models are stated in Appx. C.

4.1. INDUCING CULTURAL BIASES INTO THE IMAGE GENERATION PROCESS

We start by demonstrating the effects of homoglyph replacements in subordinate words for the image generation with DALL-E 2. We focus on characters of words that are not crucial to the overall image description, such as articles or prepositions. By this, we demonstrate the surprising effect that homoglyphs induce cultural biases and implicitly guide the image generation accordingly, without changing the underlying meaning of the prompt or explicitly defining any cultural attributes in the query. As a first example, the top row of Fig. 3 illustrates the biases induced by replacing the article A in the prompt A city in bright sunshine with a Greek or Scandinavian homoglyph. Whereas the unmodified prompt with Latin-only characters generates city images of various architectural styles, inserting the Greek capital Α (U+0391) leads to images of Greek cities. Two of the resulting images even show Athens with a view on Mount Lycabettus. For the Scandinavian character Å (U+00C5), the images depicts cities located by the water, characteristic of Scandinavian cities, e.g., Trondheim or Bergen. The bottom row of Fig. 3 further depicts results for generating images of an actress. As one can see, replacing a single character with characters from non-Latin scripts already leads to the generation of women from the cultural background associated with the corresponding homoglyph. All generated images depict cultural stereotypes, such as clothing, jewelry, and even art style. Only the top-left image in Fig. 3e seems to be out of line, but this is also regularly happening to standard DALL-E 2 queries, so we expect it to be a simple outlier. Biasing the model with single homoglyph replacements can be used in various contexts. We provide a larger set of examples in Appx. D. Throughout our experiments, we found that emojis also induce biases into the image generation process. For example, adding an emoji to the prompt A photo of a X person replacing X changes the mood of the depicted person. We state various results for these queries in Appx. D.6. Whereas homoglyph replacements in most cases influence the image generation, the occurring biases can not always be clearly described and assigned to a specific culture and are sometimes subtle, such as color schemes or environments. We found the induced biases to be stronger and clearer from homoglyphs that relate to a more narrowly defined culture, such as characters from the Greek script, which are limited to the Greek language spoken in Greece and Cyprus. In contrast, the character ọ (U+1ECD) is not only part of the Vietnamese language, but also of various African languages, which is why this specific homoglyph does not always induce a Vietnamese bias, as can be seen for images generated by Stable Diffusion in Appx. E where the images rather reflect an African bias.

4.2. HOMOGLYPH SUSCEPTIBILITY IS INHERENT IN TEXT-GUIDED GENERATIVE MODELS

Inspired by our findings, we wanted to investigate if other text-guided generative models show similar behavior. For this, we repeated the experiments from the previous sections on the publicly available Stable Diffusion (Rombach et al., 2022) v1.4 model. We generally found the quality and level of details of images generated to be inferior and less consistent compared to DALL-E 2. However, the results are still of decent quality and follow the specified text descriptions. Overall, we found that Stable Diffusion behaves comparably to homoglyph replacements and integrates similar cultural biases into its generated images. However, the induced biases are sometimes less clearly depicted compared to the DALL-E 2 results. As we discuss in Sec. 5.1, the underlying training data did not cover all tested homoglyphs, which is why the model cannot link them to a specific culture. To quantify the influence of individual homoglyphs, we measured the relative bias for five homoglyphs from different scripts. Fig. 5 visualizes our results. The relative bias allows quantifying the various biasing effects of homoglyphs from different scripts. our results show that homoglyphs from the Korean or African scripts have a large impact on the generation of humans, whereas Greek homoglyphs mainly bias the architectural style of buildings. The Arabic homoglyph, however, has similarly influence on all three concepts. Generally, as expected, the biases are larger if the associated cultural concept differs more strongly from standard western culture. To further investigate if the cultural differences actually originate from single homoglyphs, we calculated the text embeddings for single non-Latin characters and determined the difference with the Latin counterpart. We then added this difference to the text embedding of a standard text prompt without homoglyph replacements. Fig. 4 demonstrates the general principle and some results for inducing Korean and Arabic biases. As can be seen, the added embedding shift induced similar cultural biases as our previous experiments with homoglyph replacements in the text prompt. However, we note that the induced biases are not always strong. Since the induced cultural biases in the Stable Diffusion images are usually weaker compared to our DALL-E 2 results, we expect the effect of embedding manipulations also to be stronger for DALL-E 2.

4.3. HOMOGLYPH UNLEARNING

We also propose a novel approach to remove the biasing behavior of specific homoglyphs from a model. Inspired by backdoor attacks on pre-trained text encoders (Struppek et al., 2022) , we introduce a fine-tuning procedure to let a text encoder learn to map a set of homoglyphs H to their Latin counterpart and, consequently, make the model invariant to these characters. Our method starts with two text encoder models, E and E inv , both initialized with the same pre-trained encoder weights used by the generative model. We then only update the weights of E inv to make it invariant against certain homoglyphs. We do this by minimizing it with the following loss function: L unearning = 1 |B| z∈B S c (E(z), E inv (z)) + h∈H 1 |B h | z ′ ∈B h S c (E(z ′ ), E inv (z ′ ⊕ h)) . (2) Here, S C denotes the cosine similarity computed between the text embeddings computed by the encoders. During each step, we sample prompt batches B and B h from a suitable English text dataset. The first term ensures that the computed embeddings of E inv are close to the embeddings of E and that the general utility of the encoder is preserved. The second term trains the encoder to map embeddings for prompts with homoglyph h ∈ H to the corresponding embedding for its Latin counterpart. The operator ⊕ denotes the replacement of a single pre-defined Latin character in a prompt by its corresponding homoglyph, e.g., a Latin o by a Greek ο. We tested our approach on the CLIP text encoder in combination with Stable Diffusion and prompts from the LAION-Aesthetics v2 6.5+ dataset (Schuhmann et al., 2022) . We state our hyperparameters in Appx. A. Our results, depicted in Fig. 5 by light bars, clearly demonstrate that our homoglyph unlearning approach successfully removes most of their biasing behavior. Only in some cases, e.g., for the African ọ , some small biasing behavior is still present. However, compared to the standard text encoder, the relative bias has still been reduced drastically. At the same time, the image quality measured as the FID score (Heusel et al., 2017) stays roughly the same -17.05 for the standard encoder and 17.22 for the encoder after the homoglyph unlearning. We further state qualitative results in Appx. E.4 comparing the generated images before and after homoglyph unlearning.

5. DISCUSSION, LIMITATIONS AND CONCLUSION

We now discuss our findings in more detail, including the reason for the models' behavior (Sec. 5.1) and how this behavior might be mitigated (Sec. 5.2). We also raise the question if this model property is compellingly bad (Sec. 5.3) and note some limitations of our analysis (Sec. 5.4).

5.1. REASONS FOR MODEL BEHAVIOR

All investigated models have been trained on large amounts of filtered public data with English captions. However, some of the data must contain captions or some text parts with non-Latin Unicode characters included, allowing the models to grasp the specifics and differences of various Unicode scripts and their associated cultures. Whereas the training data for DALL-E 2 and CLIP are not publicly available, Stable Diffusion is trained on the 600M image-text pairs from the LAION-Aesthetics V2 5+ dataset (Schuhmann et al., 2022) with English captions. We found numerous image descriptions that contain non-Latin characters in this dataset, as we state in Appx. G. Most common non-Latin characters are from the Cyrillic, Greek and extended Latin scripts. We also found some letters, e.g., from the Oriya or Osmanya scripts, are not contained in the dataset, which explains why these characters did not induce any clear cultural bias, as apparent in Appx. E.1. Furthermore, it seems that a small share of image-text samples with a specific character is sufficient to learn the cultural bias. For example, the Hangul character ㅇ (U+3147) was only present in 64 samples, about 0.00001% of all 600M samples in the dataset. Still, the model induced Korean biases into the generated images when provided with the homoglyph in a prompt. Given that DALL-E 2 generates biases for a larger number of scripts and homoglyphs, we conclude that there was even a greater variety of homoglyphs present in its training data. Transformer-based language models are generally known to learn fine details of languages given a sufficient capacity and amount of training data (Radford et al., 2018) . Therefore, it is not surprising that the text encoders in multimodal systems learn the details of various cultural influences from a few training samples. Diffusion models, on the other hand, offer strong mode coverage and sample diversity (Nichol & Dhariwal, 2021) and enable the generation of images representing the various cultural biases. We hypothesize that the interplay of both components plays a crucial role in explaining the culturally influenced behavior of the investigated models in the presence of homoglyphs.

5.2. HOW TO AVOID HOMOGRAPH ATTACKS

In addition to our homoglyph unlearning approach, we envision two additional basic approaches to avoid model biases by homoglyph replacements. The first solution is a technical Unicode script detector built into any model API. For example, the API could scan each text input for any non-Latin characters or non-Arabic numbers and either block the queries or inform the user about the presence of such symbols. In addition, detected queries containing homoglyphs can be purified by simple character mappings to valid characters. As a second solution, we propose to train the models' text encoders on multilingual data. As we demonstrated in Appx. F.1, the multilingual M-CLIP model is considerably more robust against homoglyph manipulations and probably computes more stable text embeddings to guide the image generation. Due to the high computational costs required for evaluation of this approach, we leave it to future research. A naive approach would simply filter out any image-text pair from the training data that contains any non-Latin characters. However, neural networks are known to be susceptible to out-of-distribution data and produce unforeseeable predictions if characters are present in a query that contains unseen characters. The consequences of such models without any defense might be more severe than for current models trained on data with at least some non-Latin characters.

5.3. IT'S NOT A BUG, IT'S A FEATURE?

Manipulating text prompts with homoglyphs can be seen as a security attack against text-guided image generation models. As we have shown in Appx. C.1, content clearly described in the prompt could be hidden by replacing single characters with homoglyphs. Such attacks might render the model useless and degrade the generated images' quality. However, by carefully placing homoglyphs in subordinate words or as additional inputs, a user can also intentionally induce cultural biases. DALL-E 2 already contains a natural bias towards western culture, particularly from the US. We demonstrated that the injection of single characters from non-Latin Unicode script reduces western biases and induces features from the corresponding culture. It is at least debatable if universal purpose models such as DALL-E 2 should provide users only with western biases, independent of a user's own cultural background. Homoglyphs allow users to tailor the image generation process to their own culture by inserting single characters from their local script into their prompt. Consequently, depending on someone's perspective, the investigated model behaviors to non-Latin characters can also be interpreted as a model feature instead of a vulnerability or undesired behavior.

5.4. LIMITATIONS

It remains to be empirically investigated whether other text-conditional image generation models, such as Google's Parti and Imagen, exhibit similar biases for non-Latin characters. Unfortunately, these models were not publicly available at the time of this writing. We, therefore, leave the investigation of a broader range of models to future work. However, we assume that these models exhibit similar behavior, as they were all trained to capture image semantics from textual descriptions found on the Internet, which likely almost always contain non-Latin characters. Another limitation of our work is the lack of access to the training data from DALL-E 2 to identify possible patterns that might lead to the current cultural biases for homoglyph replacements.

5.5. CONCLUSION

Our work has shown that text-guided image generation models are sensitive to the character encoding of the provided text description, and that replacing characters with visually hard-to-spot non-Latin homoglyphs substantially alters the behavior of a model. This opens the door to two types of potentially malicious manipulations: By changing just a single character in the text description, we could either induce cultural distortions or obfuscate certain concepts in the generated images. We found that this behavior is not limited to DALL-E 2, but also occurs in other models trained on image-text pairs from the Internet. Our results show that models trained on large datasets from public sources capture the specifics of individual cultures implicitly from a relatively small number of samples containing non-Latin characters. With our homoglyph unlearning procedure, we proposed a novel solution to make the text encoders of the generative models invariant to homoglyphs without the need for full retraining. We further hope that our research will contribute to a better understanding of multimodel models and support the development of more robust systems.

ETHICS STATEMENT

Our paper shows that multimodal models trained on large datasets of image-text pairs collected from the internet implicitly learn cultural stereotypes and biases associated with different Unicode scripts. By replacing single characters in a text prompt of such models, the user or an attacker could control the image generation of text-guided synthesis models to include cultural biases in the resulting images. Some people might feel discriminated against by such images, depicting cultural stereotypes. Moreover, an attacker could reduce the image quality and even render the systems useless, resulting in worse perceived model or service quality, which might also have economic disadvantages for the service or model providers. Homoglyph replacements can be done by any user simply by copying and pasting possible homoglyphs into the text prompts of any available model (if no homoglyph detector is present). This requires little to no technical or linguistic knowledge and skills. While this might make our paper and its insights appear harmful, we believe that the benefits of informing the community about this phenomenon outweighs the potential harms. Knowing the presented model behavior when homoglyphs are present in text prompts allows the model providers to react and implement possible defense mechanisms, such as homoglyph detectors or simple character mappings. We hope that our results also inspire research to develop more robust models and training procedures. By making our findings publicly available as soon as possible, we give the community more time to discuss the societal implications and develop solutions or improvements. Furthermore, understanding the behavior of deep neural networks, particularly of large language and diffusion models, is still an ongoing subject of research. We believe that the results of our work make an important contribution in this direction, showing that the models indeed learn subtleties from their training data, which also raises the issue of privacy. If a relatively small amount of training data with non-Latin characters is already sufficient for the models to capture the features of different cultures, one must ask what information is also learned from such models and potentially leaked. In summary, we believe that the positive effects on machine learning research outweigh possible negative impacts and lead to the development of more robust and reliable systems.

REPRODUCIBILITY STATEMENT

To allow the reproduction of our results, we provide our source code in the supplementary material of this submission. We further refer to Appx. A for hardware and software details. Also, we provide all hyperparameters used during our experiments, together with the model sources. The text prompts used to create the various figures are always stated within the captions. The image sources used to investigate the robustness of CLIP are stated in Appx. F.4. We will also make our source code publicly available on GitHub after acceptance.

A HARD-AND SOFTWARE DETAILS

Our experiments were performed on NVIDIA DGX machines running NVIDIA DGX Server Version 5.1.0 and Ubuntu 20.04.5 LTS. The machines have 1.6TB of RAM and contain Tesla V100-SXM3-32GB-H GPUs and Intel Xeon Platinum 8174 CPUs. We further relied on CUDA 11.6, Python 3.8.13, and PyTorch 1.12.0 with Torchvision 0.13.0 for our experiments. Our DALL-E 2 experiments were performed with the web API available at https://labs. openai.com/ for selected users. We used Stable Diffusion v1.4, available at https:// huggingface.co/CompVis/stable-diffusion-v1-4 to generate the corresponding samples. It was used with an K-LMS scheduler with the parameters β start = 0.00085, β end = 0.012 and a linear scaled scheduler. The generated images have a size of 512 × 512 and were generated with 100 inference steps and a guidance scale of 7.5. We set the seed to 1 for Stable Diffusion experiments and then generated four images for each prompt. For our CLIP experiments, we relied on publicly available models. For OpenAI's CLIP (Radford et al., 2021) , we used the models provided by https://github.com/openai/CLIP, namely the ViT-B/32 and RN50 models. For OpenCLIP, the models are available at https://github. com/mlfoundations/open_clip. We also used the ViT-B-32, trained on the LAION2Ben dataset. In case of the multilingual CLIP (M-CLIP), we used the XLM-R Large Vit-B/32 text encoder in combination with OpenAI's OpenAI ViT-B/32 vision encoder. M-CLIP is available at https://github.com/FreddeFrallan/Multilingual-CLIP. To compute the relative bias, we generated ten images for each of the ten prompts in the prompt datasets, stated in Tab. 1, once with and once without a homoglyph inserted. We used the same seed for each set of images to avoid influences due to randomness. To compute the image and text embeddings, we used the ViT-H/14 OpenCLIP model, which promises the best zero-shot performance and is also trained on a different dataset than OpenAI's CLIP model used in Stable Diffusion and DALL-E 2. To perform the homoglyph unlearning procedure, we optimized the pretrained CLIP text encoder for 500 steps on samples from the LAION-Aesthetics v2 6.5+ dataset (Schuhmann et al., 2022) . During each step, we sampled 128 to compute the first term of the loss function on Latin-only prompts. For each of the five homoglyphs stated in Fig. 5 , we sampled an additional 128 prompts and replaced a single Latin character with its homoglyph counterpart. We optimized the encoder with the AdamW optimizer (Loshchilov & Hutter, 2019) with a learning rate of 10 -4 and multiplied it after 400 steps by the factor 0.1. We further kept β = (0.9, 0.999) and ϵ = 10 -8 at their default values. We measured the FID score using the clean FID approach (Parmar et al., 2022) . We sampled 10,000 prompts from the MS-COCO 2014 (Lin et al., 2014) validation split and generated images with Stable Diffusion with the parameters stated at the beginning of this section. As real samples, we used all 40,504 images from the MS-COCO validation split.

B UNICODE SCRIPTS

Unicode supports a wide range of different scripts. See https://www.unicode.org/ standard/supported.html for an overview of all supported scripts. The current Unicode Standard 15.0.0 (Unicode Consortium, 2022) supports 149,186 characters from 161 scripts. Each script contains a set of characters and written signs of one or more writing systems. We now provide a short and non-exhaustive overview of the Unicode scripts we used throughout this work. Basic Latin: Ranges from U+0000 to U+007F and contains 128 standard letters and digits used by Western languages, such as English, as well as basic punctuation and symbols. This paper, for example, is mostly encoded in the characters from this Script. Together with 18 additional blocks comprising supplements and extensions, the Latin script currently contains 1,475 characters. Latin Supplements and Extensions: This group comprises multiple additional character variations of the basic Latin script. The Latin-1 Supplement ranges from U+0080 to U+00FF and offers characters for the French, German and Scandinavian alphabets, amongst others. The Latin Extended-A (U+0100 to U+017F) and Extended-B (U+0180 to U+024F) scripts contain further Latin character variations for, e.g., Afrikaans, Hungarian, Turkish and the Romanian writing systems. The Latin 

(c)

The Great Wall in fog with the l replaced by Arabic ‫ا‬ (U+0627). We start by demonstrating that homoglyph replacements allow a malicious user or adversary to prevent DALL-E 2 from generating concepts clearly described in the text prompt. For this, the attacker replaces characters in a concept's name with homoglyphs or generally any non-Latin character and forces the generative model to ignore the object in the generated image. In practice, the manipulated strings might be shared over the internet to reduce the perceived model quality if queried with these image descriptions. A malicious browser plugin or website script might further automatically manipulate inserted prompt texts by replacing characters with inconspicuous homoglyphs to render the whole model useless without users noticing. Such attacks might have serious impacts on the profitability of commercial services based on text-guided models. We visualized some examples in Fig. 6 . Taking the prompt Big Ben in the rain as an example, we simply replaced the B with the Cyrillic В (U+0412) to remove the Big Ben from the generated images. DALL-E 2 generated various images of rainy locations but without any trace of the Big Ben or similar monuments. In most cases, replacing a single character is already sufficient for hiding the concept in the generated images. However, when concept description consists of multiple words, we needed to insert a homoglyph in each of the concept's words, see Fig. 6a . We experimented with numerous queries and Unicode scripts and found that any object could successfully be hidden by using homoglyphs from any non-Latin script. To verify that DALL-E 2 is generally capable to generate the stated concepts, we show the created images for benign prompts in Appx. D.7.

D ADDITIONAL DALL-E 2 RESULTS

In this section, we state additional results of our experiments on DALL-E 2. We queried the model using the online API and the prompts specified in the image captions. Note that we queried the model only once with each text description, and stated all four generated results. (e) N'Ko ߋ (U+07CB). (f) Hangul (Korean) ㅇ (U+3147). (g) Arabic ‫ه‬ (U+0647). (h) Armenian օ (U+0585). (i) Bengali ০ (U+09E6). (e) N'Ko ߋ (U+07CB). (f) Hangul (Korean) ㅇ (U+3147). (g) Arabic ‫ه‬ (U+0647). (h) Armenian օ (U+0585). (i) Bengali ০ (U+09E6). Figure 18 : Non Cherry-picked examples of biases induced into the embedding space. We queried Stable Diffusion with the following prompt: A man sitting at a table. We further computed the difference between the text embeddings of the stated non-Latin homoglyphs and the Latin character o (U+006F). We then added the difference to the prompt embedding to induce the cultural biases. See Fig. 4 for an overview of the approach. F CLIP EXPERIMENTS To check if other multimodal models are similarly susceptible to homoglyphs, we extended our experiments to CLIP. We investigated the official CLIP models (Radford et al., 2021) based on a vision transformer (ViT-B/32) (Dosovitskiy et al., 2021) and a smaller ResNet-50 (RN50) (He et al., 2016) , both trained on a non-public dataset with 400M text-image pairs. We further used a public reimplementation of CLIP (OpenCLIP) (Ilharco et al., 2021) trained on the LAION-2B-en dataset, a subset of the public LAION-5B dataset (Schuhmann et al., 2022) with English text. Additionally, we tested a multilingual CLIP model (M-CLIP) (Carlsson et al., 2022) with the text encoder based on XLM-RoBERTa (Conneau et al., 2020) , pre-trained on samples from 100 languages. We tested how the insertion of homoglyphs influences the cosine similarity scores computed between the image and text embeddings. Fig. 19 shows the mean similarity scores for querying the models with 12 real photos of Athens in combination with the correct description A photo of Athens and multiple variants of the obviously wrong description A photo of New York. As visualized, all four models assign clearly higher similarity scores for the correct image description (left) compared to the false description (right). However, by introducing Greek homoglyphs into the obviously wrong description, we could increase the predicted similarity scores markedly for all models except M-CLIP. We point out that homoglyph replacements are then most effective when replacing parts of the false object, in the depicted case New York. More importantly, our results demonstrate that the text encoder trained on multilingual data is much more robust against homograph attacks than the English-only models. We have also investigated the other direction, trying to obfuscate objects that are correctly described in the text. Similar to our DALL-E 2 results in Appx. C.1, we were able to reduce the similarity scores markedly if we replaced characters directly in the object names. However, M-CLIP again behaves quite robustly against such manipulations. We state visualization for our results in Appx. F. 



https://unicode-table.com/en/ provides an overview of available Unicode characters and scripts.



Figure 1: Example of our homoglyph manipulations in the DALL-E 2 pipeline. The model has been queried with the prompt A photo of an actress. Using only Latin characters in the text, the model generates pictures of women with different ethnic backgrounds. However, replacing the o in the text with the visually barely distinguishable single Korean (Hangul script) or Indian (Oriya script) homoglyphs leads to the generation of biased images that clearly reflect cultural biases.

Figure 2: Examples of Unicode homoglyphs from different scripts with their Unicode number and description.Whereas the visual differences between some characters as part of a sentence might be spotted by an attentive user, several characters look almost identical, especially in some fonts, and corresponding homoglyph attacks are difficult to spot by visual inspection.

(a) Latin A (U+0041). (b) Greek Α (U+0391). (c) Scandinavian Å (U+00C5).

(d) Oriya (Indian) ୦ (U+0B66).

(e) Osmanya 𐒆 (U+10486).(f) Vietnamese ọ (U+1ECD).

Figure 3: Non-cherry-picked examples of induced biases with a single homoglyph replacement. We queried DALL-E 2 with the following prompts: A city in bright sunshine (top row) and A photo of an actress (bottom row). Each query differs only by the underlined characters A and o, respectively. Most inserted homoglyphs are visually barely distinguishable and are rendered very similarly to their Latin counterparts in the DALL-E 2 API. See Appx. D for additional results.

Figure 4: By adding the embedding difference between Latin and non-Latin characters to the CLIP text embedding of Stable Diffusion, cultural biases can also be induced without directly changing the textual description. See Appx. E.5 for additional results.

Figure 5: Relative bias measured for five homoglyphs from different scripts and cultures. The dark bars state the results for the standard text encoder. The light bars indicate the results after performing our homoglyph unlearning procedure on a single encoder for the five homoglyphs. As is clearly evident, our homoglyph unlearning approach successfully removes most of the biasing behavior.

frozen New York with N and Y replaced by Greek Ν (U+039D) and Υ (U+03A5). (b) Big Ben in the rain with the B replaced by Cyrillic В (U+0412).

Figure 6: Non-cherry-picked examples of hidden objects with homoglyph replacements. We queried DALL-E 2 three times and replaced the underlined characters with different homoglyphs. The inserted characters are visually barely distinguishable and are rendered almost identically to their Latin counterparts in the DALL-E 2 API.

D.1 A CITY IN BRIGHT SUNSHINE (a) Latin A (U+0041). (b) Greek Α (U+0391). (c) Scandinavian Å (U+00C5). (d) Cyrillic А (U+0410). (e) Canadian ᗅ (U+15C5). (f) Cherokee Ꭺ (U+13AA).(g) Latin À (U+00C0).(h) Lisu ꓮ (U+A4EE).(i) Mathematical 𝖠 (U+1D5A0).

Figure 7: Non Cherry-picked examples of induced biases with a single homoglyph replacement. We queried DALL-E 2 with the following prompt: A city in bright sunshine. Each query differs only by the first character A.

Figure 8: Non Cherry-picked examples of induced biases with a single homoglyph replacement. We queried DALL-E 2 with the following prompt: A photo of an actress. Each query differs only by the o in of.

Figure 9: Non Cherry-picked examples of induced biases with a single homoglyph replacement. We queried DALL-E 2 with the following prompt: Delicious food on a table.Each query differs only by a single character in the word Delicious replaced by the stated homoglyphs.

Figure 9: Non Cherry-picked examples of induced biases with a single homoglyph replacement. We queried DALL-E 2 with the following prompt: Delicious food on a table.Each query differs only by a single character in the word Delicious replaced by the stated homoglyphs.

Figure 11: Query A photo of a flag. Non Cherry-picked examples of flag images with a homoglyph replacement generated by the DALL-E 2 model. Whereas the model has a learned bias towards generating USA flag, inducing a Greek bias leads to the generation of Greek flags. Surprisingly, using a Cyrillic bias enables the model to generate a wide range of different flags from European countries.

Figure 12: Non Cherry-picked examples of induced biases with a single emoji added. We queried DALL-E 2 with the following prompt: A photo of a X person. Each query differs by adding an emoji at the X position.

Figure 13: Non Cherry-picked examples of hidden objects with a homoglyph replacement generated by the DALL-E 2 model. The left column depicts images without manipulated prompts.

Figure 15: Non Cherry-picked examples of induced biases with a single homoglyph replacement. We queried Stable Diffusion with the following prompt: Delicious food on a table.Each query differs only by a single character in the word Delicious replaced by the stated homoglyphs.

Figure 15: Non Cherry-picked examples of induced biases with a single homoglyph replacement. We queried Stable Diffusion with the following prompt: Delicious food on a table.Each query differs only by a single character in the word Delicious replaced by the stated homoglyphs.

Figure 16: Non Cherry-picked examples of hidden objects with a homoglyph replacement generated by the Stable Diffusion model. The left column depicts images without manipulated prompts.

Figure21: Mean similarity scores for querying various CLIP models with real photos and different captions. Underlined characters in the queries have been replaced with homoglyphs to force the models' similarity scores to increase.

depicts some homoglyph examples.

Mean similarity scores for querying various CLIP models with real photos of Athens and different captions. The underlined characters in the prompts were replaced with Greek homoglyphs.

Queries contain 12 real photos of Athens and different captions. Underlined characters in the queries have been replaced with Greek homoglyphs. Queries contain 12 real photos of the Saint Basil's Cathedral (Cathedral of Vasily the Blessed), and different captions. Underlined characters in the queries have been replaced with Cyrillic homoglyphs. Queries contain 12 real photos of the Banh baos (Vietnamese dumplings), and different captions. Underlined characters in the queries have been replaced with Vietnamese homoglyphs (Latin Extended script).Figure 20: Mean similarity scores for querying various CLIP models with real photos and different captions. Underlined characters in the queries have been replaced with homoglyphs to force the models' similarity scores to increase. Queries contain 12 real photos of Athens and different captions. Underlined characters in the queries have been replaced with Cyrillic homoglyphs. Queries contain 12 real photos of the Saint Basil's Cathedral (Cathedral of Vasily the Blessed), and different captions. Underlined characters in the queries have been replaced with Greek homoglyphs. Queries contain 12 real photos of the Banh baos (Vietnamese dumplings), and different captions. Underlined characters in the queries have been replaced with Lisu homoglyphs.

Real photos of Banh baos to test CLIP for biases. See Fig.20cand Fig.21cfor the experimental results.

annex

Extended Additional scripts (U+1E00 to U+1EFF) primarily contains characters used in the Vietnamese alphabet. Some letters are also shared with other languages, e.g., ọ (U+1ECD) is not only used in Vietnamese but also in the International African alphabet. Further examples of the extended Latin script from the paper are the characters á (U+00E1) and Å (U+00C5).Arabic Script: Ranges from U+0600 to U+06FF and contains 256 characters of the Arabic script. The script is used for the Arabic, Kurdish and Persian languages, amongst others. In the paper, we used the characters ‫ه‬ (U+0647) and ‫ا‬ (U+0627).Armenian Script: Ranges from U+0530 to U+058F and contains 91 characters for the Armenian language, spoken in Armenia. In the paper, we used the character օ (U+0585).Bengali Script: Ranges from U+0980 to U+09FF and contains 96 characters for the Bengali, Santali, and other Indo-Aryan languages, mainly spoken in South Asia. Bengali is spoken in Bengal, a geopolitical and cultural region in South Asia, covering Bangladesh and West India. In the paper, we used the character ০ (U+09E6).Unified Canadian Aboriginal Syllabics: Ranges from U+1400 to U+167F and contains 640 syllabic characters used in various Indigenous Canadian languages. These comprise the Algonquian, Inuit and Athabaskan languages. In the paper, we used the character ᗅ (U+15C5).Cherokee Script: Ranges from U+13A0 to U+13FF and contains 92 syllabic characters used for the Cherokee language. Cherokee is an Iroquoian language spoken by the Cherokee tribes, which are indigenous people in the Southeastern Woodlands of the United States. In the paper, we used the character Ꭺ (U+13AA).Cyrillic Script: Ranges from U+0400 to U+04FF and contains 256 characters from the Cyrillic writing system, also known as Slavonic script or Slavic script, and offers various national variations of the standard Cyrillic script. It is used in different countries and languages, such as Russian, Bulgarian, Serbian or Ukrainian. Throughout this work, we only used letters from the standard Russian alphabet. Examples from the paper are the characters В (U+0412) and е (U+0435).Devanagari Script: Ranges from U+0900 to U+097F and contains 128 characters for Hindi, which is spoken in India, and other Indo-Aryan languages. In the paper, we used। (U+0964).Greek and Coptic Script: Ranges from U+0370 to U+03FF and contains 135 standard letters and letter variants, digits and other symbols of the Greek language. It also contains glyphs of the Coptic language, which belongs to the family of the Egyptian language. In this work, we only used standard Greek letters used in the modern Greek language. Examples from the paper are the characters Α (U+0391) and ο (U+03BF).Hangul Jamo Script: Ranges from U+1100 to U+11FF and contains 256 positional forms of the Hangul consonant and vowel clusters. It is the official writing system for the Korean language, spoken in South and North Korea. In the paper, we used the character ㅇ (U+3147).Lisu Script: Ranges from U+A4D0 to U+A4FF and contains 48 characters used to write the Lisu language. Lisu is spoken in Southwestern China, Myanmar and Thailand, as well as a small part of India. In the paper, we used the character ꓲ (U+A4F2) and ꓮ (U+A4EE).N'Ko script: Ranges from U+07C0 to U+07FF and contains 62 characters. It is a used to write the Mande languages, spoken in West African countries, for example, Burkina Faso, Mali, Senegal, the Gambia, Guinea, Guinea-Bissau, Sierra Leone, Liberia and Ivory Coast. In the paper, we used the character ߋ (U+07CB).Oriya Script: Ranges from U+0B00 to U+0B7F and contains 91 characters. It is mainly used to write the Orya (Odia), Khondi and Santali languages, some of the many official languages of India. The languages are primarily spoken in the Indian state of Odisha and other states in eastern India. In the paper, we used the character ୦ (U+0B66).Osmanaya Script: Ranges from U+10480 to U+104AF and contains 40 characters. It is used to write the Somali language and is an official language in Somalia, Somaliland and Ethiopia, all localized in the Horn of Africa (East Africa). In the paper, we used the character 𐒆 (U+10486).Tibetan Script: Ranges from U+0F00 to U+0FFF and contains 211 characters. The characters are primarily used to write Tibetan and Dzongkha, which is spoken in Bhutan. In the paper, we used the character ། (U+0F0D).Emojis: Emojis in Unicode are not contained in a single script or block but spread across 24 blocks. Unicode 14.0 contained 1,404 emoji characters. For example, the Emoticons block ranging from U+1F600 to UF1F64F contains 80 emojis of face representations. Examples from the paper are (U+1F603) and (U+1F973).

B.1 RELATIVE BIAS DATASET PROMPTS

Table 1 : Datasets used to measure the relative bias of homoglyphs for three domains. The # marks the positions where the homoglyphs are injected. For the Latin prompts, the # were simply removed. 

E STABLE DIFFUSION RESULTS

In this section, we state additional results of our experiments on Stable Diffusion. We queried the model using a Python script and the prompts specified in the image captions. Note that we queried the model four times with each text description, and stated all four generated results. (e) N'Ko ߋ (U+07CB).(f) Hangul (Korean) ㅇ (U+3147).(g) Arabic ‫ه‬ (U+0647).(h) Armenian օ (U+0585).(i) Bengali ০ (U+09E6). 

G LAION-AESTHETICS V2 5+ ANALYSIS

We analyzed the text descriptions contained in the LAION-Aesthetics V2 5+ subset of the LAION-2B-en dataset with 600M samples that have an estimated aesthetics scores ≥ 5.0. The LAION-2B-en dataset itself is a subset of the public LAION-5B dataset Schuhmann et al. (2022) with English text.For each non-Latin homoglyph (including the extended Latin script), we counted the number of text strings that contain the character. 

