LANGUAGE MODELLING WITH PIXELS

Abstract

Language models are defined over a finite set of inputs, which creates a vocabulary bottleneck when we attempt to scale the number of supported languages. Tackling this bottleneck results in a trade-off between what can be represented in the embedding matrix and computational issues in the output layer. This paper introduces PIXEL, the Pixel-based Encoder of Language, which suffers from neither of these issues. PIXEL is a pretrained language model that renders text as images, making it possible to transfer representations across languages based on orthographic similarity or the co-activation of pixels. PIXEL is trained to reconstruct the pixels of masked patches instead of predicting a distribution over tokens. 1 We pretrain the 86M parameter PIXEL model on the same English data as BERT and evaluate on syntactic and semantic tasks in typologically diverse languages, including various non-Latin scripts. We find that PIXEL substantially outperforms BERT on syntactic and semantic processing tasks on scripts that are not found in the pretraining data, but PIXEL is slightly weaker than BERT when working with Latin scripts. Furthermore, we find that PIXEL is more robust than BERT to orthographic attacks and linguistic code-switching, further confirming the benefits of modelling language with pixels.

1. INTRODUCTION

Natural language processing has rapidly progressed in recent years due to a combination of selfsupervised representation learning, i.e. pretrained language models (PLMs) like BERT (Devlin et al., 2019) , GPT-3 (Brown et al., 2020) , and XLM-R (Conneau et al., 2020) ; large unlabelled datasets; such as C4 (Raffel et al., 2020) , The Pile (Gao et al., 2020) ; and large-scale computing power (Hirschberg & Manning, 2015) . Despite this progress, these models only cover a fraction of the world's languages, with large inequalities in performance (Pires et al., 2019; Lauscher et al., 2020) , and the majority of languages are falling behind English (Joshi et al., 2020b; Bugliarello et al., 2022) . Even within English, these models struggle when tasked with processing noisy inputs (Sun et al., 2020; Eger & Benz, 2020) . In this paper, we show how to effectively support thousands of written languages in a single model while being robust to variations caused by character-level noise. Language models typically support a finite vocabulary of categorical inputs, e.g. characters, subwords or even words, and much effort has been devoted to vocabulary construction (Wan, 2022). On one end of the spectrum, a vocabulary over words has three problems: (i) it is not possible to encode out-of-vocabulary words because they lack an entry in a closed vocabulary, e.g. "doxing", (ii) there are too many parameters in the word embedding layer, and relatedly, (iii) the normalising constant for the softmax activation in the output layer is too expensive to compute. On the other end of the spectrum, vocabularies over bytes or characters are much smaller, which leads to increased sequence lengths (Keren et al., 2022) . In practice, most current models operate over inputs smaller than words but larger than characters: subword units (Sennrich et al., 2016; Kudo, 2018) . Subwords prevent the problem of extremely large embedding and output layers, and support open vocabulary processing. While this is a practical solution in a monolingual context and for some languages like English, dealing with many languages with a variety of scripts will either result in a very large vocabulary or a trade-off over what is represented within a fixed number of subwords (see §5 2022), we use a masked autoencoder with a ViT architecture and a lightweight decoder for pretraining (left). At finetuning time (right), the decoder is replaced by a task-specific classification head that sits on top of the encoder. together, given a language model with a finite vocabulary, there is a bottleneck in two locations: at the level of the encoding of the inputs and at the level of estimating the probability distribution over the vocabulary. We call this the vocabulary bottleneck. A language model that can handle thousands of languages needs to deal with this problem. We propose to rethink language modelling as a visual recognition task, removing the need for a finite vocabulary. Our proposal is inspired by Salesky et al. ( 2021), who showed how to train a machine translation model with "visual text representations" in the encoder instead of subwords. Our Pixelbased Encoder of Language (PIXEL) is built on the Masked Autoencoding Visual Transformer (ViT-MAE; He et al., 2022) . ViT-MAE is a Transformer-based encoder-decoder trained to reconstruct the pixels in masked image patches. PIXEL does not have a vocabulary embedding layer; instead, text is rendered as a sequence of fixed-sized patches, which are processed using a Vision Transformer encoder (Dosovitskiy et al., 2021) . PIXEL also does not have an expensive output layer when it reconstructs the pixels of the masked patches. In effect, PIXEL provides a solution to the vocabulary bottleneck without needing the prohibitively long sequences of character-based models. PIXEL is pretrained on the same data as BERT, given our computational resources. This means that it has encountered only ∼0.05% non-English text (Blevins & Zettlemoyer, 2022) . 2 We evaluate PIXEL on a range of syntactic and semantic tasks in 32 typologically diverse languages across 14 scripts, showing that it can rapidly adapt to new languages and unseen scripts. PIXEL is also evaluated on its ability to handle noisy text caused by orthographic attacks, where pixel-based encoding is a clear improvement over subword-based vocabularies. In lexical code-switching experiments, PIXEL performs on-par with BERT and sometimes outperforms the multilingually pretrained MBERT. PIXEL is a new type of language model that can theoretically support any language that can be typeset by a modern computer. We make the implementation, the pretrained model including intermediate training checkpoints, and the fine-tuned models freely available for the community.foot_2 

2. APPROACH

The Pixel-based Encoder of Language, PIXEL, consists of three major components: a text renderer, which draws text as an image; an encoder, which encodes the unmasked regions of the image; and a decoder, which reconstructs the masked regions at the pixel level. Figure 1 provides an illustration.

2.1. TEXT RENDERER

The key component of PIXEL is a text renderer that takes one or more pieces of text and renders them onto a blank RGB image x ∈ R H×W ×C . We set height H = 16 and width W = 8464 and choose



See Appendix A for reconstructions of this abstract. We do not claim that a language model designed to support thousands of languages should be pretrained only on English text. We expect that pretraining on an appropriate choice of another language or multilingually may provide more remarkable results. PIXEL represents an initial effort at smaller scale. https://github.com/xplip/pixel



Figure 1: Overview of PIXEL's architecture. Following He et al. (2022), we use a masked autoencoder with a ViT architecture and a lightweight decoder for pretraining (left). At finetuning time (right), the decoder is replaced by a task-specific classification head that sits on top of the encoder.

). Taken My cat ᓚᘏᗢ enjoys eating warm oatmeal for lunch and dinner.My cat ᓚᘏᗢ enjoys eating warm oatmeal for lunch and dinner.

