LANGUAGE MODELLING WITH PIXELS

Abstract

Language models are defined over a finite set of inputs, which creates a vocabulary bottleneck when we attempt to scale the number of supported languages. Tackling this bottleneck results in a trade-off between what can be represented in the embedding matrix and computational issues in the output layer. This paper introduces PIXEL, the Pixel-based Encoder of Language, which suffers from neither of these issues. PIXEL is a pretrained language model that renders text as images, making it possible to transfer representations across languages based on orthographic similarity or the co-activation of pixels. PIXEL is trained to reconstruct the pixels of masked patches instead of predicting a distribution over tokens. 1 We pretrain the 86M parameter PIXEL model on the same English data as BERT and evaluate on syntactic and semantic tasks in typologically diverse languages, including various non-Latin scripts. We find that PIXEL substantially outperforms BERT on syntactic and semantic processing tasks on scripts that are not found in the pretraining data, but PIXEL is slightly weaker than BERT when working with Latin scripts. Furthermore, we find that PIXEL is more robust than BERT to orthographic attacks and linguistic code-switching, further confirming the benefits of modelling language with pixels.

1. INTRODUCTION

Natural language processing has rapidly progressed in recent years due to a combination of selfsupervised representation learning, i.e. pretrained language models (PLMs) like BERT (Devlin et al., 2019) , GPT-3 (Brown et al., 2020), and XLM-R (Conneau et al., 2020) ; large unlabelled datasets; such as C4 (Raffel et al., 2020 ), The Pile (Gao et al., 2020) ; and large-scale computing power (Hirschberg & Manning, 2015) . Despite this progress, these models only cover a fraction of the world's languages, with large inequalities in performance (Pires et al., 2019; Lauscher et al., 2020) , and the majority of languages are falling behind English (Joshi et al., 2020b; Bugliarello et al., 2022) . Even within English, these models struggle when tasked with processing noisy inputs (Sun et al., 2020; Eger & Benz, 2020) . In this paper, we show how to effectively support thousands of written languages in a single model while being robust to variations caused by character-level noise. Language models typically support a finite vocabulary of categorical inputs, e.g. characters, subwords or even words, and much effort has been devoted to vocabulary construction (Wan, 2022). On one end of the spectrum, a vocabulary over words has three problems: (i) it is not possible to encode out-of-vocabulary words because they lack an entry in a closed vocabulary, e.g. "doxing", (ii) there are too many parameters in the word embedding layer, and relatedly, (iii) the normalising constant for the softmax activation in the output layer is too expensive to compute. On the other end of the spectrum, vocabularies over bytes or characters are much smaller, which leads to increased sequence lengths (Keren et al., 2022) . In practice, most current models operate over inputs smaller than words but larger than characters: subword units (Sennrich et al., 2016; Kudo, 2018) . Subwords prevent the problem of extremely large embedding and output layers, and support open vocabulary processing. While this is a practical solution in a monolingual context and for some languages like English, dealing with many languages with a variety of scripts will either result in a very large vocabulary or a trade-off over what is represented within a fixed number of subwords (see §5). Taken



See Appendix A for reconstructions of this abstract.1

