CODEBPE: INVESTIGATING SUBTOKENIZATION OPTIONS FOR LARGE LANGUAGE MODEL PRETRAINING ON SOURCE CODE

Abstract

Recent works have widely adopted large language model pretraining for source code, suggested source code-specific pretraining objectives and investigated the applicability of various Transformer-based language model architectures for source code. This work investigates another important aspect of such models, namely the effect of different subtokenization options, and aims at identifying most effective and length-efficient subtokenizations, taking into account code specifics. We propose subtokenziation that reduces average length by 17% without downstream performance drop, and show that a carefully chosen subtokenization may improve quality by 0.5-2%, possibly with some length increase.

1. INTRODUCTION

With the inspiration from the success of large language model (LM) pretraining in natural language processing (NLP), BERT-like models have been widely adopted for source code processing (Feng et al., 2020; Kanade et al., 2020) , as code has a similar discrete sequential structure to natural text. Being trained on huge source code corpora in a self-supervised manner, large LMs often substantially outperform domain-specific models developed purposely for applied tasks, especially in the tasks with limited parallel / labelled data (Ahmad et al., 2021a) . These tasks include fixing code bugs, generating text from code and vice versa, or translating code between programming languages. Recent works advanced large LM pretraining on source code in two main directions. First, various model kinds were utilized for source code: CodeBERT (Feng et al., 2020) and CuBERT (Kanade et al., 2020) rely on the classic encoder-only RoBERTa (Liu et al., 2019) , CodeGPT (Lu et al., 2021) uses decoder-only GPT (Radford & Narasimhan, 2018) , PLBART (Ahmad et al., 2021a) is based on the denoising sequence-to-sequence BART (Lewis et al., 2020) model, and CodeT5 (Wang et al., 2021b) utilizes multitask sequence-to-sequence T5 (Raffel et al., 2020) . Second, a range of code-specific self-supervised pretraining tasks were proposed to enrich the classic masked language modeling (MLM) objective, e. g. GraphCodeBERT (Guo et al., 2021) This work is devoted to investigating one more important component, subtokenization, which is usually not paid much attention when pretraining large LMs on source code. Modern LMs usually preprocess sequences using open-vocabulary models such as Byte-pair encoding (BPE, Sennrich et al., 2016) which split long tokens into smaller subtokens. Though this process is often referred to as tokenization, we call it subtokenization, to underline its smaller granularity. Subtokenization became a standard part of all widely-used LMs pretrained on natural text or code, because it ensures the relatively high frequency of all subtokens (compared to the whitespace-separated tokenization, which results in a large portion of out-of-vocabulary tokens), at the same time producing sequences of reasonable length (compared to character-level tokenization). Though subtokenization was initially introduced for NLP, it is especially relevant for code, as programming languages usually permit identifiers of unrestricted complexity, e. g. variable or function names (Chirkova & Troshin, 2021) . Though subtokenization is often chosen with only superficial deliberation, it is one of the essential model components which may affect both quality and prediction speed. First, an inaccurately chosen subtokenization procedure may substantially increase sequence lengths and consequently slow down prediction. As a simple example, the work on CodeT5 (Wang et al., 2021b) notices that using BPE trained specifically on source code corpora makes sequences 30-45% shorter than using BPE trained on natural text. Second, a line of recent research points at the positive effect of the carefully chosen subtokenization procedure on the model's performance in NLP. For example, Bostrom & Durrett (2020) show that using a UnigramLM (Kudo, 2018) subtokenization algorithm instead of BPE improves the quality of BERT-based question answering or textual entailment in English by 1%, and Ding et al. (2019) show that adjusting BPE vocabulary size in translation may produce +4 BLEU. At the same time, for large LMs, the particular subtokenization procedure chosen at the pretraining stage becomes an inseparable part of the model and must later be used in applied tasks. This underlines the need for a careful choice of subtokenization options when pretraining large LMs. In this work, we conduct a deep study of subtokenization options for large LM pretraining on source code, using PLBART as a testing ground. In addition to investigating general aspects, e. g. the subtokenization algorithm and the vocabulary size, we study the ways of adapting subtokenization to the specific properties of code, such as a large amount of punctuation marks and frequentlyused token combinations, a variety of complex identifiers, or relative similarity of programming languages. We aim at choosing optimal subtokenization options that (a) lead to the best performance or (b) minimize sequence lengths (and thus speed up the model) without downstream performance drop. Our contributions are as follows -we show that for large LMs pretrained on source code: • Grouping punctuation chars in single tokens reduces the average length by 17% without downstream performance drop (we call this approach CodeBPE or CodeUnigramLM), and permitting more complex composite tokens reduces lengths by 40%, sometimes with quality drop (Section 1); • UnigramLM is generally preferable over BPE (Section 4); • Smaller vocabularies may improve quality with 3-19% length increase (Section 5); • Subtokenizers are well transferable between programming languages (Section 6); Our length-efficient subtokenization procedure (see examples in Figure 1 ) compresses sequences by 17% without quality drop and our most effective subtokenization improves performance by 0.5-2% significantly in three out of eight tasks and by one standard deviation in two other tasks.

2. METHODOLOGY AND EXPERIMENTAL SETUP

The existing works on large LMs for source code usually choose a particular subtokenization library, for example the same as in the base LM the work uses, and train the subtokenizer with the vocabulary size of 30-50K on source code corpora used for pretraining. Often code is preprocessed before subtokenization, e. g. by replacing \n with NEW_LINE, and split into tokens on white-spaces and punctuation marks so that these tokens are further split into subtokens, e. g. for i in range (vocSize) will be split into ['for', 'i', 'in', 'range', '(', 'vocSize', ' )'] even if for i in is generally a frequent combination. The latter principle appears to be intuitively reasonable, since it ensures that subtokenization preserves syntactically meaningful boundaries of tokens (Kanade et al., 2020) . We refer to this principle as prohibiting composite tokens. More details on subtokenization in different LMs for code are given in Section 7.



req Lists = [ [ 0 , 0 ] for i in range ( voc Sz ) ] +0.5-2% quality 17% length reduction Freq List s = [ [ 0 , 0 ] for i in range ( vo c S z ) ] Freq Lists =[[ 0 , 0 ] for i in range ( voc S z )] List s=[ [0,0] for_i_in_range (vo c S z )]

Figure 1: Example subtokenizations (all numbers compared to the commonly used BPE-50K).

