CODEBPE: INVESTIGATING SUBTOKENIZATION OPTIONS FOR LARGE LANGUAGE MODEL PRETRAINING ON SOURCE CODE

Abstract

Recent works have widely adopted large language model pretraining for source code, suggested source code-specific pretraining objectives and investigated the applicability of various Transformer-based language model architectures for source code. This work investigates another important aspect of such models, namely the effect of different subtokenization options, and aims at identifying most effective and length-efficient subtokenizations, taking into account code specifics. We propose subtokenziation that reduces average length by 17% without downstream performance drop, and show that a carefully chosen subtokenization may improve quality by 0.5-2%, possibly with some length increase.

1. INTRODUCTION

With the inspiration from the success of large language model (LM) pretraining in natural language processing (NLP), BERT-like models have been widely adopted for source code processing (Feng et al., 2020; Kanade et al., 2020) , as code has a similar discrete sequential structure to natural text. Being trained on huge source code corpora in a self-supervised manner, large LMs often substantially outperform domain-specific models developed purposely for applied tasks, especially in the tasks with limited parallel / labelled data (Ahmad et al., 2021a) . These tasks include fixing code bugs, generating text from code and vice versa, or translating code between programming languages. Recent works advanced large LM pretraining on source code in two main directions. First, various model kinds were utilized for source code: CodeBERT (Feng et al., 2020) and CuBERT (Kanade et al., 2020) This work is devoted to investigating one more important component, subtokenization, which is usually not paid much attention when pretraining large LMs on source code. Modern LMs usually preprocess sequences using open-vocabulary models such as Byte-pair encoding (BPE, Sennrich et al., 2016) which split long tokens into smaller subtokens. Though this process is often referred to as tokenization, we call it subtokenization, to underline its smaller granularity. Subtokenization became a standard part of all widely-used LMs pretrained on natural text or code, because it ensures the relatively high frequency of all subtokens (compared to the whitespace-separated tokenization, which results in a large portion of out-of-vocabulary tokens), at the same time producing sequences of reasonable length (compared to character-level tokenization). Though subtokenization was initially introduced for NLP, it is especially relevant for code, as programming languages usually permit identifiers of unrestricted complexity, e. g. variable or function names (Chirkova & Troshin, 2021) .



rely on the classic encoder-only RoBERTa(Liu et al., 2019),CodeGPT (Lu et al.,  2021)  uses decoder-onlyGPT (Radford & Narasimhan, 2018),PLBART (Ahmad et al., 2021a)  is based on the denoising sequence-to-sequenceBART (Lewis et al., 2020) model, and CodeT5 (Wang  et al., 2021b)  utilizes multitask sequence-to-sequence T5(Raffel et al., 2020). Second, a range of code-specific self-supervised pretraining tasks were proposed to enrich the classic masked language modeling (MLM) objective, e. g.GraphCodeBERT (Guo et al., 2021)  predicts data flow connections during pretraining (one variable is computed from another variable), and CodeT5 (Wang et al., 2021b) and DOBF (Roziere et al., 2021) use a variable naming objective.

