INCODER: A GENERATIVE MODEL FOR CODE INFILLING AND SYNTHESIS

Abstract

Code is seldom written in a single left-to-right pass and is instead repeatedly edited and refined. We introduce INCODER, a unified generative model that can perform program synthesis (via left-to-right generation) as well as editing (via masking and infilling). InCoder is trained to generate code files from a large corpus of permissively licensed code, where regions of code have been randomly masked and moved to the end of each file, allowing code infilling with bidirectional context. Our model is the first large generative code model that is able to infill arbitrary regions of code, which we evaluate in a zero-shot setting on challenging tasks such as type inference, comment generation, and variable re-naming. We find that the ability to condition on bidirectional context substantially improves performance on these tasks, while still performing comparably on standard program synthesis benchmarks in comparison to left-to-right only models pretrained at similar scale. Our models and code are publicly released.

1. INTRODUCTION

Large language models trained on vast repositories of code have demonstrated remarkable progress in neural program synthesis and related tasks (Chen et al., 2021a; Austin et al., 2021; Xu et al., 2022; Nijkamp et al., 2022; Chowdhery et al., 2022) . However, such models generate code leftto-right, which makes them less directly applicable to many ubiquitous code editing tasks, such as fixing bugs, adding comments, or re-naming variables. We introduce INCODER, a unified model for program synthesis and editing. Like prior work, INCODER is trained to maximize the likelihood of a corpus of code. However, we adopt a causal masking objective (Aghajanyan et al., 2022a) , allowing INCODER to infill blocks of code conditioned on arbitrary left and right contexts. More specifically, we learn to infill by randomly replacing spans of code with a sentinel token and moving them to the end of the sequence (Figure 1 , top). The model is trained to predict all tokens in the complete sequence in this permuted ordering. During inference, we can edit code by replacing spans with sentinel tokens, prompting the model with the new sequence, and having it generate new tokens to replace the masked spans (Figure 1 , bottom). Because the model can also trivially generate without sentinel tokens, the result is a unified approach for both program synthesis (via left-to-right generation) and editing (via infilling). We evaluate performance on a range of zero-shot code infilling tasks (Sec. 4), both new and from existing work, including challenging use cases such as type prediction, variable re-naming, comment generation, and completing missing lines of code. Zero-shot infilling with bidirectional context substantially outperforms approaches based on left-to-right-only models, and on several tasks obtains performance comparable to state-of-the-art models fine-tuned on the tasks. Ablation experiments (Sec. 5) show that this does not come at the cost of left-to-right generation ability; our causal masking model achieves similar performance to a standard language model on program synthesis benchmarks (Chen et al., 2021a; Austin et al., 2021) 

2. INFILLING AND SYNTHESIS VIA CAUSAL MASKING

Neural models for code generation have either utilized a left-to-right (causal) autoregressive language modeling objective (Brown et al., 2020; Chen et al., 2021a) or, as BERT does, a masked language modeling objective (Devlin et al., 2019; Feng et al., 2020) . Both approaches have strengths and weaknesses. Causal models only condition on context to the left of the generated tokens, thus preventing infilling, but they can autoregressively generate entire documents. On the other hand, masked language models can condition on both the left and right contexts to infill a masked region, however, their training objective is typically limited to generating only about 15% of a document. In this paper, we adopt the recently proposed causal masking objective (Aghajanyan et al., 2022a) , which aims to combine the strengths of both causal and masked language models.

2.1. TRAINING

At training time, the causal masking procedure samples a number of spans of contiguous tokens in each document to mask (Figure 1 , top left). We sample the number of spans from a Poisson distribution with a mean of one, truncated to the support [1, 256] , so that there are typically a small



Figure 1: At training time (top), our causal masking objective samples one or more spans of code in training documents (in the upper left figure, a single span) and moves these spans to the end of the document, with their original location denoted by special mask sentinel tokens. An autoregressive model is trained to produce these entire masked documents, allowing it to learn to generate insertion text conditioned on bidirectional context. At inference time (bottom), we can perform a variety of code editing and infilling tasks in a zero-shot fashion by inserting mask tokens at desired locations and allowing the model to generate code to insert there. All examples shown are real outputs from our INCODER-6.7B model, with the regions inserted by the model highlighted in orange.

despite its more general training objective.

availability

https://sites.google.com/view/incoder-code

