CODEGEN: AN OPEN LARGE LANGUAGE MODEL FOR CODE WITH MULTI-TURN PROGRAM SYNTHESIS

Abstract

Program synthesis strives to generate a computer program as a solution to a given problem specification, expressed with input-output examples or natural language descriptions. The prevalence of large language models advances the state-of-the-art for program synthesis, though limited training resources and data impede open access to such models. To democratize this, we train and release a family of large language models up to 16.1B parameters, called CODEGEN, on natural language and programming language data, and open source the training library JAXFORMER. We show the utility of the trained model by demonstrating that it is competitive with the previous state-of-the-art on zero-shot Python code generation on HumanEval. We further investigate the multi-step paradigm for program synthesis, where a single program is factorized into multiple prompts specifying subproblems. To this end, we construct an open benchmark, Multi-Turn Programming Benchmark (MTPB), consisting of 115 diverse problem sets that are factorized into multi-turn prompts. Our analysis on MTPB shows that the same intent provided to CODEGEN in multiturn fashion significantly improves program synthesis over that provided as a single turn. We make the training library JAXFORMER and model checkpoints available as open source contribution:

1. INTRODUCTION

Creating a program has typically involved a human entering code by hand. The goal of program synthesis is to automate the coding process, and generate a computer program that satisfies the user's specified intent. Some have called it the holy grail of computer science (Manna & Waldinger, 1971; Gulwani et al., 2017) . Successful program synthesis would not only improve the productivity of experienced programmers but also make programming accessible to a wider audience. Two key challenges arise when striving to achieve program synthesis: (1) the intractability of the search space, and (2) the difficulty of properly specifying user intent. To maintain an expressive search space, one needs a large search space, which poses challenges in efficient search. Previous work (Joshi et al., 2002; Panchekha et al., 2015; Cheung et al., 2013) leverages domain-specific language to restrict the search space; however, this limits the applicability of synthesized programs. On the contrary, while being widely applicable, general-purpose programming languages (e.g., C, Python) introduce an even larger search space for possible programs. To navigate through the enormous program space, we formulate the task as language modeling, learning a conditional distribution of the next token given preceding tokens and leverage transformers (Vaswani et al., 2017) and large-scale self-supervised pre-training. This approach has seen success across modalities (Devlin et al., 2019; Lewis et al., 2020; Dosovitskiy et al., 2021) . Likewise, prior works have developed pre-trained language models for programming language understanding (Kanade et al., 2020; Feng et al., 2020) . To realize program synthesis successfully, users must employ some means to communicate their intent to the models such as a logical expression (which specifies a logical relation between inputs and outputs of a program), pseudo-code, input-output examples, or a verbalized specifications in natural language. On the one hand, a complete formal specification enjoys the exact specifications of user intent but may require domain expertise and effort from users to translate the intent to such a form. On the other hand, specification merely based on input-output examples is less costly but may under-specify the intent, leading to inaccurate solutions. Previous work has benefited from various methods and their combinations as the input to program synthesis models, including pseudocode (Kulal et al., 2019) , a part of a program and its documentation (Chen et al., 2021) , or natural language paragraph with input-output examples (Hendrycks et al., 2021) . However, we argue that a truly user-friendly form of intent is natural language text. To overcome these challenges, we propose a multi-turn program synthesis approach, where a user communicates with the synthesis system by progressively providing specifications in natural language while receiving responses from the system in the form of synthesized subprograms, such that the user together with the system complete the program in multiple steps. The following two considerations motivate this approach. First, we speculate that factorizing a potentially long and complicated specification into multiple steps would ease the understanding by a model and hence enhance program synthesis. In the multi-turn approach, a model can focus on the specification associated with one subprogram and avoid arduously tracking the complicated dependency among subprograms. This effectively reduces the search space besides the convenience of specifying user intent. Indeed, our speculations are confirmed in our experiments with higher quality synthesized programs through the multi-turn approach. Second, code exhibits a weak pattern of interleaved natural and programming language, which may be exploitable. Such a pattern is formed by programmers who explain the functionality of a program with comments. With the language modeling objective, we hypothesize that the interleaving pattern provides a supervision signal for the model to generate programs given natural language descriptions over multiple turns. The signal is highly noisy or weak, because only a subset of data would exhibit such a pattern, comments may be inaccurate or uninformative, and some of them may even be placed at an irrelevant position. However, up-scaling the model and data size might overcome such weak supervision, allowing the model to develop multi-turn program synthesis capacity. This enables user intent to be expressed in multiple turns, that is, the intent can be decomposed and fulfilled part by part while each turn can easily be expressed in natural language. In this work, we develop a multi-turn programming benchmark to measure the models' capacity for multi-turn program synthesis. To solve a problem in the benchmark, a model needs to synthesize a program in multiple steps with a user who specifies the intent in each turn in natural language. Please refer to Figure 1 for an example where the model synthesizes a program to extract the user name of an email address. Performance on the benchmark is measured by pass rate on expert-written test cases. To the best of our knowledge, this is the first multi-turn program synthesis benchmark, which allows quantitative analysis of multi-turn program synthesis. With the emergence of multi-turn program synthesis capacity in large language models that benefits problem-solving, we believe this benchmark will foster future research in program synthesis. Our Contributions Our work shares the basic idea of adopting language models for program synthesis with the recent and concurrent efforts (Chen et al., 2021; Austin et al., 2021; Li et al., 2022) with a single-turn user intent specification. In addition, we contribute with respect to four aspects: • We study multi-turn program synthesis emerging in autoregressive models under scaling laws. • We leverage this capacity to introduce a multi-turn program synthesis paradigm. • We investigate its properties quantitatively with a novel multi-turn programming benchmark. 1• We will open source model checkpointsfoot_1 and the custom training library: JAXFORMER. 3For program synthesis, no large-scale models competitive with Codex are available as open-source. This hinders progress, given that the expensive compute resources required to train these models are only accessible to a limited number of institutions. Our open source contribution allows a wide range of researchers to study and advance these models, which may greatly facilitate research progress.



Benchmark: https://github.com/salesforce/CodeGen/tree/main/benchmark Checkpoints: https://github.com/salesforce/CodeGen Training: https://github.com/salesforce/jaxformer

availability

https://github.com/salesforce/CodeGen.

