CODEGEN: AN OPEN LARGE LANGUAGE MODEL FOR CODE WITH MULTI-TURN PROGRAM SYNTHESIS

Abstract

Program synthesis strives to generate a computer program as a solution to a given problem specification, expressed with input-output examples or natural language descriptions. The prevalence of large language models advances the state-of-the-art for program synthesis, though limited training resources and data impede open access to such models. To democratize this, we train and release a family of large language models up to 16.1B parameters, called CODEGEN, on natural language and programming language data, and open source the training library JAXFORMER. We show the utility of the trained model by demonstrating that it is competitive with the previous state-of-the-art on zero-shot Python code generation on HumanEval. We further investigate the multi-step paradigm for program synthesis, where a single program is factorized into multiple prompts specifying subproblems. To this end, we construct an open benchmark, Multi-Turn Programming Benchmark (MTPB), consisting of 115 diverse problem sets that are factorized into multi-turn prompts. Our analysis on MTPB shows that the same intent provided to CODEGEN in multiturn fashion significantly improves program synthesis over that provided as a single turn. We make the training library JAXFORMER and model checkpoints available as open source contribution:

1. INTRODUCTION

Creating a program has typically involved a human entering code by hand. The goal of program synthesis is to automate the coding process, and generate a computer program that satisfies the user's specified intent. Some have called it the holy grail of computer science (Manna & Waldinger, 1971; Gulwani et al., 2017) . Successful program synthesis would not only improve the productivity of experienced programmers but also make programming accessible to a wider audience. Two key challenges arise when striving to achieve program synthesis: (1) the intractability of the search space, and (2) the difficulty of properly specifying user intent. To maintain an expressive search space, one needs a large search space, which poses challenges in efficient search. Previous work (Joshi et al., 2002; Panchekha et al., 2015; Cheung et al., 2013) leverages domain-specific language to restrict the search space; however, this limits the applicability of synthesized programs. On the contrary, while being widely applicable, general-purpose programming languages (e.g., C, Python) introduce an even larger search space for possible programs. To navigate through the enormous program space, we formulate the task as language modeling, learning a conditional distribution of the next token given preceding tokens and leverage transformers (Vaswani et al., 2017) and large-scale self-supervised pre-training. This approach has seen success across modalities (Devlin et al., 2019; Lewis et al., 2020; Dosovitskiy et al., 2021) . Likewise, prior works have developed pre-trained language models for programming language understanding (Kanade et al., 2020; Feng et al., 2020) . To realize program synthesis successfully, users must employ some means to communicate their intent to the models such as a logical expression (which specifies a logical relation between inputs and outputs of a program), pseudo-code, input-output examples, or a verbalized specifications in natural language. On the one hand, a complete formal specification enjoys the exact specifications of user intent but may require domain expertise and effort from users to translate the intent to such a form. On the other hand, specification merely based on input-output examples is less costly but may under-specify the intent, leading to inaccurate solutions. Previous work has benefited from various methods and their combinations as the input to program synthesis models, including pseudocode (Kulal et al., 2019) , a part of a program and its documentation (Chen et al., 2021) , or natural language paragraph with input-output examples (Hendrycks et al., 2021) . However, we argue that a truly user-friendly form of intent is natural language text. To overcome these challenges, we propose a multi-turn program synthesis approach, where a user communicates with the synthesis system by progressively providing specifications in natural language while receiving responses from the system in the form of synthesized subprograms, such that the user together with the system complete the program in multiple steps. The following two considerations motivate this approach. First, we speculate that factorizing a potentially long and complicated specification into multiple steps would ease the understanding by a model and hence enhance program synthesis. In the multi-turn approach, a model can focus on the specification associated with one subprogram and avoid arduously tracking the complicated dependency among subprograms. This effectively reduces the search space besides the convenience of specifying user intent. Indeed, our speculations are confirmed in our experiments with higher quality synthesized programs through the multi-turn approach. Second, code exhibits a weak pattern of interleaved natural and programming language, which may be exploitable. Such a pattern is formed by programmers who explain the functionality of a program with comments. With the language modeling objective, we hypothesize that the interleaving pattern provides a supervision signal for the model to generate programs given natural language descriptions over multiple turns. The signal is highly noisy or weak, because only a subset of data would exhibit such a pattern, comments may be inaccurate or uninformative, and some of them may even be placed at an irrelevant position. However, up-scaling the model and data size might overcome such weak supervision, allowing the model to develop multi-turn program synthesis capacity. This enables user intent to be expressed in multiple turns, that is, the intent can be decomposed and fulfilled part by part while each turn can easily be expressed in natural language. In this work, we develop a multi-turn programming benchmark to measure the models' capacity for multi-turn program synthesis. To solve a problem in the benchmark, a model needs to synthesize a program in multiple steps with a user who specifies the intent in each turn in natural language. Please refer to Figure 1 for an example where the model synthesizes a program to extract the user name of an email address. Performance on the benchmark is measured by pass rate on expert-written test cases. To the best of our knowledge, this is the first multi-turn program synthesis benchmark, which allows quantitative analysis of multi-turn program synthesis. With the emergence of multi-turn program synthesis capacity in large language models that benefits problem-solving, we believe this benchmark will foster future research in program synthesis. Our Contributions Our work shares the basic idea of adopting language models for program synthesis with the recent and concurrent efforts (Chen et al., 2021; Austin et al., 2021; Li et al., 2022) with a single-turn user intent specification. In addition, we contribute with respect to four aspects: • We study multi-turn program synthesis emerging in autoregressive models under scaling laws. • We leverage this capacity to introduce a multi-turn program synthesis paradigm. • We investigate its properties quantitatively with a novel multi-turn programming benchmark. 1• We will open source model checkpointsfoot_1 and the custom training library: JAXFORMER. 3For program synthesis, no large-scale models competitive with Codex are available as open-source. This hinders progress, given that the expensive compute resources required to train these models are only accessible to a limited number of institutions. Our open source contribution allows a wide range of researchers to study and advance these models, which may greatly facilitate research progress.

2. MODEL TRAINING

To evaluate the emergence of multi-turn programming capabilities under scaling laws, we adopt standard transformer-based autoregressive language models, varying (1) the number of model parameters (350M, 2.7B, 6.1B, 16.1B) and (2) the number of tokens of programming languages in the training corpora. For scaling the training, a custom library JAXFORMER for TPU-v4 hardware was developed and will be released as open-source, including the trained model weights.

2.1. DATASETS

The family of CODEGEN models is trained sequentially on three datasets: THEPILE, BIGQUERY, and BIGPYTHON. The natural language dataset THEPILE is an 825.18 GiB English text corpus collected by Gao et al. (2020) for language modeling (MIT license). The dataset is constructed from 22 diverse high-quality subsets, one of which is programming language data collected from GitHub repositories with >100 stars that constitute 7.6% of the dataset. Since the majority of THEPILE is English text, the resulting models are called as natural language CODEGEN models (CODEGEN-NL). The multi-lingual dataset BIGQUERY is a subset of Google's publicly available BigQuery dataset, which consists of code (under open-source license) in multiple programming languages. For the multilingual training, the following 6 programming languages are chosen: C, C++, Go, Java, JavaScript, and Python. Thus, we refer to models trained on the BIGQUERY as multi-lingual CODEGEN models (CODEGEN-MULTI). The mono-lingual dataset BIGPYTHON contains a large amount of data in the programming language, Python. We have compiled public, non-personal information from GitHub consisting of permissively licensed Python code in October 2021. Consequently, we refer to models trained on BIGPYTHON as mono-lingual CODEGEN models (CODEGEN-MONO). The pre-processing follows: (1) filtering, (2) deduplication, (3) tokenization, (4) shuffling, and (5) concatenation. For details on THEPILE, we refer to Gao et al. (2020) . For BIGQUERY and BIGPYTHON, we refer to Appendix A. Table 5 summarizes the statistics of the training corpora.

2.2. MODELS

The CODEGEN models are in the form of autoregressive transformers with next-token prediction language modeling as the learning objective trained on a natural language corpus and programming language data curated from GitHub. The models are trained in various sizes with 350M, 2.7B, 6.1B, and 16.1B parameters. The first three configurations allow for direct comparison with open-sourced large language models trained on text corpus, GPT-NEO (350M, 2.7B) (Black et al., 2021) and GPT-J (6B) (Wang & Komatsuzaki, 2021) . See Table 6 in Appendix A for model specifications. The CODEGEN models are trained in a sequential nature over datasets. CODEGEN-NL is first trained on THEPILE. CODEGEN-MULTI is initialized from CODEGEN-NL and trained on BIGQUERY. Finally CODEGEN-MONO is initialized from CODEGEN-MULTI and trained on BIGPYTHON. The emergence of program synthesis conditional on descriptions in natural language may stem from the size of the models and data, training objective, and nature of the training data itself. This is called emergence since we do not explicitly train the model on comment-code pairs. Similar phenomena are observed in a wide range of natural language tasks where a large-scale unsupervised language model can solve unseen tasks in a zero-shot fashion (Brown et al., 2020) . The emergence phenomena or surprising zero-shot generalization is often attributed to the large scale of the model and the data. While our focus is not to reveal the underlying mechanism on why program synthesis capacity emerges from simple language modeling, we make an attempt to provide an explanation given the nature of our modeling approach and the training data. The data consists of regular code from GitHub (without manual selection), for which some data exhibits a pattern of interleaved natural and programming language, which we believe provides a noisy supervision signal for the program synthesis capacity due to the next-token prediction training objective. However, we emphasize that such a data pattern is highly noisy and weak, because only a subset of data exhibits such a pattern, e.g., comments may be inaccurate or uninformative, and some of them may even be placed at an irrelevant Table 1 : Evaluation results on the HumanEval benchmark. Each pass@k (where k ∈ {1, 10, 100}) for each model is computed with three sampling temperatures (t ∈ {0.2, 0.6, 0.8}) and the highest one among the three are displayed, which follows the evaluation procedure in Chen et al. (2021) . Results for the model marked with * are from Chen et al. (2022) . position. Therefore, we believe two main factors contribute to the program synthesis capacity: 1) large scale of model size and data size and 2) noisy signal in training data. The scaling of such LLMs requires data and model parallelism. To address these requirements, a training library JAXFORMER (https://github.com/salesforce/jaxformer) was developed for efficient training on Google's TPU-v4 hardware. We refer to Appendix A for further details on the technical implementation and sharding schemes. Table 6 summarizes the hyper-parameters.

3. SINGLE-TURN EVALUATION

We first evaluate our CODEGEN using an existing program synthesis benchmark: HumanEval (MIT license) (Chen et al., 2021) . HumanEval contains 164 hand-written Python programming problems. Each problem provides a prompt with descriptions of the function to be generated, function signature, and example test cases in the form of assertions. The model needs to complete a function given the prompt such that it can pass all provided test cases, thus measuring the performance by functional correctness. Since a user intent is specified in a single prompt and provided to the model once, we regard the evaluation on HumanEval as a single-turn evaluation, to distinguish it from the multi-turn evaluation which we introduce in the next section. Following Chen et al. (2021) , we recruit nucleus sampling (Holtzman et al., 2020) with top-p where p = 0.95.

3.1. HUMANEVAL PERFORMANCE SCALES AS A FUNCTION OF MODEL SIZE AND DATA SIZE

We compare our models to the Codex models (Chen et al., 2021) , which demonstrate the state-ofthe-art performance on HumanEval. Moreover, our models are compared to open-sourced large language models, GPT-NEO (Black et al., 2021) and GPT-J (Wang & Komatsuzaki, 2021) . These are trained on THEPILE (Gao et al., 2020) , and thus similar to our CODEGEN-NL models, in terms of training data and model size. All models are evaluated with temperature t ∈ {0.2, 0.6, 0.8}, and we compute pass@k where k ∈ {1, 10, 100} for each model. For direct comparison to the results by Chen et al. (2021) , we choose the temperature that yields the best-performing pass@k for each k. The results of our models and baselines are summarized in Table 1 . Our CODEGEN-NL models (350M, 2.7B, 6.1B) outperform or perform on par with the respective GPT-NEO and GPT-J models. Further training CODEGEN-NL on multilingual programming language data (BIGQUERY) leads to CODEGEN-MULTI. The multilingual CODEGEN models outperform the models trained on THEPILE (GPT-NEO, GPT-J, CODEGEN-NL) by a large margin. We then finetune CODEGEN-MULTI on a Python-only dataset (BIGPYTHON), resulting in CODEGEN-MONO. The program synthesis capacity is improved substantially. Therefore, the Python program synthesis capacity enhances as the amount of Python training data increases. For almost all models, as expected, increasing the size of the model improves overall performance. Our Python-monolingual CODEGEN models have competitive or improved performance, compared to the current state-of-the-art models, Codex. CODEGEN-MONO 2.7B underperforms CODEX 2.5B when k = 100 but outperforms it when k ∈ {1, 10}. While it is only half the size, our CODEGEN-MONO 6.1B demonstrates pass@k scores approaching those of the best-performing Codex, CODEX 12B. Our largest model CODEGEN-MONO 16.1B is competitive or outperforms it depending on k.

3.2. BETTER USER INTENT UNDERSTANDING YIELDS BETTER SYNTHESIZED PROGRAMS

The success of a program synthesis system highly depends on how well it understands user intent. When the system is based on a language model, the perplexity of problem prompts provides a proxy for the system's understanding of user intent specifications. A low perplexity of an intent specification under a model indicates that this intent specification is compatible with the knowledge learned by the model from the training data. We investigate whether better prompt understanding, with lower prompt perplexity as a proxy, leads to more functionally accurate programs. We partition all problems into pass versus non-pass ones. A pass problem is one that at least one sample from 200 samples passes all test cases, while for a non-pass problem none of the 200 samples pass all test cases. We compute the average perplexity of the problem prompts of the pass problems and that of the non-pass ones, based on samples from CODEGEN-MONO models. The results are displayed in Table 2 (see Appendix F for the results on CODEGEN-NL and CODEGEN-MULTI). The prompts of the pass problems have lower perplexity than those of the non-pass ones. This finding implies that program synthesis is more likely to be successful when the user intent specification is understood better by the model. Indeed, some training data contains interleaved sequences of natural language comments and programs, where the comments describe the functionality of the following program. We thus speculate that user intent specifications similar to such a pattern would be better understood by the model, and hence lead to better program synthesis. Inspired by this pattern, we propose to specify user intent in multiple turns such that the model focus on a partial problem at a time, which would make user intent understanding by the model easier.

4. MULTI-TURN EVALUATION

In this section, we propose and study a multi-step program synthesis paradigm where program synthesis is decomposed into multiple steps and the system synthesizes a subprogram in each step. To examine such a paradigm, we first develop a Multi-Turn Programming Benchmark (MTPB). MTPB consists of 115 problems written by experts, each of which includes a multi-step descriptions in natural language (prompt). To solve a problem, a model needs to synthesize functionally correct subprograms (1) following the description at the current step and (2) considering descriptions and synthesized subprograms at previous steps (e.g., correct backreference of functions and/or variables defined in the previous steps). An illustrative example is shown in Figure 1 . Search for an email address in "{input}" and store the first match to a variable "address".

Human

Remove the substring starting from the @ symbol from "address". Replace non-alphabetical symbols with a whitespace in "address". where some prompts include templates (i.e. {input}) that are filled with test case inputs before it is fed to the model. In the displayed example, the input is a string containing abc.xyz@example.com, which replaces {input} in p 2 , and the expected output is abc xyz. 2 Our model conditions on the concatenation of interleaved past prompts and generated responses. 3 Generated responses from each turn are concatenated and executed, where the output is compared to the answer.

4.1. BENCHMARK CONSTRUCTION

We (4 authors) start by definingfoot_3 a set of 115 problems requiring a diverse range of programming knowledge, including math, array operations, string manipulations, algorithms, data science, and problems that require other knowledge, such that the number of problems in each category is roughly balanced. 5 For each problem, we construct a triplet consisting of multi-turn prompts P , test case inputs I, and test case outputs O. Multi-turn prompts P are designed following the two constraints: (1) the problem is decomposed into 3 or more turns, (2) a single turn cannot be attributed to solving the problem. For example, implementing a linear regression model could be phrased as "Perform linear regression on x and y". Since the main task is fully expressed in this prompt, understanding this prompt is sufficient to perform the task. We avoid such cases via manual inspection and distribute problem-solving over turns. Together with the prompts, we task the problem author to prepare 5 sets of test case inputs I and outputs O to evaluate model outputs with functional correctness. To reduce wrongly rewarding false positive solutions that give meaningless programs but pass the tests, we examine and revise such cases to ensure the test quality. Unlike HumanEval for which models are expected to complete a partially defined function, MTPB problems only provide the prompts, thereby models have to generate the solution from scratch. 6While the free-form generation may allow for more potential solutions, the lack of an entry point to provide test case inputs makes it challenging to test the generated code on diverse test cases. To overcome this challenge, we instead embed test case inputs within prompts. Specifically, prompts are written with Python's formatted stringfoot_6 where input values are substituted for the variable name when a specific test case is applied to the problem. For example, a prompt, "Define a string named 's' with the value {var}.", together with a test case input var = 'Hello' will be formatted into "Define a string named 's' with the value 'Hello'." Also see 1 in Figure 1 for an example.

4.2. EXECUTION ENVIRONMENT AND SOLUTION EVALUATION

For execution, the history of pairs of prompts and generated completions is concatenated into a self-contained program (see 3 in Figure 1 for an example). The program is then executed in an isolated Python environment following the single-turn HumanEval benchmark (Chen et al., 2021) . However, the problems in HumanEval are constructed in such a way that a known function signature is completed, thus invocation of the generated code under a set of functional unit tests is trivial. In our multi-turn case, no such entry point (or return value) is guaranteed to be generated. To circumvent the issue of a missing return signature (or value), the last prompt of the multi-turn problems in MTPB is always specified to print out the resulting state to the terminal. Then, the benchmark execution environment overloads the Python print(args) function and stores args on a stack. If the sampled code for the last prompt of a problem does not include the print() statement, which is a valid convention to print on the terminal in Python or specifically Jupyter notebooks, then the AST of the generated code will be mutated to inject an invocation of print(). Finally, a type-relaxed equivalence check (e.g., an implicit conversion between lists and tuples) of args against the predefined gold output of the problem is performed to determine test failure or success.

4.3. MULTI-STEP PROGRAMMING CAPACITY SCALES WITH MODEL SIZE AND DATA SIZE

In this analysis, we investigate how the model size and data size affect the program synthesis capacity in a multi-turn paradigm. In the MTPB, each problem has 5 test cases and we sample 40 samples for each test case with each model, based on which the pass rate is computed for each problem. The MTPB evaluation results (average pass rate) for our CODEGEN models, baselines, and OpenAI Codex modelsfoot_7 are shown in Table 3 . Clearly, the performance on the MTPB improves as a function of the model size and data size. This suggests that the capacity of multi-step program synthesis scales as a function of the model size and data size. The models are simply trained with an autoregressive language modeling objective. While the model and the data scale up, multi-turn program synthesis capacity emerges, that is, the capacity to synthesize programs in a multi-turn fashion. Figure 2 : Difference in average pass-rate of problems in single-turn and multi-turn formulation over levels of problem difficulty. The improvement is sizable for most model sizes and difficulty levels, except for easy problems with larger models.

4.4. BETTER USER SPECIFICATION UNDERSTANDING WITH MULTI-TURN FACTORIZATION

We hypothesize that multi-turn factorization enhances the model's understanding of user intent specifications, which in turn lead to higher program synthesis capacity. To test this hypothesis, we form a single-turn counterpart of multi-turn specifications by concatenating each specification into a single turn. As discussed in Section 3.2, we adopt the prompt perplexity as a proxy for user intent understanding. Thus, we compare the perplexity of the multi-turn prompts and that of the concatenated single-turn prompts under the four CODEGEN-MONO models. The average perplexity (see Appendix E for the calculation details) over all the problems in the MTPB is displayed in the left panel of Table 4 . For all models, the single-turn specification has a higher average perplexity than the multi-turn specification. It implies that the multi-turn user specifications can be better understood by the models. We notice that the average perplexity for both multi-turn and single-turn intent specifications under larger models is slightly lower than that under smaller models, indicating that the larger ones understand the user intent better than the smaller ones. We compare the program synthesis pass rate with the multi-turn prompts to that with the concatenated single-turn prompts. The results are shown in the right panel of Table 4 . Multi-turn specifications lead to close to or more than 10 percentage points over single-turn specifications for all model sizes. Together with the perplexity analysis above, it appears that factorizing a user specification into multiple steps and leveraging the emerged capacity of large language models allow them to digest the specification more easily and synthesize programs more successfully. Furthermore, we categorize the problems by difficulty level based on their average pass rates ("hard" with less than 30%, "easy" with larger than 70%), and examine the interaction effect between difficulty level and model size on the improvement by multi-turn factorization. See the results in Figure 2 . Across almost all model sizes and difficulty levels, multi-turn prompts lead to significant improvement over single-turn prompts and most improvements are nearly or higher than 10 percentage points. Interestingly, the larger models (6.1B and 16.1B) are invariant to multi-turn factorization for easy problems (see the two short bars, 0.19% and -0.25%, in Figure 2 ). This implies that when the problems can be easily understood by the model (due to the combined effect of easiness of the problems and the high capacity of larger models), it is not necessary or beneficial to factorize the specifications. This is in fact consistent with our motivating assumption that factorizing complicated specifications would ease problem understanding and improve program synthesis.

4.5. QUALITATIVE EXAMPLES

To further understand the differences in model behavior over model sizes, we examine cases where large models have contrasting performances to smaller models. We specifically select problems for which CODEGEN-MONO 16.1B and CODEGEN-MONO 2.7B show a significant discrepancy in performance. On problems where CODEGEN-MONO 16.1B performed significantly worse compared to CODEGEN-MONO 2.7B, we observe that the larger model becomes inflexible due to taking the prompt literally. For example, initializing a number always results in an integer, despite the prompt asking to cast into a string (Figure 3 ), or the "return" keyword in a prompt triggers a function definition while the intent is to directly generate an executable program (Figure 4 ). However in general, larger-scale models overcome mistakes due to prompt misinterpretation by smaller models, including assigning multiple variables at the same time (Figure 5 ) or understanding the concept of any comparison (Figure 6 ).

5. RELATED WORK

Program Synthesis While program synthesis has a long history, two inherent challenges remain unsolved: (1) intractability of the program space and (2) difficulty in accurately expressing user intent (Manna & Waldinger, 1971; Gulwani et al., 2017) . A large body of prior research attempted to address (1) by exploring methods like stochastic search techniques (Parisotto et al., 2017; Schkufza et al., 2013) and deductive top-down search (Gulwani, 2011; Polozov & Gulwani, 2015) . However, the scalability of these approaches is still limited. User intent can be expressed with various methods: formal logical specifications, input-output examples, and natural language descriptions. Complete and formal specifications require too much effort, while informal ones like input-output examples often under-specify problems (Gulwani, 2011) . Well-learned conditional distribution and language understanding capacity owing to the large-scale model and data allows for efficient solutions for these two challenges. Several works investigate converting conversational intents into programmable representations, such as SQL (Yu et al., 2019a; b) or dataflow graph (Andreas et al., 2020) . Our proposed benchmark requires the generation of Python, which is more general and complex. Large Language Models Transformers capture dependency among sequence elements through attention mechanism (Bahdanau et al., 2014) and are highly scalable. It has been successfully applied to natural language processing (Devlin et al., 2019; Lewis et al., 2020; Raffel et al., 2020) , computer vision (Dosovitskiy et al., 2021) , and many other areas (Oord et al., 2018; Jumper et al., 2021) . Prior works, such as CuBERT (Kanade et al., 2020) , CodeBERT (Feng et al., 2020) , PyMT5 (Clement et al., 2020) , and CodeT5 (Wang et al., 2021) , have applied transformers towards code understanding but these mostly focus on code retrieval, classification, and program repair. Several recent and concurrent efforts explore using large language models for program synthesis (Chen et al., 2021; Austin et al., 2021; Li et al., 2022; Fried et al., 2022) and its effectiveness (Vaithilingam et al., 2022) . While they focus on generating code in a single turn, we propose to factorize the specifications into multiple turns and demonstrate that it is highly effective to improve synthesis quality. It is worth pointing out that Austin et al. ( 2021) explored refining the code in multiple iterations, but it is essentially a single-turn approach since a complete program is produced in every single turn. Prompting pre-trained language models with intermediate information to improve task performance has attracted interest (Nye et al., 2021; Wei et al., 2022) . Our proposed MTPB also allows the model to leverage past turns as context.

Benchmarks for Program Synthesis

To quantitatively evaluate program synthesis models, several benchmarks have been proposed with different input forms. A popular input forms include preceding code in the same line (Raychev et al., 2016) , pseudo-code (Kulal et al., 2019) , a docstring and function signature (Chen et al., 2021) , or problem description (Hendrycks et al., 2021) . In most of those cases, only directly relevant input information is given to the model. In contrast, a few previous works instantiate benchmarks that measure the ability to generate programs given surrounding program context beyond the target program, such as variables and other methods (Iyer et al., 2018) or alternating "cells" of preceding code and text blocks (Agashe et al., 2019) , while the primary focus is to generate the target program itself. We propose a new benchmark that requires a progressive generation of subprograms through multi-turn prompts.

6. CONCLUSION

We study program synthesis with large causal language models trained on large corpora of code data. The capacity to understand long context and generate coherent responses emerges from the simple language modeling as the model size and data size scale up. Leveraging this capacity and observing that better user intent understanding leads to better program synthesis, we propose a multi-step program synthesis approach in which program synthesis is achieved through a multi-turn specification and code generation. Moreover, we develop the Multi-Turn Programming Benchmark (MTPB) to investigate our models' capacity on synthesizing programs in such a multi-step paradigm. Our experiments show that the multi-step program synthesis capacity scales as a function of the model size and data size. The intent specifications, which are specified in multiple steps, are digested more easily by the models and lead to more accurate program synthesis. We open-source the training code and the model checkpoints to facilitate future research and practical applications in this area.

BROADER IMPACT AND ETHICAL CONSIDERATIONS

All variants of CODEGEN are firstly pre-trained on the Pile, which includes a small portion of profane language. Focusing on the GitHub data that best aligns our expected use case of program synthesis, Gao et al. (2020) report that 0.1% of the data contained profane language, and has sentiment biases against gender and certain religious groups. Thus, while we did not observe in our samples, CODEGEN may generate such content as well. In addition to risks on natural language outputs (e.g., docstrings), generated programs may include vulnerabilities and safety concerns, which are not remedied in this work. Models should not be used in applications until being treated for these risks.

A MODEL TRAINING

To evaluate the emergence of multi-turn program synthesis capabilities under scaling laws, we adopt standard transformer-based autoregressive language models, varying (1) the number of model parameters (350M, 2.7B, 6.1B, 16.1B) and ( 2) the number of tokens of programming languages in the training corpora. For scaling the models, a custom library JAXFORMER for training large language models on TPU-v4 hardware was developed and will be released as open source, including the trained model weights. A Table 5 : Approximate statistics for training corpora along the pre-processing steps. For each dataset, the pre-processing shares the following steps: (1) filtering, (2) deduplication, (3) tokenization, (4) shuffling, and (5) concatenation. For details on THEPILE, we refer to Gao et al. (2020) . For BIGQUERY and BIGPYTHON, in (1) files are filtered by file extension, and files with average lines length of <100 characters, a maximum line length of 1, 000, and >90% of the characters being decimal or hexadecimal digits are removed. For (2), exact duplicates based on their SHA-256 hash are removed, which amounts to a substantial portion of the raw data due to forks and copies of repositories. For (3), the BPE vocabulary of GPT-2 is extended by special tokens representing repeating tokens of tabs and white spaces. In the multi-lingual setting of BIGQUERY, a prefix is prepended to indicate the name of the programming language. For (4), each year of data is randomly shuffled. For (5), sequences are concatenated to fill the context length of 2, 048 tokens with a special token as a separator. Table 5 summarizes the statistics of the training corpora. CODEGEN-NL models are randomly initialized and trained on THEPILE. CODEGEN-MULTI models are initialized from CODEGEN-NL and then trained on the BIGQUERY. CODEGEN-MONO models are initialized from CODEGEN-MULTI and then trained on BIGPYTHON. A.2 MODELS Our models are autoregressive transformers with the regular next-token prediction language modeling as the learning objective. The family of CODEGEN models is trained in various sizes with 350M, 2.7B, 6.1B, and 16.1B parameters. The first three configurations allow for direct comparison with opensourced large language models trained on text corpus, GPT-NEO (350M, 2.7B) (Black et al., 2021) and GPT-J (6B) (Wang & Komatsuzaki, 2021) . See Table 6 in Appendix A for model specifications. The architecture follows a standard transformer decoder with left-to-right causal masking. For the positional encoding, we adopt rotary position embedding (Su et al., 2021) . For the forward pass, we execute the self-attention and feed-forward circuits in parallel for improved communication overhead following Wang & Komatsuzaki (2021) , that is, x t+1 = x t + mlp(ln(x t + attn(ln(x t )))) is altered to x t+1 = x t + attn(ln(x t )) + mlp(ln(x t )) for which the computation of self-attention, attn(), and feed-forward, mlp(), with layer-norm, ln(), is simultaneous. The architecture and hyper-parameter choices were optimized specifically for the hardware layout of TPU-v4. 

A.3 TRAINING

The scaling of large language models requires data and model parallelism. Google's TPU-v4 hardware with a high-speed toroidal mesh interconnect naturally allows for efficient parallelism. To efficiently utilize the hardware, the training of the models is implemented in JAX (Bradbury et al., 2018) . For parallel evaluation in JAX the pjit() 9 operator is adopted. The operator enables a paradigm named single-program, multiple-data (SPMD) code, which refers to a parallelism technique where the same computation is run on different input data in parallel on different devices. 10 Specifically, pjit() is the API exposed for the XLA SPMD partitioner in JAX, which allows a given function to be evaluated in parallel with equivalent semantics over a logical mesh of compute. Our library JAXFORMER recruits a designated coordinator node to orchestrate the cluster of TPU-VMs 11 with a custom TCP/IP protocol. For data parallelism, the coordinator partitions a The intra-TPU-VM scheme is adopted for models of size of less or equal to 6B parameters, the total amount of model and optimizer parameters fit into the combined HBM memory of a single TPU-v4 board. For instance, a TPU-v4-512 slice with n b = 64 and n c = 4 would be configured as (r, p, c) = (64, 1, 4). That is, the parameters are being replicated across r = 64 boards with p = 1 total inter-board partitions and intra-board parallelism across c = 4 logical chips. In this configuration, the mean gradient is accumulated across boards via with_sharding_constraint(), effectively emulating the behavior of the xmap() 13 operator. The inter-TPU-VM scheme is adopted for models exceeding the size of 6B parameters for which the model and optimizer parameters have to be sharded across TPU-v4 boards. For instance, a TPU-v4-512 slice with n b = 64 and n c = 4 would be configured as (r, p, c) = (1, 64, 4). For larger slices such as TPU-v4-1024 with n b = 128, one may introduce redundancy in the parameter sharding, e.g., (r, p, c) = (2, 64, 4). In this configuration, the activations are replicated across boards via with_sharding_constraint(). Moreover, (r, p, c) allows for backwards compatibility for the logical hardware layout transition from TPU-v3 with c = 8 to TPU-v4 with c = 4 by adjusting p without the need for re-sharding. For the optimization, Table 6 summarizes the hyper-parameters. We adopt the Adam (Kingma & Ba, 2015) optimizer with (β 1 , β 2 , ϵ) = (0.9, 0.999, 1e-08) and global gradient norm clipping (Pascanu et al., 2013) of 1.0. The learning rate function over time follows GPT-3 (Brown et al., 2020) with warm-up steps and cosine annealing. In summary, we mainly adopted the GPT-3 reference configurations with minor variations accounting for TPU optimizations. We did not have the compute capacity to optimize these hyper-parameters further.

B PASS@k ESTIMATOR

We use the unbiased estimator proposed in Chen et al. (2021) to compute pass@k. For each task, n ≥ k samples are sampled. In particular, we use n = 200 and k ≤ 100. Suppose c is the number of correct samples, among the n samples, which pass all the unit tests. Then the unbiased estimator is defined as follows: pass@k = E Problems 1 - n-c k n k Directly computing this estimator is numerically unstable. We use the numerically stable numpy implementation introduced by Chen et al. (2021) .

C TYPE-RELAXED EQUIVALENCE CHECK FOR MTPB EVALUATION

We perform the following type-relaxation before assessing the equivalence between model outputs and the expected outputs. • Convert numpy arrays into correspondingly typed lists of standard types (e.g. np.int32 will be cast to int). • pandas series are converted and compared in numpy array format. • For the rest, model outputs are cast into the type of gold standard outputs. • Floating numbers are compared with ε = 1e -6 as the tolerance threshold. sanitized MBPP for all of our models, with n = 100 and temperature= 0.8. The last four rows are from the aforementioned paper. In general we observe the consistent trend of improving the performance over different versions (NL, Multi, Mono), with our largest CODEGEN-MONO 16.1B approaching the results from code-cushman-001. While we do not know whether any of OpenAI models is the "Codex 12B" reported in Chen et al. (2021) , we believe our model achieves reasonable results on MBPP as well. We also note that our CODEGEN-MONO 6.1B significantly outperformed INCODER 6B. # Assign an empty string to a variable " result", and concatenate characters popped from the last element of "stack" to "result" until "stack" is empty. result = "" while len(stack)!= 0: result += stack.pop() # Cast "result" as integer and print it out.

H GENERATED SAMPLES

print(int(result)) # Import libraries. import numpy as np # Initialize a variable "stack" with an empty list, and "num" with 100 as a string. stack = [] num = 100 # For each chracter in "num", append the character to "stack". for i in num: stack.append(i) # Assign an empty string to a variable " result", and concatenate characters popped from the last element of "stack" to "result" until "stack" is empty. result = "" while len(stack) > 0: result += stack.pop() # Cast "result" as integer and print it out. print(int(result)) CODEGEN-MONO 2.7B CODEGEN-MONO 16.1B Figure 3 : Generated samples for "Reverse digits" problem. While being simplistic, we often observe that CODEGEN-MONO 16.1B assumes the type when assigning a number to a variable. Here on the right, despite being explicitly specified to initialize a number as a string, the larger model fails to do so. This causes an error in the following turn when the code attempts to iterate over num.



Benchmark: https://github.com/salesforce/CodeGen/tree/main/benchmark Checkpoints: https://github.com/salesforce/CodeGen Training: https://github.com/salesforce/jaxformer Problem writing was performed in a closed book format, i.e. we are not allowed to consult with online resources while writing the problems. See Appendix D for a complete listing. To guide sampling in Python, we prefix the prompt with: # Import libraries.\n import numpy as np. https://docs.python.org/3/reference/lexical_analysis.html#f-strings Accessed on November 10th, 2022.



Import re and define a regular expression that matches an …import re email_regex = re.compile( "([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})" )# Search for an email address in "… abc.xyz@example.com …" and … address = email_regex.search("… abc.xyz@example.com …") # Remove the substring starting from the @ symbol from "address". address = address.group(0) address = address[:address.find("@")] # Replace non-alphabetical symbols with a whitespace in "address". address = re.sub("[^a-zA-Z]+", " ", address) # Print out "address". print(address) and define a regular expression that matches an email address.import re email_regex = re.compile("[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+")ModelModel address = email_regex.search("… abc.xyz@example.com …")

Figure1: An illustrative example for the Multi-Turn Programming Benchmark, performing the task of extracting the user name of an email address. 1 Each problem consists of prompts p i and unit tests, where some prompts include templates (i.e. {input}) that are filled with test case inputs before it is fed to the model. In the displayed example, the input is a string containing abc.xyz@example.com, which replaces {input} in p 2 , and the expected output is abc xyz. 2 Our model conditions on the concatenation of interleaved past prompts and generated responses. 3 Generated responses from each turn are concatenated and executed, where the output is compared to the answer.

batch and distributes the partitions to the individual TPU-VMs. For model parallelism, two schemes for the sharding of model parameters are supported: (1) Intra-TPU-VM, where parameters are sharded across MXU cores 12 inside a physical TPU-v4 board and replicated across boards following Shoeybi et al. (2019); Wang & Komatsuzaki (2021); (2) Inter-TPU-VM, where parameters are sharded across TPU-v4 boards and activations are replicated following Rajbhandari et al. (2020). Both intra-TPU-VM and inter-TPU-VM sharding schemes are implemented based on our specific pjit() a logical mesh specification (r, p, c) with r replicas of the parameters, p partitions of the parameters, and c logical cores per board over n b TPU boards with each n c logical cores such that d × p = n b and r × p × c = n b × n c .

± 0.23 3.66 ± 0.14 3.35 ± 0.13 3.12 ± 0.11 Non-Pass 5.18 ± 0.19 4.37 ± 0.18 3.88 ± 0.13 3.40 ± 0.11 Average prompt perplexity ↓ (± standard error) of CODEGEN-MONO models on pass and non-pass problems.

Evaluation results on the Multi-Turn Programming Benchmark. The multi-turn program synthesis performance varies as a function of model size (columns) and code data size (rows).

Comparison between multi-and concatenated single-turn specifications on perplexity (PPL) and program synthesis performance (as measured by pass rate) under CODEGEN-MONO models.

Hyper-parameters for model specification and optimization for the family of CODEGEN models.

H.1 CASES WHERE CODEGEN-MONO 16.1B UNDER-PERFORMS

availability

https://github.com/salesforce/CodeGen.

E PERPLEXITY COMPUTATION FOR SINGLE-AND MULTI-TURN PROMPTS

Suppose {p i } n i=1 is the set of prompts for a given problem, and {s i } n i=1 are the n sub-programs synthesized by a model P θ . Suppose c i-1 = [p 1 ; s 1 ; ...; p i-1 ; s i-1 ] where [• ; •] indicates concatenation, the conditional probability of p i is Prob i = P θ (p i |c i-1 ), and then the perplexity for the multi-turn prompts is computed aswhere m is the total number of tokens of all prompts {p i } n i=1 . Suppose c = [p 1 ; s 1 ; ..., p n , s n ], then its probability is Prob = P θ (c), and the the perplexity for the single-turn prompts is computed as We also evaluated our models on Mostly Basic Python Problems (MBPP) (Austin et al., 2021) . The results are displayed in 

I ADDITIONAL ANALYSES ON MTPB

We conducted additional analyses to illustrate the relationship generated program length and pass rate and showed the results in Figure 7 , Figure 8 , and Figure 9 . The relationship between generated program length and prompt length is shown in Figure 10 .Published as a conference paper at ICLR 2023 

