PLANNING WITH LARGE LANGUAGE MODELS FOR CODE GENERATION

Abstract

Existing large language model-based code generation pipelines typically use beam search or sampling algorithms during the decoding process. Although the programs they generate achieve high token-matching-based scores, they often fail to compile or generate incorrect outputs. The main reason is that conventional Transformer decoding algorithms may not be the best choice for code generation. In this work, we propose a novel Transformer decoding algorithm, Planning-Guided Transformer Decoding (PG-TD), that uses a planning algorithm to do lookahead search and guide the Transformer to generate better programs. Specifically, instead of simply optimizing the likelihood of the generated sequences, the Transformer makes use of a planner to generate candidate programs and test them on public test cases. The Transformer can therefore make more informed decisions and generate tokens that will eventually lead to higher-quality programs. We also design a mechanism that shares information between the Transformer and the planner to make our algorithm computationally efficient. We empirically evaluate our framework with several large language models as backbones on public coding challenge benchmarks, showing that 1) it can generate programs that consistently achieve higher performance compared with competing baseline methods; 2) it enables controllable code generation, such as concise codes and highly-commented codes by optimizing modified objective 1 .

1. INTRODUCTION

Large language models like Transformer (Vaswani et al., 2017a) have shown successes in natural language processing, computer vision, and various other domains. Thanks to Transformer's power on sequence modeling, it has been adopted for code generation (Wang et al., 2021; Ahmad et al., 2021) by treating programs as text sequences. Transformer has achieved significant improvements on the benchmarking tasks of code translation (Roziere et al., 2022) , code completion (Chen et al., 2021a) , and solving coding challenge problems (Hendrycks et al., 2021) . Recently, AlphaCode (Li et al., 2022) even achieved a competitive-level performance in programming competitions with the help of large Transformer models pre-trained on a large programming corpus. Transformer-based pipelines like AlphaCode follow the tradition of natural language processing and use sampling methods (Fan et al., 2018; Dabre & Fujita, 2020) during the generation process. Specifically, they sample a large number of complete programs using a pre-trained code generation Transformer, evaluate these programs using the public test cases provided in the dataset, and output the program that passes the most number of test cases. Compared with beam search-based methods, these sampling followed by filtering algorithms (which we will refer to as sampling + filtering) can take advantage of test cases and indeed improve the quality of the generated programs. However, during the Transformer generation process, they do not consider the test cases at all. Instead, they only use the test cases to evaluate the programs after all the candidate programs are generated. This can make their algorithms sample inefficient. Different from natural languages, programs may fail completely with even a single incorrect generated token. So these algorithms need to exhaustively sample a large number of programs to find a correct solution. The main reason behind the sample efficiency issue of these algorithms is that the Transformer beam search algorithm and the sampling algorithm (Vaswani et al., 2017b) may not be the best choices for code generation. An ideal code generation algorithm should stop early in the generation process when it knows the program it currently generates would certainly fail, and bias the generation process towards generating successful programs that pass more test cases. To achieve such a goal, we contribute to applying a planning algorithm in the Transformer generation process. Since a planning algorithm can use the pass rates of the generated programs as its objective, we use it to determine the quality of the generated codes and make the Transformer model make more informed decisions. In this paper, we investigate the following research question: Can we integrate a planning algorithm with a pre-trained code generation Transformer, achieving an algorithm that generates better programs than the conventional Transformer generation algorithms and the well-accepted sampling + filtering scheme in the literature? To answer this question, we propose a novel algorithm, Planning-Guided Transformer Decoding (PG-TD). During the code generation process, a planner does lookahead search and finds tokens that will lead to high-quality codes. The planner alone may not efficiently find high-quality codes due to the large search space of codes, and that is where a pretrained code generation Transformer comes into play. Specifically, the Transformer beam search algorithm and the next-token probabilities are used inside the planner to provide useful heuristics. We find that a straightforward integration between the planner and the Transformer can be computationally inefficient. So we design mechanisms that allow the Transformer and the planner to share their information to make the overall algorithm more efficient. We emphasize that our algorithm is model-agnostic, that is, any standard code generation Transformer model can be used as the backbone Transformer. Importantly, our algorithm does not require acquiring more sample solutions or finetuning the Transformer model to improve its performance. We empirically find that our proposed algorithm generates higher-quality programs under multiple accepted metrics compared with competing baseline methods. Additionally, we also empirically show that our algorithm has the following advantages. 1) By changing the reward function of the planner, our algorithm becomes versatile and can optimize different objective functions without the necessity of finetuning the Transformer model. 2) Our algorithm can generate solutions that are used to finetune a code-generation Transformer model to improve the Transformer's performance. More precisely, we have the following contributions in this paper. • First, we propose a novel algorithm, Planning-Guided Transformer Decoding (PG-TD), that uses a planning algorithm for lookahead search and guide the Transformer to generate better codes. Our algorithm is model-agnostic, which can work with any standard Transformer model, and does not require knowledge of the grammar of the generated programs. • Second, a direct integration of the planning algorithm with the Transformer decoding process can cause redundant uses of the Transformer beam search algorithm. We contribute to designing mechanisms that significantly improve the computational efficiency of the algorithm. • Third, we evaluate our algorithm on competitive programming benchmarks and empirically show that our algorithm can consistently generate better programs in terms of the pass rate and other metrics compared with the baseline methods. We also show that our algorithm is versatile and can optimize objectives other than the pass rate for controllable code generation, such as generating concise codes and codes with more comments.

2. RELATED WORK

Transformers for program synthesis. Our work is based on Transformer for program synthesis (Roziere et al., 2020; Austin et al., 2021) . Inspired by their capacities on a range of natural language tasks, modern transformer-based language models (Devlin et al., 2019; Radford et al., 2019; Raffel et al., 2020) have been adopted for program synthesis by treating programming languages in the same way as natural languages. A family of BERT-based Transformers are developed for code syntax (Kanade et al., 2020; Feng et al., 2020; Devlin et al., 2019; Guo et al., 2020) 



Project page: https://codeaimcts.github.io. Correspondence to: shun.zhang@ibm.com.



. Later, CodeX (Chen et al., 2021a) and CodeT5 (Wang et al., 2021) adopted GPT2 (Radford et al., 2019) and T5 (Raffel et al., 2020), respectively, as backbones for both code understanding and generation. Different learning methods including learning from examples (Ellis et al., 2021) and neural-symbolic methods (Nye et al., 2020) were explored. Recently, AlphaCode (Li et al., 2022) combined large

