MULTI-LINGUAL EVALUATION OF CODE GENERATION MODELS

Abstract

We present two new benchmarks, MBXP and Multilingual HumanEval, designed to evaluate code generation models in over 10 programming languages. These datasets are generated using a conversion framework that transpiles prompts and test cases from the original MBPP and HumanEval datasets into the corresponding data in the target language. By using these benchmarks, we are able to assess the performance of code generation models in a multi-lingual fashion, and discovered generalization ability of language models on out-of-domain languages, advantages of multi-lingual models over mono-lingual, the ability of few-shot prompting to teach the model new languages, and zero-shot translation abilities. In addition, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages, which can be used for other coderelated evaluations such as code insertion, robustness, or summarization tasks.

1. INTRODUCTION

Code completion by machine-learning models has great potential to improve developer productivity (Barke et al., 2022) . This line of research has seen tremendous progress with several models recently proposed such as Codex (Chen et al., 2021) , CodeGen (Nijkamp et al., 2022) , PaLM (Chowdhery et al., 2022) , BLOOM (Mitchell et al., 2022), and InCoder (Fried et al., 2022) . One key component for code generation research is how to evaluate such program synthesis abilities. In the literature, two primary evaluation approaches emerged, namely, the match-based and the execution-based evaluations. For both approaches, each problem contains a prompt which a model uses as input to generate a candidate body of code. The match-based evaluation compares the candidate code against reference source code using n-gram metrics such as BLEU, whereas the execution-based evaluation executes the candidate code against test cases and calculates success rate. The execution-based evaluation has benefits over the n-gram evaluation in that it permits solutions that are functionally correct but might not be equivalent to the reference solution in terms of the exact implementation. Since the release of datasets such as HumanEval (Chen et al., 2021) or MBPP (Austin et al., 2021) , the community has been widely adopting the execution-based approach as a primary tool to evaluate program generation capabilities. However, creating execution-based evaluation datasets is time-consuming since it requires careful construction of test cases to check the correctness of the code's functionality. Such difficulty leads to limited available of executionbased evaluation data. For instance, to date, many execution-based datasets contain only problems in Python. In this work, we propose a scalable framework for dataset convertion from Python to many different languages. While translating code from one language to another is typically a non-trivial task, it is possible to convert existing execution-based datasets to another language by transforming only prompts and test statements (see Figure 1 part A and Figure 2 ). That is, the purpose of evaluating function completion ability, we do not need the canonical solution since it is not used during evaluation. The function signature prompts and test cases of basic programming problems involve sufficiently simple data structures that can be analyzed to synthesize dataset in new languages. Without having to translate the generic function body of code to another language, the conversion process becomes possible via a rule-based transpiler. The result of such conversion are two benchmarks, MBXP ‡ and Multilingual HumanEval, which are derived from the original Python dataset MBPP (Austin et al., 2021) and HumanEval (Chen et al., 2021) . We provide the evaluation data in many languages besides the original Python, namely, Java, JavaScript, TypeScript, Go, Ruby, Kotlin, PHP, C#, Scala, C++, Swift, and Perl, with plans for more language expansion in the future. Along with these datasets, we also release a code package to perform execution in all supported languages. In addition, our conversion framework is easily extensible and allows us to obtain the multi-lingual version of other existing datasets such as MathQA (Schubotz et al., 2019) . In the main paper, we provide results and analyses mostly on MBXP where the results on Multilingual HumanEval and MathQA can also be found in Appendix D. Our benchmarks also support other code completion tasks such as code insertion or translation in many languages. This extension is made possible by performing large-scale bootstrapping to synthetize solutions (Section O.1.11) . The result of our dataset conversion framework and the solution synthesis process is, to date, the first multi-lingual execution-based evaluation benchmark equipped with canonical solutions, which can be adapted for many code-related evaluations. In this paper, we process MBXP for multiple use cases, namely, for zero-shot translation t-MBXP, prompt robustness r-MBXP, code insertion i-MBXP, and the summarization s-MBXP. Overall, the constructed datasets provides us new opportunities to explore many facets of code generation abilities. In this work, we conduct a large scale evaluation where we train models of various sizes spanning three orders of magnitude (from ∼ 100M to ∼ 10B parameters) in both multi-lingual and mono-lingual settings. We analyze results from hundreds of thousands of code generation samples to investigate the models' code generation abilities with respect to in-domain versus out-ofdomain languages, the effectiveness of few-shot prompting, zero-shot translation abilities, robustness to prompt perturbation, code summarization, and code insertion.

2. FINDING HIGHLIGHTS

We provide the highlights of out findings below. 1. Given the same model size, a multi-lingual model often outperforms the best of monolingual models trained with equivalent training resources, especially when the models are sufficiently large. This observation indicates that it is beneficial to train a single model on all programming languages, and provided that the model size has enough capacity, the performance will be better than the best of monolingual models. 2. Language models are able to generate code with correct syntax and pass unit tests in programming languages they are not intentionally trained on. We hypothesize that the data "spillover" effect, where code in one language is present in other languages through code



‡ MBXP stands for Most Basic X(Python/Java/Go/Ruby, etc.) Programming Problems



Figure 1: Benchmark Construction.

availability

://github.com/amazon-research/ mbxp-exec-eval.

