MULTI-LINGUAL EVALUATION OF CODE GENERATION MODELS

Abstract

We present two new benchmarks, MBXP and Multilingual HumanEval, designed to evaluate code generation models in over 10 programming languages. These datasets are generated using a conversion framework that transpiles prompts and test cases from the original MBPP and HumanEval datasets into the corresponding data in the target language. By using these benchmarks, we are able to assess the performance of code generation models in a multi-lingual fashion, and discovered generalization ability of language models on out-of-domain languages, advantages of multi-lingual models over mono-lingual, the ability of few-shot prompting to teach the model new languages, and zero-shot translation abilities. In addition, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages, which can be used for other coderelated evaluations such as code insertion, robustness, or summarization tasks.

1. INTRODUCTION

Code completion by machine-learning models has great potential to improve developer productivity (Barke et al., 2022) . This line of research has seen tremendous progress with several models recently proposed such as Codex (Chen et al., 2021 ), CodeGen (Nijkamp et al., 2022 ), PaLM (Chowdhery et al., 2022) , BLOOM (Mitchell et al., 2022), and InCoder (Fried et al., 2022) . One key component for code generation research is how to evaluate such program synthesis abilities. In the literature, two primary evaluation approaches emerged, namely, the match-based and the execution-based evaluations. For both approaches, each problem contains a prompt which a model uses as input to generate a candidate body of code. The match-based evaluation compares the candidate code against reference source code using n-gram metrics such as BLEU, whereas the execution-based evaluation executes the candidate code against test cases and calculates success rate. The execution-based evaluation has benefits over the n-gram evaluation in that it permits solutions that are functionally correct but might not be equivalent to the reference solution in terms of the exact implementation. Since the release of datasets such as HumanEval (Chen et al., 2021) or MBPP (Austin et al., 2021) , the community has been widely adopting the execution-based approach as a primary tool to evaluate program generation capabilities. However, creating execution-based evaluation datasets is time-consuming since it requires careful construction of test cases to check the correctness of the code's functionality. Such difficulty leads to limited available of executionbased evaluation data. For instance, to date, many execution-based datasets contain only problems in Python.

availability

://github.com/amazon-research/ mbxp-exec-eval.

