CODE TRANSLATION WITH COMPILER REPRESENTATIONS

Abstract

In this paper, we leverage low-level compiler intermediate representations (IR) to improve code translation. Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnaturallooking code. Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a naturallooking translation. However, they treat the code as sequences of text tokens, and still do not differentiate well enough between similar pieces of code which have different semantics in different languages. The consequence is low quality translation, reducing the practicality of NMT, and stressing the need for an approach significantly increasing its accuracy. Here we propose to augment code translation with IRs, specifically LLVM IR, with results on the C++, Java, Rust, and Go languages. Our method improves upon the state of the art for unsupervised code translation, increasing the number of correct translations by 11% on average, and up to 79% for the Java → Rust pair with greedy decoding. With beam search, it increases the number of correct translations by 5.5% in average. We extend previous test sets for code translation, by adding hundreds of Go and Rust functions. Additionally, we train models with high performance on the problem of IR decompilation, generating programming source code from IR, and study using IRs as intermediary pivot for translation.

1. INTRODUCTION

Automatic code translation allows to port old codebases to new frameworks, or high-level (but slow) languages to low-level (and fast) ones. Current industry solutions, known as transpilers or transcompilers 1 , rely on handcrafted rules that are applied systematically. They produce unidiomatic translations that prove hard to read for human programmers. This is a serious limitation: the translated code should be easy to read and understand, as it will eventually be maintained by human developers. In recent years, Neural Machine Translation (NMT) was proposed as an alternative to rule-based code translation (Roziere et al., 2020; Weisz et al., 2021; 2022) . These models, trained from existing human-readable code, produce idiomatic, easy to understand, translations. Unfortunately, neural transpilers are unreliable, and often fail to translate the semantics of the input program accurately. This is a serious limitation, as some of the human work saved by the transpiler has to be reinvested debugging its output. We propose to improve the reliability of NMT by leveraging information from compiler toolchains. When processing source code, compilers create Intermediary Representations (IR): language-agnostic pseudocode that describes the semantics of the program. Augmenting training data with the corresponding IR can benefit a Neural Transpiler in two ways: it helps align embeddings for different languages and improves the semantic understanding of the code. As shown in Figure 1 , this can greatly improve the semantic quality of neural translations. In this work, we leverage LLVM (Lattner and Adve, 2004) to augment source code with corresponding Intermediate Representation and train models for code translation and decompilation. We compare

funding

://en.wikipedia.org/wiki/Source-to-source_compiler

annex

Figure 1 : Improvements over TransCoder. The first example shows a translation from C++ to rust, where TransCoder generates code using unsigned instead of signed integers. In the second example, a translation from Java to Go, it generates a function with the wrong return type. In the third example, which is also a translation from Java to Go, the model outputs a function that looks similar to the correct solution but it confuses > with » and closes an expression with a parenthesis too early. In these cases and many others, TransCoder makes mistakes that are small in terms of edit distance, but have a large impact on the semantics of the code. Using the IR to ground the representations to the semantics often helps solving these issues.it to TransCoder, which uses only code and no IR. We also design an IR-only baseline, dubbed the pivot method, which generates a translation solely by decompiling an IR generated from the source language to a different target language. We experiment with four languages: C++ Java, Rust and Go, and show that utilizing both the code and the IR allows for an average relative improvement of 5.5%. Moreover, our method only uses the IR at training time and does not require extra computations at inference time.Our main contributions are:• We implement a new IR-augmented translation method, which leverages LLVM IRs to improve code representations. It allows us to increase the number of correct translations generated by TransCoder for C++, Java, Go and Rust by 5.5%. Compared to our IR-only pivot method, the improvement reaches 170% • Our method is especially useful in the low data regime: with relative improvements reaching 29.7% when translating to Rust and 25.6% when translating from it. • We extend the parallel evaluation dataset of 852 functions in C++, Java and Python from Roziere et al. ( 2020) with 343 more functions in Go and 280 more in Rust, along with corresponding test cases • In addition, we achieve 78% accuracy when decompiling LLVM IRs to C++

2. INTERMEDIATE REPRESENTATIONS IN COMPILERS

Compilers translate programs written in a computer language into executable code for a specific machine. Most compilers consist of a front-end taking source code as input, and a back-end which produces machine binary code. The front-end lexes (tokenizes) and parses the program. Then, it produces an abstract syntax tree (AST), and translates it into some Intermediate Representation (IR). The back-end converts the IR into machine-specific executable code.In modern compilers such as LLVM (Lattner and Adve, 2004), the IR is generic across different input languages (and thus different front-ends). It allows the application of transformations and target

