IPTR: LEARNING A REPRESENTATION FOR INTERAC-TIVE PROGRAM TRANSLATION RETRIEVAL

Abstract

Program translation contributes to many real world scenarios, such as porting codebases written in an obsolete or deprecated language to a modern one or reimplementing existing projects in one's preferred programming language. Existing data-driven approaches either require large amounts of training data or neglect significant characteristics of programs. In this paper, we present IPTR for interactive code translation retrieval from Big Code. IPTR uses a novel code representation technique that encodes structural characteristics of a program and a predictive transformation technique to transform the representation into the target programming language. The transformed representation is used for code retrieval from Big Code. With our succinct representation, the user can easily update and correct the returned results to improve the retrieval process. Our experiments show that IPTR outperforms supervised baselines in terms of program accuracy.

1. INTRODUCTION

Numerous programs are being developed and released online. To port codebases written in obsolete or deprecated languages to a modern one (Lachaux et al., 2020) , or to further study, reproduce and apply them on various platforms, these programs require corresponding versions in different languages. In cases when developers do not make the translation efforts themselves, third-party users have to manually translate the software to their needed language, which is time consuming and error prone because they have to be the expert in both languages. Also, hard-wired cross-language compilers still require heavy human intervention for adaptation and are limited between some specified types of programming. In this paper, we discuss the potentials of data-driven methods that exploit existing big code resources to support code translation. The abundance of open source programs on the internet provides opportunities for new applications, such as workflow generation (Derakhshan et al., 2020) , data preparation (Yan & He, 2020), and transformation retrieval (Yan & He, 2018) . Code translation is another application that is gaining attention (Lachaux et al., 2020) . Data-driven program translation. Inspired by natural language translation, one line of approaches trains a translation model from large amounts of code data either in a supervised (Nguyen et al., 2013; 2015; Chen et al., 2018) or weakly-supervised fashion (Lachaux et al., 2020) . Supervised approaches require a parallel dataset to train the translation model. In parallel datasets, programs in different languages are considered to be "semantically aligned". Obtaining the parallel datasets in programming languages is hard because the translations have to be handwritten most of the time. Besides massive human efforts, it is also a tricky problem to extract general textual features that apply to every programming language. A recent weakly-supervised method (Lachaux et al., 2020) pretrains the translation model on the task of denoising randomly corrupted programs and optimizes the model through back-translation. However, this method still relies on high-quality training data. Further, all these approaches directly reuse NLP approaches that neglect the special features of programming languages. Another potential approach is to use a retrieval system to obtain translation candidates directly from Big Code, which refers to well-maintained program repositories. However, existing code retrieval systems, such as Sourcerer (Linstead et al., 2009) , lack the proper capabilities for code-to-code search and cross-language code retrieval. These methods ask users to give feedback on several preset metrics and questions (Wang et al., 2014; Dietrich et al., 2013; Martie et al., 2015; Sivaraman et al., 2019) . None of these methods is tailored to cross-language retrieval. In this paper, we propose an interactive program translation retrieval system IPTR based on a novel and generalizable code representation that retains important code properties. The representation not only encodes textual features but structural features that can be generalized across all imperative programming languages. We further propose a query transformation model based on autoencoders to transform the input program representation to a representation that has properties of the target language. Due to the succinct form of our code representation, IPTR can adapt the original query based on user annotations. This methodology can compete with existing statistical translation models that require a large amount of training data. In short, we make the following main contributions: • We propose IPTR, an interactive cross-language code retrieval system with a program feature representation that additionally encodes code structure properties. • We further propose a novel query transformation model that learns a refined code representation in the target language before using it for retrieval. This model can be trained in an unsupervised way but also improved through active learning. • Based on our succinct code representation, we propose a user feedback mechanism that enables IPTR to successively improve its results.

2. SYSTEM OVERVIEW

We propose IPTR, an interactive cross language code retrieval system that supports program translation on multiple programming languages. Problem Definition. Given a piece of source program P s written in language L s , a selected target language L t , and a large program repository D p = {P 1 , P 2 , ..., P n }, the goal is to find the best possible translation P t of P s in L t from D p . The problem is to design an effective program feature representation that generalizes to many languages and can be updated through user feedback. Figure 1 : IPTR overview Our solution: IPTR The workflow of IPTR is shown in Figure 1 . IPTR first constructs a succinct but informative feature representation for input programs (Section 3.1). Since the target is to identify a similar program in the target language, IPTR then applies a query transformation model (QTM) to transform this representation into an estimated feature representation of the translation (Section 3). The transformation model is trained in an unsupervised manner but can also be updated dynamically through active learning (Section 3.2.2). Finally, this new representation will be used as a query to retrieve the program that has similar features from the database. In addition, as an interactive system, IPTR allows the user to give feedback on the retrieved translation (Section 4). The user can either accept the result or make corrections. Based on our structured and informative feature representation, IPTR can easily and quickly adapt the query based on raw user corrections. Then with the new query, it may identify a more appropriate translation candidate in the second retrieval attempt.

3. PROGRAM REPRESENTATION

To retrieve a promising program translation from a large code database, IPTR needs an effective and efficient query. Directly retrieving based on raw code is impractical. In contrast to existing methods that generate queries based on either keywords (Linstead et al., 2009) or preset metrics and questions (Martie et al., 2015) , IPTR generates a feature representation that effectively combines structural properties of the program and textual features. It further uses a query transformation model (QTM) to generate features in the target language.

3.1. BASIC ENCODING OF PROGRAM STRUCTURE AND TEXT

Due to the special and non-trival structure of programming languages compared to natural languages, we take both structural and textual features into consideration. The structural features of a program can be represented by its syntax tree where each tree node denotes a code construct.

