IPTR: LEARNING A REPRESENTATION FOR INTERAC-TIVE PROGRAM TRANSLATION RETRIEVAL

Abstract

Program translation contributes to many real world scenarios, such as porting codebases written in an obsolete or deprecated language to a modern one or reimplementing existing projects in one's preferred programming language. Existing data-driven approaches either require large amounts of training data or neglect significant characteristics of programs. In this paper, we present IPTR for interactive code translation retrieval from Big Code. IPTR uses a novel code representation technique that encodes structural characteristics of a program and a predictive transformation technique to transform the representation into the target programming language. The transformed representation is used for code retrieval from Big Code. With our succinct representation, the user can easily update and correct the returned results to improve the retrieval process. Our experiments show that IPTR outperforms supervised baselines in terms of program accuracy.

1. INTRODUCTION

Numerous programs are being developed and released online. To port codebases written in obsolete or deprecated languages to a modern one (Lachaux et al., 2020) , or to further study, reproduce and apply them on various platforms, these programs require corresponding versions in different languages. In cases when developers do not make the translation efforts themselves, third-party users have to manually translate the software to their needed language, which is time consuming and error prone because they have to be the expert in both languages. Also, hard-wired cross-language compilers still require heavy human intervention for adaptation and are limited between some specified types of programming. In this paper, we discuss the potentials of data-driven methods that exploit existing big code resources to support code translation. The abundance of open source programs on the internet provides opportunities for new applications, such as workflow generation (Derakhshan et al., 2020) , data preparation (Yan & He, 2020), and transformation retrieval (Yan & He, 2018). Code translation is another application that is gaining attention (Lachaux et al., 2020) . Data-driven program translation. Inspired by natural language translation, one line of approaches trains a translation model from large amounts of code data either in a supervised (Nguyen et al., 2013; 2015; Chen et al., 2018) or weakly-supervised fashion (Lachaux et al., 2020) . Supervised approaches require a parallel dataset to train the translation model. In parallel datasets, programs in different languages are considered to be "semantically aligned". Obtaining the parallel datasets in programming languages is hard because the translations have to be handwritten most of the time. Besides massive human efforts, it is also a tricky problem to extract general textual features that apply to every programming language. A recent weakly-supervised method (Lachaux et al., 2020) pretrains the translation model on the task of denoising randomly corrupted programs and optimizes the model through back-translation. However, this method still relies on high-quality training data. Further, all these approaches directly reuse NLP approaches that neglect the special features of programming languages. Another potential approach is to use a retrieval system to obtain translation candidates directly from Big Code, which refers to well-maintained program repositories. However, existing code retrieval systems, such as Sourcerer (Linstead et al., 2009) , lack the proper capabilities for code-to-code search and cross-language code retrieval. These methods ask users to give feedback on several preset metrics and questions (Wang et al., 2014; Dietrich et al., 2013; Martie et al., 2015; Sivaraman et al., 2019) . None of these methods is tailored to cross-language retrieval.

