CCT: CROSS-CONSISTENCY TRAINING FOR CLONE DE-TECTION AND CODE SEARCH TASKS

Abstract

Clone detection is a well known task, which could be formulated on any programming language. Although to the best of our knowledge there is no cross-lingual clone detection task formulated. In this work we formulate such a task alongside with a specific training procedure CCT for a deep leaning language model. This procedure allows CCT-trained model to outperform the existing approaches on POJ-104 benchmark with result of 95.67% MAP and on newly created cross-lingual clone detection benchmark XCD. Moreover, CCT model shows new state of the art results in code search task AdvTest 47.15% MRR.

1. INTRODUCTION

In software development practice in it is sometimes important to identify the code with the same effective output. This could be useful e.g. for unification and control of side effects. To meet this need there was clone detection task formulated (Mou et al., 2016) . In the mentioned work the task is formulated for C/C++ programming code. Although for compiled languages it seems to be the most profitable to detect similar behaviour of the code, instead of compiling and running, it still could be useful for many other languages, including interpreted ones. The next step which could be done, one could detect the same output for the programming code in different languages 1 . We formulate cross-lingual clone detection task and establish some baselines on it. There are various approaches to solve clone detection task problem starting from algorithmic based methods (?) and continuing with modern machine learning ones (). Most machine learning approaches are based on embedding representation of the code snippet. This approach allows to find duplicate code snippets by similarity between their embedding representation. The performance of such systems depends on the quality of obtained embeddings. We present a novel technique of training (CCT) for language models which allows them to embed the code snippets in effective way. We demonstrate this on previously formulated clone detection task POJ-104 (Mou et al., 2016) and on newly formulated cross-lingual clone detection task XCD. Interestingly, we also found out that this CCT technique allows a model to produce representations useful for code search task also. Code search itself is a task where a code snippet should be mapped to some text description, as formulated in (Lu et al., 2021a) . The contributions of our work is as follows: (i) we present a pre-training method CCT allowing a model to align code snippets in different languages; (ii) we present a novel cross-lingual clone detection task XCD; (iii) we present results for a language model trained with CCT on clone detection tasks POJ-104 and XCD; (iv) we present the results of CCT-model for code search AdvTest task. 2

2. DATASETS

In our work we use two types of the datasets, one is for clone detection, the other is for code search. 1 As an example of the programmes with the same effective output one could refer to http:// helloworldcollection.de/ website, which contains "Hello, world!" snippets in 603 programming languages. 2 We are going to release CCT code after the review process is over. 1

