CCT: CROSS-CONSISTENCY TRAINING FOR CLONE DE-TECTION AND CODE SEARCH TASKS

Abstract

Clone detection is a well known task, which could be formulated on any programming language. Although to the best of our knowledge there is no cross-lingual clone detection task formulated. In this work we formulate such a task alongside with a specific training procedure CCT for a deep leaning language model. This procedure allows CCT-trained model to outperform the existing approaches on POJ-104 benchmark with result of 95.67% MAP and on newly created cross-lingual clone detection benchmark XCD. Moreover, CCT model shows new state of the art results in code search task AdvTest 47.15% MRR.

1. INTRODUCTION

In software development practice in it is sometimes important to identify the code with the same effective output. This could be useful e.g. for unification and control of side effects. To meet this need there was clone detection task formulated (Mou et al., 2016) . In the mentioned work the task is formulated for C/C++ programming code. Although for compiled languages it seems to be the most profitable to detect similar behaviour of the code, instead of compiling and running, it still could be useful for many other languages, including interpreted ones. The next step which could be done, one could detect the same output for the programming code in different languagesfoot_0 . We formulate cross-lingual clone detection task and establish some baselines on it. There are various approaches to solve clone detection task problem starting from algorithmic based methods (?) and continuing with modern machine learning ones (). Most machine learning approaches are based on embedding representation of the code snippet. This approach allows to find duplicate code snippets by similarity between their embedding representation. The performance of such systems depends on the quality of obtained embeddings. We present a novel technique of training (CCT) for language models which allows them to embed the code snippets in effective way. We demonstrate this on previously formulated clone detection task POJ-104 (Mou et al., 2016) and on newly formulated cross-lingual clone detection task XCD. Interestingly, we also found out that this CCT technique allows a model to produce representations useful for code search task also. Code search itself is a task where a code snippet should be mapped to some text description, as formulated in (Lu et al., 2021a) . The contributions of our work is as follows: (i) we present a pre-training method CCT allowing a model to align code snippets in different languages; (ii) we present a novel cross-lingual clone detection task XCD; (iii) we present results for a language model trained with CCT on clone detection tasks POJ-104 and XCD; (iv) we present the results of CCT-model for code search AdvTest task.foot_1 

2. DATASETS

In our work we use two types of the datasets, one is for clone detection, the other is for code search. CodeSearchNet AdvTest is a Python language only dataset constructed from the CodeSearchNet corpus. Each example includes a function paired with a document. The authors of AdvTest followed the original work (Husain et al., 2019a) in taking the first paragraph of the documentation as the query for the corresponding function. To improve the quality of the dataset, they filter it by removing the following examples: • Examples whose code could not be parsed into abstract syntax tree. • Examples whose document is shorter than 3 or larger than 256 tokens. • Examples whose document contains special tokens such as "http://". • Examples whose document is empty or not written in English. 



As an example of the programmes with the same effective output one could refer to http:// helloworldcollection.de/ website, which contains "Hello, world!" snippets in 603 programming languages. We are going to release CCT code after the review process is over.



Figure1: Here is a difference between strong and weak cross-lingual alignment. In a strongly aligned embedding space, the most semantically similar items are always the closest, regardless of language. A weakly aligned multilingual embedding space just enables zero-shot transfer between languages.

To better test the understanding and generalization abilities of the model, they normalize function and variable names in testing and development sets like f unc for the function name and arg i for the i-th variable name. The task aims to search source codes from candidates for a natural language query. In contrast to the testing phase of previous works(Husain et al., 2019a; ?) that only involved 1,000 candidates, the authors use the entire testing set for each query, which makes CodeSearchNet AdvTest dataset more difficult. The training set for this task comes from the filtered CodeSearchNet dataset(Husain et al., 2019a).

