MISIM: A NOVEL CODE SIMILARITY SYSTEM

Abstract

Semantic code similarity systems are integral to a range of applications from code recommendation to automated software defect correction. Yet, these systems still lack the maturity in accuracy for general and reliable wide-scale usage. To help address this, we present Machine Inferred Code Similarity (MISIM), a novel end-to-end code similarity system that consists of two core components. First, MISIM uses a novel context-aware semantic structure (CASS), which is designed to aid in lifting semantic meaning from code syntax. We compare CASS with the abstract syntax tree (AST) and show CASS is more accurate than AST by up to 1.67×. Second, MISIM provides a neural-based code similarity scoring algorithm, which can be implemented with various neural network architectures with learned parameters. We compare MISIM to four state-of-the-art systems: (i) Aroma, (ii) code2seq, (iii) code2vec, and (iv) Neural Code Comprehension, . In our experimental evaluation across 328,155 programs (over 18 million lines of code), MISIM has 1.5× to 43.4× better accuracy across all four systems.

1. INTRODUCTION

The field of machine programming (MP) is concerned with the automation of software development (Gottschlich et al., 2018) . In recent years, there has been an emergence of many MP systems, due, in part, to advances in machine learning, formal methods, data availability, and computing efficiency (Allamanis et al., 2018a; Alon et al., 2018; 2019b; a; Ben-Nun et al., 2018; Cosentino et al., 2017; Li et al., 2017; Luan et al., 2019; Odena & Sutton, 2020; Tufano et al., 2018; Wei & Li, 2017; Zhang et al., 2019; Zhao & Huang, 2018 ). An open challenge in MP is the construction of accurate code similarity systems. Code similarity is the problem of determining if two or more code snippets have some degree of semantic similarity (or equivalence) even in the presence of syntactic divergence. At the highest level, code similarity systems aim to determine if two or more code snippets are solving a similar problem, even if the implementations they use differ (e.g., various algorithms of sort() (Cormen et al., 2009) ). Historically, code similarity systems have been considered an auxiliary feature that aim to improve programmer productivity with tools such as code recommendation, automated bug detection, and language-to-language transformation for small kernels (e.g., stencils), to name a few (Allamanis et al., 2018b; Ahmad et al., 2019; Bader et al., 2019; Barman et al., 2016; Bhatia et al., 2018; Dinella et al., 2020; Kamil et al., 2016; Luan et al., 2019; Pradel & Sen, 2018 ). Yet, these systems still lack the maturity in accuracy for general and reliable wide-scale usage. In particular, without largely accurate code similarity systems to automate significant parts of our software development, we believe the explosion of heterogeneous software and hardware may become an untenable problem that software developers will not be able to navigate (Ahmad et al., 2019; Batra et al., 2018; Bogdan et al., 2019; Chen et al., 2020; Deng et al., 2020; Hannigan et al., 2019 ) . Yet, as others have noted before us, even some of the most fundamental questions in code similarity have no clear answers, such as the proper structural representation of code for a particular similarity problem (Alam et al., 2019; Allamanis et al., 2018b; Becker & Gottschlich, 2017; Ben-Nun et al., 2018; Dinella et al., 2020; Iyer et al., 2020; Luan et al., 2019) . In this paper, we aim to address some of these questions. While prior work has explored some structural representations of code in the space of code similarity, these explorations are far from complete. The abstract syntax tree (AST) is used in the code2vec and code2seq system (Alon et al., 2019b; a) , a novel structure called the contextual flow graph (XFG) is used by Neural Code Comprehension (NCC) (Ben-Nun et al., 2018) , and a new structure called the simplified parse tree (SPT) is used by Aroma (Luan et al., 2019) . While each of these representations have benefits in certain contexts, we have found that they possess one or more limitations when used in the broader context of code similarity that may limit their practical application. For example, the AST -while principally valuable for compilers -is syntax driven, which can often mislead code similarity systems into learning too much syntax and not enough semantics (i.e., the meaning behind the syntax). The XFG is obtained from an intermediate representation (IR), which requires code compilation; this limits its use to only compilable code. Although the SPT is structurally driven (not syntax driven), it does not always resolve syntactic ambiguities, which may result in semantic obfuscation that prevent it from observing semantic variation caused by contextual syntactic differences. In Section 4, we evaluate how these limitations impact code similarity accuracy. Learning from these observations, we attempt to address some of the open questions around code similarity with our novel end-to-end code similarity system called Machine Inferred Code Similarity (MISIM). In this paper, we principally focus on two main novelties of MISIM and how they may improve code similarity analysis: (i) its structural representation of code, called the context-aware semantic structure (CASS), and (ii) its neural-based learned code similarity scoring algorithm. These components can be used individually or together as we have chosen to do. This paper makes the following technical contributions: • We present Machine Inferred Code Similarity (MISIM), a novel end-to-end code similarity system. • We present MISIM's context-aware semantic structure (CASS), a novel structural representation of code specifically designed to (i) lift semantic meaning from code syntax and (ii) provide an extensible representation that can be augmented as needed (e.g., as new programming languages (PLs) emerge, existing PL contextual syntax ambiguities are discovered, etc.). We also experiment AST and CASS on representing code semantics. Our preliminary result shows that CASS can be up to 1.67× more accurate. • We present MISIM's open-ended deep neural network (DNN) backend that learns the similarity scoring algorithm for a given code corpus and show its efficacy across three DNN topologies: (i) bag-of-features, (ii) a recurrent neural network (RNN), and (iii) a graph neural network (GNN).foot_0 • We compare MISIM to four state-of-the-art code similarity systems: (i) code2vec, (ii) code2seq, (iii) Neural Code Comprehension, and (iv) Aroma. Our experimental evaluation, across 328,155 C/C++ programs comprising of over 18 million lines of code, illustrates that MISIM is more accurate than all four systems, across all experiments, ranging from 1.5× to 43.4×.

2. MISIM SYSTEM

Phase 0 : source code Phase 1 : CASS featurization Phase 2 : code similarity scoring 



We acknowledge that this is a non-exhaustive list of all possible neural network architectures that can be used with MISIM. We explore these three for two reasons: (i) to demonstrate the diversity of DNNs that can be used with MISIM and (ii) to illustrate how different DNNs have a measurable impact on MISIM's accuracy.



Figure 1: Overview of the MISIM System.

