MISIM: A NOVEL CODE SIMILARITY SYSTEM

Abstract

Semantic code similarity systems are integral to a range of applications from code recommendation to automated software defect correction. Yet, these systems still lack the maturity in accuracy for general and reliable wide-scale usage. To help address this, we present Machine Inferred Code Similarity (MISIM), a novel end-to-end code similarity system that consists of two core components. First, MISIM uses a novel context-aware semantic structure (CASS), which is designed to aid in lifting semantic meaning from code syntax. We compare CASS with the abstract syntax tree (AST) and show CASS is more accurate than AST by up to 1.67×. Second, MISIM provides a neural-based code similarity scoring algorithm, which can be implemented with various neural network architectures with learned parameters. We compare MISIM to four state-of-the-art systems: (i) Aroma, (ii) code2seq, (iii) code2vec, and (iv) Neural Code Comprehension, . In our experimental evaluation across 328,155 programs (over 18 million lines of code), MISIM has 1.5× to 43.4× better accuracy across all four systems.

1. INTRODUCTION

The field of machine programming (MP) is concerned with the automation of software development (Gottschlich et al., 2018) . In recent years, there has been an emergence of many MP systems, due, in part, to advances in machine learning, formal methods, data availability, and computing efficiency (Allamanis et al., 2018a; Alon et al., 2018; 2019b; a; Ben-Nun et al., 2018; Cosentino et al., 2017; Li et al., 2017; Luan et al., 2019; Odena & Sutton, 2020; Tufano et al., 2018; Wei & Li, 2017; Zhang et al., 2019; Zhao & Huang, 2018 ). An open challenge in MP is the construction of accurate code similarity systems. Code similarity is the problem of determining if two or more code snippets have some degree of semantic similarity (or equivalence) even in the presence of syntactic divergence. At the highest level, code similarity systems aim to determine if two or more code snippets are solving a similar problem, even if the implementations they use differ (e.g., various algorithms of sort() (Cormen et al., 2009) ). Historically, code similarity systems have been considered an auxiliary feature that aim to improve programmer productivity with tools such as code recommendation, automated bug detection, and language-to-language transformation for small kernels (e.g., stencils), to name a few (Allamanis et al., 2018b; Ahmad et al., 2019; Bader et al., 2019; Barman et al., 2016; Bhatia et al., 2018; Dinella et al., 2020; Kamil et al., 2016; Luan et al., 2019; Pradel & Sen, 2018 ). Yet, these systems still lack the maturity in accuracy for general and reliable wide-scale usage. In particular, without largely accurate code similarity systems to automate significant parts of our software development, we believe the explosion of heterogeneous software and hardware may become an untenable problem that software developers will not be able to navigate (Ahmad et al., 2019; Batra et al., 2018; Bogdan et al., 2019; Chen et al., 2020; Deng et al., 2020; Hannigan et al., 2019 ) . Yet, as others have noted before us, even some of the most fundamental questions in code similarity have no clear answers, such as the proper structural representation of code for a particular similarity problem (Alam et al., 2019; Allamanis et al., 2018b; Becker & Gottschlich, 2017; Ben-Nun et al., 2018; Dinella et al., 2020; Iyer et al., 2020; Luan et al., 2019) . In this paper, we aim to address some of these questions. While prior work has explored some structural representations of code in the space of code similarity, these explorations are far from complete. The abstract syntax tree (AST) is used in the code2vec and code2seq system (Alon et al., 2019b; a) , a novel structure called the contextual flow graph (XFG) is used by Neural Code Comprehension (NCC) (Ben-Nun et al., 2018) , and a new structure called the simplified parse tree (SPT) is used by Aroma (Luan et al., 2019) . While each of these representations have benefits in certain contexts, we have found that they possess one or more limitations when used

