CONTRASTIVE CODE REPRESENTATION LEARNING

Abstract

Machine-aided programming tools such as automated type predictors and autocomplete are increasingly learning-based. However, current approaches predominantly rely on supervised learning with task-specific datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised algorithm for learning task-agnostic semantic representations of programs via contrastive learning. Our approach uses no human-provided labels, only the raw text of programs. ContraCode optimizes for a representation that is invariant to semantic-preserving code transformations. We develop an automated source-to-source compiler that generates textually divergent variants of source programs. We then train a neural network to identify variants of anchor programs within a large batch of non-equivalent negatives. To solve this task, the network must extract features representing the functionality, not form, of the program. In experiments, we pre-train ContraCode with 1.8M unannotated JavaScript methods mined from GitHub, then transfer to downstream tasks by fine-tuning. Pre-training with ContraCode consistently improves the F1 score of code summarization baselines and top-1 accuracy of type inference baselines by 2% to 13%. ContraCode achieves 9% higher top-1 accuracy than the current state-of-the-art static type analyzer for TypeScript. Finally, representations learned through a hybrid contrastive and reconstruction objective transfer in zero-shot to code clone detection with +10% AUROC over a static text similarity measure and +5% over reconstruction alone.

1. INTRODUCTION

Programmers increasingly rely on machine-aided programming tools to aid software development (Kim et al., 2012) . However, the wide diversity of programs encountered in practice limits the generalization of hand-written rules. Catching semantic bugs such as naming errors requires deeper language understanding, motivating learning-based programming tools. Recent work uses machine learning for bug detection (Pradel & Sen, 2018) and optimization (Mendis et al., 2019) . Consider predicting the type of the variable declaration "var median = ...;". Static analysis fails as the type is underspecified, but the variable name indicates the statement is a float. Programming language datasets suffer from scarce annotations due to the time and expertise required to label. State-of-the-art approaches generally rely on either (1) synthetic supervised datasets or (2) self-supervised pre-training. Synthetic auto-generated labels have been used for method naming (Alon et al., 2019a; b) and bug detection (Ferenc et al., 2018; Benton et al., 2019; Pradel & Sen, 2018) . However, synthetic code datasets suffer from duplication issues (Allamanis, 2019) and biases (Shin et al., 2019) which degrade generalization. Moreover, auto-generated data does not cover the diverse program behaviors encountered in the wild. In contrast, self-supervised learning can leverage large open-source repositories such as GitHub with limited or no annotations. Inspired by the success of pre-training in natural language processing, recent work uses self-supervision to learn code representations. Authors have explored context-based token embeddings (Ben-Nun et al., 2018) and masked language modeling, where tokens are corrupted and reconstructed (Feng et al., 2020; Kanade et al., 2020) However, reconstruction focuses on superficial language reasoning and does not explicitly address the underlying program functionality. The resulting models attend to program implementation specifics such as variable names. intensive to identify equivalent programs in a large corpus, it is cheap to leverage static compiler transformations to automatically generate many equivalent versions of a particular source program. In this work, we develop ContraCode, a self-supervised representation learning algorithm that uses source-to-source compiler transformation techniques (e.g., dead code elimination, obfuscation and constant folding) to generate syntactically diverse but functionally equivalent programs. ContraCode uses these equivalent programs to construct a challenging discriminative pretext task that requires the model to identify equivalent programs out of a large dataset of distractors. In doing so, it has to embed the functionality, not the form, of the code. In essence, the domain knowledge from our code transformations induces the knowledge of the structure of programs onto learned representations. The contributions of our work include: 1. the novel use of compiler-inspired transformations as data augmentations for code, 2. the concept of program representation learning based on functional equivalence, and 3. a detailed analysis of architectures, code transforms and pre-train strategies, where ContraCode improves static type inference top-1 accuracy by 9%, learned inference by 2% -13%, summarization F1 score by up to 8% and clone detection AUROC by 5% -10%.

2. RELATED WORK

Self-supervised learning (SSL) is a general representation learning strategy where some dimensions or attributes of a datapoint are predicted from the remaining parts. These methods are unsupervised in the sense that they do not rely on labels, but SSL tasks often adapt losses and architectures designed for supervised learning. Self-supervised pre-training has yielded large improvements in both NLP (Howard & Ruder, 2018; Devlin et al., 2018; Radford et al., 2018; 2019) et al., 2019; Chen et al., 2020b) recently made progress by using many negatives for dense loss signal. Beyond images, InfoNCE has been applied to NLP (Chuang et al., 2020; Giorgi et al., 2020) , but may require supervision (Fang & Xie, 2020). Code representation learning There has been substantial work on architectures and tasks for machine learning on code (Allamanis et al., 2018) . We adopt the summarization task of Alon et al.



Figure 1: Programs with the same functionality should have the same underlying representation. ContraCode learns such representations with contrastive learning: the network is trained to find equivalent programs among many distractors, encoding semantics into the representation.

and computer  vision (Mahajan et al., 2018)  by improving generalization(Erhan et al., 2010; Hao et al., 2019). Weak visual features, such as orientation(Gidaris et al., 2018), color (Zhang et al., 2016), and  context (Pathak et al., 2016), are meaningful signals for representations(Mahajan et al., 2018). learning unifies many past SSL approaches that compare pairs or collections of similar and dissimilar items(Hadsell et al., 2006). Rather than training the network to predict labels or reconstruct data, contrastive methods minimize the distance between the representations of similar examples (positives) while maximizing the distance between dissimilar examples (negatives). Examples include Siamese networks(Bromley et al., 1994)  and triplet losses(Schroff et al., 2015). Contrastive predictive coding(Oord et al., 2018; Hénaff et al., 2019)  learns to encode chunks of sequential data to predict of future chunks with the InfoNCE loss, a variational lower bound on mutual information between views of the data(Tian et al., 2019; Wu et al., 2020)  inspired by noiseconstrastive estimation(Gutmann & Hyvärinen, 2010). In instance discrimination tasks(Wu et al.,  2018), views and not pieces of an entire image are compared.SimCLR (Chen et al., 2020a)  and Momentum Contrast (He

