CONTRASTIVE CODE REPRESENTATION LEARNING

Abstract

Machine-aided programming tools such as automated type predictors and autocomplete are increasingly learning-based. However, current approaches predominantly rely on supervised learning with task-specific datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised algorithm for learning task-agnostic semantic representations of programs via contrastive learning. Our approach uses no human-provided labels, only the raw text of programs. ContraCode optimizes for a representation that is invariant to semantic-preserving code transformations. We develop an automated source-to-source compiler that generates textually divergent variants of source programs. We then train a neural network to identify variants of anchor programs within a large batch of non-equivalent negatives. To solve this task, the network must extract features representing the functionality, not form, of the program. In experiments, we pre-train ContraCode with 1.8M unannotated JavaScript methods mined from GitHub, then transfer to downstream tasks by fine-tuning. Pre-training with ContraCode consistently improves the F1 score of code summarization baselines and top-1 accuracy of type inference baselines by 2% to 13%. ContraCode achieves 9% higher top-1 accuracy than the current state-of-the-art static type analyzer for TypeScript. Finally, representations learned through a hybrid contrastive and reconstruction objective transfer in zero-shot to code clone detection with +10% AUROC over a static text similarity measure and +5% over reconstruction alone.

1. INTRODUCTION

Programmers increasingly rely on machine-aided programming tools to aid software development (Kim et al., 2012) . However, the wide diversity of programs encountered in practice limits the generalization of hand-written rules. Catching semantic bugs such as naming errors requires deeper language understanding, motivating learning-based programming tools. Recent work uses machine learning for bug detection (Pradel & Sen, 2018) and optimization (Mendis et al., 2019) . Consider predicting the type of the variable declaration "var median = ...;". Static analysis fails as the type is underspecified, but the variable name indicates the statement is a float. Programming language datasets suffer from scarce annotations due to the time and expertise required to label. State-of-the-art approaches generally rely on either (1) synthetic supervised datasets or (2) self-supervised pre-training. Synthetic auto-generated labels have been used for method naming (Alon et al., 2019a;b) and bug detection (Ferenc et al., 2018; Benton et al., 2019; Pradel & Sen, 2018) . However, synthetic code datasets suffer from duplication issues (Allamanis, 2019) and biases (Shin et al., 2019) which degrade generalization. Moreover, auto-generated data does not cover the diverse program behaviors encountered in the wild. In contrast, self-supervised learning can leverage large open-source repositories such as GitHub with limited or no annotations. Inspired by the success of pre-training in natural language processing, recent work uses self-supervision to learn code representations. Authors have explored context-based token embeddings (Ben-Nun et al., 2018) and masked language modeling, where tokens are corrupted and reconstructed (Feng et al., 2020; Kanade et al., 2020) However, reconstruction focuses on superficial language reasoning and does not explicitly address the underlying program functionality. The resulting models attend to program implementation specifics such as variable names. We hypothesize that programs with the same functionality should have the same underlying representation for downstream code understanding tasks, a principle illustrated in Fig. 1 . While it is time

