A Theory of Equivalence-Preserving Program Embeddings

Abstract

Program embeddings are used to solve tasks such as code clone detection and semantic labeling. Solutions to these semantic tasks should be invariant to semantics-preserving program transformations. When a program embedding function satisfies this invariance, we call it an equivalence-preserving program embedding function. We say a programming language can be tractably embedded when we can construct an equivalence-preserving program embedding function that executes in polynomial time in program/input length and produces program embeddings that are proportional to the input length. Determining whether a programming language can be tractably embedded is the equivalence-preserving program embedding problem. We formalize this problem and theoretically characterize when programming languages can be tractably embedded. To validate our theoretical results, we use the BERT-Tiny model to learn an equivalence-preserving program embedding function for a programming language that can be tractably embedded and show the model fails to construct an equivalence-preserving program embedding function for a similar language that is intractable to embed.

1. Introduction

Emerging research demonstrates that powerful new techniques can solve challenging program reasoning tasks, such as code clone detection and semantic labeling (Hu et al., 2017; Mou et al., 2016) . At the core of many techniques lie program embeddings, fixed-size representations produced from program text that model a program property. For example, a program embedding may be a vector of floating-point numbers that models a program's input-output behavior. A common and effective method for producing program embeddings is to use a neural network (Yu et al., 2019; Ben-Nun et al., 2018) . While the empirical results of these techniques have been demonstrated, there has been, to date, little theoretical understanding of the capabilities of program embedding techniques. Semantic Tasks. A key first step towards understanding program embeddings is understanding the tasks they are designed to solve. In this work, we focus on semantic tasks, tasks that depend only on the input-output behavior of a program. Equivalently, the solution to a semantic task is invariant to all semantics-preserving transformations on the input program. A Theory of Equivalence-Preserving Program Embeddings. When a program embedding technique is invariant to semantics-preserving transformations, two programs' embeddings are identical exactly when the programs are semantically equivalent. We call such a technique an equivalence-preserving program embedding function. We say a programming language can be tractably embedded when we can construct an equivalence-preserving program embedding function that runs in time polynomial in program/input length and the embedding size is proportional to input length. The problem of determining whether a programming language can be tractably embedded is the equivalence-preserving program embedding problem. To date, this problem has not been identified in the literature, and it has been unknown under what conditions it can be solved. We provide the first theoretical characterization of this problem by proving necessary and sufficient conditions for when a programming language can be tractably embedded.

