A Theory of Equivalence-Preserving Program Embeddings

Abstract

Program embeddings are used to solve tasks such as code clone detection and semantic labeling. Solutions to these semantic tasks should be invariant to semantics-preserving program transformations. When a program embedding function satisfies this invariance, we call it an equivalence-preserving program embedding function. We say a programming language can be tractably embedded when we can construct an equivalence-preserving program embedding function that executes in polynomial time in program/input length and produces program embeddings that are proportional to the input length. Determining whether a programming language can be tractably embedded is the equivalence-preserving program embedding problem. We formalize this problem and theoretically characterize when programming languages can be tractably embedded. To validate our theoretical results, we use the BERT-Tiny model to learn an equivalence-preserving program embedding function for a programming language that can be tractably embedded and show the model fails to construct an equivalence-preserving program embedding function for a similar language that is intractable to embed.

1. Introduction

Emerging research demonstrates that powerful new techniques can solve challenging program reasoning tasks, such as code clone detection and semantic labeling (Hu et al., 2017; Mou et al., 2016) . At the core of many techniques lie program embeddings, fixed-size representations produced from program text that model a program property. For example, a program embedding may be a vector of floating-point numbers that models a program's input-output behavior. A common and effective method for producing program embeddings is to use a neural network (Yu et al., 2019; Ben-Nun et al., 2018) . While the empirical results of these techniques have been demonstrated, there has been, to date, little theoretical understanding of the capabilities of program embedding techniques.

Semantic Tasks.

A key first step towards understanding program embeddings is understanding the tasks they are designed to solve. In this work, we focus on semantic tasks, tasks that depend only on the input-output behavior of a program. Equivalently, the solution to a semantic task is invariant to all semantics-preserving transformations on the input program. A Theory of Equivalence-Preserving Program Embeddings. When a program embedding technique is invariant to semantics-preserving transformations, two programs' embeddings are identical exactly when the programs are semantically equivalent. We call such a technique an equivalence-preserving program embedding function. We say a programming language can be tractably embedded when we can construct an equivalence-preserving program embedding function that runs in time polynomial in program/input length and the embedding size is proportional to input length. The problem of determining whether a programming language can be tractably embedded is the equivalence-preserving program embedding problem. To date, this problem has not been identified in the literature, and it has been unknown under what conditions it can be solved. We provide the first theoretical characterization of this problem by proving necessary and sufficient conditions for when a programming language can be tractably embedded. Empirical Study. We apply our theory to programming languages for modular addition and bitvector arithmetic (Bosselaers et al., 1994; Barrett et al., 1998) . We prove the modular addition language can be tractably embedded, while the bitvector arithmetic language cannot. We find that a BERT-Tiny model can learn an equivalence-preserving embedding function for the modular addition language, but not the bitwise arithmetic language (Bhargava et al., 2021; Turc et al., 2019) . We also validate that as the number of possible inputs increases for these languages, the model maintains 100% train and test accuracy for the modular addition language, but accuracy degrades quickly for the bitvector arithmetic language. Our contributions are as follows: 1. We define the equivalence-preserving program embedding problem and identify applications of equivalence-preserving program embeddings. 2. We prove necessary and sufficient conditions for when a programming language can be tractably embedded. 3. We hypothesize that programming languages that can be tractably embedded are easier to learn than those that are intractable to embed and provide evidence with an empirical study of languages for modular addition and bitvector arithmetic. We present definitions and results constituting the first steps in a theory of equivalencepreserving program embeddings. Building on these foundations, future work can better analyze existing programming languages, design new programming languages that can be tractably embedded, and develop principled approximations for programming languages that cannot be tractably embedded.

2. Equivalence-Preserving Program Embeddings In Practice

Semantic tasks are those where only a program's input-output behavior must be modeled. We show how equivalence-preserving program embeddings can solve these tasks and briefly explain how researchers solve them. Code Clone Detection. Code clone detection is the task of identifying whether a pair of programs are equivalent and is often used to find duplicate code in a codebase or identify software vulnerabilities in programs (Hu et al., 2017) . Given an equivalence-preserving program embedding function, one can solve this problem by embedding both programs and checking equivalence. Researchers produce program embeddings that solve this problem via neural networks (Yu et al., 2019) , execution of programs at a fixed set of inputs (Hu et al., 2017) , and locality-sensitive hashing of such executions (Pewny et al., 2015) . Semantic Labeling. In semantic labeling, the goal is to label a program as having a particular semantic property among a fixed-size collection of possible properties. Examples include determining the algorithm a program implements and identifying whether a program exhibits a particular error. An equivalence-preserving program embedding function maps each program to its label, which identifies the program's functionality. Note that the labels come from a fixed set representing all program functionalities of interest. Researchers solve this problem using neural program embeddings (Mou et al., 2016; Ben-Nun et al., 2018; Puri et al., 2021; Wang et al., 2018) .

3. A Theory of Equivalence-Preserving Program Embeddings

In this section, we formalize the equivalence-preserving program embedding problem and prove conditions under which it can be solved. We define a programming language P as a set of strings with additional structure. For one, deciding membership of programs in a programming language is often decidable (e.g., by parsing and type checking), meaning one can enumerate all programs from shortest to

