LATENT PROGRAMMER: DISCRETE LATENT CODES FOR PROGRAM SYNTHESIS

Abstract

In many sequence learning tasks, such as program synthesis and document summarization, a key problem is searching over a large space of possible output sequences. We propose to learn representations of the outputs that are specifically meant for search: rich enough to specify the desired output but compact enough to make search more efficient. Discrete latent codes are appealing for this purpose, as they naturally allow sophisticated combinatorial search strategies. The latent codes are learned using a self-supervised learning principle, in which first a discrete autoencoder is trained on the output sequences, and then the resulting latent codes are used as intermediate targets for the end-to-end sequence prediction task. Based on these insights, we introduce the Latent Programmer, a program synthesis method that first predicts a discrete latent code from input/output examples, and then generates the program in the target language. We evaluate the Latent Programmer on two domains: synthesis of string transformation programs, and generation of programs from natural language descriptions. We demonstrate that the discrete latent representation significantly improves synthesis accuracy.

1. INTRODUCTION

Our focus in this paper is program synthesis, one of the longstanding grand challenges of artificial intelligence research (Manna & Waldinger, 1971; Summers, 1977) . The objective of program synthesis is to automatically write a program given a specification of its intended behavior, such as a natural language description or a small set of input-output examples. Search is an especially difficult challenge within program synthesis (Alur et al., 2013; Gulwani et al., 2017) , and many different methods have been explored, including top-down search (Lee et al., 2018) , bottom up search (Udupa et al., 2013) , beam search (Devlin et al., 2017) , and many others (see Section 2). We take a different philosophy: Can we learn a representation of programs specifically to help search? A natural way of representing a program is as a sequence of source code tokens, but the synthesis task requires searching over this representation, which can be difficult for longer, more complex programs. A programmer often starts by specifying high-level components of a program as a plan, then fills in the details of each component i.e. in string editing, a plan could be to extract the first name, then the last initial. We propose to use a sequence of latent variable tokens, called discrete latent codes, to represent such plans. Instead of having a fixed dictionary of codes, we let a model discover and learn what latent codes are useful and how to infer them from specification. Our hypothesis is that a discrete latent code -a sequence of discrete latent variables -can be a useful representation for search (van den Oord et al., 2017; Roy et al., 2018; Kaiser et al., 2018) . This is because we can employ standard methods from discrete search, such as beam search, over a compact space of high-level plans and then over programs conditioned on the plan, in a two-level procedure. We posit that the high-level search can help to organize the search over programs. In the string editing example earlier, a model could be confident that it needs to extract the last initial, but is less sure about whether it needs to extract a first name. By changing one token in the latent code, two-level search can explore alternative programs that do different things in the beginning. Whereas in traditional single-level search, the model would need to change multi-token prefixes of the alternatives, which is difficult to achieve in limited budget search. We propose the Latent Programmer, a program synthesis method that uses learned discrete representations to guide search via a two-level synthesis. The Latent Programmer is trained by a self-

Inputs

Outputs Program "Mason Smith" "Smith M" "Henry Myers" "Myers H" GetToken_PROP_CASE_2 | Const(" ") | "Barry Underwood" "Underwood B" GetToken_ALL_CAPS_1 "Sandy Jones" "Jones S" supervised learning principle. First a discrete autoencoder is trained on a set of programs to learn discrete latent codes, and then an encoder is trained to map the specification of the synthesis task to these latent codes. Finally, at inference time, Latent Programmer uses a two-level search. Given the specification, the model first produces a L-best list of latent codes from the latent predictor, and uses them to synthesize potential programs. On two different program synthesis domains, we find empirically that the Latent Programmer improves synthesis accuracy by over 10% compared to standard sequence-to-sequence baselines as RobustFill (Devlin et al., 2017) . We also find that our method improves diversity of predictions, as well as accuracy on long programs.

2. BACKGROUND

Problem Setup The goal in program synthesis is to find a program in a given language that is consistent with a specification. Formally, we are given a domain specific language (DSL) which defines a space Y of programs. The task is described by a specification X ∈ X and is solved by some, possibly multiple, unknown program(s) Y ∈ Y. Vector Quantization Traditionally, neural program synthesis techniques process the input specification as a set of sequences and predicts the output program token-by-token (Devlin et al., 2017) . In this work, we present a new approach for synthesis that performs structured planning in latent space using a discrete code. We conjecture that programs have an underlying discrete structure; specifically, programs are compositional and modular with components that get reused across different problems. Our approach leverages this structure to guide the search over large program spaces. Following works in computer vision (van den Oord et al., 2017; Roy et al., 2018) , we discover such discrete structure by using a Vector Quantized Variational Autoencoder (VQ-VAE). VQ-VAEs work by feeding the intermediate representation of an autoencoder through a discretization bottleneck (van den Oord et al., 2017) . For completeness, we provide background on VQ-VAEs below. In a VQ-VAE, latent codes are drawn from a discrete set of learned vectors c ∈ R K×D , or codebook. Each element in the codebook can be viewed as either a token with id k ∈ [K] or as an embedding c k ∈ R D . To generate the discrete codes, the continuous autoencoder output e is quantized via nearest-neighbor lookup into the codebook. Formally, the token id qk(e) and quantized embedding qc(e) are defined as qc(e) = c qk(e) where qk(e) = arg min k∈[K] ||e -c k || 2 . For input x, the training loss for a VQ-VAE consists of: a reconstruction loss for the encoder-decoder weights, a codebook loss that encourages codebook embeddings to be close to the continuous vectors which are quantized to them, and a commitment loss that encourages the encoded input ec(x) to "commit" to codes i.e. not switch which discrete code it is quantized to. The loss is given by, 



Figure 1: A string transformation task with 4 input-output examples a possible program in the string transformation DSL that is consistent with the examples.

For example, each specification can be a set of input/output (I/O) examples denoted X = {(I 1 , O 1 ), . . . (I N , O N )}. Then, we say that we have solved specification X if we found a program Y which correctly solves all the examples: Y (I i ) = O i , ∀i = 1, . . . , N . As another example, each specification can be a natural language description of a task, and the corresponding program implements said task. An example string transformation synthesis task with four I/O examples together with a potential correct program in the string transformation DSL is shown in Figure 1.

(c, θ, φ) = log p θ (x | qc(ec φ (x))) + ||sg(ec φ (x)) -c)|| 2 2 + β||sg(c) -ec φ (x)|| 2 2 ,(2)where θ, φ are the parameters of the decoder and encoder, respectively, sg(•) is the stop gradient operator that fixes the operand from being updated by gradients, and β controls the strength of the commitment loss. To stabilize training, van den Oord et al. (2017) also proposed removing the codebook loss and set the codebook to an exponential moving average (EMA) of encoded inputs.

