LATENT PROGRAMMER: DISCRETE LATENT CODES FOR PROGRAM SYNTHESIS

Abstract

In many sequence learning tasks, such as program synthesis and document summarization, a key problem is searching over a large space of possible output sequences. We propose to learn representations of the outputs that are specifically meant for search: rich enough to specify the desired output but compact enough to make search more efficient. Discrete latent codes are appealing for this purpose, as they naturally allow sophisticated combinatorial search strategies. The latent codes are learned using a self-supervised learning principle, in which first a discrete autoencoder is trained on the output sequences, and then the resulting latent codes are used as intermediate targets for the end-to-end sequence prediction task. Based on these insights, we introduce the Latent Programmer, a program synthesis method that first predicts a discrete latent code from input/output examples, and then generates the program in the target language. We evaluate the Latent Programmer on two domains: synthesis of string transformation programs, and generation of programs from natural language descriptions. We demonstrate that the discrete latent representation significantly improves synthesis accuracy.

1. INTRODUCTION

Our focus in this paper is program synthesis, one of the longstanding grand challenges of artificial intelligence research (Manna & Waldinger, 1971; Summers, 1977) . The objective of program synthesis is to automatically write a program given a specification of its intended behavior, such as a natural language description or a small set of input-output examples. Search is an especially difficult challenge within program synthesis (Alur et al., 2013; Gulwani et al., 2017) , and many different methods have been explored, including top-down search (Lee et al., 2018) , bottom up search (Udupa et al., 2013) , beam search (Devlin et al., 2017) , and many others (see Section 2). We take a different philosophy: Can we learn a representation of programs specifically to help search? A natural way of representing a program is as a sequence of source code tokens, but the synthesis task requires searching over this representation, which can be difficult for longer, more complex programs. A programmer often starts by specifying high-level components of a program as a plan, then fills in the details of each component i.e. in string editing, a plan could be to extract the first name, then the last initial. We propose to use a sequence of latent variable tokens, called discrete latent codes, to represent such plans. Instead of having a fixed dictionary of codes, we let a model discover and learn what latent codes are useful and how to infer them from specification. Our hypothesis is that a discrete latent code -a sequence of discrete latent variables -can be a useful representation for search (van den Oord et al., 2017; Roy et al., 2018; Kaiser et al., 2018) . This is because we can employ standard methods from discrete search, such as beam search, over a compact space of high-level plans and then over programs conditioned on the plan, in a two-level procedure. We posit that the high-level search can help to organize the search over programs. In the string editing example earlier, a model could be confident that it needs to extract the last initial, but is less sure about whether it needs to extract a first name. By changing one token in the latent code, two-level search can explore alternative programs that do different things in the beginning. Whereas in traditional single-level search, the model would need to change multi-token prefixes of the alternatives, which is difficult to achieve in limited budget search. We propose the Latent Programmer, a program synthesis method that uses learned discrete representations to guide search via a two-level synthesis. The Latent Programmer is trained by a self-

