ADVERSARIAL SYNTHETIC DATASETS FOR NEURAL PROGRAM SYNTHESIS

Abstract

Program synthesis is the task of automatically generating a program consistent with a given specification. A natural way to specify programs is to provide examples of desired input-output behavior, and many current program synthesis approaches have achieved impressive results after training on randomly generated input-output examples. However, recent work has discovered that some of these approaches generalize poorly to data distributions different from that of the randomly generated examples. We show that this problem applies to other state-ofthe-art approaches as well and that current methods to counteract this problem are insufficient. We then propose a new, adversarial approach to control the bias of synthetic data distributions and show that it outperforms current approaches.

1. INTRODUCTION

Program synthesis has long been a key goal of AI research. In particular, researchers have become increasingly interested in the task of programming by example (PBE) , where the goal is to generate a program consistent with a given set of input-output (I/O) pairs. Recent studies have achieved impressive results, capable of solving PBE problems that humans would find difficult (e.g., Sharma et al. (2017) ; Zohar & Wolf (2018) ; Ellis et al. (2019) ). However these studies have a concerning weakness: since large, naturally occurring datasets of program synthesis problems do not exist, these studies train and test their models on synthetic datasets of randomly generated programs and I/O pairs. The justification for using these synthetic datasets is that if a model can correctly predict programs for arbitrary PBE problems, then it has likely learned the semantics of the programming language and can generalize to problems outside the synthetic data distribution (Devlin et al., 2017) . While this justification is plausible, a model might also perform well because it has learned specific aspects of the synthetic data distribution, and recent studies have found this to be the case for several state-of-the-art models (Shin et al., 2019; Clymo et al., 2019) . These studies find that current PBE models often perform poorly on distributions different from that of the training data, and they propose methods to mitigate this issue by generating synthetic data with more varied distributions. The idea behind these methods is that a model trained on more varied synthetic data should generalize to a wider variety of distributions, hopefully including those of real-world PBE problems. Nevertheless, we find that these methods are often insufficient. Previous studies differ on what constitutes a "varied distribution" of synthetic data, creating definitions based on problem-specific heuristics. While generating training data based on these heuristics does help models generalize to certain distributions, we find that models trained using these methods still fail to generalize to many other distributions, including those resembling distributions of real-world problems. Moreover, different methods fail to generalize to different distributions, raising the question of how one should construct test sets to evaluate these methods. While previous studies have arbitrarily picked test sets that they believe present a reasonable challenge for state-of-the-art methods, this approach may lead to overly optimistic evaluations. A study may report that a method performed well because the researchers failed to find those distributions on which the method performs poorly. In this paper, we propose an adversarial method to generate a training set. Our adversarial approach builds a training set iteratively, finding data distributions on which a given model performs poorly and adding data drawn from those distributions to the training set on each iteration. We test this method by using it to generate training data for the PCCoder model from Zohar & Wolf (2018), and we show that models trained using our method generalize to a variety of distributions better than previously proposed methods. Moreover, we propose using a variation of our adversarial approach to generate test sets to evaluate PBE methods. We create test sets for different versions of PCCoder using this approach and show that these test sets reveal weaknesses in models that are not obvious when using other test sets. This paper makes the following key contributions: 1. We propose a new, adversarial method to generate desirable distributions on which to train models for PBE. 2. We show that models trained using our method generalize to a variety of datasets better than models trained using previously proposed methods. 3. We show that our adversarial approach may also be used to generate test sets that are less likely to overestimate the performance of a model. 2018) better generalizes to a variety of distributions when trained with data generated with this method. However, this method has two major disadvantages. First, it requires the user to determine the correct salient variables, which may be difficult for complex domains. Second, if the domain of valid I/O pairs is highly dependent on the program, it is often prohibitively complex to enforce uniformity across salient variables.

2. RELATED WORK

Recently, Clymo et al. (2019) proposed a method to generate PBE problems using a SMT solver. They impose constraints on the I/O pairs to ensure that pairs selected for the dataset are not too similar to each other and then select I/O pairs that satisfy these constraints using a SMT solver. However, when testing an implementation of this method on the DeepCoder domain, the reported improvement of the constraint-based methods over simpler sampling methods is marginal, with the best constraint-based method performing only 2.4% better than the best sampling method. Moreover, many of their constraints are highly specific to the DeepCoder domain, and Clymo et al. do not offer a way to adapt their method to other problem spaces. Nevertheless, they present a method



Parisotto et al. Parisotto et al. (2016)  create a dataset for the Flashfill domain(Gulwani et al., 2012)   by enumerating programs up to 13 expressions long and then randomly sampling inputs to create I/O pairs. They report that while their model achieves 97% accuracy on their synthetic data, they achieve only 38% accuracy on a small dataset of 238 real-world problems. Devlin et al. (2017) use a data generation approach similar to Parisotto et al. but with an improved model and are more successful, achieving 92% accuracy on the same real-world dataset used by Parisotto et al.All of the papers above focus on advancing models for PBE, but they do so largely using synthetic data to train those models.Shin et al. (2019)  report that even minor differences between the synthetic data distributions used for training and evaluation can drastically decrease a model's performance. To solve this problem, they propose a data generation method to improve a model's ability to generalize to other data distributions. They first choose a set of "salient variables" for the domain, defined as a mapping from I/O pairs in the synthetic dataset to a finite, discrete set. They then sample I/O pairs such that the salient variables will be approximately uniformly distributed in the resulting dataset. Shin et al. find that the model proposed byBunel et al. (

