SIMPLIFYING MODELS WITH UNLABELED OUTPUT DATA

Abstract

We focus on prediction problems with high-dimensional outputs that are subject to output validity constraints, e.g. a pseudocode-to-code translation task where the code must compile. For these problems, labeled input-output pairs are expensive to obtain, but "unlabeled" outputs, i.e. outputs without corresponding inputs, are freely available and provide information about output validity (e.g. code on GitHub). In this paper, we present predict-and-denoise, a framework that can leverage unlabeled outputs. Specifically, we first train a denoiser to map possibly invalid outputs to valid outputs using synthetic perturbations of the unlabeled outputs. Second, we train a predictor composed with this fixed denoiser. We show theoretically that for a family of functions with a high-dimensional discrete valid output space, composing with a denoiser reduces the complexity of a 2-layer ReLU network needed to represent the function and that this complexity gap can be arbitrarily large. We evaluate the framework empirically on several datasets, including image generation from attributes and pseudocode-to-code translation. On the SPOC pseudocode-to-code dataset, our framework improves the proportion of code outputs that pass all test cases by 3-5% over a baseline Transformer.

1. INTRODUCTION

We study problems whose outputs have validity constraints. For example, in pseudocode-to-code translation, the output code must compile. Other examples include natural language translation and molecule generation, where outputs should be grammatically correct or chemically valid, respectively. State-of-the-art models typically learn the input-output mapping from expensively-obtained labeled data Kulal et al. (2019) ; Vaswani et al. (2017) ; Méndez-Lucio et al. (2020) ; Senior et al. (2020) , which may not contain enough examples to learn a complex validity structure on high-dimensional output spaces. However, there are often lots of "unlabeled" outputs-outputs without a corresponding input (e.g., GitHub has over 40 million public code repositories). How do we leverage these with a much smaller amount of labeled input-output pairs to improve accuracy and validity? In this paper, we present predict-and-denoise, a framework in which we compose a base predictor, which maps an input to a possibly invalid output, with a denoiser, which maps the possibly invalid output to a valid output. We first train the denoiser on synthetic perturbations of unlabeled outputs. Second, we train the base predictor composed with the fixed denoiser on the labeled data (Figure 1 left). By factorizing into two modules, base predictor and denoiser, the framework allows the base predictor to be simpler by offloading the complexity of modeling the output validity structure to the denoiser, which has the benefit of being trained on much more data. We aim to lay down a principled framework for using unlabeled outputs with theoretical justification for improving sample efficiency by reducing the complexity of the learned base predictor. Figure 1 (middle,right) shows a pictorial example of a staircase function where valid outputs are integers and requires a complex spline to represent. When composed with a denoiser (which rounds to the nearest integer), a simple linear base predictor can represent the staircase function. We theoretically show that our framework reduces the complexity of a 2-layer ReLU network needed to represent a family of functions on a discrete valid output set in high-dimensions. This complexity gap can be arbitrarily large depending on the stability of the target function being learned. We expect such a lower complexity function to be learnable with fewer samples, improving generalization. Empirically, we show on image generation and two pseudocode-to-code datasets (synthetic and SPOC Kulal et al. (2019) ) that predict-and-denoise improves test performance across continuous and discrete output data modalities. In image generation, our framework improves the clarity and styling of font images by learning a low-complexity base predictor to generate an abstract image while the denoiser sharpens the image. For pseudocode-to-code, we consider the more difficult full-program translation task rather than line-by-line translation (with compiler side information) studied by previous work Kulal et al. (2019) ; Yasunaga and Liang (2020) . We first study a synthetic pseudocode-to-code dataset where the denoiser simplifies the base predictor by helping with global type inference. On SPOC, a recent pseudocode-to-code dataset on programming competition problems, we improve the proportion of correct programs by 3-5% points over a baseline Transformer.

2. SETUP

We consider prediction problems from an input space X (e.g., pseudocode) to an output space Y (e.g., code) where there is an unknown subset of valid outputs V ⊆ Y (e.g., code that compiles), where the true output is always valid (in V). We have a labeled dataset (x 1 ,y 1 ),...,(x n ,y n ) where x i ∈ X and y i ∈ V and access to many unlabeled outputs (ỹ 1 ,...,ỹ m ) from V. We do not assume access to any black box function for testing validity (whether y ∈ V or not), allowing for general problems (e.g. language generation) where output validity is imprecisely characterized. A predictor f : X → Y from a chosen hypothesis class H maps from inputs to the ambient output space. Our goal is to improve the predictor by leveraging information about the valid space V from the unlabeled examples {ỹ i } m i=1 . We leverage a denoiser Π : Y → V, which projects a possibly invalid output in Y and to the valid set V. We can use unlabeled outputs to learn an approximate denoiser. Base, composed, and direct predictors. Let • be a norm on H. Let Π•f base be a composed predictor that is supposed to represent the target function f (that is, Π•f base = f on X ). In the context of a composed predictor, we call f base the base predictor. We compare against f direct ∈ argmin f ∈H { f : f (x) = f (x),x ∈ X }, a minimum norm direct predictor which represents f .

3. DENOISERS CAN REDUCE MODEL COMPLEXITY

In this section, we study direct and composed predictors from an approximation standpoint and use complexity measures on predictors as surrogates for sample complexity. We aim to represent a target function f : X → V. We assume access to a denoiser Π : Y → V which projects to the nearest valid output for an appropriate metric on the output space (breaking ties arbitrarily). In Section 3.1, we give a simple example for when composing with a denoiser (Π • f base ) can drastically reduce the complexity of the learned predictor. Since f base becomes easier to approximate, we may expect better generalization Bartlett et al. (2017) ; Neyshabur et al. (2017) ; Wei and Ma (2020; 2019) . In Section 3.2, we theoretically show for two-layer ReLU networks that the complexity required to directly represent f can be arbitrarily larger than representing with a composed predictor depending on the stability of f .

3.1. MOTIVATING EXAMPLE

Figure 1 shows a staircase function f that requires a complex direct predictor f direct but the minimum norm base predictor f * base has low complexity. For 0 < δ < 1, let the input space X = N i=1 [i-(1-δ)/2,i+(1-δ)/2 ] be a union of N disjoint intervals and the valid outputs V = Z be the integers, a subset of the ambient output space Y = R. The staircase function is f (x) = x defined on X , which rounds a linear function onto the integers. Following Savarese et al. (2019) , we define the norm of a univariate function f : R → R as f = 1 2 max ∞ -∞ |f (x)| 2 dx,|f (-∞)+f (+∞)| . This norm measures the (lack of) stability of f . Complex functions will have a higher norm. Consider representing f with linear splines, a family of piecewise linear functions. In linear splines, the norm in Equation (1) becomes roughly the sum of absolute changes in slope between piecewise segments. If we represent f directly with a linear spline f direct , the norm of f direct has to be large due to the large number of slope changes: f direct = (N -1)/δ (Figure 1 left). Suppose we have access to a denoiser Π(y) = y , which projects onto V = Z. Then a linear function f * base composed with Π can represent the staircase on X , reducing the norm to 1 (Figure 1 right). By not requiring f * base to represent the local complexity and discreteness in f , the base predictor f * base better captures the underlying globally linear structure of f .

3.2. ANALYSIS FOR 2-LAYER RELU NETWORKS

We extend to more general hypothesis classes and high dimensional outputs. Our setting is motivated by the task of generating images of font characters from attributes, which we study empirically in Section 5.1. In font image generation, there is a discrete set of valid font images in the continuous ambient output space. Formally, we take the valid set V = {y * 1 ,...,y * N } to be a discrete set over N output values in R k and f is a piecewise constant function defined on N disjoint intervals X = N i=1 [x l i ,x u i ] (in ascending order), where there is a δ > 0 gap between each interval and the next. The target function f is defined such that if x ∈ [x l i ,x u i ], then f (x) = y * i . We study 2-layer ReLU networks, often studied as a first step towards understanding the expressivity of neural networks Neyshabur et al. (2014) ; Savarese et al. (2019) ; Eldan and Shamir (2016) . Following Savarese et al. (2019) , we define f θ ∈ H as f θ (x) = h l=1 w (2) l w (1) l ,x +b (1) l + +b (2) l on x ∈ R d , where we will take d = 1 throughout. Here, [x] + = max(x,0) is the element-wise ReLU nonlinearity. The parameters θ contain the hidden unit size h ∈ N and all weights and biases. We let W (1) ∈ R h×d denote the matrix with w (1) l ∈ R d as rows and let b (1) ,b (2) ,w (2) ∈ R h be vectors with b (1) l ,b (2) l ,w l ∈ R as elements respectively. We let Θ denote this parameter space. Measure of complexity. Following Savarese et al. (2019) , the complexity of a network is associated with the squared Euclidean norm of the weights C(θ) = 1 2 ( w (2) 2 2 + W (1) 2 F ). The norm of f ∈ H is the minimum norm required to represent f : f = inf θ∈Θ C( θ) s.t. f θ = f. (2) Savarese et al. (2019) showed that this norm is equivalent to Equation 1 for univariate networks. Since these complexity measures typically appear in generalization bounds Bartlett et al. (2017) ; Neyshabur et al. (2017) , we expect to improve generalization error by reducing these complexity measures. Minimum complexity reduces with a denoiser. Given Π(y) ∈ argmin y * ∈V y * -y 2 which is projection onto V (breaking ties arbitrarily), we want to compare the norms of f direct that represents f directly and the minimum norm base predictor that represents f : f * base = argmin f ∈H { f : Π•f (x) = f (x),x ∈ X }. Note that f * base ≤ f direct since f direct is a feasible solution. Thus composing cannot increase the norm. Adjacent intervals measure stability. Our result depends crucially on the number of non-adjacent pairs of intervals in f . Suppose the output dimension is k = 1. We define a pair of interval indices (i,i + 1) as adjacent if there is no valid output value y ∈ V such that either y * i < y < y * i+1 or y * i+1 < y < y * i hold. The number of non-adjacent interval pairs characterizes the instability of f . Let |J| be the number of non-adjacent pairs and |I| be the number of adjacent pairs, where |I|+|J| = N -1. Our bound also depends on L = min i |y * i -y * i+1 | and U = max i |y * i -y * i+1 |, the min and max separation between valid points. For higher output dimensions (k > 1), let y * i,j be the j-th output coordinate of the i-th valid point and let |J j |,|I j |,L j ,U j be the analogous quantities for each output coordinate j ∈ [k].  } in V. Let f : R → R k be a piecewise constant function defined on X = N i=1 [x l i ,x u i ] where f (x) = y * i if x ∈ [x l i ,x u i ]. Let ∆ x be the length of the smallest interval in X . Then f direct f * base = Ω   N max j L j k j=1 U j |J j |+δ |Ij | ∆x   See Appendix A for a proof. If |J j | are sublinear in N and valid points are evenly spaced, then the gap is Ω(1/δ) which can be arbitrarily large for a fixed output dimension as δ → 0 and N → ∞. If any |J j | is linear in N (many non-adjacent intervals), then there is only a constant factor gap in the worst case. Overall, if f is stable with respect to its discrete output space, we can learn a simpler base predictor that still represents f when composed with the denoiser. Note that in practice, we need to regularize the base predictor to find this low complexity solution.

4. PREDICT-AND-DENOISE FRAMEWORK

In Section 3, we assumed access to a denoiser Π that maps output y ∈ Y to a valid output Π(y) ∈ V, allowing the min-norm base predictor f * base to have much lower complexity. In this section, we are not given a denoiser but instead have access to a large number of unlabeled outputs ỹ1 ,..., ỹm ∈ V. We present predict-and-denoise, a framework for utilizing unlabeled output examples to simplify models. In this framework, we first use self-supervised learning on the unlabeled outputs to learn an approximate denoiser Π, and then use Π (which is now fixed) to learn a composed predictor Π•f θ . Here, f θ is the learned base predictor with parameters θ. Figure 2 (left-middle) gives an example input-output pair in a pseudocode-to-code task. Using the predict-and-denoise framework, the model could learn to make code translations on a mostly local, line-by-line basis (a simpler solution) while relying on the denoiser to correct types globally. Learning the denoiser. Assume that as domain knowledge, we have a noising distribution q(ỹ | ỹ) over outputs given a valid output ỹ. Figure 2 (middle) gives an example of an output program in a pseudocode-to-code translation task. Here, a noising distribution may make random semantic and syntactic corruptions such as changing types or removing semicolons and parentheses (Figure 2 right ). The denoising objective here is to recover the original code from corrupted code. More generally, given the noising distribution, we train a probabilistic model p β (ỹ | ỹ ) on output pairs (ỹ ,ỹ) where ỹ ∼ q(• | ỹ). We train the probabilistic model by maximizing the log-likelihood maximize β E ỹ [E ỹ ∼q [logp β (ỹ | ỹ )]] (5) using unlabeled output samples. The denoiser Π β (ỹ ) = argmax ỹ p β (ỹ | ỹ ) is defined via the probabilistic model. Learning the composed predictor. In this step, we fix the learned denoiser Π β and learn the composed predictor Π β • f θ on labeled examples. We train a probabilistic model p θ for the base predictor by optimizing maximize θ E x,y [E y ∼p θ [logp β (y | y )]]+λE x,y [logp θ (y | x)]. The first term maximizes a lower bound on the log-likelihood of the composed predictor via p β and p θ (see Appendix D). We optimize a lower bound since optimizing the log-likelihood directly requires computing an intractable partition function over the high-dimensional output space. The second term is the log-likelihood of only p θ . We define the base predictor f θ (x) = argmax y p θ (y | x). Choice of noising distribution. Learning the base predictor composed with the denoiser allows for some distribution mismatch between the errors of the base predictor and the noising distribution the denoiser is trained on. By learning in a composed manner, the base predictor can adapt to the choice of noising distribution. In our experiments in Appendix B, we find that predict-and-denoise gives gains across a variety of noising distributions.

5. EXPERIMENTS

We evaluate predict-and-denoise on image generation from given attributes and full-program psuedocode-to-code translation, showing its benefits on both continuous and discrete output spaces. In image generation, composed models generate clearer images with less artifacts with few labeled examples. For full-program pseudocode-to-code translation in SPOC Kulal et al. (2019) , a recent pseudocode-code dataset, our framework improves the proportion of correctly generated programs by 3-5% points over a baseline Transformer and achieves comparable or better results to a line-by-line translation model from previous work Kulal et al. (2019) .

5.1. IMAGE GENERATION FROM ATTRIBUTES

We evaluate predict-and-denoise on font image generation, where the ambient output space is continuous. This task closely mirrors the theoretical setup, where the input is low-dimensional (index of the font and character type) to a high-dimensional output (image). We also validate the theory from Section 3, which suggested that regularization is required to realize the complexity reduction of the minimum-norm base predictor f * base . Qualitatively, image samples from our composed predictor are clearer and has less artifacts. Prediction task and denoising objective. We mapping two one-hot vectors corresponding to the character identity (out of 62 possible) and the font of the character to generate (out of 100 fonts) 32×32 grayscale font images. Here, valid font images have cleanly defined lines and adhere to the font styling. We train using the pixel-wise squared error loss for all models and tune L2 regularization strength on a validation set. To train the composed predictor, we set λ = 0 in (6), using only the composed loss. The denoising objective is to sharpen unlabeled font images distorted by a Gaussian blur filter with randomly sampled radii in [0,2]. We also report gains with other noising functions (embossing, contrast perturbations) in Appendix B. Data. We use a dataset of 56k fonts originally scraped from the Internet Bernhardsson (2016) Models and metrics. The base predictor f θ and the direct predictor f direct are both 7-layer fully-connected networks (see Appendix B). The denoiser Π β is a 3-layer U-Net Ronneberger et al. (2015) . We test image sample quality directly by computing the pixel-wise squared error with respect to ground truth test images. Results. For regularization strength tuned on the validation set (Figure 4 left), the composed predictor achieves an 11% reduction in test MSE compared to the best direct predictor test error. The direct predictor test MSE increases when its outputs are processed by the denoiser at test time. We visualize the predicted images for some randomly-selected fonts for comparison (Figure 3 ). The base predictor trained with L2 regularization outputs noisy gray images, suggesting that it has learned a lower complexity model. In contrast, L2 regularization does not improve the direct predictor (Figure 4 right) since directly outputting clearly defined lines and transitions between black and white pixels requires a relatively high complexity model. Note that we study L2 regularization as motivated by theory, but we expect any reasonable regularization method to help. Indeed, we find that adding dropout to the composed model improves the MSE further to 0.165. Additional results on varying labeled and unlabeled data size are in Appendix B, where the performance of the Composed model improves upon Direct on all instances.

5.2. PSEUDOCODE-TO-CODE

We evaluate predict-and-denoise on pseudocode-to-code translation, where the ambient output space is discrete. We evaluate on two pseudocode-to-code datasets (synthetic in Section 5.2.1 and SPOC in Section 5.2.2. On SPOC, our framework improves the proportion of programs that pass all test cases by 3-5% points over a baseline Transformer and has competitive or better results to line-by-line models Kulal et al. (2019) . Prediction task and denoising objective. We consider full-program pseudocode-to-code translation, where inputs X are human-generated pseudocode. The ambient output space Y is all possible strings and the set of valid outputs V are strings that compile with the g++ compiler. In contrast to previous works which decompose the problem into line-by-line translation and use information from the compiler Kulal et al. (2019) ; Yasunaga and Liang (2020) , we translate the entire program at once without compiler access. Following Yasunaga and Liang (2020) , the denoising objective for both pseudocode-to-code datasets consists of repairing random semantic and syntactic corruptions of unlabeled code examples (see Appendix E). Models and regularization. We use a Transformer Vaswani et al. (2017) for both the base predictor and the denoiser. In all models, we use a combination of weight decay, dropout, attention dropout, and ReLU dropout as regularization. To train the composed predictor, we use λ = 1 to balance between the fitting the composed and direct objectives. During inference, we use a greedy decoding for simplicity (without beam search). Problem-specific optimizations such as beam search and querying a compiler during inference can improve the results further. Pretraining models. In machine translation, a standard way to incorporate unlabeled outputs is to pretrain the encoder/decoder on monolingual data Ramachandran et al. (2018); Skorokhodov et al. (2018) ; Devlin et al. (2019) . We consider a pretrained predictor which is pretrained with the denoising objective on unlabeled code and then trained on labeled examples, utilizing a shared encoder/decoder vocabulary. We employ predict-and-denoise on top by initializing from the pretrained model (Pretrained+Composed), which provides complementary benefits beyond pretraining. Back-translation models. Back-translation methods use an output to input model (learned on the labeled data) applied on unlabeled outputs to generate additional synthetic inputs Sennrich et al. (2016b) . We employ predict-and-denoise on top by initializing from a back-translation model (Back-translation + Composed), showing complementary benefits. Metrics. A generated program has three possible outcomes: compilation error, execution error, or correct. A program is correct if, executed on a set of input test cases, its outputs match the set of gold outputs. We measure the proportion of programs that fall into these outcomes.

5.2.1. SYNTHETIC DATASET

Pseudocode specifies local information but there are global consistency constraints to enforce (Figure 2 ). Modeling everything directly requires a complex model. With predict-and-denoise, the base predictor f θ can do local translation while the denoiser Π enforces global constraints such as type correctness. To test this intuition, we generate a synthetic pseudocode-to-code dataset where the pseudocode specifies all but the declaration types (see Figure 2 ). Dataset generation. The synthetic programs involve 1-4 variables (bools, ints, and strings) drawn from 10 possible variable names, which are first initialized (by reading stdin) and then processed by up to 1-5 random operations, including 3 unary operations per type and 2 binary operations on ints. There are 100 possible integer values and 10 possible string values. We generate 1000 labeled examples and 20000 unlabeled code examples. Results. Table 1 shows the results for all models. The Pretrained+Composed predictor improves the proportion of correct programs over direct training by 51% and over pretraining by 9.6%. We can also apply the learned denoiser to the outputs of the direct and pretrained predictors at test time, which reduces the improvement to 29.6% and 7.6% respectively. Similarly, Backtranslation+Composed improves upon direct training by 55.2% and over the strong backtranslation baseline by 2.6%. The Composed model without combining with pretraining or backtranslation still achieves a 31.4% increase over the direct model, but requires combining with pretraining or backtranslation to achieve the best performance. This suggests that predict-and-denoise offers a complementary benefit from using unlabeled output data. Results on varying unlabeled and labeled data sizes, where the composed model improves over the baselines in all instances, are in Appendix C. Figure 5 gives an example input with the output of the base and composed predictors. With the denoiser, the base predictor does not have to output all the correct variable types. Here, the denoiser correctly instantiates var_5 and corrects the type of temp. Denoising objective. We use random syntactic and semantic corruptions of additional ∼280k unlabeled code examples from codeforces.com as in Yasunaga and Liang (2020) . Previous program repair works Yasunaga and Liang (2020) utilize compiler error messages to guide the repair model. We only use code as input, and thus the task is relatively difficult. We define p β in two parts. First, we train a binary classifier g γ : Y → {0,1} which detects if a program has an error (error is label 1), trained using the denoising dataset. Results. On both test sets, predict-and-denoise (composed) models improve the proportion of correct code over the direct predictor by 3-5%, and applying predict-and-denoise to pretrained and backtranslation models improve them by about 1-2% (Table 2 ). Predict-and-denoise without pretraining or backtranslation still improved over pretraining, but improving over backtranslation requires the combining the complementary benefits of backtranslation and predict-and-denoise. Applying the denoiser to the direct and pretrained models during test time did not improve their performance. Backtranslation+Composed matches the top-1 performance of a line-by-line LSTM with attention-based copying Kulal et al. (2019) on TESTP and improve upon it by 11.4% on TESTW despite considering the more difficult full-program generation task.

5.3. COMPARISONS TO SCALED-UP BASELINES

Although the composed model does not optimize the denoiser during joint training with the base predictor, the final composed model consists of roughly double the number of layers as the baselines. Thus, we also present baseline results with the same number of layers as the composed model. We note that these scaled-up baselines have roughly double the amount of trainable parameters as the composed model, since the denoiser is fixed when training on labeled data. Table 3 shows the results of scaled-up direct, pretrained, and backtranslation baselines on the synthetic code task. While the scaled direct predictor improves with respect to the unscaled direct predictor, scaling up worsens the pretrained and backtranslation baselines. Intruigingly, the pretraining becomes dramatically less effective with a very large model, possibly due to the relatively large size of the model in comparison to the unlabeled data (20k examples). Table 2 gives results for a scaled-up direct predictor for the SPOC dataset. We find that while the compilation error rate decreases, scaling up slightly degrades the correct rate of its output programs.

5.4. COMPARISONS TO BASELINES WITH A DENOISER

We also give comparisons to the baselines when their outputs are post-processed by the same denoiser used by the composed model. This results in baselines with the exact same architecture and same number of trainable parameters. The only difference is that the Composed model trains the base predictor jointly with the fixed denoiser. Table 3 shows the results of the baselines with a denoiser. While the improvement between the composed models and the baselines decreases, predict-and-denoise gives gains of 29.6%, 7.6%, and 3.2% above direct+denoiser, pretrained+denoiser, and backtranslation+denoiser respectively. We find that using the denoiser does not improve SPOC baseline results in general. These experiments highlight the importance of training jointly with the fixed denoiser.

6. RELATED WORK

Semi-supervised learning. Like semi-supervised learning, predict-and-denoise leverages large amounts of unlabeled data. Vincent et al. (2008; 2010) . Recently, DAEs have been considered on the input side to combat adversarial robustness by attempting to clean the adversarial example first using invariances learned from unlabeled data Gu and Rigazio (2015) ; Wong and Kolter (2020) . We consider DAEs for learning invariances and structure in the output space instead of inputs. Semantic parsing and structured prediction. Some recent semantic parsing works have explicitly provided output constraints using abstract syntax trees (AST) and enforcing type constraints Yin and Neubig (2017); Krishnamurthy et al. (2017) ; Xiao et al. (2016) ; Dong and Lapata (2016) . Krishnamurthy et al. (2017) note that enforcing type constraints during training not only prevents invalid outputs but also improves generalization, supporting our results. While these methods are useful when the validity structure is known and well-defined, we focus on extracting unknown structure from unlabeled outputs. Structured prediction spans applications including speech Zhang and Wu (2013), vision Mueller (2013) , and medical diagnosis Jagannatha and Yu (2016) . Many approaches use graphical models (on top of neural models) for enforcing validity, e.g. HMMs and CRFs in OCR and sequence tagging Kassel (1995) ; Huang et al. (2015) . These approaches typically require carefully engineering the graphical model to integrate with a neural component and do not consider the simplicity benefits of composition.

7. CONCLUSION

Many tasks in machine learning are no longer classification or regression but require generating outputs with rich structure (images, text, music, proteins, etc.), for which unpaired outputs are very common. We introduce the predict-and-denoise framework, in which we compose a predictor with a denoiser trained on unpaired outputs. Open questions include whether we can train in a more differentiable way for discrete output spaces and how to choose the best denoising objective for a given prediction task.



Figure 1: (Left)The predict-and-denoise framework: First, a denoiser is learned using synthetic perturbations of a large number of unlabeled outputs. Second, a base predictor composed with the denoiser is learned with labeled data. Composing with a denoiser allows the base predictor to be simpler, improving generalization. (Middle) Univariate regression example where a staircase function requires a complex linear spline fit. (Right) A simple linear function can fit a staircase function when composed with a denoiser which projects onto the valid outputs (the integers).

Figure 3: Generated letters A-J for 10 randomly selected fonts. (a) The direct predictor makes blurry outputs with many artifacts. (b) The composed predictor (base + denoiser) makes clearer outputs with more distinct font patterns. (c) The improvement comes from leveraging output structure learned by the denoiser. This allows the base predictor to produce blurrier outputs corresponding to a lower norm model. Since the learned Π β is imperfect, the hyperparameter λ in the objective trades off between fitting the composition Π β •f θ and fitting f θ directly to the data. For discrete output spaces, the first term in this objective involves an expectation over a discrete space of outputs. Depending on the model and the task, optimizing this objective may require REINFORCE Williams (1992) or a Gumbel-softmax reparameterization Jang et al. (2017); Maddison et al. (2016). The direct predictor is only trained with the second term of our objective E x,y [logp θ (y | x)].

Figure 4: Test MSE on font image generation. (Left) Results when L2 regularization strength is tuned with the validation set. (Right) Varying L2 regularization strength (1e-6 to 1e-2) for direct and composed predictors. While similar at low regularization, increasing the regularization strength improves the composed predictor while hurting the direct predictor.

Figure 5: (Left-Middle) Example input and output of the base predictor on the synthetic dataset. (Right) Output of the denoiser, which instantiates var_5 and corrects the type of temp.

For an output y , if g γ (y ) = 0 then we define p β (y | y ) = δ(y ) to be a delta distribution on y . Otherwise, if g γ (y ) = 1, then p β (y | y ) = p ν (y | y ), where p ν is a Transformer. The Transformer p ν is first pretrained using a linewise code repair dataset generated from unlabeled examples, then trained on full-program repair where the input program has one random corrupted line with probability 0.75. Thus, taking β = (γ,ν), we have Π β (y ) = y if g γ (y ) = 0 and Π β (y ) = argmax y p ν (y | y ) otherwise.

Machine translation. Machine translation methods use monolingual data in both the source and target languages to improve their models Sennrich et al. (2016b); Cheng et al. (2016). Pretraining methods use language modeling on monolingual data to initialize the encoder and decoder Ramachandran et al. (2018); Skorokhodov et al. (2018); Devlin et al. (2019). Back-translation methods generate additional synthetic parallel examples by training on the backwards (target to source) problem Sennrich et al. (2016b). Predict-and-denoise gives complementary gains on top of pretraining and back-translation.

Figure 2: (Left-Middle) Example pseudocode and code from the synthetic dataset. Since the pseudocode is ambiguous, variable types and whether to instantiate a variable must be inferred. (Right) Random corruption used to train a denoiser from corrupted to valid code. The denoiser must infer the correct type of var_5 from other lines. Theorem 1. Let the valid output space V ⊂ R k be a set over N multivariate output values {y * 1 ,...,y * N

. Out of the 6200 labeled examples (62 characters × 100 fonts), we split randomly into 2500 training examples, 100 validation examples, and 3600 test examples. The training examples contain a random subset of the characters for each font. The models must generate the unseen characters of each font with the correct font styling at test-time. The denoiser uses additional unlabeled images for ∼50k other fonts.

Results on SPOC pseudocode-to-code test sets, showing proportion of outcomes of generated code (%). Direct-scaled has the same number of total layers as the composed model, but roughly double the number of trainable parameters.

Results of scaled-up baselines and baselines with a denoiser on synthetic pseudocode-to-code task. evaluate on the challenging SPOC pseudocode-to-code datasetKulal et al. (2019), which contains code scraped from codeforces.com and pseudocode written by crowdsourced workers. Since we consider the full-program translation task instead of line-by-line as in previous worksKulal et al. (2019);Yasunaga and Liang (2020), we filter out training examples where the code is longer than 1000 tokens after pre-processing, retaining over 95% (11355/11925) of the training examples. We use the two given SPOC test sets, TESTP and TESTW. TESTP tests for generalization to unseen problems, while TESTW tests for generalization to pseudocode written by different workers. We report results on the full (unfiltered) test sets.

However, semi-supervised learning works typically use unlabeled input data Tarvainen and Valpola (2017); Miyato et al. (2018); Shu et al. (2018); Berthelot et al. (2019), whereas we have "unlabeled" outputs. In classification, unlabeled outputs can help with handling label shift Lipton et al. (2018); Azizzadenesheli et al. (2019), but otherwise there is very little output structure. If both unlabeled inputs and outputs are available, our method is complementary with semi-supervised methods. Denoising autoencoding. Denoising autoencoders (DAE) are classical building blocks for unsupervised deep representation learning

annex

P. Yin and G. Neubig. A syntactic neural model for general-purpose code generation. In Association for Computational Linguistics (ACL), pages 440-450, 2017.X. Zhang and J. Wu. Denoising deep neural networks based voice activity detection. arXiv, 2013.

