SIMPLIFYING MODELS WITH UNLABELED OUTPUT DATA

Abstract

We focus on prediction problems with high-dimensional outputs that are subject to output validity constraints, e.g. a pseudocode-to-code translation task where the code must compile. For these problems, labeled input-output pairs are expensive to obtain, but "unlabeled" outputs, i.e. outputs without corresponding inputs, are freely available and provide information about output validity (e.g. code on GitHub). In this paper, we present predict-and-denoise, a framework that can leverage unlabeled outputs. Specifically, we first train a denoiser to map possibly invalid outputs to valid outputs using synthetic perturbations of the unlabeled outputs. Second, we train a predictor composed with this fixed denoiser. We show theoretically that for a family of functions with a high-dimensional discrete valid output space, composing with a denoiser reduces the complexity of a 2-layer ReLU network needed to represent the function and that this complexity gap can be arbitrarily large. We evaluate the framework empirically on several datasets, including image generation from attributes and pseudocode-to-code translation. On the SPOC pseudocode-to-code dataset, our framework improves the proportion of code outputs that pass all test cases by 3-5% over a baseline Transformer.

1. INTRODUCTION

We study problems whose outputs have validity constraints. For example, in pseudocode-to-code translation, the output code must compile. Other examples include natural language translation and molecule generation, where outputs should be grammatically correct or chemically valid, respectively. State-of-the-art models typically learn the input-output mapping from expensively-obtained labeled data Kulal et al. (2019); Vaswani et al. (2017) ; Méndez-Lucio et al. (2020) ; Senior et al. (2020) , which may not contain enough examples to learn a complex validity structure on high-dimensional output spaces. However, there are often lots of "unlabeled" outputs-outputs without a corresponding input (e.g., GitHub has over 40 million public code repositories). How do we leverage these with a much smaller amount of labeled input-output pairs to improve accuracy and validity? In this paper, we present predict-and-denoise, a framework in which we compose a base predictor, which maps an input to a possibly invalid output, with a denoiser, which maps the possibly invalid output to a valid output. We first train the denoiser on synthetic perturbations of unlabeled outputs. Second, we train the base predictor composed with the fixed denoiser on the labeled data (Figure 1 left). By factorizing into two modules, base predictor and denoiser, the framework allows the base predictor to be simpler by offloading the complexity of modeling the output validity structure to the denoiser, which has the benefit of being trained on much more data. We aim to lay down a principled framework for using unlabeled outputs with theoretical justification for improving sample efficiency by reducing the complexity of the learned base predictor. Figure 1 (middle,right) shows a pictorial example of a staircase function where valid outputs are integers and requires a complex spline to represent. When composed with a denoiser (which rounds to the nearest integer), a simple linear base predictor can represent the staircase function. We theoretically show that our framework reduces the complexity of a 2-layer ReLU network needed to represent a family of functions on a discrete valid output set in high-dimensions. This complexity gap can be arbitrarily large depending on the stability of the target function being learned. We expect such a lower complexity function to be learnable with fewer samples, improving generalization. Empirically, we show on image generation and two pseudocode-to-code datasets (synthetic and SPOC Kulal et al. ( 2019)) that predict-and-denoise improves test performance across continuous and discrete output data modalities. In image generation, our framework improves the clarity and styling of font images by learning a low-complexity base predictor to generate an abstract image while the denoiser sharpens the image. For pseudocode-to-code, we consider the more difficult full-program 2020). We first study a synthetic pseudocode-to-code dataset where the denoiser simplifies the base predictor by helping with global type inference. On SPOC, a recent pseudocode-to-code dataset on programming competition problems, we improve the proportion of correct programs by 3-5% points over a baseline Transformer.

2. SETUP

We consider prediction problems from an input space X (e.g., pseudocode) to an output space Y (e.g., code) where there is an unknown subset of valid outputs V ⊆ Y (e.g., code that compiles), where the true output is always valid (in V). We have a labeled dataset (x 1 ,y 1 ),...,(x n ,y n ) where x i ∈ X and y i ∈ V and access to many unlabeled outputs (ỹ 1 ,...,ỹ m ) from V. We do not assume access to any black box function for testing validity (whether y ∈ V or not), allowing for general problems (e.g. language generation) where output validity is imprecisely characterized. A predictor f : X → Y from a chosen hypothesis class H maps from inputs to the ambient output space. Our goal is to improve the predictor by leveraging information about the valid space V from the unlabeled examples {ỹ i } m i=1 . We leverage a denoiser Π : Y → V, which projects a possibly invalid output in Y and to the valid set V. We can use unlabeled outputs to learn an approximate denoiser. Base, composed, and direct predictors. Let • be a norm on H. Let Π•f base be a composed predictor that is supposed to represent the target function f (that is, Π•f base = f on X ). In the context of a composed predictor, we call f base the base predictor. We compare against f direct ∈ argmin f ∈H { f : f (x) = f (x),x ∈ X }, a minimum norm direct predictor which represents f .

3. DENOISERS CAN REDUCE MODEL COMPLEXITY

In this section, we study direct and composed predictors from an approximation standpoint and use complexity measures on predictors as surrogates for sample complexity. We aim to represent a target function f : X → V. We assume access to a denoiser Π : Y → V which projects to the nearest valid output for an appropriate metric on the output space (breaking ties arbitrarily). In Section 3.1, we give a simple example for when composing with a denoiser (Π • f base ) can drastically reduce the complexity of the learned predictor. Since f base becomes easier to approximate, we may expect better generalization Bartlett et al. (2017); Neyshabur et al. (2017) ; Wei and Ma (2020; 2019) . In Section 3.2, we theoretically show for two-layer ReLU networks that the complexity required to directly represent f can be arbitrarily larger than representing with a composed predictor depending on the stability of f .

3.1. MOTIVATING EXAMPLE

Figure 1 shows a staircase function f that requires a complex direct predictor f direct but the minimum norm base predictor f * base has low complexity. For 0 < δ < 1, let the input space X = N i=1 [i-(1-δ)/2,i+(1-δ)/2] be a union of N disjoint intervals and the valid outputs V = Z be the integers, a subset of the ambient output space Y = R. The staircase function is f (x) = x defined on X , which rounds a linear function onto the integers. Following Savarese et al. ( 2019 (1)



Figure 1: (Left) The predict-and-denoise framework: First, a denoiser is learned using synthetic perturbations of a large number of unlabeled outputs. Second, a base predictor composed with the denoiser is learned with labeled data. Composing with a denoiser allows the base predictor to be simpler, improving generalization. (Middle) Univariate regression example where a staircase function requires a complex linear spline fit. (Right) A simple linear function can fit a staircase function when composed with a denoiser which projects onto the valid outputs (the integers). translation task rather than line-by-line translation (with compiler side information) studied by previous work Kulal et al. (2019); Yasunaga and Liang (2020). We first study a synthetic pseudocode-to-code dataset where the denoiser simplifies the base predictor by helping with global type inference. On SPOC, a recent pseudocode-to-code dataset on programming competition problems, we improve the proportion of correct programs by 3-5% points over a baseline Transformer.

), we define the norm of a univariate function f : R → R as (x)| 2 dx,|f (-∞)+f (+∞)| .

