SIMPLIFYING MODELS WITH UNLABELED OUTPUT DATA

Abstract

We focus on prediction problems with high-dimensional outputs that are subject to output validity constraints, e.g. a pseudocode-to-code translation task where the code must compile. For these problems, labeled input-output pairs are expensive to obtain, but "unlabeled" outputs, i.e. outputs without corresponding inputs, are freely available and provide information about output validity (e.g. code on GitHub). In this paper, we present predict-and-denoise, a framework that can leverage unlabeled outputs. Specifically, we first train a denoiser to map possibly invalid outputs to valid outputs using synthetic perturbations of the unlabeled outputs. Second, we train a predictor composed with this fixed denoiser. We show theoretically that for a family of functions with a high-dimensional discrete valid output space, composing with a denoiser reduces the complexity of a 2-layer ReLU network needed to represent the function and that this complexity gap can be arbitrarily large. We evaluate the framework empirically on several datasets, including image generation from attributes and pseudocode-to-code translation. On the SPOC pseudocode-to-code dataset, our framework improves the proportion of code outputs that pass all test cases by 3-5% over a baseline Transformer.

1. INTRODUCTION

We study problems whose outputs have validity constraints. For example, in pseudocode-to-code translation, the output code must compile. Other examples include natural language translation and molecule generation, where outputs should be grammatically correct or chemically valid, respectively. State-of-the-art models typically learn the input-output mapping from expensively-obtained labeled data Kulal et al. (2019); Vaswani et al. (2017) ; Méndez-Lucio et al. (2020); Senior et al. (2020) , which may not contain enough examples to learn a complex validity structure on high-dimensional output spaces. However, there are often lots of "unlabeled" outputs-outputs without a corresponding input (e.g., GitHub has over 40 million public code repositories). How do we leverage these with a much smaller amount of labeled input-output pairs to improve accuracy and validity? In this paper, we present predict-and-denoise, a framework in which we compose a base predictor, which maps an input to a possibly invalid output, with a denoiser, which maps the possibly invalid output to a valid output. We first train the denoiser on synthetic perturbations of unlabeled outputs. Second, we train the base predictor composed with the fixed denoiser on the labeled data (Figure 1 left). By factorizing into two modules, base predictor and denoiser, the framework allows the base predictor to be simpler by offloading the complexity of modeling the output validity structure to the denoiser, which has the benefit of being trained on much more data. We aim to lay down a principled framework for using unlabeled outputs with theoretical justification for improving sample efficiency by reducing the complexity of the learned base predictor. Figure 1 (middle,right) shows a pictorial example of a staircase function where valid outputs are integers and requires a complex spline to represent. When composed with a denoiser (which rounds to the nearest integer), a simple linear base predictor can represent the staircase function. We theoretically show that our framework reduces the complexity of a 2-layer ReLU network needed to represent a family of functions on a discrete valid output set in high-dimensions. This complexity gap can be arbitrarily large depending on the stability of the target function being learned. We expect such a lower complexity function to be learnable with fewer samples, improving generalization. Empirically, we show on image generation and two pseudocode-to-code datasets (synthetic and SPOC Kulal et al. ( 2019)) that predict-and-denoise improves test performance across continuous and discrete output data modalities. In image generation, our framework improves the clarity and styling of font images by learning a low-complexity base predictor to generate an abstract image while the denoiser sharpens the image. For pseudocode-to-code, we consider the more difficult full-program

