Tailoring: ENCODING INDUCTIVE BIASES BY OPTIMIZING UNSUPERVISED OBJECTIVES AT PREDICTION TIME

Abstract

From CNNs to attention mechanisms, encoding inductive biases into neural networks has been a fruitful source of improvement in machine learning. Auxiliary losses are a general way of encoding biases in order to help networks learn better representations by adding extra terms to the loss function. However, since they are minimized on the training data, they suffer from the same generalization gap as regular task losses. Moreover, by changing the loss function, the network is optimizing a different objective than the one we care about. In this work we solve both problems: first, we take inspiration from transductive learning and note that, after receiving an input but before making a prediction, we can fine-tune our models on any unsupervised objective. We call this process tailoring, because we customize the model to each input. Second, we formulate a nested optimization (similar to those in meta-learning) and train our models to perform well on the task loss after adapting to the tailoring loss. The advantages of tailoring and meta-tailoring are discussed theoretically and demonstrated empirically on several diverse examples: encoding inductive conservation laws from physics, increasing robustness to adversarial examples, meta-tailoring with contrastive losses to improve theoretical generalization guarantees, and increasing performance in model-based RL.

1. INTRODUCTION

The key to successful generalization in machine learning is the encoding of useful inductive biases. A variety of mechanisms, from parameter tying to data augmentation, have proven useful but there is no systematic strategy for designing and implementing these biases. Auxiliary losses are a paradigm for encoding a wide variety of biases, constraints and objectives, helping networks learn better representations and generalize more broadly. They add an extra term to the task loss and minimize it over the training data or, in semi-supervised learning, on an extra set of unlabeled data. However, they have two major difficulties: 1. Auxiliary losses are only minimized at training time, but not for the query points. This causes a generalization gap between training and testing, in addition to that of the task loss. 2. By minimizing the sum of the task loss plus the auxiliary loss, we are optimizing a different objective than the one we care about (only the task loss). In this work we propose a solution to each problem: 1. We use ideas from transductive learning to minimize the auxiliary loss at the query by running an optimization at prediction time, eliminating the generalization gap for the auxiliary loss. We call this process tailoring, because we customize the model to each query. 2. We use ideas from meta-learning to learn a model that performs well on the task loss assuming that we will be optimizing the auxiliary loss. This meta-tailoring effectively trains the model to leverage the unsupervised tailoring loss to minimize the task loss. Tailoring a predictor In classical inductive supervised learning, an algorithm consumes a training dataset of input-output pairs, ((x i , y i )) n i=1 , and produces a set of parameters θ by minimizing a supervised loss n i=1 L sup (f θ (x i ), y i ) and, optionally, an unsupervised auxiliary loss n i=1 L unsup (θ, x i ). These parameters specify a hypothesis f θ (•) that, given a new input x, generates an output ŷ = f θ (x). This problem setting misses a substantial opportunity: before the learning algorithm sees the query Figure 1 : Comparison of several learning settings with offline computation in the orange boxes and online computation in the green boxes, with tailoring in blue. For meta-tailoring training, τ (θ, L tailor , x) = arg min θ ≈θ L tailor (x, θ ) represents the tailoring process resulting in θ x . Although tailoring and meta-tailoring are best understood in supervised learning, they can also be applied in reinforcement learning, as shown in section 5.3. point x, it has distilled the data down to a set of parameters, which are frozen during inference, and so it cannot use new information about the particular x that it will be asked to make a prediction for. Vapnik recognized an opportunity to make more accurate predictions when the query point is known, in a framework that is now known as transductive learning (Vapnik, 1995; Chapelle et al., 2000) . In transductive learning, a single algorithm consumes both labeled data, ((x i , y i )) n i=1 , and a set of input points for which predictions are desired, (x (j) ) j , and produces predicted outputs (ŷ (j) ) j for each of the queries, as illustrated in the top row of figure 1. In general, however, we do not know queries a priori, and instead we want an inductive rule that makes predictions on-line, as queries arrive. To obtain a prediction function from a transductive system, we would need to encapsulate the entire learning procedure inside the prediction function. This strategy would achieve our objective of taking x into account at prediction time, but would be computationally much too slow. We observe that this strategy for combining induction and transduction would perform very similar computations for each prediction, sharing the same training data and objective. We can use ideas from meta-learning to find a shared "meta-hypothesis" that can then be efficiently adapted to each query (treating each prediction as a task). As shown in the third row of figure 1 , we first run regular supervised learning to obtain parameters θ; then, given a query input x, we fine-tune θ on an unsupervised loss (L tailor ) to obtain customized parameters θ x and use them to make the final prediction: f θx (x). We call this process tailoring, because we adapt the model to each particular input for a customized fit. Notice that tailoring optimizes the loss at the query point, eliminating the generalization gap on the auxiliary loss. Meta-tailoring Since we will be applying tailoring at prediction time, it is natural to anticipate this adaptation during training, resulting in a two-layer optimization similar to those used for metalearning. Because of this similarity, we call this process, illustrated in the bottom row of figure 1 , meta-tailoring. Now, rather than letting θ be the direct minimizer of the supervised loss, we set it to θ ∈ arg min θ n i=1 L sup (f τ (θ,L tailor ,xi) (x i ), y i ). Notice that by optimizing this nested objective, the outer process is now optimizing the only objective we care about, L sup , instead of a proxy combination of L sup and L unsup . At the same time, we are learning to leverage the unsupervised tailoring losses in the inner optimization to affect the model before making the final prediction, both during training and at prediction time.

