LEARNING TO EXTRAPOLATE: A TRANSDUCTIVE AP-PROACH

Abstract

Machine learning systems, especially with overparameterized deep neural networks, can generalize to novel test instances drawn from the same distribution as the training data. However, they fare poorly when evaluated on out-of-support test points. In this work, we tackle the problem of developing machine learning systems that retain the power of overparameterized function approximators while enabling extrapolation to out-of-support test points when possible. This is accomplished by noting that under certain conditions, a "transductive" reparameterization can convert an out-of-support extrapolation problem into a problem of within-support combinatorial generalization. We propose a simple strategy based on bilinear embeddings to enable this type of combinatorial generalization, thereby addressing the out-of-support extrapolation problem under certain conditions. We instantiate a simple, practical algorithm applicable to various supervised learning and imitation learning tasks.

1. INTRODUCTION

Generalization is a central problem in machine learning. Typically, one expects generalization when the test data is sampled from the same distribution as the training set, i.e out-of-sample generalization. However, in many scenarios, test data is sampled from a different distribution from the training set, i.e., out-of-distribution (OOD). In some OOD scenarios, the test distribution is assumed to be known during training -a common assumption made by meta-learning methods (Finn et al., 2017b) . Several works have tackled a more general scenario of "reweighted" distribution shift (Koh et al., 2021; Quinonero-Candela et al., 2008) where the test distribution shares support with the training distribution, but has a different and unknown probability density. This setting can be tackled via distributional robustness approaches (Sinha et al., 2018; Rahimian & Mehrotra, 2019) . Our paper aims to find structural conditions under which generalization to test data with support outside of the training distribution is possible. Formally, assume the problem of learning function h: ŷ = h(x) using data {(x i , y i )} N i=1 ∼ D train , where x i ∈ X train , the training domain. We are interested in making accurate predictions h(x) for x / ∈ X train (see examples in Fig 1 ). Consider an example task of predicting actions to reach a desired goal (Fig 1b) . During training, goals are provided from the blue cuboid (x ∈ X train ), but test time goals are from the orange cuboid (x / ∈ X train ). If h is modeled using a deep neural network, its predictions on test goals in the blue area are likely to be accurate, but for the goals in the orange area the performance can be arbitrarily poor unless further domain knowledge is incorporated. This challenge manifests itself in a variety of real-world problems, ranging from supervised learning problems like object classification (Barbu et al., 2019) , sequential decision making with reinforcement learning (Kirk et al., 2021) , transferring reinforcement learning policies from simulation to real-world (Zhao et al., 2020) , imitation learning (de Haan et al., 2019) , etc. Reliably deploying learning algorithms in unconstrained environments requires accounting for such "out-of-support" distribution shift, which we refer to as extrapolation. It is widely accepted that if one can identify some structure in the training data that constrains the behavior of optimal predictors on novel data, then extrapolation may become possible. Several methods can extrapolate if the nature of distribution shift is known apriori: convolution neural We propose an alternative structural condition under which extrapolation is feasible. Typical machine learning approaches are inductive: decision-making rules are inferred from training data and employed for test-time predictions. An alternative to induction is transduction (Gammerman et al., 1998) where a test example is compared with the training examples to make a prediction. Our main insight is that in the transductive view of machine learning, extrapolation can be reparameterized as a combinatorial generalization problem, which, under certain low-rank and coverage conditions (Shah et al., 2020; Agarwal et al., 2021b; Andreas et al., 2016; Andreas, 2019) , admits a solution. First we show how we can (i) reparameterize out-of-support inputs h(x test ) → h(∆x, x ′ ), where x ′ ∈ X train , and ∆x is a representation of the difference between x test and x ′ . We then (ii) provide conditions under which h(∆x, x ′ ) makes accurate predictions for unseen combinations of (∆x, x ′ ) based on a theoretically justified bilinear modeling approach: h(∆x, x ′ ) → f (∆x) ⊤ g(x ′ ), where f and g map their inputs into vector spaces of the same dimension. Finally, (iii) empirical results demonstrate the generality of extrapolation of our algorithm on a wide variety of tasks: (a) regression for analytical functions and high-dimensional point cloud data; (b) sequential decision-making tasks such as goal-reaching for a simulated robot.

2. SETUP

Notation. Given a space of inputs X and targets Y, we aim to learn a predictor h θ : X → P(Y)foot_0 parameterized by θ, which best fits a ground truth function h ⋆ : X → Y. Given some non-negative loss function ℓ : Y × Y → R ≥0 on the outputs (e.g., squared loss), and a distribution D over X , the risk is defined as R(h θ ; D) := E x∼D E y∼h θ (x) ℓ(y, h ⋆ (x)). (2.1) Various training (D train ) and test (D test ) distributions yield different generalization settings: In-Distribution Generalization. This setting assumes D test = D train . The challenge is to ensure that with N samples from D train , the expected risk R(h θ ; D test ) = R(h θ ; D train ) is small. This is a common paradigm in both empirical supervised learning (e.g., Simonyan & Zisserman (2014)) and in standard statistical learning theory (e.g., Vapnik (2006) ). Out-of-Distribution (OOD). This is more challenging and requires accurate predictions when D train ̸ = D test . When the ratio between the density function of D test to that of D train is bounded,



Throughout, we let P(Y) denote the set of distributions supported on Y.



Figure 1: In the real-world the test distribution (orange) often has a different support than the training distribution (blue). Some illustrative tasks: (a) grasp point prediction for object instances with out-of-support scale, (b) action prediction for reaching out-of-support goals, (c) function value prediction for an out-of-support input range. The black crosses show predictions for a conventionally trained deep neural network that makes accurate predictions for in-support inputs, but fails on out-of-support inputs. We propose an algorithm that makes accurate out-of-support predictions under a set of assumptions.

