LEARNING TO EXTRAPOLATE: A TRANSDUCTIVE AP-PROACH

Abstract

Machine learning systems, especially with overparameterized deep neural networks, can generalize to novel test instances drawn from the same distribution as the training data. However, they fare poorly when evaluated on out-of-support test points. In this work, we tackle the problem of developing machine learning systems that retain the power of overparameterized function approximators while enabling extrapolation to out-of-support test points when possible. This is accomplished by noting that under certain conditions, a "transductive" reparameterization can convert an out-of-support extrapolation problem into a problem of within-support combinatorial generalization. We propose a simple strategy based on bilinear embeddings to enable this type of combinatorial generalization, thereby addressing the out-of-support extrapolation problem under certain conditions. We instantiate a simple, practical algorithm applicable to various supervised learning and imitation learning tasks.

1. INTRODUCTION

Generalization is a central problem in machine learning. Typically, one expects generalization when the test data is sampled from the same distribution as the training set, i.e out-of-sample generalization. However, in many scenarios, test data is sampled from a different distribution from the training set, i.e., out-of-distribution (OOD). In some OOD scenarios, the test distribution is assumed to be known during training -a common assumption made by meta-learning methods (Finn et al., 2017b) . Several works have tackled a more general scenario of "reweighted" distribution shift (Koh et al., 2021; Quinonero-Candela et al., 2008) where the test distribution shares support with the training distribution, but has a different and unknown probability density. This setting can be tackled via distributional robustness approaches (Sinha et al., 2018; Rahimian & Mehrotra, 2019) . Our paper aims to find structural conditions under which generalization to test data with support outside of the training distribution is possible. Formally, assume the problem of learning function h: ŷ = h(x) using data {(x i , y i )} N i=1 ∼ D train , where x i ∈ X train , the training domain. We are interested in making accurate predictions h(x) for x / ∈ X train (see examples in Fig 1) . Consider an example task of predicting actions to reach a desired goal (Fig 1b) . During training, goals are provided from the blue cuboid (x ∈ X train ), but test time goals are from the orange cuboid (x / ∈ X train ). If h is modeled using a deep neural network, its predictions on test goals in the blue area are likely to be accurate, but for the goals in the orange area the performance can be arbitrarily poor unless further domain knowledge is incorporated. This challenge manifests itself in a variety of real-world problems, ranging from supervised learning problems like object classification (Barbu et al., 2019) , sequential decision making with reinforcement learning (Kirk et al., 2021) , transferring reinforcement learning policies from simulation to real-world (Zhao et al., 2020) , imitation learning (de Haan et al., 2019) , etc. Reliably deploying learning algorithms in unconstrained environments requires accounting for such "out-of-support" distribution shift, which we refer to as extrapolation. It is widely accepted that if one can identify some structure in the training data that constrains the behavior of optimal predictors on novel data, then extrapolation may become possible. Several methods can extrapolate if the nature of distribution shift is known apriori: convolution neural networks are appropriate if a training pattern appears at an out-of-distribution translation in a test example (Kayhan & Gemert, 2020) . Similarly, accurate predictions can be made for object point clouds in out-of-support orientations by building in SE(3) equivariance (Deng et al., 2021b ). Another way to extrapolate is if the model class is known apriori: fitting a linear function to a linear problem will extrapolate. Similarly, methods like NeRF (Zhang et al., 2022a) use physics of image formation to learn a 3D model of a scene which can synthesize images from novel viewpoints. We propose an alternative structural condition under which extrapolation is feasible. Typical machine learning approaches are inductive: decision-making rules are inferred from training data and employed for test-time predictions. An alternative to induction is transduction (Gammerman et al., 1998) where a test example is compared with the training examples to make a prediction. Our main insight is that in the transductive view of machine learning, extrapolation can be reparameterized as a combinatorial generalization problem, which, under certain low-rank and coverage conditions (Shah et al., 2020; Agarwal et al., 2021b; Andreas et al., 2016; Andreas, 2019) , admits a solution. First we show how we can (i) reparameterize out-of-support inputs h(x test ) → h(∆x, x ′ ), where x ′ ∈ X train , and ∆x is a representation of the difference between x test and x ′ . We then (ii) provide conditions under which h(∆x, x ′ ) makes accurate predictions for unseen combinations of (∆x, x ′ ) based on a theoretically justified bilinear modeling approach: h(∆x, x ′ ) → f (∆x) ⊤ g(x ′ ), where f and g map their inputs into vector spaces of the same dimension. Finally, (iii) empirical results demonstrate the generality of extrapolation of our algorithm on a wide variety of tasks: (a) regression for analytical functions and high-dimensional point cloud data; (b) sequential decision-making tasks such as goal-reaching for a simulated robot.

2. SETUP

Notation. Given a space of inputs X and targets Y, we aim to learn a predictor h θ : X → P(Y)foot_0 parameterized by θ, which best fits a ground truth function h ⋆ : X → Y. Given some non-negative loss function ℓ : Y × Y → R ≥0 on the outputs (e.g., squared loss), and a distribution D over X , the risk is defined as R(h θ ; D) := E x∼D E y∼h θ (x) ℓ(y, h ⋆ (x)). (2.1) Various training (D train ) and test (D test ) distributions yield different generalization settings: In-Distribution Generalization. This setting assumes D test = D train . The challenge is to ensure that with N samples from D train , the expected risk R(h θ ; D test ) = R(h θ ; D train ) is small. This is a common paradigm in both empirical supervised learning (e.g., Simonyan & Zisserman (2014) ) and in standard statistical learning theory (e.g., Vapnik (2006) ). Out-of-Distribution (OOD). This is more challenging and requires accurate predictions when D train ̸ = D test . When the ratio between the density function of D test to that of D train is bounded, rigorous OOD extrapolation guarantees exist and are detailed in Appendix B.1. Such a situation arises when D test shares support with D train but is differently distributed as depicted in Fig 2a . Out-of-Support (OOS). There are innumerable forms of distribution shift in which density ratios are not bounded. The most extreme case is when the support of D test is not contained in that of D train . I.e., when there exists some X ′ ⊂ X such that P x∼Dtest [x ∈ X ′ ] > 0, but P x∼Dtrain [x ∈ X ′ ] = 0 (see Fig 2b ). We term the problem of achieving low risk on such a D test as OOS extrapolation. Out-of-Combination (OOC). This is a special case of OOS. Let X = X 1 × X 2 be the product of two spaces. Let D train,X1 , D train,X2 denote the marginal distributions of Transduction. In classical transduction (Gammerman et al., 1998) , given data points {x i } l i=1 and their labels {y i } l i=1 , the objective is making predictions for points {x i } l+k i=l+1 . In this paper we refer to predictors of x that are functions of the labeled data points {x i } l i=1 as transductive predictors.  x 1 ∈ X 1 , x 2 ∈ X 2 under D train ,

3. CONDITIONS AND ALGORITHM FOR OUT-OF-SUPPORT GENERALIZATION

While general OOS extrapolation can be arbitrarily challenging, we show that there exist some concrete conditions under which OOC extrapolation is feasible. Under these conditions, an OOS prediction problem can be converted into an OOC prediction problem. 

3.1. TRANSDUCTIVE PREDICTORS: CONVERTING OOS TO OOC

We require that input space X has group structure, i.e., addition and subtraction operators x+x ′ , xx ′ are well-defined for x, x ′ ∈ X . Let ∆X := {x -x ′ : x, x ′ ∈ X }. We propose a transductive reparameterization h θ : X → P(Y) with a deterministic function hθ : ∆X × X → Y as h θ (x) := hθ (x -x ′ , x ′ ), (3.1) where x ′ is referred to as an anchor point for a query point x. Under this reparameterization, the training distribution can be rewritten as a joint distribution of ∆x = x -x ′ and x ′ as follows: P Dtrain [(∆x, x ′ ) ∈ •] := Pr[(∆x, x ′ ) ∈ • | x ∼ D train , x ′ ∼ D train , ∆x = x -x ′ ]. (3.2) This is just representing the prediction for querying every point from the training distribution in terms of its relationship to other anchor points in the training distribution. At test time, we are presented with query point x ∼ D test that may be from an OOS distribution. To make a prediction on this OOS x, we observe that with a careful selection of an anchor point x ′ from D train , our reparameterization may be able to convert this OOS problem into a more manageable OOC one, since representing the test point x in terms of its difference from training points can still be an "in-support" problem. For a radius parameter ρ > 0, define the distribution of chosen anchor points D trns (x) (referred to as a transducing distribution) as P Dtrns(x) [x ′ ∈ •] = Pr[x ′ ∈ • | x ′ ∼ D train , inf ∆x∈∆Xtrain ∥(x -x ′ ) -∆x∥ ≤ ρ], where X train denotes the set of x in our training set, and we denote ∆X train := {x 1 -x 2 : x 1 , x 2 ∈ X train }. Intuitively, our choice of D trns (x) selects anchor points x ′ to transduce from the training distribution, subject to the resulting differences (x -x ′ ) being close to a "seen" ∆x ∈ ∆X train . In doing so, both the anchor point x ′ and the difference ∆x have been seen individually at training time, albeit not in combination. This allows us to express the prediction for a OOS query point in terms of an in-support anchor point x ′ and an in-support difference ∆x (but not jointly in-support). This choice of anchor points induces a joint test distribution of ∆x = x -x ′ and x ′ : By applying transduction to keep both the marginal x ′ and ∆x in-support, we are ensuring that we can convert difficult OOS problems into (potentially) more manageable OOC ones. P Dtest [(∆x, x ′ ) ∈ •] := Pr[(∆x, x ′ ) ∈ • | x ∼ D test , x ′ ∼ D trns (x), ∆x = x -x ′ ]. (3.4)

3.2. BILINEAR REPRESENTATIONS FOR OOC LEARNING

Without additional assumptions, OOC extrapolation may be just as challenging as OOS. However, with certain low-rank structure it can be feasible (Shah et al., 2020; Agarwal et al., 2021b; Athey et al., 2021) . Following Agarwal et al. (2021a) , we recognize that this low-rank property can be leveraged implicitly for our reparameterized OOC problem even in the continuous case (where x, ∆x do not explicitly form a finite dimensional matrix), using a bilinear representation of the transductive predictor in Eq. (3.1), hθ (∆x, x ′ ) = ⟨f θ (∆x), g θ (x ′ )⟩. Here f θ , g θ map their respective inputs into a vector space of the same dimension, R p .foot_1 If the output space is K-dimensional, then we independently model the prediction for each dimension using a set of K bilinear embeddings: hθ (∆x, x ′ ) = ( hθ,1 (∆x, x ′ ), . . . , hθ,K (∆x, x ′ )); hθ,k (∆x, x ′ ) = ⟨f θ,k (∆x), g θ,k (x ′ )⟩. (3.5) While hθ,k are bilinear in embeddings f θ,k , g θ,k , the embeddings themselves may be parameterized by general function approximators. The effective "rank" of the transductive predictor is controlled by the dimension of the continuous embeddings f θ,k (∆x), g θ,k (x ′ ). We now introduce three conditions under which extrapolation is possible. As a preliminary, we write µ 1 ≪ κ µ 2 for two (unnecessarily normalized) measures µ 1 , µ 2 to denote µ 1 (A) ≤ κµ 2 (A) for all events A. This notion of coverage has been studied in the reinforcement learning community (termed as "concentrability" (Munos & Szepesvári, 2008) ), and implies that the support of µ 1 is contained in the support of µ 2 . The first condition is the following notion of combinatorial coverage. Assumption 3.1 (Bounded combinatorial density ratio). We assume that Dtest has κ-bounded combinatorial density ratio with respect to Dtrain , written as Dtest ≪ κ,comb Dtrain . This means that there exist distributions D ∆X ,i and D X ,j , i, j ∈ {1, 2}, over ∆X and X , respectively, such that Recall that p denotes the dimension of the parametrized embeddings in Eq. (3.5). Our next assumptions are that the ground-truth embedding is low-rank (we refer to as "bilinearly transducible"), and that under D 1⊗1 , its component factors do not lie in a strict subspace of R p , i.e., not degenerate. Assumption 3.2 (Bilinearly transducible). For each component k ∈ [K], there exists f ⋆,k : ∆X → R p and g ⋆,k : X → R p such that for all x ∈ X , the following holds with probability 1 over D i⊗j := D ∆X ,i ⊗ D X ,j satisfy (i,j)̸ =(2,2) D i⊗j ≪ κ Dtrain , x ′ ∼ D trns (x): h ⋆,k (x) = h⋆,k (∆x, x ′ ) := ⟨f ⋆,k (∆x), g ⋆,k (x ′ )⟩, where ∆x = x -x ′ . Further, the ground-truth predictions are bounded: max k sup ∆x,x ′ | h⋆,k (∆x, x ′ )| ≤ M . Assumption 3.3. For all components k ∈ [K], the following lower bound holds for some σ 2 > 0: min{σ p (E D1⊗1 [f ⋆,k f ⊤ ⋆,k ]), σ p (E D1⊗1 [g ⋆,k g ⊤ ⋆,k ])} ≥ σ 2 . (3.6) The following is proven in Appendix B; due to limited space, we defer further discussion of the bound to Appendix B.4.1. Theorem 1. Suppose Assumptions 3.1 to 3.3 all hold, and that the loss ℓ(•, •) is the square loss. Then, if the training risk satisfies R(h θ ; D train ) ≤ σ 2 4κ , then the test risk is bounded as R(h θ ; D test ) ≤ R(h θ ; D train ) • κ 2 1 + 64 M 4 σ 4 = R(h θ ; D train ) • poly(κ, M σ ).

3.3. OUR PROPOSED ALGORITHM: BILINEAR TRANSDUCTION

Our basic proposal for OOS extrapolation is bilinear transduction, depicted in Algorithm 1: at training time, a predictor hθ is trained to make predictions for training points x i drawn from the training set X train based on their similarity with other points x j ∈ X train : hθ (x i -x j , x j ). The training pairs x i , x j are sampled uniformly from X train . At test time, for an OOS point x test , we first select an anchor point x i from the training set which has similarity with the test point x test -x i that is within some radius ρ of the training similarities ∆X train . We then predict the value for x test based on the anchor point x i and the similarity of the test and anchor points: hθ (x test -x i , x i ). For supervised learning, i.e., the regression setting, we compute differences directly between inputs x i , x j ∈ X . For the goal-conditioned imitation learning setting, we compute difference between states x i , x j ∈ X sampled uniformly over demonstration trajectories. At test time, we select an anchor trajectory based on the goal, and transduce each anchor state in the anchor trajectory to predict a sequence of actions for a test goal. As explained in Appendix C.1, there may be scenarios where it is beneficial to more carefully choose which points to transduce. To this end, we outline a more sophisticated proposal, weighted transduction, that achieves this. Pseudo code for bilinear transduction is depicted in Algorithm 1, and for the weighted variant in Algorithm 2 in the Appendix. Note that we assume access to a representation of X for which the subtraction operator captures differences that occur between training points and between training and test points. In Appendix D we describe examples of such representations as defined in standard simulators or which we extract with dimensionality reduction techniques (PCA). Alternatively, one might potentially extract such representations via self-supervision (Kingma & Welling, 2013) and pre-trained models (Upchurch et al., 2017) , or compare data points via another similarity function. Algorithm 

4. EXPERIMENTS

We answer the following questions through an empirical evaluation: (1) Does reformulating OOS extrapolation as a combinatorial generalization problem allow for extrapolation in a variety of supervised and sequential decision-making problems? (2) How does the particular choice of training distribution and data-generating function affect performance of the proposed technique? (3) How important is the choice of the low-rank bilinear function class for generalization? (4) Does our method scale to high dimensional state and action spaces? We first analyze OOS extrapolation on analytical domains and then on more real world problem domains depicted in Fig 8 .

4.1. ANALYZING OOS EXTRAPOLATION ON ANALYTICAL PROBLEMS

We compare our method on regression problems generated via 1-D analytical functions (described in Appendix D.1) against standard neural networks with multiple fully-connected layers trained and tested in ranges [20, 40] and [10, 20] ∪ [40, 50] (with the exception of Fig 4c trained in [-1 , 1] and tested in [-1.6, -1]∪[1, 1.6]). We use these domains to gain intuition about the following questions: What types of problems satisfy the assumptions for extrapolation? While we outlined a set of conditions under which extrapolation is guaranteed, it is not apparent which problems satisfy these assumptions. To understand this better, we considered learning functions with different structure: a periodic function with mixed periods (Fig 4a ) In comparison, our approach (pink) accurately extrapolates on periodic functions but is much less effective on polynomials. This is because the periodic functions have symmetries which induce low rank structure under the proposed reparameterization. functions with repeated structure, whereas they struggle on arbitrary polynomials. Standard neural nets fail to extrapolate in most settings, even when provided periodic activations (Tancik et al., 2020) . Going beyond known inductive biases. Given our method extrapolates to periodic functions in 

4.2. ANALYZING OOS EXTRAPOLATION ON LARGER SCALE DECISION-MAKING PROBLEMS

To establish that our method is useful for complex and real-world problem domains, we also vary the complexity (i.e., working with high-dimensional observation and action spaces) and the learning setting (regression and sequential decision making). Baselines. To assess the effectiveness of our proposed scheme for extrapolation via transduction and bilinear embeddings, we compare with the following non-transductive baselines. Linear Model: linear function approximator to understand whether the issue is one of overparameterization, and whether linear models would solve the problem. Neural Networks: typical training of overparameterized neural network function approximator via standard empirical risk minimization. Alternative Techniques with Neural Networks (DeepSets): an alternative architecture for combining multiple inputs (DeepSets (Zaheer et al., 2017) ), that are meant to be permutation invariant and encourage a degree of generaliza-tion between different pairings of states and goals. Finally, we compare with a Transductive Method without a Mechanism for Structured Extrapolation (Transduction): transduction with no special structure for understanding the impact of bilinear embeddings and the lowrank structure. This baseline uses reparameterization, but just parameterizes hθ as a standard neural network. We present additional comparison results introducing periodic activations (Tancik et al., 2020) in Table 4 in the Appendix. OOS extrapolation in sequential decision making. Table 1 contains results for different types of extrapolation settings. More training and evaluation details are provided in Appendices D.1 and D.2. • Extrapolation to OOS Goals: We considered two tasks from the Meta-World benchmark (Yu et al., 2020) where a simulated robotic agent needs to either reach or push a target object to a goal location (column 2 in Fig 8) . Given a set of expert demonstrations reaching/pushing to goals in the blue box ([0, 0.4] × [0.7, 0.9] × [0.05, 0.3] and [0, 0.3] × [0.5, 0.7] × {0.01}), we tested generalization to OOS goals in the orange box ([-0.4, 0] × [0.7, 0.9] × [0.05, 0.3] and [-0.3, 0] × [0.5, 0.7] × {0.01}), using a simple extension of our method described in Appendix D.2 to perform transduction over trajectories rather than individual states. We quantify performance by measuring the distance between the conditioned and reached goal. Results in Table 1 show that on the easy task of reaching, training a linear or typical neural network-based predictor extrapolate as well as our method. However, for the more challenging task of pushing an object, our extrapolation is better by an order of magnitude than other baselines, showing the ability to generalize goals in a completely different direction. • Extrapolation with Large State and Action Space: Next we tested our method on grasping and placing an object to OOS goal-locations in R 3 with an anthropomorphic "Adroit" hand with a much larger action (R 30 ) and state (R 39 ) space (column 3 in Fig 8) . Results confirm that bilinear transduction scales up to high dimensional state-action spaces as well and is naturally able to grasp the ball and move it to new target locations ( .15, 0.35] ). These results are better appreciated in the video attached with the supplementary material, but show the same trends as above, with bilinear transduction significantly outperforming standard inductive methods and non-bilinear architectures. [-0.3, 0] × [-0.3, 0] × [0.15, 0.35]) after trained on target loca- tions in ([0, 0.3] × [0, 0.3] × [0 • Extrapolation to OOS Dynamics: Lastly, we consider problems involving extrapolation not just in location of the goals for goal-reaching problems, but across problems with varying dynamics. Specifically, we consider a slider task where the goal is to move a slider on a table to strike a ball such that it rolls to a fixed target position (column 4 in Fig 8) . The mass of the ball varies across episodes and is provided as input to the policy. We train and test on a range of masses ([60, 130] and [5, 15] ). We find that bilinear transduction adjusts behavior and successfully extrapolates to new masses, showing the ability to extrapolate not just to goals, but also to varying dynamics. Importantly, bilinear transduction is significantly less prone to variance than standard inductive or permutation-invariant architectures. This can be seen from a heatmap over various training architectures and training seeds as shown in Fig 12 and Table 5 in Appendix C.4 . While other methods can sometimes show success, only bilinear transduction is able to consistently accomplish extrapolation. OOS extrapolation in higher dimensional regression problems. To scale up the dimension of the input space, we consider the problem of predicting valid grasping points in R 3 from point clouds of various objects (bottles, mugs, and teapots) with different orientations, positions, and scales (column 1 in Fig 8) . At training and test, objects undergo z-axis orientation ([0, 1.2π] and [1.2π, 2π]), translation ([0, 0.5] × [0, 0.5] and [0.5, 0.7] × [0.5, 0.7]) and scaling ([0.7, 1.3 ] and [1.3, 1.6]). In this domain, we represent entire point clouds by a low-dimensional representation of the point cloud obtained via PCA. We consider situations, where we train on individual objects where the training set consists of various rotations, translations, and scales, and standard bilinear transduction is applied. We also consider situations where the objects are not individually identified but instead, a single grasp point predictor is trained on the entire set of bottles, mugs, and teapots. We assume access to category labels at training time, but do not require this at test time. For a more in-depth discussion of this assumption and a discussion of how this can be learned via weighted-transduction, we refer readers to Appendix C.1. While training performance is comparable in all instances, we find that extrapolation behavior is significantly better for our method. This is true both for single object cases as well as scenarios where all objects are mixed together (Table 1 ). These experiments show that bilinear transduction can work on feature spaces of high-dimensional data such as point clouds. Note that while we only predict a single grasp point here, a single grasp point prediction can be easily generalized to predict multiple keypoints instead (Manuelli et al., 2019) , enabling success on more challenging control domains. For further visualizations and details, please see Appendices C and D. (Vapnik, 2006; Bousquet & Elisseeff, 2002; Bartlett & Mendelson, 2002) . Our focus in on performance on distributions with examples which may not be covered by the training data, as described formally in Section 2. Here we provide a discussion of the most directly related approaches, and defer an extended discussion to Appendix A. Even more so than generalization, extrapolation necessitates leveraging structure in both the data and the learning algorithm. Along these lines, past work has focused on structured neural networks, which hardcode such symmetries as equivariance (Cohen & Welling, 2016; Simeonov et al., 2022) , Euclidean symmetry (Smidt, 2021) and periodicity (Abu-Dakka et al., 2021; Parascandolo et al., 2016) into the learning process. Other directions have focused on general architectures which seem to exhibit combinatorial generalization, such as transformers (Vaswani et al., 2017) , graph neural networks (Cappart et al., 2021) and bilinear models (Hong et al., 2021) . In this work, we focus on bilinear architectures with what we term a transductive parametrization. We demonstrate that this framework can in many cases learn the symmetries (e.g. equivariance and periodicity) that structured neural networks harcode for, and achieve extrapolation in some regimes in which these latter methods cannot. A bilinear model with low inner dimension is equivalent to enforcing a low-rank constraint on one's predictions. Low-rank models have commanded broad popularity for matrix completion (Mnih & Salakhutdinov, 2007; Mackey et al., 2015) . And whereas early analysis focused on the missing-at-random setting (Candès & Recht, 2009; Recht, 2011; Candes & Plan, 2010) equivalent to classical in-distribution statistical learning, we adopt more recent perspectives on missing-not-at-random data, see e.g., (Shah et al., 2020; Agarwal et al., 2021b; Athey et al., 2021) , which tackle the combinatorial-generalization setting described in Section 3.2. In the classical transduction setting (Joachims, 2003; Gammerman et al., 1998; Cortes & Mohri, 2006) , the goal is to make predictions on a known set of unlabeled test examples for which the features are known; this is a special case of semi-supervised learning. We instead operate in a standard supervised learning paradigm, where test labels and features are only revealed at test time. Still, we find it useful to adopt a "transductive parametrization" of our predictor, where instead of compressing our predictor into parameters alone, we express predictions for labels of one example as a function of other, labeled examples. An unabridged version of related work is in Appendix A.

6. DISCUSSION

Our work serves as an initial study of the circumstances under which problem structure can be both discovered and exploited for extrapolation, combining parametric and non-parametric approaches. The main limitations of this work are assumptions regarding access to a representation of X and similarity measure for obtaining ∆x. Furthermore, we are only guaranteed to extrapolate to regions of X that admit ∆x within the training distribution. A number of natural questions arise for further research. First, can we classify which set of real-world domains fits our assumptions, beyond the domains we have demonstrated? Second, can we learn a representation of X in which differences ∆x are meaningful for high dimensional domains? And lastly, are there more effective schemes for selecting anchor points? For instance, analogy-making for anchor point selection may reduce the complexity of hθ , guaranteeing low-rank structure.

7. ETHICS STATEMENT

Bias In this work on extrapolation via bilinear transduction, we can only aim to extrapolate to data that shifts from the training set in ways that exist within the training data. Therefore the performance of this method relies on the diversity within the training data, allowing some forms of extrapolation but not to ones that do not exist within the training distribution.

Dataset release

We provide in detail in the Appendix the parameters and code bases and datasets we use to generate our data. In the future, we plan to release our code base and expert policy weights with which we collected expert data.

8. REPRODUCIBILITY STATEMENT

We describe our algorithms in Section 3.3 and the complete proof of our theoretical results and assumptions in Appendix B. Extensive implementation details regarding our algorithms, data, models, and optimization are provided in Appendix D. In addition, we plan in the future to release our code and data. CONTENTS A UNABRIDGED RELATED WORK 1 Here we discuss various prior approaches to extrapolation in greater depth, focusing on areas not addressed in our abridged discussion in the main text. Approaches which encode explicit structure. One popular approach to designing networks that extrapolate to novel data has been the equivariant neural networks, first proposed by Cohen & Welling (2016) . The key idea is that, if it is known there is a group G which acts on both the input and target domains of the predictor, and if it is understood that the true predictor must satisfy the equivariance property h ⋆ (g • x) = g • h ⋆ (x) for all g ∈ G (here g • () denotes group action), then one can explicitly encode for predictors h θ satisfying the same identity. Similarly, one can encode for invariances, where h ⋆ (g 2016) for how periodic structure can be explicitly built in. We remark that in many of these approaches, the group/symmetry must be known ahead of time. While attempt to learn global (Benton et al., 2020) or local (Dehmamy et al., 2021) equivariances/invariances, there are numerous forms of structure that can be represented as a group symmetry, and for which these methods do not apply. As we show in our experiments, there are a number of group structures which are not captured by equivariance (such as Fig 7) , which bilinear transduction captures. • x) = h ⋆ (x) for all g ∈ G. Architectures favorable to extrapolation. There are numerous learning architectures which are purported to be favorable to extrapolation to novel-domains. For example, graph neural networks (GNNs) have been use to facilitate reasoning behavior in combinatorial environments (Battaglia et al., 2018; Cappart et al., 2021) , and the implicit biases of GNNs have received much theoretical study in recent years (see Jegelka (2022) and the references therein). Another popular domain for extrapolations has been sequence-to-sequence modeling, especially in the context of natural language processing. Here, the now-renowned Transformer architecture due to Vaswani et al. (2017) , as well as its variants based on the same "attention mechanism" (e.g. Kitaev et al. (2020) ) have become incredibly popular. Supervised by a sufficiently diverse set of tasks and staggering amount of data, these have shown broad population in various language understanding tasks (Devlin et al., 2018 ), text-to-image generation (Ramesh et al., 2021) , and even quantitative reasoning (Lewkowycz et al., 2022) . Attention-based models tend to excel best in tasks involving language and when trained on massively large corpora; their ability to extrapolate in manipulation tasks (Zhou et al., 2022) and in more modest data regimes remains an area of ongoing research. Moreover, while recent research has aimed to study their (in-distribution) generalization properties (Edelman et al., 2022) , their capacity for "reasoning" more generally remains mysterious (Zhang et al., 2022b) . Distributional Robustness. One popular approach to out-of-distribution learning has been distributional robustness, which seeks predictors that perform on a family of "nearby" shifted distributions Mitchell (2021) . We formalize this notion in the context of out-of-support extrapolation and show that transduction allows for analogy making under low-rank structure assumptions.

B THEORETICAL RESULTS

This appendix is organized into four parts. • Appendix B.1 introduces the bounded density condition, and reproduces a folklore guarantee for extrapolation when one distribution has bounded density with respect to the other. The following gives a robust, quantitative notion of when one distribution is in the support of another. For generality, we state this condition in terms of general positive measures µ 1 , µ 2 , which need not be normalized and sum to one. Definition B.1 (κ-bounded density ratio). Let µ 1 , µ 2 be two measures over a space Ω. We say µ 1 has κ-bounded density with respect to µ 2 , which we denote µ 1 ≪ κ µ 2 , if for all measurable event 3 A ⊂ Ω, µ 1 [A] ≤ κµ 2 [A]. Stating Definition B.1 for general probability affords us the flexibility to write example, P 1 ≪ κ P 2 + P 3 , as P 2 + P 3 is a nonnegative measure with total mass 1 + 1 = 2. Remark B.1 (Connection to concentrability). The parameter κ is known in the off-policy reinforcement learning literature as the concentratability coefficient (see, e.g. Munos & Szepesvári (2008) ), and appears in controlling the performance of a rollout policy π 1 trained on a behavior policy π 2 by asserting that P π1 ≪ κ P π2 , where P πi is, for example, a state-action visitation probability under π i . Remark B.2 (Density nomenclature). The nomenclature "density" refers to the alternative definition in terms of Radon-Nikodym derivatives. To avoid technicalities, this can be best seen when µ 1 and µ 2 are continuous densitives over Ω = R d with densities p 1 (•) and p 2 (•); e.g. for i ∈ {1, 2}. µ i [A] = x∈A p i (x)dx. Then µ 1 ≪ κ µ 2 is equivalent to sup x p 1 (x) p 2 (x) ≤ κ. The following lemma motivates the use of Definition B.1. Its proof is standard but included for completeness. Lemma B.1. Let µ 1 , µ 2 be measures on the same measurable space Ω, and suppose that µ 2 ≪ κ µ 1 . Then, for any nonnegative function ϕ, µ 2 [ϕ] ≤ κµ 1 [ϕ]. 4 In particular, if D test ≪ κ D train , then as long as our loss function ℓ(•, •) is nonnegative, R(h θ ; D test ) ≤ κR(h θ ; D train ). Thus, up to a κ-factor, R(h θ ; D test ) inherits any in-distribution generalization guarantees for R(h θ ; D train ). Proof. As in standard measure theory (c.f. C ¸inlar (2011, Chapter 1)), we can approximate any ϕ ≥ 0 by a sequence of simple functions ϕ n ↑ ϕ, where ϕ n (ω) = kn i=1 c n,i I{ω ∈ A n,i }, with A n,i ⊂ Ω and c n,i ≥ 0. For each ϕ n , we have µ 2 [ϕ n ] = kn i=1 c n,i µ 2 [A n,i ] ≤ κ kn i=1 c n,i µ 1 [A n,i ] = µ 1 [ϕ n ]. The result now follows from the monotone convergence theorem. To derive the special case for D test and D train , apply the general result with nonnegative function ϕ(x) = E y∼h θ (x) ℓ(y, h ⋆ (x)) (recall ℓ(•, •) ≥ 0 by assumption), µ 1 = D train and µ 2 = D test .

B.2 EXTRAPOLATION FOR MATRIX COMPLETION

In what follows, we derive a simple extrapolation guarantee for matrix completion. The following is in the spirit of the Nyström column approximation (see e.g. Gittens & Mahoney (2013) ), and our proof follows the analysis due to Shah et al. (2020) . Throughout, consider M = M11 M12 M21 M22 , M ⋆ = M ⋆ 11 M ⋆ 12 M ⋆ 21 M ⋆ 22 , where we decompose M, M ⋆ into blocks (i, j) ∈ {1, 2} 2 for dimension n i × m j . Lemma B.2. Suppose that M is rank at most p, M ⋆ is rank p, and ∀(i, j) ̸ = (2, 2), ∥ Mi,j -M ⋆ i,j ∥ F ≤ ϵ, and ∥M ⋆ i,j ∥ F ≤ M, where ϵ ≤ σ p (M ⋆ 11 )/2. Then, ∥ M22 -M ⋆ 22 ∥ F ≤ 8ϵ M 2 σ p (M ⋆ 11 ) 2 . Proof. The proof mirrors that of Shah et al. (2020, Proposition 13) . We shall show below that M is of rank exactly p. Hence, Shah et al. (2020, Lemma 12) gives the following exact expression for the bottom-right blocks, M22 = M21 M † 11 M12 , M ⋆ 22 = M ⋆ 21 (M ⋆ 11 ) † M ⋆ 12 , where above (•) † denotes the Moore-Penrose pseudoinverse. Since ∥ M11 -M ⋆ 11 ∥ op ≤ ∥ M11 - M ⋆ 11 ∥ F ≤ ϵ ≤ σ p (M ⋆ 11 )/2 , Weyls inequality implies that M11 is rank p (as promised), and ∥ M † 11 ∥ op ≤ 2σ p (M ⋆ 11 ) -1 . Similarly, as ∥ M12 -M ⋆ 12 ∥ op ≤ σ p (M ⋆ 11 )/2 ≤ M/2, so ∥ M12 ∥ op ≤ 3 2 M . Thus, ∥ M22 -M ⋆ 22 ∥ F ≤ ∥ M21 -M ⋆ 21 ∥ F ∥ M † 11 ∥ op ∥ M12 ∥ op + ∥M ⋆ 21 ∥ op ∥ M † 11 ∥ op ∥M ⋆ 12 -M12 ∥ F ∥M ⋆ 12 ∥ op ∥M ⋆ 21 ∥ op ∥ M † 11 -(M ⋆ 11 ) † ∥ F ≤ 5ϵM 2σ p (M ⋆ ) + M 2 ∥ M † 11 -(M ⋆ 11 ) † ∥ F . Next, using a perturbation bound on the pseudoinversefoot_4 due to Meng & Zheng (2010, Theorem 2.1) , ∥ M † 11 -(M ⋆ 11 ) † ∥ F ≤ ∥ M11 -M ⋆ 11 ∥ F max{∥ M † 11 ∥ 2 op , ∥(M ⋆ 11 ) † ∥ 2 op } ≤ ϵ • 4σ p (M ⋆ 11 ) -2 . Therefore, we conclude ∥ M22 -M ⋆ 22 ∥ F ≤ 5ϵM 2σ p (M ⋆ ) + ϵ 4M 2 σ p (M ⋆ 11 ) 2 ≤ 8ϵ M 2 σ p (M ⋆ 11 ) 2 .

B.3 GENERAL ANALYSIS FOR COMBINATIORAL EXTRAPOLATION

We now provide our general analysis for combinatorial extrapolation. To avoid excessive subscripts, we write X = W × V rather than X = X 1 × X 2 as in the main body. We consider extrapolation under the following definition of combinatorial support. Definition B.2 (Bounded combinatorial density ratio). Let D, D ′ be two distributions over a product space W × V. We say D ′ has κ-bounded combinatorial density ratio with respect to D, written as D ′ ≪ κ,comb D, if there exist distributions D W,i and D V,j , i, j ∈ {1, 2}, over W and V, respectively, such that D i⊗j := D W,i ⊗ D V,j satisfy (i,j)̸ =(2,2) D i⊗j ≪ κ D, and D ′ ≪ κ i,j=1,2 D i⊗j . For simplicity, we consider scalar predictors, as the general result for vector valued estimators can be obtained by stacking the components. Specifically, we consider a ground-truth predictor h ⋆ and estimator ĥ of the form h ⋆ = ⟨f ⋆ , g ⋆ ⟩, ĥ = ⟨ f , ĝ⟩, f ⋆ , f : W → R p , g ⋆ , ĝ : V → R p . (B.1) Lastly, we choose the (scalar) square-loss, yielding the following risk R( ĥ; D) := E (w,v)∼D [(h ⋆ (w, v) -ĥ(w, v)) 2 ]. Throughout, we assume that all expectations that arise are finite. Our main guarantee is as follows. Theorem 2. Let D, D ′ be two distributions on W ×V satisfying D ′ ≪ κ,comb D, with corresponding factor distributions D W,i and D V,j , i, j ∈ {1, 2}. Define the effective singular value σ 2 ⋆ := σ p (E D W,1 [f ⋆ (w)f ⋆ (w) ⊤ ])σ p E D V,1 [g ⋆ (v)g ⋆ (v) ⊤ ] , (B.2) and suppose that max 1≤i,j≤2 E Di⊗j |h ⋆ (w, v)| 2 ≤ M 2 ⋆ . R( ĥ; D) ≤ σ 2 ⋆ 4κ , R( ĥ; D ′ ) ≤ R( ĥ; D) • κ 2 1 + 64 M 4 ⋆ σ 4 ⋆ = R( ĥ; D) • poly κ, M ⋆ σ ⋆ . B.3.1 PROOF OF THEOREM 2 First, let us assume the following two conditions hold; we shall derive these conditions from the conditions of Theorem 2 at the end of the proof: 6 ∀(i, j) ̸ = (2, 2), R( ĥ; D i⊗j ) ≤ ϵ 2 , E Di⊗j [h ⋆ (w, v) 2 ] ≤ M 2 ⋆ , ϵ < σ ⋆ /2. (B.3) Our strategy is first to prove a version of Theorem 2 for when W and V have finite cardinality by reduction to the analysis of matrix completion in Lemma B.2, and then extend to arbitrary domains via a limiting argument. Lemma B.3. Suppose that Eq. (B.3) hold, and in addition, that W and V have finite cardinality. Then, R( ĥ; D 2⊗2 ) = ∥ M22 -M ⋆ 22 ∥ 2 F ≤ 64ϵ 2 M 4 ⋆ σ 4 ⋆ . Proof Lemma B.3. By adding additional null elements to either W or V, we may assume without loss of generality that |W| = |V| = d, and enumerate their elements {w 1 , . . . , w d } and {v 1 , . . . , v d }. Let p i,a = Pr w∼D W,i [w = w a ] and q j,b = Pr v∼D V,j [v = v b ]. Consider matrices M, M ⋆ ∈ R 2d×2d , with d × d blocks ( Mij ) ab = √ p i,a q j,b • ĥ(w a , v b ), (M ⋆ ij ) ab = √ p i,a q j,b • h ⋆ (w a , v b ). We then verify that ∥ Mij -M ⋆ ij ∥ 2 F = d a,b=1 p i,a q j,b ( ĥ(w a , v b ) -h ⋆ (w a , v b )) 2 = E Di⊗j [( ĥ(w, v) -h ⋆ (w, v)) 2 ] = R( ĥ; D i⊗j ), (B.4) and thus ∥ Mij -M ⋆ ij ∥ 2 F ≤ ϵ 2 for (i, j) ̸ = (2, 2). Furthermore, define the matrices Âi , Bj via ( Âi ) a := √ p i,a f (w a ) ⊤ , ( Bj ) b := √ q j,b ĝ(v b ) ⊤ , and define A ⋆ i , B ⋆ j similarly. Then, M = Â1 Â2 B1 B2 ⊤ , M ⋆ = A ⋆ 1 A ⋆ 2 B ⋆ 1 B ⋆ 2 ⊤ , 6 Notice that here we take M 2 ⋆ as an upper bound of ED i⊗j [h⋆(w, v) 2 ], rather than a pointwise upper bound in Theorem 2. This is for convenience in a limiting argument below. showing that rank( M1 ), rank( M2 ) ≤ p. Finally, by Eq. (B.2), σ p (M ⋆ 11 ) 2 = σ p (A ⋆ 1 (B ⋆ 1 ) ⊤ ) 2 ≥ σ 2 p (A ⋆ 1 )σ 2 p (B ⋆ 1 ) = σ p (A ⋆ 1 ) ⊤ A ⋆ 1 σ p (B ⋆ 1 ) ⊤ B ⋆ 1 = σ p d a=1 p 1,a f (w a ) f (w a ) ⊤ σ p d b=1 q 1,b ĝ(v b )ĝ(v b ) ⊤ = σ p (E D W,1 [ f (w) f (w) ⊤ ])σ p E D V,1 [ĝ(v)ĝ(v) ⊤ ] = σ 2 ⋆ . Lastly, we have ∥M ⋆ i,j ∥ 2 F = a,b p i,a q j,b h ⋆ (w a , v b ) 2 = E Di⊗j h ⋆ (w, v) 2 ≤ M 2 ⋆ . Thus, Eq. (B.4) and Lemma B.2 imply that R( ĥ; D 2⊗2 ) = ∥ M22 -M ⋆ 22 ∥ 2 F ≤ 64ϵ 2 M 4 ⋆ σ 4 ⋆ . Lemma B.4. Suppose that Eq. (B.3) hold, but unlike Lemma B.4, W and V need not be finite spaces. Then, still, it holds that R( ĥ; D 2⊗2 ) = ∥ M22 -M ⋆ 22 ∥ 2 F ≤ 64ϵ 2 M 4 ⋆ σ 4 ⋆ . Proof of Lemma B.4. For n ∈ N, define h ⋆,n = ⟨f ⋆,n , g ⋆,n ⟩ and ĥn = ⟨ fn , ĝn ⟩, where f ⋆,n , fn , ĝn , g ⋆,n are simple functions (i.e. finite range, see the proof of Lemma B.1) converging to f ⋆ , f , g ⋆ , ĝ. Define σ 2 ⋆,n = σ p (E D W,1 [f ⋆,n (w)f f⋆,n (w) ⊤ ])σ p E D V,1 g ⋆,n (v)g ⋆,n (v) ⊤ ] , M 2 ⋆,n = max i,j̸ =(2,2) E Di⊗j [h ⋆ (w, v) 2 ] ϵ 2 n = max i,j̸ =(2,2) R( ĥn ; D i⊗j ). By the dominated convergence theoremfoot_5 , lim inf n≥1 σ 2 ⋆,n ≥ σ 2 ⋆ , lim sup n≥1 M 2 ⋆,n ≤ M 2 ⋆ , lim sup n≥1 ϵ 2 n ≤ ϵ 2 . In particular, as ϵ 2 ≤ σ 2 /4, then applying Lemma B.3 for n sufficiently large, R( ĥn ; D 2⊗2 ) ≤ 64 M 4 ⋆,n σ 4 ⋆,n ϵ 2 n . Indeed, for any fixed n, all of fn , ĝn , f ⋆,n , g ⋆,n are simple functions, so we can partition W and V into sets on which these embeddings are constant, and thus treat W and V as finite domains; this enables the application of Lemma B.3 applies. Finally, using the dominated covergence theorem one last time, R( ĥ; D 2⊗2 ) = lim n→∞ R( ĥn ; D 2⊗2 ) ≤ lim sup n≥1 64 M 4 ⋆,n σ 4 ⋆,n ϵ 2 n ≤ 64 M 4 ⋆ σ 4 ⋆ ϵ 2 . We can now conclude the proof of our proposition. Proof of Theorem 2. As D ′ ≪ κ i,j D i⊗j and i,j̸ =(2,2) D i⊗j ≪ κ D, Lemma B.1 and additivity of the integral implies R( ĥ; D ′ ) ≤ κR( ĥ; D 2⊗2 ) + κ (i,j)̸ =(2,2) R( ĥ; D i⊗j ) ≤ κR( ĥ; D 2⊗2 ) + κ 2 R( ĥ; D). (B.5) Moreover, setting ϵ 2 := κR( ĥ; D), we have max (i,j)̸ =(2,2) R( ĥ; D i⊗j ) ≤ (i,j)̸ =(2,2) R( ĥ; D i⊗j ) ≤ κR( ĥ; D) := ϵ 2 . Thus, for R( ĥ; D) <  R( ĥ; D 2⊗2 ) ≤ 64ϵ 2 M 4 ⋆ σ 4 ⋆ = 64κR( ĥ; D) M 4 ⋆ σ 4 ⋆ . Thus, combining with Eq. (B.5), R( ĥ; D ′ ) ≤ κ 2 R( ĥ; D) • 1 + 64 M 4 ⋆ σ 4 ⋆ , completing the proof.

B.4 EXTRAPOLATION FOR TRANSDUCTION

Leveraging Theorem 2, this section proves Theorem 1, thereby providing a formal theoretical justification for predictors of the form Eq. (3.5). Proof of Theorem 1. We argue by reducing to Theorem 2. The reparameterization of the stochastic predictor h θ in Eq. (3.1), followed by Assumption 3.2 allows us to write R(h θ ; D train ) = E x∼Dtrain E y∼h θ (x) ℓ(y, h ⋆ (x)) = E x∼Dtrain E x ′ ∼Dtrns(x) ℓ( hθ (x -x ′ , x ′ ), h ⋆ (x)) = E x∼Dtrain E x ′ ∼Dtrns(x) ℓ( hθ (x -x ′ , x ′ ), h⋆ (x -x ′ , x ′ )). In the above display, the joint distribution of (x -x ′ , x ′ ) is precisely given by Dtrain (see Eq. (3.2)). Hence, R(h θ ; D train ) = E Dtrain ℓ( hθ (∆x, x ′ ), h⋆ (∆x, x ′ )). Further, as ℓ(y, y ′ ) = ∥y -y ′ ∥ 2 is the square loss and decomposes across coordinates, R(h θ ; D train ) = K k=1 E Dtrain ( hθ,k (∆x, x ′ ) -h⋆,k (∆x, x ′ )) 2 . (B.6) By the same token, R(h θ ; D test ) = K k=1 E Dtest ( hθ,k (∆x, x ′ ) -h⋆,k (∆x, x ′ )) 2 . To conclude the proof, we remain to show that for all k ∈ [K], we have E Dtest ( hθ,k (∆x, x ′ ) -h⋆,k (∆x, x ′ )) 2 ≤ C prob • E Dtrain ( hθ,k (∆x, x ′ ) -h⋆,k (∆x, x ′ )) 2 , where C prob = κ 2 1 + 64 M 4 σ 4 . (B.7) Indeed, for each k ∈ [K], we have E Dtrain ( hθ,k (∆x, x ′ ) -h⋆,k (∆x, x ′ )) 2 (Eq. (B.6)) ≤ R(h θ ; D train ) (by assumption) ≤ σ 2 4κ . Hence Eq. (B.7) holds by invoking Theorem 2 with the correspondences W ← ∆X , V ← X , σ ⋆ ← σ, M ⋆ ← M and κ ← κ. This concludes the proof.

B.4.1 FURTHER REMARKS ON THEOREM 1

• The singular value condition, min{σ p (E D ∆X ,1 [f ⋆,k f ⊤ ⋆,k ]), σ p (E D X ,1 [g ⋆,k g ⊤ ⋆,k ])} ≥ σ 2 > 0, mirrors the non-degeneracy conditions given in the past work in matrix completion (c.f. (Shah et al., 2020) ). • The support condition sup ∆x,x ′ | h⋆,k (∆x, x ′ )| ≤ M is a mild boundedness condition, which (in light of Theorem 2) can be weakened further to max 1≤i,j≤2 E Di⊗j [ h⋆,k (∆x, x ′ ) 2 ] ≤ M 2 , where D i⊗j are the contituent distributions witnessing Dtest ≪ κ,comb Dtrain . • The final condition, R(h θ ; D train ) ≤ σ 2 4κ , is mostly for convenience. Indeed, as M ≥ σ and κ ≥ 1, then as soon as R(h θ ; D train ) > σ 2 4κ , our upper-bound on R(h θ ; D test ) is no better than κM 2 • 64 4 • M 2 σ 2 ≥ 6M 2 , which is essentially vacuous. Indeed, if we also inflate M and stipulate that This is a standard setting studied extensively in the matrix completion literature, see (Shah et al., 2020; Agarwal et al., 2021b; Athey et al., 2021) for example. Our results in Theorem 1 can be viewed as a generalization of these results to the continuous space embeddings. Finally, note that Assumptions 3.2 and 3.3 are also adopted correspondingly from the matrix completion literature, where Assumption 3.2 concerns the ground-truth bilinear structure of the embeddings, which mirrors the low-rank structure of the underlying matrix in the matrix completion case; Assumption 3.3 corresponds to the non-degeneracy assumption on the the top-left block distribution, as referred to as "anchor" columns/rows in the past work when dealing with matrices (Shah et al., 2020) . sup ∆x,x ′ | hθ (∆x, x ′ )| ≤ √ 6M ,

C.1 LEARNING WHICH POINTS TO TRANSDUCE: WEIGHTED TRANSDUCTION

There may be scenarios where it is beneficial to more carefully choose which points to transduce rather than comparing every training point to every other training point. To this end, we outline a more sophisticated proposal, weighted transduction, that achieves this. Sampling training pairs uniformly may violate the low rank structure needed for extrapolation discussed in Section 3.2. Consider a dataset of point clouds of various 3D objects in various orientations where the task is to predict their 3D grasping point. A predictor required to make predictions from similarities between point clouds of any two different objects and orientations may need to be far more complex than simpler predictors trained on only similar objects. Therefore, it may be useful to identify which training points to transduce, rather than transduce between all pairs of points during training. Hence, we introduce weighted transduction described in Algorithm 2, which identifies which training points to transduce, rather than transduce between all pairs of points during training. During training, a transductive predictor hθ is trained together with a scalar bilinear weighting function ω θ . hθ is trained to predict values as in the bilinear transduction algorithm, yet weighted by ω θ . ω θ predicts weights for data points and their potential anchors based on anchor point similarity and the anchor point, thereby deciding which points to transduce. We can either pre-train ω θ , or jointly optimize ω θ and hθ . At test time, we select an anchor point based on the learned weighting function, and make predictions via hθ . We show results on learning with weighted transduction on analytic functions in Appendix C. ). The function is constructed by sampling a tiling of random values in the central block (between (1, 1) and (5, 5)). The remaining tiles are constructed by shifting the central tile by 6 units along the x 1 or x 2 direction. For tiles shifted in the x 1 -direction ((6, 0) or (0, 6)) the label has the same first dimension and a negated second dimension. For tiles shifted in the x 2 -direction ((6, 0) or (0, 6)) the label has the same second dimension and a negated first dimension. This has a particular symmetry in its construction where the problem has a very simple relationship between instances, but not every instance is related to every other instance (only those off by a constant offset). We show in Fig 10 that weighted transduction is particularly important in this domain as not every point corresponds to every other point. We perform weighted transduction (seeded by a few demonstration pairs), and this allows for extrapolation to OOS pairs. While this may seem like an OOC problem, we show through the standard bilinear (without reparameterization) comparison that there is no low rank structure between the dimensions but instead there is low rank structure on reparameterization. This shows the importance of using weighted transduction and the ability to find symmetries in the data with our proposed method. Bilinear transduction is comparable to an SE(3) equivariant architecture on OOS orientations and exceeds its performance on OOS scales We compare bilinear transduction to a method that builds in SE(3) equivariance domain knowledge into its architecture and operates on point clouds. Specifically, we compare to tensor field networks Thomas et al. (2018) , and report the results in Table 2 . To highlight that SE(3) equivariance architectures can extrapolate to new rotations but lack guarantees on scale, we train on 100 noisy samples of rotated mugs, bottles or teapots and on a separate set of scaled mugs, bottles or teapots by in-distribution values described in Appendix D.1. We test extrapolation to 50 mugs, bottles or teapots with out-of-sample orientations and scales sampled from the distribution described in Appendix D.1. We adapt the Tetris object classification tensor field network to learn grasp point prediction by training the mean of the last equivariant layer to predict a grasping point and applying the mean squared error (MSE) loss. Following the implementation in Thomas et al. (2018) , we provide six manually labeled key points of the point cloud that uniquely identify the grasp point: two points on the handle and mug intersection and four points on the mug bottom. We compare to the linear, neural network and transduction baselines, as well as bilinear transduction. We fix these architectures to one layer and 32 units. Under various rotations, the tensor field network is able to extrapolate. For scaling however, the network error increases significantly, which is in accordance with the fact that scale transformation guarantees do not exist in SE(3) architectures. Bilinear transduction can extrapolate in both cases, is comparable to tensor field networks on OOS orientations and exceeds its performance on OOS scales. Weighted grasp point prediction The grasp point prediction results for three objects described in Table 1 are trained with priviledged knowledge of the object types at training time. I.e., the predictor at training time was provided only with pairs of same object categories but with different positions, orientations and scale. At test time an anchor was selected based on in-distribution similarities without knowledge of the test sample's object type. While this provides extra information, it is still not building in information about the distribution shift we would like to extrapolate to, i.e. the types of transformations the objects undergo: translation, rotation and scale. We apply the weighted transduction algorithm (Algorithm 2) to this domain and show that with more relaxed assumptions on the required object labels at training time we can learn weights and a predictor that perform as well as with priviledged training (Table 3 ). For grasp point prediction, during training, we train a weighting function ω θ (x j -x i , x i ) to predict whether x i should be transduced to x j based on object-level labels of the data (transduce all mugs to other mugs and bottles to other bottles). Then we learn a predictor h θ weighted by fixed ω θ as described in Algorithm 2. At test time, given point x j we select the training point x i with highest value ω θ (x j -x i , x i ) and predict y j via h θ (x j -x i , x i ), without requiring privileged information about the object category at test time. Robustness across seeds and architectures: We show that our results are stable across the choice of architecture and seed. In Fig 12 we show the performance of our method and baselines over Meta-World tasks and architectures. In Table 5 we show the average performance of our method and baselines over the complex domains for fixed architecture parameters over three seeds. 1 showing mean OOS errors on Meta-World. While some architectures can be suitable for some domains, our method is more robust to hyperparameter search and almost always achieves low error. Analytic functions: We used the following analytic functions for evaluation in Section 4.1. Figure 4a . Periodic mixture of functions with different periods: We consider the period growing mixture of functions, and do a similar function but with two functions with different periods repeating. Let us define x v = x mod 9, then the f (x) =                  x m * sin(10 * x v ) if x v < 1 x m * (sin(10) + (x v -1)) if 1 < x v < 2 x m * (sin(10) + 1 + (x v -2) 2 ) if 2 < x v < 3 x m * sin(10 * (xv-3) 2 ) if 3 < x v < 5 x m * (sin(10) + ( xv-3 2 -1)) if 5 < x v < 7 x m * (sin(10) + 1 + ( xv-3 2 -2) 2 ) if 7 < x v < 9 (D.1) We can see that this function mixes 2 different periods, to show that our framework can deal with varying functions as well.  (x) = (x mod Period) * Amplitude Period if (⌊ x Period ⌋) mod 2 == 0 Amplitude -(x mod Period) * Amplitude Period if (⌊ x Period ⌋) mod 2 == 1 (D.2) Figure 4c . Randomly chosen polynomial: We sampled a random degree 8 polynomial: Meta-World: We evaluate on the reach-v2 and push-v2 tasks. For reach-v2, we reduce the observation space to a 6-tuple of the end effector 3D position and target 3D position. The action space is a 4-tuple of the end effector 3D position and the gripper's degree of openness. During training, the 3D target position for reach-v2 is sampled from range [0, 0.4] × [0.7, 0.9] × [0.05, 0.3] and at test time from range [-0.4, 0] × [0.7, 0.9] × [0.05, 0.3]. The gripper initial position is fixed to [0, 0.6, 0.2]. For push-v2 the observation space is a 10-tuple of the end effector 3D position, gripper openness, object 3D position and target 3D position. The action space remains the same as in reach-v2. The training 3D target position is sampled from [0, 0.3] × [0.5, 0.7] × {0.01} and at test time from [-0.3, 0] × [0.5, 0.7] × {0.01}. The gripper initial position is fixed to [0, 0.4, 0.08] and the object initial position to [0, 0.4, 0.02]. For both environments, expert and evaluation trajectories are collected for 100 steps. f (x) = (x -0.1)(x + 0.4)(x + 0.7)(x -0.5)(x -1.5)(x + 1.75)(x + 1)(x -1.2) Figures 5 & 7. Periodic growing mixture of functions: Let us define x v = x mod 3 and x m = ⌊ x 3 ⌋ f (x) =    x m * sin(10 * x v ) if x v < 1 x m * (sin(10) + (x v -1)) if 1 < x v < 2 x m * (sin(10) + 1 + (x v -2) 2 ) if 2 < x v < 3 (D.3) Adroit hand: We evaluate on the object relocation task in the ADROIT hand manipulation benchmark. The observation space is a 39-tuple of 30D hand joint positions, 3D object position, 3D palm position and 3D target position. The action space is a 30-tuple of 3D position and 3D orientation of the wrist, and 24D for finger positions commanding a PID controller. During training, the target location is sampled from [0, 0. Slider: We introduce an environment for controlling a brush to slide an object to a target. The observation space is an 18-tuple composed of a 9-tuple representing the brush 2D position, 3D object position and 4D quaternion, an 8-tuple representing the 2D brush velocity, 3D object velocity and 3D rotational object velocity, and the object mass. The action space is torque applied to the brush. During training, the object mass is sampled from [60, 130] and at test time from [5, 15] . Expert and evaluation trajectories are collected for 200 steps. In Figure 14 we plot the actions as a function of time steps for demonstrations, baselines described in Section 4 and Bilinear Transduction for a set of fixed architecture parameters.

D.2.1 DETAILS OF BILINEAR TRANSDUCTION IMPLEMENTATION

We describe the bilinear transduction algorithm implementation in detail. For grasp point prediction, during training we uniformly select an object o ∈ {bottle, mug, teapot}, and two instances of those objects, o i , o j that were rotated, translated and scaled with different parameters. We learn to transduce o i to o j , i.e. learn h θ (o j -o i , o i ) to predict y j , the grasping point of o j . At test time, given test point x j with an unknown object category, we select a training point x i with an "in-distribution" difference x j -x i , as described in Algorithm 1. We select ρ to be the closest difference x j -x i to an in-distribution difference over all training points x i . The differences generated by the optimization process are sampled uniformly to generate a smaller subset for comparison. For the goal-conditioned imitation learning setting, during training we uniformly sample trajectories τ i = {(s i t , a i t )} T t=1 for horizon T specified in Appendix D.1 and τ j = {(s j t , a j t )} T t=1 . We further uniformly sample a time step t from [0, T ]. We learn to transduce state s i t ∈ τ i to s j t ∈ τ j . I.e., learn h θ (s j t -s i t , s i t ) to predict action a j t . At test time, we select an anchor trajectory based on the goal (Meta-World, Adroit) or mass (Slider) g j . This is done by selecting the training trajectory τ i with goal or mass g i that generates an "in-distribution" difference g j -g i with the test point, as described in Algorithm 1. We select ρ to be the 10-th percentile of the in-distribution differences. The differences generated by the optimization process are approximated by generating differences from the training data. We then transduce each state in τ i to predict actions for the test goal or mass. Given test state s j t , we predict a j t with h θ (s j t -s i t , s i t ), execute action a j t in the environment and observe state s j t+1 . We complete this process for T steps to obtain τ j .

D.2.2 DETAILS OF TRAINING DISTRIBUTIONS

We train on 1000 samples for all versions of the grasp point prediction domain: single objects, three objects and weighted transduction. In the sequential decision domains we train on N demonstrations, sequences of state-action pairs, {(x i , y i )} T i=1 where horizon T for each domain is specified in D.1. For Meta-World reach and push, N = 1000. For Slider and Adroit N = 100. All domains were evaluated on 50 in-distribution out-of-sample points and 50 OOS points. For each domain, all methods were trained on the same expert dataset and evaluated on the same in-distribution out-of-sample and OOS sets.



Throughout, we let P(Y) denote the set of distributions supported on Y. In our implementation, we take θ = (θ f , θg), with separate parameters for each embedding. For simplicity, we omit concrete discussion of measurability concerns throughout. Here, µ[ϕ] := ϕ(ω)dµ(ω) denotes the integration with respect to µ. Unlike Shah et al. (2020), we are interested in the Frobenius norm error, so we elect for the slightly sharper bound ofMeng & Zheng (2010) above than the classical operator norm bound ofStewart (1977). Via standard arguments, one can construct the limiting embeddings f⋆,n, fn, ĝn, g⋆,n in such a way that their norms are dominated by integrable functions.



Figure 1: In the real-world the test distribution (orange) often has a different support than the training distribution (blue). Some illustrative tasks: (a) grasp point prediction for object instances with out-of-support scale, (b) action prediction for reaching out-of-support goals, (c) function value prediction for an out-of-support input range. The black crosses show predictions for a conventionally trained deep neural network that makes accurate predictions for in-support inputs, but fails on out-of-support inputs. We propose an algorithm that makes accurate out-of-support predictions under a set of assumptions.

and D test,X1 , D test,X2 under D test . In OOC learning, D test,X1 , D test,X2 are in the support of D train,X1 , D train,X2 , but the joint distributions D test need not be in the support of D train .

Figure 2: Illustration of different learning settings. (a) in-support out-of-distribution (OOD) learning; (b) general out-of-support (OOS) learning in 1-D (on the left) and 2-D (on the right); (c) out-ofcombination (OOC) learning in 2-D.

Figure 3: Illustration of converting OOS to OOC. (Left) Consider training points x1, x2, x3 ∈ Xtrain and OOS test point xtest. During training, we predict h θ (x2) by transducing x3 to hθ (∆23, x3), where ∆23 = x2 -x3. Similarly, at test time, we predict h θ (xtest) by transducing training point x1, via hθ (∆xtest, x1), where ∆xtest = xtest -x1. In this example note that ∆23 = ∆xtest. (Right) This conversion yields an OOC generalization problem in space ∆X × Xtrain: marginal distributions ∆X and Xtrain are covered by the training distribution, but their combination is not.

, a sawtooth function (Fig 4b) and a polynomial function (Fig 4c). Standard deep networks (yellow) fit the training points well (blue), but fail to extrapolate to OOS inputs (orange).

Figure 4: Bilinear transduction behavior on 1-D regression problems. Bilinear transduction performs well on

Fig 4 (which displays shift invariance), one might argue that a similar extrapolation can be achieved by building in an inductive bias for periodicity / translation invariance. In Fig 6, we show that bilinear transduction in fact is able to extrapolate even in cases that the ground truth function is not simply translation invariant, but is translation equivariant, showing that bilinear transduction can capture equivariance. Moreover, bilinear transduction can in fact go beyond equivariance or invariance. To demonstrate broader generality of our method, we consider a piecewise periodic function that also grows in magnitude (Fig 7). This function is neither invariant nor equivariant to translation, as the group symmetries in this problem do not commute. The results in Fig 7 demonstrate that while the baselines fail to do so (including baselines that bake in equivariance (green)), bilinear transduction successfully extrapolates. The important thing for bilinear transduction to work is when comparing training instances, there is a simple (low-rank) relationship between how their labels are transformed. While it can capture invariance and equivariance, as Fig 7 shows, it is more general. How does the relationship between the training distribution and test distribution affect extrapolation behavior? We try and understand how the range of test points that can be extrapolated to, depends on the training range. We show in Fig 5 that for a particular "width" of the training distribution (size of the training set), OOS extrapolation only extends for one "width" beyond the training range since the conditions for ∆X being in-support are no longer valid beyond this point.

Figure 5: Performance of transductive predictors as test points go more and more OOS. Predictions are accurate for one "datawidth" outside training data.

Figure 6: Prediction on function that displays affine equivariance. Bilinear trandsuction is able to capture equivariance without this being explicitly encoded.

Figure 7: Prediction on function that is neither invariant nor equivariant. Bilinear transduction is able to extrapolate while an equivariant predictor fails.

Figure 8: Evaluation domains at training (blue) and OOS (orange). (Left to Right:) grasp prediction for various object orientations and scales, table-top robotic manipulation for reaching and pushing objects to various targets, dexterous manipulation for relocating objects to various targets, slider control for striking a ball of various mass.

Lastly, a line of research has studied various bilinear models (e.g.Hong et al. (2021);Shah et al. (2020)), motivated by the extension of literature on matrix factorization discussed in our abridged related work. However, these methods require additional fine tuning when applied to novel goals and do not display zero shot extrapolation. Moreover,Shah et al. (2020) requries a discretization of the state-action space in order to formulate the matrix completion problem, whereas bilinear transduction shows how the results hold in continuous bilinear form, without the need for discretization. Along similar lines, the Deep-Sets architecture ofZaheer et al. (2017) aims for combinatorial extrapolation by embedding tokens of interest in a latent vector space on which addition operators can be defined.Zhou et al. (2022) compares the Deep Sets and Transformers approaches as policy architectures in reinforcement learning, finding that neither architecture uniformly outperforms the other.

Figure 9: Illustration of bilinear representations for OOC learning, and connection to matrix completion. (a) An example of low-rank matrix completion, where both M and M11 have rank-p. Blue: support where entries can be accessed, green: entries are missing. (b) An example that low-rank structure facilitates certain forms of OOC, i.e. for each k ∈ [K], the predictor can be represented by bilinear embeddings as hθ,k (∆x, x ′ ) = ⟨f θ,k (∆x), g θ,k (x ′ )⟩. Building upon Definition B.1, Assumption 3.1 introduces a notion of bounded density ratio between Dtrain and Dtest in the OOC setting. Take the discrete case of matrix completion as an example, as illustrated in Fig 9, the training distribution of (∆x, x ′ ) covers the support of the (1, 1), (1, 2), (2, 1) blocks of the matrix, while the test distribution of (∆x, x ′ ) might be covered by any product of the marginals of the 2 × 2 blocks. With this connection in mind, it is possible to establish the OOC guarantees on Dtest as in matrix completion tasks, if the bilinear embedding admits some low-rank structure. In other words, samples from the top-left and off-diagonal blocks uniquely determine the bottom-right block.

2 and on learning how to grasp from point clouds in Appendix C.3. Algorithm 2 Weighted Transduction 1: Input: distance parameter ρ, training set (x1, y1), . . . , (xn, yn), regularizer Reg(•) 2: Train: Train θ on loss L(θ) = n i=1 j̸ =i ω θ (xi -xj, xj) • ℓ( hθ (xi -xj, xj), yi) + Reg(θ) 3: Test: for each new xtest, predict y = hθ (xtest -x i , x i ), where i ∼ P[i = i] ∝ ω θ (xtest -xi, xi) C.2 ADDITIONAL RESULTS ON A SIMPLE 2-D EXAMPLE Weighted transduction with a 2-D example We also considered a 2-D analytic function as shown in Fig 10. This function has a 2-D input (x 1 , x 2 ), and 2-D output (y 1 , y 2

Figure 10: Predictions for the 2-D analytic function from (x1, x2) to (y1, y2). (Top) y1 output values for inputs (x1, x2), (Bottom) y2 output values. (Left to Right:) In-distribution ground truth values, OOS ground truth values, neural network predictions, bilinear predictions, bilinear transduction predictions and weighted transduction predictions. Transduction weighting is important in this domain and it is able to discover problem symmetry. While this may seem like an OOC problem, since bilinear prediction directly doesn't work, the OOC view on this problem does not have low rank structure, while the OOS view does.

Figure 11: OOS grasping point prediction for bottles, mugs and teapots. Each row represents one test datapoint. Columns (Left to Right): linear, neural network, transductive and bilinear model predictions (red) and ground truth (green) grasp points. We show that baslines may predict infeasible grasping points (e.g. in the middle of an object) whereas bilinear transduction makes near accurate predictions.

Figure 12: Heatmap complementing Table1showing mean OOS errors on Meta-World. While some architec-

Figure 4b. Sawtooth function: we use a classic sawtooth function f (x) = (x mod Period) * Amplitude

Figure 6. Shift equivariant mixture functions: Let us define x v = x mod 3 and x m = ⌊ x 3 ⌋

3] × [0, 0.3] × [0.15, 0.35] and at test time from [-0.3, 0] × [-0.3, 0] × [0.15, 0.35]. Expert and evaluation trajectories are collected for 200 steps.

Figure 13: Reduced point cloud feature space in R 12 . From left to right: mean and three PCA components. Top: in distribution, bottom: OOS. Blue: bottles, red: mugs, green: teapots.

Figure 14: Slider torque actions as a function of time steps. Each color represents a different trajectory and mass. From left to right, top to bottom: in-support demonstrations and Linear, Neural Net, DeepSets, Transduction and Bilinear Transduction policies for OOS masses.

1⊗1 corresponds to the top-left block of a matrix, where all rows and columns are insupport. D 1⊗2 and D 2⊗1 correspond to the off-diagonals, and D 2⊗2 corresponds to the bottom-right block. The condition requires that Dtrain covers the entire top-right block and off-diagonals, but need not cover the bottom-right block D 2⊗2 . Dtest , on the other hand, can contain samples from this last block. For problems with appropriate low-rank structure, samples from the top-left block, and samples from the off-diagonal blocks, uniquely determine the bottom-right block.

For each new xtest, let I(xtest) := {i : inf∆x∈∆X train ∥xtest -xi -∆x∥ ≤ ρ}, and predict y = hθ (xtest -x i , x i ), where i ∼ Uniform(I(xtest))

Mean and standard deviation over prediction (regression) or final state (sequential decision making) error for OOS samples and over a hyperparameter search.

Transductive Predictors: Converting OOS to OOC . . . . . . . . . . . . . . . . . . 4 3.2 Bilinear representations for OOC learning . . . . . . . . . . . . . . . . . . . . . . 4 3.3 Our Proposed Algorithm: Bilinear Transduction . . . . . . . . . . . . . . . . . . . 5 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 B.4 Extrapolation for Transduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 B.4.1 Further Remarks on Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . 26 B.5 Connections to matrix completion . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Learning which points to transduce: Weighted Transduction . . . . . . . . . . . . 27 C.2 Additional Results on a Simple 2-D Example . . . . . . . . . . . . . . . . . . . . 27

Deng et al. (2021a) extended the original equivariance framework to accommodate situations where the group G corresponds to rotations (SO(3)), and Simeonov et al. (2022) proposed neural descriptor fields which handle rigid-body transformations SE(3)). For a broader review on how other notions of symmetry can be encoded into machine learning settings, consult Smidt (2021), and Abu-Dakka et al. (2021); Parascandolo et al. (

Sinha et al. (2018);Rahimian & Mehrotra (2019). These approaches are well suited to OOD settings where the test distributions have the same support, but have differing (but boundedly different) probability densities. So while the particular data distribution may be different, the meta-level assumption is still an in-distribution oneFallah et al. (2020).Yin et al. (2019) shows that under distribution shift, meta-learning methods can fail dramatically, and this problem will be exacerbated when supports shift as well. Multi-task learning methods often make the assumption that training on some set of training tasks can provide good pre-training before finetuning on new tasksJulian et al. (2020);Su et al. (2022);Meftah et al. (2020). However, it is poorly understand how these tasks actually relate to each other and the study has been largely empirical.

Comparing bilinear transduction on grasp point prediction with tensor field networksThomas et al. (2018), an SE(3) equivaraint network for point clouds. We report mean and standard deviation for for 50 OOS samples.

Grasping point prediction on three objects with various rotations, translations and scales with ground truth weighted training, standard training and learned weighted training for bilinear transduction with a fixed architecture. By training a weighting function with fewer labels, learned weighted bilinear transduction achieves similar performance to ground truth weighted training.Periodic activation functionsWe provide further results on the grasp point prediction and imitation learning domains in Table4. This experiment is similar to the one reported in Table1, but with fourier pre-processing as the initial layer in each model. For the various domains, adding this pre-processing step is beneficial for some architecture-domain combinations, but does not surpass bilinear transduction on all domains for any method. Moreover, in most domains bilinear transduction achieves the best results, or close results to the best model. For implementation details see Section D.2.

Mean and standard deviation over prediction (regression) or final state (sequential decision making) error for OOS samples and over a hyperparameter search with fourier pre-processing. While fourier activations are useful for some combinations of models and domains, it is not beneficial across all.

We report mean and standard deviation error over evaluation samples and three seeds. For each domain, the first row is evaluated on in-distribution samples, the second row is on OOS samples. We show (1) bilinear transduction performs well in-distribution as well as OOS (2) our algorithm is stable across several seeds.

9. ACKNOWLEDGMENTS

We thank Anurag Ajay, Tao Chen, Zhang-Wei Hong, Jacob Huh, Leslie Kaelbling, Hannah Lawrence, Richard Li, Gabe Margolis, Devavrat Shah and Anthony Simeonov for the helpful discussions and feedback on the paper. We are grateful to MIT Supercloud and the Lincoln Laboratory Supercomputing Center for providing HPC resources. This research was also partly sponsored by the DARPA Machine Common Sense Program, MIT-IBM Watson AI Lab, the United States Air Force Research Laboratory and the United States Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes, notwithstanding any copyright notation herein.

annex

We collect expert demonstrations as follows. For grasp point prediction, we use the Neural Descriptor Fields (Simeonov et al., 2022) simulator to generate a feasible grasping point by a Franka Panda arm for three ShapeNet objects (bottle 2d1aa4e124c0dd5bf937e5c6aa3f9597, mug 61c10dccfa8e508e2d66cbf6a91063 and Utah teapot wss.ce04f39420c7c3e82fb82d326efadfe3). In Meta-World, we use the expert policies provided in the environment. For Adroit and Slider we train an expert policy using standard Reinforcement Learning (RL) methods -TRPO (Schulman et al., 2015) for Adroit and SAC (Haarnoja et al., 2018) The weighting grasp point prediction network was provided with with 300 labels of positive pairs to be transduced and 300 negative pairs labeled with binary labels. This is a slightly more relaxed condition than labeling 1000 objects as in the standard version.

D.2.3 DETAILS OF MODEL ARCHITECTURES

For analytic domains, we use MLPs (both for NN and bilinear embeddings) with 3 layers of 1000 hidden units each with ReLu activations. We train all models with periodic activations of fourier features as described in Tancik et al. (2020) .Linear model: data points x are processed through a fully connected layer to output predictions y. Neural Network: data points x are processed through an mlp with k layers, n units each and ReLU activations for hidden layers to output predictions y. DeepSets: we refer to the elements of the state space excluding target position (Meta-World, Adroit) or mass (Slider) as observations. Observations and target position or mass are processed separately by two mlps with k layers, n units each and ReLU activations for hidden layers. Their embeddings are summed and processed by an mlp with one hidden layer with n units and ReLU activations to produce predictions y. Transduction: training point x j and x i -x j (for training or test point x i ) are concatenated and processed through an mlp with k layers, n units each and ReLU activations for hidden layers, to predict y i . Our architecture Bilinear Transduction: training point x j and ∆x ij = x i -x j (for training or test point x i ) are embedded separately to a feature space in R d•m , where d is the dimension of y i , by two mlps with k layers, n units each and ReLU activations for hidden layers. Each predicted element of y i is the dot product of m-sized segments of these embeddings as described in Eq. (3.5).Further, we evaluate for all models with Fourier Embedding as a pre-processing step. We process inputs x, target or mass or ∆x through a linear model which outputs ×40 the number of inputs. We then multiply values by π and process through the sin function. This layer is optimized as the first layer of the model.In Table 1 we search over number of layers k ∈ {2, 3}, and unit size n ∈ {32, 512, 1024}. The bilinear transduction embedding size is m = 32. In Tables 3 and 5 we set k = 2, n = 32 and m = 32.The weighting function used for grasp point prediction is the bilinear transduction architecture with k = 2, n = 128 and m = 32.

D.2.4 DETAILS OF MODEL OPTIMIZATION

We train the analytic functions for 500 epochs, batch size 32, and Adam optimizer with learning rate 1e -4. We optimize the mean squared error (MSE) loss for regression.We trained all more complex models for 5k epochs, batch size 32, with Adam (Kingma & Ba, 2014) optimizer and learning rate 1e-4. We optimize the L2-norm loss function comparing between ground truth and predicted grasping points or actions for the sequential decision making domains. Each configuration of hyperparameters was ran and tested on one seed. We demonstrate the stability of our method across three seeds for a fixed set of hyperparameters in Table 5 .We train the weighted grasp point prediction for 5k epochs, batch size 16, and Adam optimizer with learning rate 1e -4. We optimize the MSE loss between the output and ground truth binary label indicating if a training point should be transduced to another training point. The weighting function did not require further finetuning jointly with the bilinear predictor.

