THE TRAVELING OBSERVER MODEL: MULTI-TASK LEARNING THROUGH SPATIAL VARIABLE EMBEDDINGS

Abstract

This paper frames a general prediction system as an observer traveling around a continuous space, measuring values at some locations, and predicting them at others. The observer is completely agnostic about any particular task being solved; it cares only about measurement locations and their values. This perspective leads to a machine learning framework in which seemingly unrelated tasks can be solved by a single model, by embedding their input and output variables into a shared space. An implementation of the framework is developed in which these variable embeddings are learned jointly with internal model parameters. In experiments, the approach is shown to (1) recover intuitive locations of variables in space and time, (2) exploit regularities across related datasets with completely disjoint input and output spaces, and (3) exploit regularities across seemingly unrelated tasks, outperforming task-specific single-task models and multi-task learning alternatives. The results suggest that even seemingly unrelated tasks may originate from similar underlying processes, a fact that the traveling observer model can use to make better predictions.

1. INTRODUCTION

Natural organisms benefit from the fact that their sensory inputs and action outputs are all organized in the same space, that is, the physical universe. This consistency makes it easy to apply the same predictive functions across diverse settings. Deep multi-task learning (Deep MTL) has shown a similar ability to adapt knowledge across tasks whose observed variables are embedded in a shared space. Examples include vision, where the input for all tasks (photograph, drawing, or otherwise) is pixels arranged in a 2D plane (Zhang et al., 2014; Misra et al., 2016; Rebuffi et al., 2017) ; natural language (Collobert & Weston, 2008; Luong et al., 2016; Hashimoto et al., 2017) , speech processing (Seltzer & Droppo, 2013; Huang et al., 2015), and genomics (Alipanahi et al., 2015) , which exploit the 1D structure of text, waveforms, and nucleotide sequences; and video game-playing (Jaderberg et al., 2017; Teh et al., 2017) , where interactions are organized across space and time. Yet, many real-world prediction tasks have no such spatial organization; their input and output variables are simply labeled values, e.g., the height of a tree, the cost of a haircut, or the score on a standardized test. To make matters worse, these sets of variables are often disjoint across a set of tasks. These challenges have led the MTL community to avoid such tasks, despite the fact that general knowledge about how to make good predictions can arise from solving seemingly "unrelated" tasks (Mahmud & Ray, 2008; Mahmud, 2009; Meyerson & Miikkulainen, 2019) . This paper proposes a solution: Learn all variable locations in a shared space, while simultaneously training the prediction model itself (Figure 1 ). To illustrate this idea, Figure 1a gives an example of four tasks whose variable values are measured at different locations in the same underlying 2D embedding space. The shape of each marker (i.e., •, , , ) denotes the task to which that variable belongs; white markers denote input variable, black markers denote output variables, and the background coloring indicates the variable values in the entire embedding space when the current sample is drawn. As a concrete example, the color could indicate the air temperature at each point in a geographical region at a given moment in time, and each marker the location of a temperature sensor (however, note that the embedding space is generally more abstract). Figure 1b -c shows a model that can be applied to any task in this universe, using the • task as an example: (b) The function f encodes the value of each observed variable x i given its 2D location z i ∈ R 2 , and these encodings The function f encodes the value of each observed variable x i given its 2D location z i ∈ R 2 , and these encodings are aggregated by elementwise addition ; (c) The function g decodes the aggregated encoding to a prediction for y j at its location z j . In general, the embedded locations z are not known a priori, but they can be learned alongside f and g by gradient descent. f f f ⨁ g g g V (a) (b) (c) are aggregated by elementwise addition ; (c) The function g decodes the aggregated encoding to a prediction for y j at its location z j . Such a predictor can be viewed as a traveling observer model (TOM): It traverses the space of variables, taking a measurement at the location of each input. Given these observations, the model can make a prediction for the value at the location of an output. In general, the embedded locations z are not known a priori (i.e., when input and output variables do not have obvious physical locations), but they can be learned alongside f and g by gradient descent. The input and output spaces of a prediction problem can be standardized so that the measured value of each input and output variable is a scalar. The prediction model can then be completely agnostic about the particular task for which it is making a prediction. By learning variable embeddings (VEs), i.e., the z's, the model can capture variable relationships explicitly and supports joint training of a single architecture across seemingly unrelated tasks with disjoint input and output spaces. TOM thus establishes a new lower bound on the commonalities shared across real-world machine learning problems: They are all drawn from the same space of variables that humans can and do measure. This paper develops a first implementation of TOM, using an encoder-decoder architecture, with variable embeddings incorporated using FiLM (Perez et al., 2018) . In the experiments, the implementation is shown to (1) recover the intuitive locations of variables in space and time, (2) exploit regularities across related datasets with disjoint input and output spaces, and (3) exploit regularities across seemingly unrelated tasks to outperform single-tasks models tuned to each tasks, as well as current Deep MTL alternatives. The results confirm that TOM is a promising framework for representing and exploiting the underlying processes of seemingly unrelated tasks.

2. BACKGROUND: MULTI-TASK ENCODER-DECODER DECOMPOSITIONS

This section reviews Deep MTL methods from the perspective of decomposition into encoders and decoders (Table 1 ). In MTL, there are T tasks {(x t , y t )} T t=1 that can, in general, be drawn from different domains and have varying input and output dimensionality. The tth task has n t input variables [x t1 , . . . , x tnt ] = x t ∈ R nt and m t output variables [y t1 , . . . , y tmt ] = y t ∈ R mt . Two tasks (x t , y t ) and (x t , y t ) are disjoint if their input and output variables are non-overlapping, i.e., {x ti } nt i=1 ∪ {y tj } mt j=1 ∩ {x t i } n t i=1 ∪ {y t j } m t j=1 = ∅. The goal is to exploit regularities across task models x t → ŷt by jointly training them with overlapping parameters. The standard intra-domain approach is for all task models to share their encoder f , and each to have its own task-specific decoder g t (Table 1a ). This setup was used in the original introduction of MTL



Figure1: The Traveling Observer Model. (a) Tasks with disjoint input and output variable sets are measured in the same underlying 2D universe. The shape of each marker (i.e., •, , , ) denotes the task to which that variable belongs; white markers denote input variables, black markers denote output variables, and the background color shows the state of the entire universe when the current sample is drawn. (b) The function f encodes the value of each observed variable x i given its 2D location z i ∈ R 2 , and these encodings are aggregated by elementwise addition ; (c) The function g decodes the aggregated encoding to a prediction for y j at its location z j . In general, the embedded locations z are not known a priori, but they can be learned alongside f and g by gradient descent.

