THE TRAVELING OBSERVER MODEL: MULTI-TASK LEARNING THROUGH SPATIAL VARIABLE EMBEDDINGS

Abstract

This paper frames a general prediction system as an observer traveling around a continuous space, measuring values at some locations, and predicting them at others. The observer is completely agnostic about any particular task being solved; it cares only about measurement locations and their values. This perspective leads to a machine learning framework in which seemingly unrelated tasks can be solved by a single model, by embedding their input and output variables into a shared space. An implementation of the framework is developed in which these variable embeddings are learned jointly with internal model parameters. In experiments, the approach is shown to (1) recover intuitive locations of variables in space and time, (2) exploit regularities across related datasets with completely disjoint input and output spaces, and (3) exploit regularities across seemingly unrelated tasks, outperforming task-specific single-task models and multi-task learning alternatives. The results suggest that even seemingly unrelated tasks may originate from similar underlying processes, a fact that the traveling observer model can use to make better predictions.

1. INTRODUCTION

Natural organisms benefit from the fact that their sensory inputs and action outputs are all organized in the same space, that is, the physical universe. This consistency makes it easy to apply the same predictive functions across diverse settings. Deep multi-task learning (Deep MTL) has shown a similar ability to adapt knowledge across tasks whose observed variables are embedded in a shared space. Examples include vision, where the input for all tasks (photograph, drawing, or otherwise) is pixels arranged in a 2D plane (Zhang et al., 2014; Misra et al., 2016; Rebuffi et al., 2017) ; natural language (Collobert & Weston, 2008; Luong et al., 2016; Hashimoto et al., 2017) , speech processing (Seltzer & Droppo, 2013; Huang et al., 2015) , and genomics (Alipanahi et al., 2015) , which exploit the 1D structure of text, waveforms, and nucleotide sequences; and video game-playing (Jaderberg et al., 2017; Teh et al., 2017) , where interactions are organized across space and time. Yet, many real-world prediction tasks have no such spatial organization; their input and output variables are simply labeled values, e.g., the height of a tree, the cost of a haircut, or the score on a standardized test. To make matters worse, these sets of variables are often disjoint across a set of tasks. These challenges have led the MTL community to avoid such tasks, despite the fact that general knowledge about how to make good predictions can arise from solving seemingly "unrelated" tasks (Mahmud & Ray, 2008; Mahmud, 2009; Meyerson & Miikkulainen, 2019) . This paper proposes a solution: Learn all variable locations in a shared space, while simultaneously training the prediction model itself (Figure 1 ). To illustrate this idea, Figure 1a gives an example of four tasks whose variable values are measured at different locations in the same underlying 2D embedding space. The shape of each marker (i.e., •, , , ) denotes the task to which that variable belongs; white markers denote input variable, black markers denote output variables, and the background coloring indicates the variable values in the entire embedding space when the current sample is drawn. As a concrete example, the color could indicate the air temperature at each point in a geographical region at a given moment in time, and each marker the location of a temperature sensor (however, note that the embedding space is generally more abstract). Figure 1b -c shows a model that can be applied to any task in this universe, using the • task as an example: (b) The function f encodes the value of each observed variable x i given its 2D location z i ∈ R 2 , and these encodings The function f encodes the value of each observed variable x i given its 2D location z i ∈ R 2 , and these encodings are aggregated by elementwise addition ; (c) The function g decodes the aggregated encoding to a prediction for y j at its location z j . In general, the embedded locations z are not known a priori, but they can be learned alongside f and g by gradient descent. are aggregated by elementwise addition ; (c) The function g decodes the aggregated encoding to a prediction for y j at its location z j . Such a predictor can be viewed as a traveling observer model (TOM): It traverses the space of variables, taking a measurement at the location of each input. Given these observations, the model can make a prediction for the value at the location of an output. In general, the embedded locations z are not known a priori (i.e., when input and output variables do not have obvious physical locations), but they can be learned alongside f and g by gradient descent. The input and output spaces of a prediction problem can be standardized so that the measured value of each input and output variable is a scalar. The prediction model can then be completely agnostic about the particular task for which it is making a prediction. By learning variable embeddings (VEs), i.e., the z's, the model can capture variable relationships explicitly and supports joint training of a single architecture across seemingly unrelated tasks with disjoint input and output spaces. TOM thus establishes a new lower bound on the commonalities shared across real-world machine learning problems: They are all drawn from the same space of variables that humans can and do measure. This paper develops a first implementation of TOM, using an encoder-decoder architecture, with variable embeddings incorporated using FiLM (Perez et al., 2018) . In the experiments, the implementation is shown to (1) recover the intuitive locations of variables in space and time, (2) exploit regularities across related datasets with disjoint input and output spaces, and (3) exploit regularities across seemingly unrelated tasks to outperform single-tasks models tuned to each tasks, as well as current Deep MTL alternatives. The results confirm that TOM is a promising framework for representing and exploiting the underlying processes of seemingly unrelated tasks.

2. BACKGROUND: MULTI-TASK ENCODER-DECODER DECOMPOSITIONS

This section reviews Deep MTL methods from the perspective of decomposition into encoders and decoders (Table 1 ). In MTL, there are T tasks {(x t , y t )} T t=1 that can, in general, be drawn from different domains and have varying input and output dimensionality. The tth task has n t input variables [x t1 , . . . , x tnt ] = x t ∈ R nt and m t output variables [y t1 , . . . , y tmt ] = y t ∈ R mt . Two tasks (x t , y t ) and (x t , y t ) are disjoint if their input and output variables are non-overlapping, i.e., {x ti } nt i=1 ∪ {y tj } mt j=1 ∩ {x t i } n t i=1 ∪ {y t j } m t j=1 = ∅. The goal is to exploit regularities across task models x t → ŷt by jointly training them with overlapping parameters. The standard intra-domain approach is for all task models to share their encoder f , and each to have its own task-specific decoder g t (Table 1a ). This setup was used in the original introduction of MTL  ŷt = gt(f (xt)) ŷt = g(f (xt, zt))) ŷt = gt(ft(xt)) ŷj = g i f (xi, zi), zj Table 1 : MTL approaches decomposed into encoders f * and decoders g * : (a) Standard MTL takes advantage of the shared spatialization of tasks within a domain by sharing a single encoder across all tasks t; (b) Task embeddings allow tasks within a domain to share their decoder as well; (c) Applying standard MTL across domains requires task-specific encoders, and finding some other method of sharing parameters across tasks; (d) TOM allows a single encoder and decoder to be used even in the cross-domain setting, by embedding all input and output variables into a shared space. (Caruana, 1998) , has been broadly explored in the linear regime (Argyriou et al., 2008; Kang et al., 2011; Kumar & Daumé, 2012) , and is the most common approach in Deep MTL (Huang et al., 2013; Zhang et al., 2014; Dong et al., 2015; Liu et al., 2015; Ranjan et al., 2016; Jaderberg et al., 2017) . The main limitation of this approach is that it is limited to sets of tasks that are all drawn from the same domain. It also has the risk of the separate decoders doing so much of the learning that there is not much left to be shared, which is why the decoders are usually single affine layers. To address the issue of limited sharing, the task embeddings approach trains a single encoder f and single decoder g, with all task-specific parameters learned in embedding vectors z t that semantically characterize each task, and which are fed into the model as additional input (Yang & Hospedales, 2014; Bilen & Vedaldi, 2017; Zintgraf et al., 2019) (Table 1b ). Such methods require that all tasks have the same input and output space, but are flexible in how the embeddings can be used to adapt the model to each task. As a result, they can learn tighter connections between tasks than separate decoders, and these relationships can be analyzed by looking at the learned embeddings. To exploit regularities across tasks from diverse and disjoint domains, cross-domain methods have been introduced. Existing methods address the challenge of disjoint output and input spaces by using separate decoders and encoders for each domain (Table 1c ), and thus they require some other method of sharing model parameters across tasks, such as sharing some of their layers (Kaiser et al., 2017; Meyerson & Miikkulainen, 2018) or drawing their parameters from a shared pool (Meyerson & Miikkulainen, 2019) . For many datasets, the separate encoder and decoder absorbs too much functionality to share optimally, and their complexity makes it difficult to analyze the relationships between tasks. Earlier work prior to deep learning showed that, from an algorithmic learning theory perspective, sharing knowledge across tasks should always be useful (Mahmud & Ray, 2008; Mahmud, 2009) , but the accompanying experiments were limited to learning biases in a decision tree generation process, i.e., the learned models themselves were not shared across tasks. TOM extends the notion of task embeddings to variable embeddings in order to apply the idea in the cross-domain setting (Table 1d ). The method is described in the next section.

3. THE TRAVELING OBSERVER MODEL

Consider the set of all scalar random variables that could possibly be measured {v 1 , v 2 , ...} = V . Each v i ∈ V could be an input or output variable for some prediction task. To characterize each v i semantically, associate with it a vector z i ∈ R C that encodes the meaning of v i , e.g., "height of left ear of human adult in inches", "answer to survey question 9 on a scale of 1 to 5", "severity of heart disease", "brightness of top-left pixel of photograph", etc. This vector z i is called the variable embedding (VE) of v i . Variable embeddings could be handcoded, e.g., based on some featurization of the space of variables, but such a handcoding is usually unavailable, and would likely miss some of the underlying semantic regularities across variables. An alternative approach is to learn variable embeddings based on their utility in solving prediction problems of interest. A prediction task (x, y) = ([x 1 , . . . , x n ], [y 1 , . . . , y m ]) is defined by its set of observed variables {x i } n i=1 ⊆ V and its set of target variables {y j } m j=1 ⊆ V whose values are unknown. The goal is to find a prediction function Ω that can be applied across any prediction task of interest, so that it can learn to exploit regularities across such problems. Let z i and z j be the variable embeddings corresponding to x i and y j , respectively. Then, this universal prediction model is of the form E[y j | x] = Ω(x, {z i } n i=1 , z j ). (1) Importantly, for any two tasks (x t , y t ), (x t , y t ), their prediction functions (Eq. 1) differ only in their z's, which enforces the constraint that functionality is otherwise completely shared across the models. One can view Ω as a traveling observer, who visits several locations in the C-dimensional variable space, takes measurements at those locations, and uses this information to make predictions of values at other locations. To make Ω concrete, it must be a function that can be applied to any number of variables, can fit any set of prediction problems, and is invariant to variable ordering, since we cannot in general assume that a meaningful order exists. These requirements lead to the following decomposition: E[y j | x] = Ω(x, {z i } n i=1 , z j ) = g n i=1 f (x i , z i ), z j , where f and g are functions called the encoder and decoder, with trainable parameters θ f and θ g , respectively. The variable embeddings z tell f and g which variables they are observing, and these z can be learned by gradient descent alongside θ f and θ g . A depiction of the model is shown in Figure 1 . For some integer M , f : R C+1 → R M and g : R M +C → R. In principle, f and g could be any sufficiently expressive functions of this form. A natural choice is to implement them as neural networks. They are called the encoder and decoder because they map variables to and from a latent space of size M . This model can then be trained end-to-end with gradient descent. A batch for gradient descent is constructed by sampling a prediction problem, e.g., a task, from the distribution of problems of interest, and then sampling a batch of data from the data set for that problem. Notice that, in addition to supervised training, in this framework it is natural to autoencode, i.e., predict input variables, and subsample inputs to simulate multiple tasks drawn from the same universe. The question remains: How can f and g be designed so that they can sufficiently capture a broad range of prediction behavior, and be effectively conditioned by variable embeddings? The next section introduces an experimental architecture that satisfies these requirements.

4. INSTANTIATION

The experiments in this paper implement TOM using a generic architecture built from standard components (Figure 2 ). The encoder and decoder are conditioned on VEs via FiLM layers (Perez et al., 2018) , which provide a flexible yet inexpensive way to adapt functionality to each variable, and have been previously used to incorporate task embeddings (Vuorio et al., 2019; Zintgraf et al., 2019) . For simplicity, the FiLM layers are based on affine transformations of VEs. Specifically, the th FiLM layer F is parameterized by affine layers W * and W + , and, given a variable embedding z, the hidden state h is modulated by F (h) = W * (z) h + W + (z), where is the Hadamard product. A FiLM layer is located alongside each fully-connected layer in the encoder and decoder, both of which consist primarily of residual blocks. To avoid deleterious behavior of batch norm across diverse tasks and small datasets/batches, the recently proposed SkipInit The VE for each variable, i.e., pixel, is colored uniquely. TOM peels the border of the CIFAR images (the upper loop of VEs at iteration 300K) away from their center (the lower grid). This makes sense, since CIFAR images all feature a central object, which semantically splits the image into foreground (the object itself) and background (the remaining ring of pixels around the object). See https://youtu.be/R_z-2SR2KpY for videos of VEs being learned. (De & Smith, 2020) is used as a replacement to stabilize training. SkipInit adds a trainable scalar α initialized to 0 at the end of each residual block, and uses dropout for regularization. Finally, for computational efficiency, the decoder is redecomposed into the Core, or g 1 , which is independent of output variable, and the Decoder proper, or g 2 , which is conditioned on the output variable. That way, generic transformations of the summed Encoder output can be learned by the Core and run in a single forward and backward pass each iteration. With this decomposition, Eq. 2 is rewritten as E[y j | x] = g 2 g 1 n i=1 f (x i , z i ) , z j . The complete architecture is depicted in Figure 2 . In the following sections, all models are implemented in pytorch (Paske et al., 2017) , use Adam for optimization (Kingma & Ba, 2014) , and have hidden layer size of 128 for all layers. Variable embeddings for TOM are initialized from N (0, 10 -3 ). See Appendix C for additional details of this implementation.

5. EXPERIMENTS

This section presents a suite of experiments that evaluate the behavior of the implementation introduced in Section 4. See Appendix for additional experimental details.

5.1. VALIDATING LEARNED VARIABLE EMBEDDINGS: DISCOVERING SPACE AND TIME

The experiments in this section test TOM's ability to learn variable embeddings that reflect our a priori intuition about the domain, in particular, the organization of space and time. CIFAR. The first experiment is based on the CIFAR dataset (Krizhevsky, 2009) . The pixels of the 32 × 32 images are converted to grayscale values in [0, 1], yielding 1024 variables. The goal is to predict all variable values, given only a subset of them as input. The model is trained to minimize the binary cross-entropy of each output, and it uses 2D VEs. The a priori, or Oracle, expectation is that the VEs form a 32 × 32 grid corresponding to how pixels are spatially laid out in an image. Daily Temperature. The second experiment is based on the Melbourne minimum daily temperature dataset (Brownlee, 2016) , a subset of a larger database for tracking climate change (Della-Marta et al., 2004) . As above, the goal is to predict the daily temperature of the previous 10 days, given only some subset of them, by minimizing the MSE of each variable. The a priori, Oracle, expectation is that the VEs are laid out linearly in a single temporal dimension. The goal is to see whether TOM will also learn VEs (in a 2D space) that follow a clear 1D manifold that can be interpreted as time. For both experiments, a subset of the input variables is randomly sampled at each training iteration, which simulates drawing tasks from a limited universe. The resulting learning process for the VEs is illustrated in Figures 3 and 4 . The VEs for CIFAR pull apart and unfold, until they reflect the oracle embeddings (Figure 3 ). The remaining difference is that TOM peels the border of the CIFAR images (the upper loop of VEs at iteration 300K) away from their center (the lower grid). This makes sense, since CIFAR images all feature a central object, which semantically splits the image into foreground (the object itself) and background (the remaining ring of pixels around the object). Similarly, the VEs for daily temperature pull apart until they form a perfect 1D manifold representing the time dimension (Figure 4 ). The main difference is that TOM has embedded this 1D structure as a ring in 2D, which is well-suited to the nonlinear encoder and decoder, since it mirrors an isotropic Gaussian distribution. Note that unlike visualization methods like SOM (Kohonen, 1990) , PCA (Pearson, 1901) , or t-SNE (van der Maaten & Hinton, 2008), TOM learns locations for each variable not each sample. Furthermore, TOM has no explicit motivation to visualize; learned VEs are simply the locations found to be useful by using gradient descent when solving the prediction problem. To get an idea of how learning VEs affects prediction performance, comparisons were run with three cases of fixed VEs: (1) all VEs set to zero, to address the question of whether differentiating variables with VEs is needed at all in the model; (2) random VEs, to address the question of whether simply having any unique label for variables is sufficient; and (3) oracle VEs, which reflect the human a priori expectation of how the variables should be arranged. The results show that the learned embeddings outperform zero and random embeddings, achieving performance on par with the Oracle (Table 2 ). The conclusion is that learned VEs in TOM are not only meaningful, but can help make superior predictions, without a priori knowledge of variable meaning. The next section shows how such VEs can be used to exploit regularities across tasks in an MTL setting.

5.2. EXPLOITING REGULARITIES ACROSS DISJOINT TASKS

This section considers two synthetic multi-task problems that contain underlying regularities across tasks. These regularities are not known to the model a priori; it can only exploit them via its VEs. The first problem evaluates TOM in a regression setting where input and output variables are drawn from the same continuous space; the second problem evaluates TOM in a classification setting. For classification tasks, each class defines a distinct output variable. Transposed Gaussian Process. In the first problem, the universe is defined by a Gaussian process (GP). The GP is 1D, is zero-mean, and has an RBF kernel with length-scale 1. One task is generated for each (# inputs, # outputs) pair in {1, . . . , 10} × {1, . . . , 10}, for a total of 100 tasks. The "true" location of each variable lies in the single dimension of the GP, and is sampled uniformly from [0, 5]. Samples for the task are generated by sampling from the GP, and measuring the value at each variable location. The dataset for each task contains 10 training samples, 10 validation samples, and 100 test samples. Samples are generated independently for each task. The goal is to minimize MSE of the outputs. Figure 5 gives two examples of tasks drawn from this universe. This testbed is ideal for TOM, because, by the definition of the GP, it explicitly captures the idea that variables whose VEs are nearby are closely related, and every variable has some effect on all others. x 1 f z 1 ... g z n+1 y 1 ... x 2 f z 2 x n f z n g z n+2 y 2 g z n+m y m ... ... Concentric Hyperspheres. In the second problem, each task is defined by a set of concentric hyperspheres. Many areas of human knowledge have been organized abstractly as such hyperspheres, e.g., planets around a star, electrons around an atom, social relationships around an individual, or suburbs around Washington D.C.; the idea is that a model that discovers this common organization could then share general knowledge across such areas more effectively. To test this hypothesis, one task is generated for each (# features n, # classes m) pair in {1, . . . , 10} × {2, . . . , 10}, for a total of 90 tasks. For each task, its origin o t is drawn from N (0, I n ). Then, for each class c ∈ {1, . . . , m}, samples are drawn from R n uniformly at distance c from o t , i.e., each class is defined by a (hyper) annulus. The dataset for each task contains five training samples, five validation samples, and 100 test samples per class. The model has no a priori knowledge that the classes are structured in annuli, or which annulus corresponds to which class, but it is possible to achieve high accuracy by making analogies of annuli across tasks, i.e., discovering the underlying structure of this universe. In these experiments, TOM is compared to five alternative methods: (1) TOM-STL, i.e. TOM trained on each task independently; (2) DR-MTL (Deep Residual MTL), the standard cross-domain (Table 1c) version of TOM, where instead of FiLM layers, each task has its own linear encoder and decoder layers, and all residual blocks are CoreResBlocks; (3) DR-STL, which is like DR-MTL except it is trained on each task independently; (4) SLO (Soft Layer Ordering; Meyerson & Miikkulainen, 2018) , which uses a separate encoder and decoder for each task, and which is (as far as we know) the only prior Deep MTL approach that has been applied across disjoint tabular datasets; and (5) Oracle, i.e. TOM with VEs fixed to intuitively correct values. The Oracle is included to give an upper bound on how well the TOM architecture in Section 4 could possibly perform. The oracle VE for each Transposed GP task variable is the location where it is measured in the GP; for Concentric Hyperspheres, the oracle VE for each class c is c /10, and for the ith feature is o t i . TOM outperforms the competing methods and achieves performance on par with the Oracle (Table 3). Note that the improvement of TOM over TOM-STL is much greater than that of DR-MTL over DR-STL, indicating that TOM is particularly well-suited to exploiting structure across disjoint data sets (learned VEs are shown in Figure 6a-b ). Now that this suitability has been confirmed, the next section evaluates TOM across a suite of disjoint, and seemingly unrelated, real-world problems.

5.3. MULTI-TASK LEARNING ACROSS SEEMINGLY UNRELATED REAL-WORLD DATASETS

This section evaluates TOM in the setting for which it was designed: learning a single shared model across seemingly unrelated real-world datasets. The set of tasks used is UCI-121 (Lichman, 2013; Fernández-Delgado et al., 2014) , a set of 121 classification tasks that has been previously used to evaluate the overall performance of a variety of deep NN methods (Klambauer et al., 2017) . The tasks come from diverse areas such as medicine, geology, engineering, botany, sociology, politics, and game-playing. Prior work has tuned each model to each task individually in the single-task regime; no prior work has undertaken learning of all 121 tasks in a single joint model. The datasets are highly diverse. Each simply defines a classification task that a machine learning practitioner was interested in solving. The number of features for a task range from 3 to 262, the number of classes from 2 to 100, and the number of samples from 10 to 130,064. To avoid underfitting to the larger tasks, C = 128, and after joint training all model parameters (θ f , θ g1 , θ g2 , and z's) are finetuned on each task with at least 5K samples. Note that it is not expected that training any two tasks jointly will improve performance in both tasks, but that training all 121 tasks jointly will improve performance overall, as the model learns general knowledge about how to make good predictions. Results across a suite of metrics are shown in Table 4 . Mean Accuracy is the test accuracy averaged across all tasks. Normalized Accuracy scales the accuracy within each task before averaging across tasks, with 0 and 100 corresponding to the lowest and highest accuracies. Mean Rank averages the method's rank across tasks, where the best method gets a rank of 0. Best % is the percentage of tasks for which the method achieves the top accuracy (with possible ties). Win % is the percentage of tasks for which the method achieves accuracy strictly greater than all other methods. TOM outperforms the alternative approaches across all metrics, showing its ability to learn many seemingly unrelated tasks successfully in a single model (see Figure 6c for a high-level visualization of learned VEs). In other words, TOM can both learn meaningful VEs and use them to improve prediction performance.

6. DISCUSSION AND FUTURE WORK

Sections 2 and 3 developed the foundations for the TOM approach; Sections 4 and 5 illustrated its capabilities, demonstrating its value as a general multitask learning system. This section discusses four key areas of future work for increasing the understanding and applicability of the approach. First, there is an opportunity to develop a theoretical framework for understanding when TOM will work best. It is straightforward to extend universal approximation results from approximation of single functions (Cybenko, 1989; Lu et al., 2017; Kidger & Lyons, 2020) to approximation of a set of functions each with any input and output dimensionality via Eq. 2. It is also straightforward to extend convergence bounds for certain model classes, such as PAC bounds (Bartlett & Mendelson, 2002; Neyshabur et al., 2018) , to TOM architectures implemented with these classes, if the "true" variable embeddings are fixed a priori, so they can simply be treated as features. However, a more Second, in this paper, TOM was evaluated only in the case when the data for all tasks is always available, and the model is trained simultaneously across all tasks. However, it would also be natural to apply TOM in a meta-learning regime (Finn et al., 2017; Zintgraf et al., 2019) , in which the model is trained explicitly to generalize to future tasks, and to lifelong learning (Thrun & Pratt, 2012; Brunskill & Li, 2014; Abel et al., 2018) , where the model must learn new tasks as they appear over time. Simply freezing the learned parameters of TOM results in a parametric class of ML models with C parameters per variable that can be applied to new tasks. However, in practice, it should be possible to improve upon this approach by taking advantage of more sophisticated fine-tuning and parameter adaptation. For example, in low-data settings, methods can be adapted from meta-learning approaches that modulate model weights in a single forward pass instead of performing supervised backpropagation (Garnelo et al., 2018; Vuorio et al., 2019) . Interestingly, although they are designed to address issues quite different from those motivating TOM, the architectures of such approaches have a functional decomposition that is similar to that of TOM at a high level (see e.g. Conditional Neural Processes, or CNPs; Garnelo et al., 2018) . In essence, replacing the VEs in Eq. 2 with input samples and the variables with output samples yields a function that generates a prediction model given a dataset. This analogy suggests that it should be possible to extend the benefits of CNPs to TOM, including rich uncertainty information. Third, to make the foundational case for TOM, this paper focused on the setting where VEs are a priori unknown, but when such knowledge is available, it could be useful to integrate with learned VEs. Such an approach could eliminate the cost of relearning VEs, and suggest how to take advantage of spatially-customized architectures. E.g., convolution or attention layers could be used instead of dense layers as architectural primitives, as in vision and language tasks. Such specialization could be instrumental in making TOM more broadly applicable and more powerful in practice. Finally, one interpretation of Fig. 6c is that the learned VEs of classes encode a task-agnostic concept of "normal" vs. "abnormal" system states. TOM could be used to analyze the emergence of such general concepts and as an analogy engine: to describe states of one task in the language of another.

7. CONCLUSION

This paper introduced the traveling observer model (TOM), which enables a single model to be trained across diverse tasks by embedding all task variables into a shared space. The framework was shown to discover intuitive notions of space and time and use them to learn variable embeddings that exploit knowledge across tasks, outperforming single-and multi-task alternatives. Thus, learning a single function that cares only about variable locations and their values is a promising approach to integrating knowledge across data sets that have no a priori connection. The TOM approach thus extends the benefits of multi-task learning to broader sets of tasks.

B PYTORCH CODE

To give a detailed picture of how the TOM architecture in this paper was implemented, the code for the forward pass of the model implemented in pytorch (Paske et al., 2017) is given in Figure 7 . For efficiency, TOM is implemented with Conv1D layers with kernel size 1 instead of Dense layers. This approach enables the model to run the encoder and decoder on all variables in parallel. The fact that Conv layers are so highly optimized in pytorch makes the implementation substantially more efficient than with Dense layers. In this code, input batch has shape (batch size, input variables), input contexts has shape (1, VE dim, # input variables), and output contexts has shape (1, VE dim, # output variables). Code for TOM will be available at https://github. com/leaf-ai/tom-release.

C ADDITIONAL EXPERIMENTAL DETAILS

A sigmoid layer is applied at the end of the decoder for the CIFAR experiments, to squash the output between 0 and 1. For the CIFAR and Daily Temperature experiments, a subset of the variables is sampled each iteration to be used as input. This subset is sampled in the following way: (1) Sample the size k of the subset uniformly from [1, n t ], where n t is the number of variables in the experiment; (2) Sample a subset of variables of size k uniformly from all subsets of size k. This sampling method ensures that every subset size has an equal chance of getting selected, so that the universe is not biased towards tasks of a particular size. E.g., if instead the subset were created by sampling each variable independently with probability p, then the subset size would concentrate tightly around pn t . For classification tasks, each class defines a distinct output variable, i.e., a K-class classification task has K output variables. The squared hinge loss was used for classification tasks (Janocha & Czarnecki, 2017) . It is preferable to categorical cross-entropy loss in this setting, because it does not require taking a softmax across output variables, so the outputs are kept separate. Also, the loss becomes exactly zero once a sample is learned strongly, so that the model does not continue to overfit as remaining samples and tasks are learned. The number of blocks in the encoder, core, and decoder is N = 3 for all problems except UCI-121, for which it is N = 10. All experiments use a hidden size of 128 for all dense layers aside from the final decoder layer that maps to the output space. The batch size was 32 for CIFAR and Daily Temperature, and max(200, # train samples) for all other tasks. At each step, T o tasks are uniformly sampled from the set of all tasks, and gradients are summed over a batch for each task in the sample. T o = 1 in all experiments except UCI-121, for which T o = 32. To Weights are initialized using the default pytorch initialization (aside from the SkipInit α scalars, which are initialized to zero (De & Smith, 2020) ). The experiments in Section 5.1 use no weight decay; in Section 5.2 use weight decay of 10 -4 ; and in Section 5.3 use weight decay of 10 -5 . Dropout is set to 0.0 for CIFAR, Daily Temperature, and Concentric Hyperspheres; and 0.5 for Transposed Gaussian Process and UCI-121. In UCI-121, fully-trained MTL models are finetuned to tasks with more than 5,000 samples, using the same optimizer configuration as for joint training, except the steps-per-epoch is set to # train samples /batch size , the learning rate is initialized to 0.0001, the patience for early stopping is set to 100, and the validation performance is smoothed over every 10 epochs (simple moving average), following the protocol used to train single-task models in prior work (Klambauer et al., 2017) . TOM uses a VE size of C = 2 for all experiments, except for UCI-121, where C = 128 in order to accommodate the complexity of such a large and diverse set of tasks. For Figure 6c, t-SNE (van der Maaten & Hinton, 2008) was used to reduce the dimensionality to two. t-SNE was run for 10K iterations with default parameters in the scikit-learn implementation (Pedregosa et al., 2011) , after first reducing the dimensionality from 128 to 32 via PCA. Independent runs of t-SNE yielded qualitatively similar results. Autoencoding (i.e., predicting the input variables as well as unseen variables) was used for CIFAR, Daily Temperature, and Transposed Guassian Process; it was not used for Concentric Hyperspheres or UCI-121. The Soft Layer Ordering architecture follows the original implementation (Meyerson & Miikkulainen, 2018) . There are four shared ReLU layers, each of size 128, with dropout after each to ease sharing across different soft combinations of layers. In Tables 2 and 3 means and standard error for each method are computed over ten runs. The Daily Temperature dataset was downloaded from https://raw.githubusercontent. com/jbrownlee/Datasets/master/daily-min-temperatures.csv. 



Figure1: The Traveling Observer Model. (a) Tasks with disjoint input and output variable sets are measured in the same underlying 2D universe. The shape of each marker (i.e., •, , , ) denotes the task to which that variable belongs; white markers denote input variables, black markers denote output variables, and the background color shows the state of the entire universe when the current sample is drawn. (b) The function f encodes the value of each observed variable x i given its 2D location z i ∈ R 2 , and these encodings are aggregated by elementwise addition ; (c) The function g decodes the aggregated encoding to a prediction for y j at its location z j . In general, the embedded locations z are not known a priori, but they can be learned alongside f and g by gradient descent.

(a) Intra-domain (b) Task Embeddings (c) Cross-domain (d) Variable Embeddings (TOM)

Figure3: Variable embeddings learned for CIFAR unfold over iterations until they resemble Oracle expectations (best viewed in color). The VE for each variable, i.e., pixel, is colored uniquely. TOM peels the border of the CIFAR images (the upper loop of VEs at iteration 300K) away from their center (the lower grid). This makes sense, since CIFAR images all feature a central object, which semantically splits the image into foreground (the object itself) and background (the remaining ring of pixels around the object). See https://youtu.be/R_z-2SR2KpY for videos of VEs being learned.

Figure5: (a) Tasks with disjoint input and output variable sets, whose variables are nonetheless measured in the same underlying space (dotted lines are samples). These tasks are drawn from the Transposed Gaussian Process problem in Section 5.2; (b) TOM can be applied to any task in this space: It predicts values at output locations, given values at input locations.

Figure 6: Learned VEs capture underlying structure across tasks. (a) VEs of features for concentric hyperspheres encode the origin location, and (b) for classes encode the index of their annuli (less precisely for the more distant annuli, since they occur in fewer tasks); (c) VEs for UCI-121 (shown in 2D via t-SNE) neatly carve the space into features, common classes, and uncommon classes.

Figure 7: Pytorch code for the forward pass of the TOM implementation.

Diagram of the TOM implementation used in the experiments. Encoder, Core, and Decoder correspond to f , g 1 , and g 2 in Eq. 4, resp. The Encoder and Decoder are conditioned on input and output VEs z via FiLM layers. A CRB is simply an FRB without conditioning. Dropout and trainable scalars α implement SkipInit as a substitute for BatchNorm. This residual structure allows the architecture to learn tasks of varying complexity in a flexible manner.

Quantitative results for space and time prediction. This table compares test errors (± std. err.) of learned VEs to fixed-VE alternatives in TOM. The results show that learned VEs outperform Zero and Random VEs, reaching performance on par with the Oracle. That is, TOM not only learns meaningful VEs (Figures3 and 4), but also uses these VEs to achieve superior peformance.

Quantitative Results in synthetic disjoint MTL scenarios. TOM learns variable embeddings that enable it to outperform alternative approaches, and achieve performance on par with the Oracle.

Comparisons to external results of deep STL models tuned to each task (see "Experiments" inKlambauer et al. (2017) for more details); (b) Comparisons across methods evaluated in this paper. Metrics are aggregated over all 121 tasks (± std. err.). TOM achieves high performance across seemingly unrelated tasks, outperforming the comparisons across all metrics.intriguing direction involves understanding how the true locations of variables affects TOM's ability to learn and exploit them, i.e., what are desirable theoretical properties of the space of variables?

allow for multi-task training with datasets of varying numbers of samples, we say the model has completed one epoch each time it is evaluated on the validation set. An epoch is 1000 steps for CIFAR, 100 steps for Daily Temperature, 1K steps for Transposed Gaussian Process, 1K steps for Concentric Hyperspheres, and 10K steps for UCI-121.For CIFAR, the official training and test splits are used for training and testing. No validation set is needed for CIFAR, because none of the models can overfit to the training set. For Daily Temperature, the second-to-last year of data is withheld for validation, and the final year is withheld for testing. The UCI-121 experiments use the preprocessed versions of the official train-val-test splits (https://github.com/bioinf-jku/SNNs/tree/master/UCI).Adam is used for all experiments, with all parameters initialized to their default values. In all experiments except UCI-121, the learning rate is kept constant at 0.001 throughout training. In UCI-121, the learning rate is decreased by a factor of two when the mean validation accuracy has not increased in 20 epochs; it is decreased five times; model training stops when it would be decreased a sixth time. Models are trained for 500K steps for CIFAR, 100K steps for Daily Temperature, and 250K for Transposed Gaussian Process and Concentric Hyperspheres. The test performance for each task is its performance on the test set after the epoch of its best validation performance.

contains test accuracies for each UCI-121 task for all methods run in the experiments in Section 5.3.

Accuracies for each UCI-121 task.

ACKNOWLEDGEMENTS

Thank you to Babak Hodjat and others in the Evolutionary AI research group for helpful discussions and technical feedback. Thank you also to the reviewers, particularly for their suggestions for improving the organizational structure and clarity of the paper.

annex

Published as a conference paper at ICLR 2021 0 00 0 0 0 0 00 0 0 0 0 0 0 00 0 0 00 0 0 00 0 0 0 0 00 0 0 0 00 0 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 000 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 1 1 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 33 3 3 3 3 3 3 3 3 3 3 3 3 3  3 3  3  3  3 3  3  3  3 3  3 A ADDITIONAL EXPERIMENT ON THE EMBEDDING SIZE CIn the experiments in Section 5.1 and 5.2, the VE dimensionality C for TOM was set to 2 in order to most clearly visualize the VEs that were learned. In the experiment in Section 5.3, C was increased in order to accommodate the scale-up to a large number of highly diverse real world tasks. In that experiment C was set to 128 in order to match the number of task-specific parameters of the other Deep MTL methods compared in Table 4 .To evaluate the sensitivity of TOM to the setting of C, additional experiments were run for TOM on UCI-121 with C = 64 and C = 256. The results are shown in 

