ONE NETWORK FITS ALL? MODULAR VERSUS MONOLITHIC TASK FORMULATIONS IN NEURAL NETWORKS

Abstract

Can deep learning solve multiple tasks simultaneously, even when they are unrelated and very different? We investigate how the representations of the underlying tasks affect the ability of a single neural network to learn them jointly. We present theoretical and empirical findings that a single neural network is capable of simultaneously learning multiple tasks from a combined data set, for a variety of methods for representing tasks-for example, when the distinct tasks are encoded by well-separated clusters or decision trees over certain task-code attributes. More concretely, we present a novel analysis that shows that families of simple programming-like constructs for the codes encoding the tasks are learnable by two-layer neural networks with standard training. We study more generally how the complexity of learning such combined tasks grows with the complexity of the task codes; we find that combining many tasks may incur a sample complexity penalty, even though the individual tasks are easy to learn. We provide empirical support for the usefulness of the learning bounds by training networks on clusters, decision trees, and SQL-style aggregation. * Work performed in part while visiting Google.

1. INTRODUCTION

Standard practice in machine learning has long been to only address carefully circumscribed, often very related tasks. For example, we might train a single classifier to label an image as containing objects from a certain predefined set, or to label the words of a sentence with their semantic roles. Indeed, when working with relatively simple classes of functions like linear classifiers, it would be unreasonable to expect to train a classifier that handles more than such a carefully scoped task (or related tasks in standard multitask learning). As techniques for learning with relatively rich classes such as neural networks have been developed, it is natural to ask whether or not such scoping of tasks is inherently necessary. Indeed, many recent works (see Section 1.2) have proposed eschewing this careful scoping of tasks, and instead training a single, "monolithic" function spanning many tasks. Large, deep neural networks can, in principle, represent multiple classifiers in such a monolithic learned function (Hornik, 1991) , giving rise to the field of multitask learning. This combined function might be learned by combining all of the training data for all of the tasks into one large batch-see Section 1.2 for some examples. Taken to an extreme, we could consider seeking to learn a universal circuit-that is, a circuit that interprets arbitrary programs in a programming language which can encode various tasks. But, the ability to represent such a monolithic combined function does not necessarily entail that such a function can be efficiently learned by existing methods. Cryptographic hardness theorems (Kearns & Valiant, 1994 ) establish that this is not possible in general by any method, let alone the specific training methods used in practice. Nevertheless, we still can ask how Figure 1 : Our framework shows that it is possible to learn analytic functions such as the gravitational force law, decision trees with different functions at the leaf nodes, and programming constructs such as those on the right, all using a non-modular monolithic architecture. rich a family of tasks can be learned by these standard methods. In this work, we study the extent to which backpropagation with stochastic gradient descent (SGD) can learn such monolithic functions on diverse, unrelated tasks. There might still be some inherent benefit to an architecture in which tasks are partitioned into sub-tasks of such small scope, and the training data is correspondingly partitioned prior to learning. For example, in the early work on multitask learning, Caruana (1997) observed that training a network to solve unrelated tasks simultaneously seemed to harm the overall performance. Similarly, the seminal work of Jacobs et al. (1991) begins by stating that "If backpropagation is used to train a single, multilayer network to perform different subtasks on different occasions, there will generally be strong interference effects that lead to slow learning and poor generalization". We therefore ask if, for an unfortunate choice of tasks in our model, learning by standard methods might be fundamentally impaired. As a point of reference from neuroscience, the classical view is that distinct tasks are handled in the brain by distinct patches of the cortex. While it is a subject of debate whether modularity exists for higher level tasks (Samuels, 2006) , it is accepted that there are dedicated modules for low-level tasks such as vision and audio processing. Thus, it seems that the brain produces a modular architecture, in which different tasks are handled by different regions of the cortex. Conceivably, this division into task-specific regions might be driven by fundamental considerations of learnability: A single, monolithic neural circuit might simply be too difficult to learn because the different tasks might interfere with one another. Others have taken neural networks trained by backpropagation as a model of learning in the cortex (Musslick et al., 2017) ; to the extent that this is reasonable, our work has some bearing on these questions as well.

1.1. OUR RESULTS

We find, perhaps surprisingly, that combining multiple tasks into one cannot fundamentally impair learning with standard training methods. We demonstrate this for a broad family of methods for combining individual tasks into a single monolithic task. For example, inputs for each individual tasks may come from a disjoint region (for example, a disjoint ball) in a common input space, and each individual task could then involve applying some arbitrary simple function (e.g., a separate linear classifier for each region). Alternately there may be an explicit "task code" attribute (e.g., a one-hot code), together with the usual input attributes and output label(s), where examples with the same task code are examples for the same learning task. Complementing our results that combining multiple tasks does not impair learning, we also find that some task coding schemes do incur a sample complexity penalty. A vast variety of task coding schemes may be used. As a concrete example, when the data points for each task are well-separated into distinct clusters, and the tasks are linear classification tasks, we show that a two-layer architecture trained with SGD successfully learns the combined, monolithic function; the required amount of data simply scales as the sum of the amount required to learn each

