THE MULTIPLE SUBNETWORK HYPOTHESIS ENABLING MULTIDOMAIN LEARNING BY ISOLATING TASK-SPECIFIC SUBNET-WORKS IN FEEDFORWARD NEURAL NETWORKS

Abstract

Neural networks have seen an explosion of usage and research in the past decade, particularly within the domains of computer vision and natural language processing. However, only recently have advancements in neural networks yielded performance improvements beyond narrow applications and translated to expanded multitask models capable of generalizing across multiple data types and modalities. Simultaneously, it has been shown that neural networks are overparameterized to a high degree, and pruning techniques have proved capable of significantly reducing the number of active weights within the network while largely preserving performance. In this work, we identify a methodology and network representational structure which allows a pruned network to employ previously unused weights to learn subsequent tasks. We employ these methodologies on well-known benchmarking datasets for testing purposes and show that networks trained using our approaches are able to learn multiple tasks, which may be related or unrelated, in parallel or in sequence without sacrificing performance on any task or exhibiting catastrophic forgetting.

1. INTRODUCTION

It is well known and documented that artificial neural networks (ANNs) are often overparameterized, resulting in computational inefficiency (LeCun et al., 1990; Liu et al., 2019) . Applying unstructured pruning to an ANN pares down the number of active weights by identifying which subset of weights in an ANN is most important for the model's predictive performance and discarding those which are less important or even entirely unnecessary. This technique has been shown to reduce the computational cost of using and storing a model without necessarily affecting model accuracy (Frankle & Carbin, 2019; Suzuki et al., 2001; Han et al., 2016; Lis et al., 2019; Wang et al., 2021) . This phenomenon naturally raises a corollary question: are pruned weights entirely useless, or could weights identified as being unnecessary for one task be retained and used to learn other tasks? Further, can the ANN's performance on learned tasks be preserved while these weights learn to perform new tasks? The exploration of these questions leads us to propose the multitple subnetwork hypothesis -that a dense, randomly-initialized feedforward ANN contains within its architecture multiple disjoint subnetworks which can be utilized together to learn and make accurate predictions on multiple tasks, regardless of the degree of similarity between tasks or input types. Instead of focusing on matching or surpassing state of the art results on the data sets and tasks presented in this study, we focus on testing the multiple subnetwork hypothesis on a set of standardized network architectures and compare multitask model performance with traditionally trained, single-task models of identical architectures. An obstacle to developing multitask ANNs is the tendency of ANNs to exhibit catastrophic forgetting (CF), during which they destroy internal state representations used in learning previously acquired tasks when learning a new task (French, 1999; Goodfellow et al., 2013; Pfülb & Gepperth, 2019) ). CF is especially pronounced in continuous learning paradigms with a low degree of intertask relatedness (Aljundi et al., 2017; Ma et al., 2018; Masana et al., 2021) , posing a challenge to creating multitask models able to perform prediction or classification across very different tasks, input data types, or input shapes. Isolating parameters for a given task, thereby fixing parameter subsets for a given task, is a means of minimizing the effects of CF in a continuous learning context (Delange et al., 2021) . Parameter isolation methods localize the performance of learned tasks over a physically (Yoon et al., 2018; Rusu et al., 2016; Mallya & Lazebnik, 2018) or logically (Liu et al., 2019; Serra et al., 2018; Mallya et al., 2018) isolated region of an ANN, reducing or eliminating the interaction of weights responsible for a given task, therefore minimizing the opportunity for the acquisition of new tasks to interfere with the ability of an ANN to continue to perform previously learned tasks. While the isolation of subnetworks is the surest way to prevent interference arising from learning new tasks, this method can scale poorly under memory and computational constraints and therefore demands a flexible approach to limiting network capacity (Yoon et al., 2018) or minimizing the size of the network utilized for each task. Our methodology aims to negate the high computational cost of parameter isolation by representing multidomain multitask models as sparse tensors, thereby minimizing their computational footprint. In this work, we demonstrate the implications of the multiple subnetwork hypothesis through a set of experiments, the results of which demonstrate that our model training procedure and multitask representational structure can be used to create multiclass models with reduced computational footprints which are capable of overcoming CF.

2. METHODOLOGY

In this section, we will discuss how we enable a single model to learn multiple tasks across multiple domains and datatypes through both a task-specific weight representational structure and a modified training procedure which selects disjoint subsets of network weights for each individual task.

2.1. MULTITASK REPRESENTATIONAL STRUCTURE

The weights of fully connected neural network layers traditionally consist of a kernel tensor, k, of shape m × n, and a bias vector, b, of length n, where m denotes the number of columns in the input data and n denotes the number of neurons within the layer. To process input data x, the decision function is thus Φ(kx + b), where Φ denotes the layer's activation function. Suppose then that there are inputs from two different distributions, x1 and x2 . Given the two-dimensional structure of the kernel tensor in this traditional representation, there is no way to alter the decision function given the distribution currently selected. To address this problem, we add a new dimension, t, to the kernel tensor and the bias vector within the layer to denote which distribution or task the input data belongs to, leaving the kernel tensor with a shape of t × m × n and transforming the bias vector into a bias matrix of shape t × n. The decision function for this new layer is thus altered to that in Equation 1. F (x, i) = Φ(k i x + b i ) We take a similar approach for convolutional layers. For a two-dimensional convolutions with color channels, the kernel is of shape s 1 × s 2 × c × f , where s 1 and s 2 denote the height and width of the convolutional filters, c represents the number of channels in the input, and f corresponds to the number of filters. The bias vector in this scenario has length f . In our multitask representation of the convolutional layer, we once more add a new dimension to the front of both the kernel tensor and the bias vector to denote the task.

2.2. SPARSIFICATION AND SUBNETWORK IDENTIFICATION

While the multitask representational structure described above could theoretically be used to enable a network to learn multiple tasks as-is, it alone does not help us answer our hypothesis. Alone, this structure allows multiple networks to be combined into a single model, as the number of parameters within each layer is increased by a factor of the number of tasks. To fully test our multiple subnetwork hypothesis, we devised the Reduction of Sub-Network Neuroplasticity (RSN2) training procedure, which ensures that only one weight along the task dimension is active for all other fixed indices in multitask layers. In other words, due to the nature of RSN2's pruning schema, no two fixed weights along the task dimension, t, are active after pruning.

