COMPOSITIONAL MODELS: MULTI-TASK LEARNING AND KNOWLEDGE TRANSFER WITH MODULAR NET-WORKS

Abstract

Conditional computation and modular networks have been recently proposed for multitask learning and other problems as a way to decompose problem solving into multiple reusable computational blocks. We propose a novel fully-differentiable approach for learning modular networks. In our method, the modules can be invoked repeatedly and allow knowledge transfer to novel tasks by adjusting the order of computation. This allows soft weight sharing between tasks with only a small increase in the number of parameters. We show that our method leads to interpretable self-organization of modules in case of multi-task learning, transfer learning and domain adaptation while achieving competitive results on those tasks. From practical perspective, our approach allows to: (a) reuse existing modules for learning new task by adjusting the computation order, (b) use it for unsupervised multi-source domain adaptation to illustrate that adaptation to unseen data can be achieved by only manipulating the order of pretrained modules, (c) show how our approach can be used to increase accuracy of existing architectures for image classification tasks such as IMAGENET, without any parameter increase, by reusing the same block multiple times.

1. INTRODUCTION

Most modern interpretations of neural networks treat layers as transformations that convert lowerlevel features, such as pixels, into higher-level abstractions. Adopting this view, many designs that break up monolithic models into interacting modular components localize them to particular processing stages thus disallowing parameter sharing across layers. This restriction, however, can be limiting when we need to continually grow the model, be that for transferring knowledge to a new task or preventing catastrophic forgetting. It also prohibits module reuse within a model even though certain parameter-efficient architectures and complex decision making processes may involve recurrent components (Kaiser & Sutskever, 2016; Randazzo et al., 2020) and feedback loops (Herzog et al., 2020; Yan et al., 2019; Kar et al., 2019) . Here we propose a simple alternative design that represents an entire model as a mixture of modules each of which can contribute to computation at any processing stage. As opposed to many other approaches (Zoph & Le (2016) ; Bengio (2016); Kirsch et al. (2018a); Rosenbaum et al. (2019) etc.), we use a soft mixture of modules to obtain a more flexible model and produce a differentiable optimization objective that can be optimized end-to-end and does not involve high-variance estimators. Following our approach, the parameters of every block (or layer) of the network is computed as a linear combination of a set of "template" block parameters thus representing the entire model as: (a) a databank of template blocks and (b) vectors of "mixture weights" that are used to generate weights for every layer. This simple design can be utilized for a variety of applications from producing compact networks and training multi-task models capable of re-using individual modules to knowledge transfer and domain adaptation. The experimental results show that: (a) when used for multi-task training, our model organizes its modules where tasks share first few layers while specializing closer to the head, while (b) in domain adaptation problems modules instead specialize on processing the image, while sharing later processing stages. Moreover, our self-organizing model achieves promising results in multi-task learning and model personalization. The rest of the paper is organized as follows: in Section 2 we go over the related literature and discuss the existing approaches to modular networks and neural architecture search; in Section 3 we describe the architecture of our conversational model and various aspects of learning the best module sequence for each specific task; we introduce the experimental results in a single-task setting, multi-task learning, continual learning and domain adaptation in Section 4 and provide a final discussion in Section 5.

2. PRIOR WORK

The idea of constructing a deep neural network as a composition of reusable computation blocks has been extensively explored in recent literature. 2016) used a natural language parser to determine the layout of the composite network consisting of predefined modules that solve different kinds of subtasks, such as find, measure, describe etc; further works improved this approach by switching from an external parser to a fixed (Hu et al. (2017b) ) or arbitrary query structure (Hu et al. (2017a; 2018); Pahuja et al. (2019) ).

3.1. COMPOSITIONAL MODELS

In conventional convolutional deep neural networks, most layers are different from each other and do not conform to a particular fixed design. Various model blocks may have different resolutions and different numbers of filters, some blocks may have or lack a residual connection and so on. This makes most model activations at different layers incompatible with each other. Recently, it has been demonstrated (Sandler et al., 2019) that high performing convolutional networks can nevertheless be composed of identical blocks. Such networks that were called isometric networks essentially iterate on the same activation space. The architecture of an isometric network includes the following core components: (a) input adapter that maps model input i ∈ I into the activation space Z; (b) model body, a sequence of blocks all sharing the same architecture and mapping the space Z to itself; (c) output adapter mapping the output of the last block into the embedding space E, and finally (d) logits layer for classification models, mapping embedding space E into the final predicted probabilities.



First conditioned computation methods used evolution algorithms (Wierstra et al. (2005); Floreano et al. (2008); Miikkulainen et al. (2019)) to determine a suitable model architecture; later on, reinforcement learning was used to optimize the model layout and parameters (Zoph & Le (2016); Bengio (2016); Baker et al. (2016)). Recently, Kirsch et al.(2018a)  proposed an end-to-end modular neural network framework that learns to choose the best among several module candidates at each layer according to the input data. Conditional computation routing is even more appealing for the multitask learning application, as it allows to both learn reusable computation blocks and adjust the model to each specific task with minimal network alteration. For instance, Misra et al. (2016) use an additional set of modules for each task and enforce similarity between the corresponding task-specific modules weights.Rosenbaum et al. (2017; 2019)  used reinforcement learning to perform task-and data-specific routing; Sun et al. (2019) train both the model weights and the task-specific policy that determines which layers should be shared for a given task; Ma et al. (2019); Maziarz et al. (2019) introduced a multitask model where each layer's output is computed as a weighted sum of a set or candidate modules outputs, which is similar to our method, with two important differences: a) in the works by Ma et al. (2019); Maziarz et al. (2019), the output is computed as a linear combination of candidates outputs, not module weights, and b) the module candidates in their methods are tied to the specific location in the network. Maninis et al. (2019) added a task-specific residual adapters to specialize the feature extractor to each task, The method proposed by Purushwalkam et al. (2019) performs zero-shot multitask learning by finding a task-specific routing using a gating mechanism. This approach is somewhat similar to ours, but is less efficient in terms of the number of parameters since a) the method in Purushwalkam et al. (2019) uses a modular network on top of a classical ResNet (He et al. (2016)) while our method uses only about 100 additional parameters and b) the modules used in Purushwalkam et al. (2019) are layer-specific and not reusable while there is no such constraint in our architecture. Wu et al. (2018); Newell et al. (2019); Guo et al. (2020) perform task-specific model compression. There also exists a number of studies on the effectiveness of modular networks in the context of visual question answering. One of the early papers Andreas et al. (

