COMPOSITIONAL MODELS: MULTI-TASK LEARNING AND KNOWLEDGE TRANSFER WITH MODULAR NET-WORKS

Abstract

Conditional computation and modular networks have been recently proposed for multitask learning and other problems as a way to decompose problem solving into multiple reusable computational blocks. We propose a novel fully-differentiable approach for learning modular networks. In our method, the modules can be invoked repeatedly and allow knowledge transfer to novel tasks by adjusting the order of computation. This allows soft weight sharing between tasks with only a small increase in the number of parameters. We show that our method leads to interpretable self-organization of modules in case of multi-task learning, transfer learning and domain adaptation while achieving competitive results on those tasks. From practical perspective, our approach allows to: (a) reuse existing modules for learning new task by adjusting the computation order, (b) use it for unsupervised multi-source domain adaptation to illustrate that adaptation to unseen data can be achieved by only manipulating the order of pretrained modules, (c) show how our approach can be used to increase accuracy of existing architectures for image classification tasks such as IMAGENET, without any parameter increase, by reusing the same block multiple times.

1. INTRODUCTION

Most modern interpretations of neural networks treat layers as transformations that convert lowerlevel features, such as pixels, into higher-level abstractions. Adopting this view, many designs that break up monolithic models into interacting modular components localize them to particular processing stages thus disallowing parameter sharing across layers. This restriction, however, can be limiting when we need to continually grow the model, be that for transferring knowledge to a new task or preventing catastrophic forgetting. It also prohibits module reuse within a model even though certain parameter-efficient architectures and complex decision making processes may involve recurrent components (Kaiser & Sutskever, 2016; Randazzo et al., 2020) and feedback loops (Herzog et al., 2020; Yan et al., 2019; Kar et al., 2019) . Here we propose a simple alternative design that represents an entire model as a mixture of modules each of which can contribute to computation at any processing stage. As opposed to many other approaches (Zoph & Le (2016) ; Bengio ( 2016 2019) etc.), we use a soft mixture of modules to obtain a more flexible model and produce a differentiable optimization objective that can be optimized end-to-end and does not involve high-variance estimators. Following our approach, the parameters of every block (or layer) of the network is computed as a linear combination of a set of "template" block parameters thus representing the entire model as: (a) a databank of template blocks and (b) vectors of "mixture weights" that are used to generate weights for every layer. This simple design can be utilized for a variety of applications from producing compact networks and training multi-task models capable of re-using individual modules to knowledge transfer and domain adaptation. The experimental results show that: (a) when used for multi-task training, our model organizes its modules where tasks share first few layers while specializing closer to the head, while (b) in domain adaptation problems modules instead specialize on processing the image, while sharing later processing stages. Moreover, our self-organizing model achieves promising results in multi-task learning and model personalization. The rest of the paper is organized as follows: in Section 2 we go over the related literature and discuss the existing approaches to modular



); Kirsch et al. (2018a); Rosenbaum et al. (

