MULTI-TASK LEARNING BY A TOP-DOWN CONTROL NETWORK Anonymous

Abstract

As the range of tasks performed by a general vision system expands, executing multiple tasks accurately and efficiently in a single network has become an important and still open problem. Recent computer vision approaches address this problem by branching networks, or by a channel-wise modulation of the network feature-maps with task specific vectors. We present a novel architecture that uses a dedicated top-down control network to modify the activation of all the units in the main recognition network in a manner that depends on the selected task, image content, and spatial location. We show the effectiveness of our scheme by achieving significantly better results than alternative state-of-the-art approaches on four datasets. We further demonstrate our advantages in terms of task selectivity, scaling the number of tasks and interpretability.

1. INTRODUCTION

The goal of multi-task learning is to improve the learning efficiency and increase the prediction accuracy of multiple tasks learned and performed in a shared network. In recent years, several types of architectures have been proposed to combine multiple tasks training and evaluation. Most current schemes assume task-specific branches, on top of a shared backbone (Figure 1a ) and use a weighted sum of tasks losses for training (Chen et al., 2017; Sener & Koltun, 2018) . Having a shared representation is more efficient from the standpoint of memory and sample complexity (Zhao et al., 2018) , but the performance of such schemes is highly dependent on the relative losses weights that cannot be easily determined without a "trial and error" search phase (Kendall et al., 2018) . Another type of architecture (Zhao et al., 2018; Strezoski et al., 2019) uses task-specific vectors to modulate the feature-maps along a feed-forward network, in a channel-wise manner (Figure 1b ). Channel-wise modulation based architecture has been shown to decrease the destructive interference between conflicting gradients of different tasks (Zhao et al., 2018) and allowed Strezoski et al. (2019) to scale the number of tasks without changing the network. Here, both training and evaluation use the single tasking paradigm: executing one task at a time, rather than getting responses to all the tasks in a single forward pass. Executing one task at a time is also possible by integrating task-specific modules along the network (Maninis et al., 2019) . A limitation of using task-specific modules (Maninis et al., 2019) or of using a fixed number of branches (Strezoski et al., 2019) , is that it may become difficult to add additional tasks at a later time during the system life-time. We propose a new type of architecture with no branching, which performs a single task at a time with no task-specific modules. Our model is trained to perform a set of tasks ({t i } T i=1 ) one task at a time. The model receives two inputs: the input image, and a learned vector that specifies the selected task t k to perform. It is constructed from two main parts (Figure 1c ): a main recognition network that is common to all tasks, termed below BU2 (BU for bottom-up), and a control network that modifies the feature-maps along BU2 in a manner that will compute a close approximation to the selected task t k . As detailed below, the control network itself is built from two components (Figure 1d ): a top-down (TD) network that receives as inputs both a task vector as well as image information from a bottom-up stream termed BU1 (Figure 1d ). As a result, the TD stream combines task information with image information, to control the individual units of the feature-maps along BU2. The modification of units activity in BU2 therefore depends on the task to perform, the spatial location, and the image content extracted by BU1. As shown later, the task control by our approach becomes highly efficient in the sense that the recognition network becomes tuned with high specificity to the selected task t k . Our contributions are as follow: a. Our new architecture is the first to modulate a multi-task network as a function of the task, location (spatial-aware) and image content (image-aware). All this is achieved by a top-down stream propagating task, image and location information to lower levels of the bottom-up network. b. Our scheme provides scalability with the number of tasks (no additional modules / branches per task) and interpretability (Localization of relevant objects at the end of the top-down stream). d. We introduce a new measure of task specificity, crucial for multi-tasking, and show the high task-selectivity of our scheme compared with alternatives.

2. RELATED WORK

Our work draws ideas from the following research lines: Multiple Task Learning (MTL) Multi-task learning has been used in machine learning well before the revival of deep networks (Caruana, 1997) . The success of deep neural networks in the performance of single tasks (e.g., in classification, detection and segmentation) has revived the interest of the computer vision community in the subject (Kokkinos, 2017; He et al., 2017; Redmon & Farhadi, 2017) . Although our primary application area is computer vision, multi-task learning has also many applications in other fields like natural language processing (Hashimoto et al., 2016; Collobert & Weston, 2008) and even across modalities (Bilen & Vedaldi, 2016) . Over the years, several types of architectures have been proposed in computer vision to combine the training and evaluation of multiple tasks. First works used several duplications (as many as the tasks) of the base network, with connections between them to pass useful information between the tasks (Misra et al., 2016; Rusu et al., 2016) . These works do not share computations and cannot scale with the number of tasks. More recent architectures, which are in common practice these days, assume task-specific branches on top of a shared backbone, and use a weighted sum of losses to train them. The joint learning of several tasks has proven beneficial in several cases (He et al., 2017) , but can also decrease the accuracy of some of the tasks due to limited network capacity, the presence of uncorrelated gradients from the different tasks and different rates of learning (Kirillov et al., 2019) . A naive implementation of multi-task learning requires careful calibration of the relative losses of the different tasks. To address these problem several methods have been proposed: 'Grad norm' (Chen



Figure 1: Schemes and Datasets: (a) Multi-branched architecture, task-specific branches on a top of a shared backbone. (b) Channel-wise modulation architecture, uses task vectors to modulate the feature maps along the main network. (c) Our architecture uses a top-down (TD) control network and modifies the featuremaps along the recognition net (BU2) element-wise, according to the image and to the current task. (d) The internal structure of the control network: image information, extracted by BU1, is combined with task information and accumulated by the TD stream, to control the units along BU2. (e) Images examples with their corresponding tasks. upper part: M-MNIST, the task is to recognize all the digits, CELEB-A, example tasks are the classification of a smile, sunglasses or earrings. lower part, left: CLEVR, an example task is to recognize the material of the cylinder to the right of the blue cube. right: CUB200, an example task is to identify the color of the bird's neck.

c. We show significantly better results than other state-of-the-art methods on four datasets: Multi-MNIST (Sener & Koltun, 2018), CLEVR(Johnson et al., 2017), CELEB-A(Liu et al., 2015)  andCUB-200 (Welinder et al., 2010). Advantages are shown in both accuracy and effective learning.

