MULTI-TASK LEARNING BY A TOP-DOWN CONTROL NETWORK Anonymous

Abstract

As the range of tasks performed by a general vision system expands, executing multiple tasks accurately and efficiently in a single network has become an important and still open problem. Recent computer vision approaches address this problem by branching networks, or by a channel-wise modulation of the network feature-maps with task specific vectors. We present a novel architecture that uses a dedicated top-down control network to modify the activation of all the units in the main recognition network in a manner that depends on the selected task, image content, and spatial location. We show the effectiveness of our scheme by achieving significantly better results than alternative state-of-the-art approaches on four datasets. We further demonstrate our advantages in terms of task selectivity, scaling the number of tasks and interpretability.

1. INTRODUCTION

The goal of multi-task learning is to improve the learning efficiency and increase the prediction accuracy of multiple tasks learned and performed in a shared network. In recent years, several types of architectures have been proposed to combine multiple tasks training and evaluation. Most current schemes assume task-specific branches, on top of a shared backbone (Figure 1a ) and use a weighted sum of tasks losses for training (Chen et al., 2017; Sener & Koltun, 2018) . Having a shared representation is more efficient from the standpoint of memory and sample complexity (Zhao et al., 2018) , but the performance of such schemes is highly dependent on the relative losses weights that cannot be easily determined without a "trial and error" search phase (Kendall et al., 2018) . Another type of architecture (Zhao et al., 2018; Strezoski et al., 2019) uses task-specific vectors to modulate the feature-maps along a feed-forward network, in a channel-wise manner (Figure 1b ). Channel-wise modulation based architecture has been shown to decrease the destructive interference between conflicting gradients of different tasks (Zhao et al., 2018) and allowed Strezoski et al. (2019) to scale the number of tasks without changing the network. Here, both training and evaluation use the single tasking paradigm: executing one task at a time, rather than getting responses to all the tasks in a single forward pass. Executing one task at a time is also possible by integrating task-specific modules along the network (Maninis et al., 2019) . A limitation of using task-specific modules (Maninis et al., 2019) or of using a fixed number of branches (Strezoski et al., 2019) , is that it may become difficult to add additional tasks at a later time during the system life-time. We propose a new type of architecture with no branching, which performs a single task at a time with no task-specific modules. Our model is trained to perform a set of tasks ({t i } T i=1 ) one task at a time. The model receives two inputs: the input image, and a learned vector that specifies the selected task t k to perform. It is constructed from two main parts (Figure 1c ): a main recognition network that is common to all tasks, termed below BU2 (BU for bottom-up), and a control network that modifies the feature-maps along BU2 in a manner that will compute a close approximation to the selected task t k . As detailed below, the control network itself is built from two components (Figure 1d ): a top-down (TD) network that receives as inputs both a task vector as well as image information from a bottom-up stream termed BU1 (Figure 1d ). As a result, the TD stream combines task information with image information, to control the individual units of the feature-maps along BU2. The modification of units

