AUXILIARY LEARNING BY IMPLICIT DIFFERENTIATION

Abstract

Training neural networks with auxiliary tasks is a common practice for improving the performance on a main task of interest. Two main challenges arise in this multi-task learning setting: (i) designing useful auxiliary tasks; and (ii) combining auxiliary tasks into a single coherent loss. Here, we propose a novel framework, AuxiLearn, that targets both challenges based on implicit differentiation. First, when useful auxiliaries are known, we propose learning a network that combines all losses into a single coherent objective function. This network can learn nonlinear interactions between tasks. Second, when no useful auxiliary task is known, we describe how to learn a network that generates a meaningful, novel auxiliary task. We evaluate AuxiLearn in a series of tasks and domains, including image segmentation and learning with attributes in the low data regime, and find that it consistently outperforms competing methods.

1. INTRODUCTION

The performance of deep neural networks can significantly improve by training the main task of interest with additional auxiliary tasks (Goyal et al., 2019; Jaderberg et al., 2016; Mirowski, 2019) . For example, learning to segment an image into objects can be more accurate when the model is simultaneously trained to predict other properties of the image like pixel depth or 3D structure (Standley et al., 2019) . In the low data regime, models trained with the main task only are prone to overfit and generalize poorly to unseen data (Vinyals et al., 2016) . In this case, the benefits of learning with multiple tasks are amplified (Zhang and Yang, 2017) . Training with auxiliary tasks adds an inductive bias that pushes learned models to capture meaningful representations and avoid overfitting to spurious correlations. In some domains, it may be easy to design beneficial auxiliary tasks and collect supervised data. For example, numerous tasks were proposed for self-supervised learning in image classification, including masking (Doersch et al., 2015) , rotation (Gidaris et al., 2018) and patch shuffling (Doersch and Zisserman, 2017; Noroozi and Favaro, 2016) . In these cases, it is not clear what would be the best way to combine all auxiliary tasks into a single loss (Doersch and Zisserman, 2017) . The common practice is to compute a weighted combination of pretext losses by tuning the weights of individual losses using hyperparameter grid search. This approach, however, limits the potential of learning with auxiliary tasks because the run time of grid search grows exponentially with the number of tasks. In other domains, obtaining good auxiliaries in the first place may be challenging or may require expert knowledge. For example, for point cloud classification, few self-supervised tasks have been proposed; however, their benefits so far are limited (Achituve et al., 2020; Hassani and Haley, 2019; Sauder and Sievers, 2019; Tang et al., 2020) . For these cases, it would be beneficial to automate the process of generating auxiliary tasks without domain expertise. Our work takes a step forward in automating the use and design of auxiliary learning tasks. We name our approach AuxiLearn. AuxiLearn leverages recent progress made in implicit differentiation for optimizing hyperparameters (Liao et al., 2018; Lorraine et al., 2020) . We demonstrate the effectiveness of AuxiLearn in two types of problems. First, in combining auxiliaries, for cases where auxiliary tasks are predefined. We describe how to train a deep neural network (NN) on top of auxiliary losses and combine them non-linearly into a unified loss. For instance, we combine per-pixel losses in image segmentation tasks using a convolutional NN (CNN). Second, designing auxiliaries, for cases where predefined auxiliary tasks are not available. We present an approach for learning such auxiliary tasks without domain knowledge and from input data alone. This is achieved by training an auxiliary network to generate auxiliary labels while training another, primary network to learn both the original task and the auxiliary task. One important distinction from previous works, such as (Kendall et al., 2018; Liu et al., 2019a) , is that we do not optimize the auxiliary parameters using the training loss but rather on a separate (small) auxiliary set, allocated from the training data. This is a key difference since the goal of auxiliary learning is to improve generalization rather than help optimization on the training data. To validate our proposed solution, we extensively evaluate AuxiLearn in several tasks in the low-data regime. In this regime, the models suffer from severe overfitting and auxiliary learning can provide the largest benefits. Our results demonstrate that using AuxiLearn leads to improved loss functions and auxiliary tasks, in terms of the performance of the resulting model on the main task. We complement our experimental section with two interesting theoretical insights regarding our model. The first shows that a relatively simple auxiliary hypothesis class may overfit. The second aims to understand which auxiliaries benefit the main task. To summarize, we propose a novel general approach for learning with auxiliaries using implicit differentiation. We make the following novel contributions: (a) We describe a unified approach for combining multiple loss terms and for learning novel auxiliary tasks from the data alone; (b) We provide a theoretical observation on the capacity of auxiliary learning; (c) We show that the key quantity for determining beneficial auxiliaries is the Newton update; (d) We provide new results on a variety of auxiliary learning tasks with a focus on the low data regime. We conclude that implicit differentiation can play a significant role in automating the design of auxiliary learning setups.

2. RELATED WORK

Learning with multiple tasks. Multitask Learning (MTL) aims at simultaneously solving multiple learning problems while sharing information across tasks. In some cases, MTL benefits the optimization process and improves task-specific generalization performance compared to single-task learning (Standley et al., 2019) . In contrast to MTL, auxiliary learning aims at solving a single, main task, and the purpose of all other tasks is to facilitate the learning of the primary task. At test time, only the main task is considered. This approach has been successfully applied in multiple domains, including computer vision (Zhang et al., 2014) , natural language processing (Fan et al., 2017; Trinh et al., 2018) , and reinforcement learning (Jaderberg et al., 2016; Lin et al., 2019) . Dynamic task weighting. When learning a set of tasks, the task-specific losses are combined into an overall loss. The way individual losses are combined is crucial because MTL-based models are sensitive to the relative weightings of the tasks (Kendall et al., 2018) . A common approach for combining task losses is in a linear fashion. When the number of tasks is small, task weights are commonly tuned with a simple grid search. However, this approach does not extend to a large number of tasks, or a more complex weighting scheme. Several recent studies proposed scaling task weights using gradient magnitude (Chen et al., 2018 ), task uncertainty (Kendall et al., 2018) , or the rate of loss change (Liu et al., 2019b) . Sener and Koltun (2018) proposed casting the multitask learning problem as a multi-objective optimization. These methods assume that all tasks are equally important, and are less suited for auxiliary learning. Du et al. (2018) and Lin et al. (2019) proposed to weight auxiliary losses using gradient similarity. However, these methods do not scale well with the number of auxiliaries and do not take into account interactions between auxiliaries. In contrast, we propose to learn from data how to combine auxiliaries, possibly in a non-linear manner.

