AUXILIARY LEARNING BY IMPLICIT DIFFERENTIATION

Abstract

Training neural networks with auxiliary tasks is a common practice for improving the performance on a main task of interest. Two main challenges arise in this multi-task learning setting: (i) designing useful auxiliary tasks; and (ii) combining auxiliary tasks into a single coherent loss. Here, we propose a novel framework, AuxiLearn, that targets both challenges based on implicit differentiation. First, when useful auxiliaries are known, we propose learning a network that combines all losses into a single coherent objective function. This network can learn nonlinear interactions between tasks. Second, when no useful auxiliary task is known, we describe how to learn a network that generates a meaningful, novel auxiliary task. We evaluate AuxiLearn in a series of tasks and domains, including image segmentation and learning with attributes in the low data regime, and find that it consistently outperforms competing methods.

1. INTRODUCTION

The performance of deep neural networks can significantly improve by training the main task of interest with additional auxiliary tasks (Goyal et al., 2019; Jaderberg et al., 2016; Mirowski, 2019) . For example, learning to segment an image into objects can be more accurate when the model is simultaneously trained to predict other properties of the image like pixel depth or 3D structure (Standley et al., 2019) . In the low data regime, models trained with the main task only are prone to overfit and generalize poorly to unseen data (Vinyals et al., 2016) . In this case, the benefits of learning with multiple tasks are amplified (Zhang and Yang, 2017). Training with auxiliary tasks adds an inductive bias that pushes learned models to capture meaningful representations and avoid overfitting to spurious correlations. In some domains, it may be easy to design beneficial auxiliary tasks and collect supervised data. For example, numerous tasks were proposed for self-supervised learning in image classification, including masking (Doersch et al., 2015) , rotation (Gidaris et al., 2018) and patch shuffling (Doersch and Zisserman, 2017; Noroozi and Favaro, 2016) . In these cases, it is not clear what would be the best way to combine all auxiliary tasks into a single loss (Doersch and Zisserman, 2017) . The common practice is to compute a weighted combination of pretext losses by tuning the weights of individual losses using hyperparameter grid search. This approach, however, limits the potential of learning with auxiliary tasks because the run time of grid search grows exponentially with the number of tasks. In other domains, obtaining good auxiliaries in the first place may be challenging or may require expert knowledge. For example, for point cloud classification, few self-supervised tasks have been proposed; however, their benefits so far are limited (Achituve et al., 2020; Hassani and Haley, 2019;  

