MULTI TASK LEARNING OF DIFFERENT CLASS LABEL REPRESENTATIONS FOR STRONGER MODELS Anonymous

Abstract

We find that the way in which class labels are represented can have a powerful effect on how well models trained on them learn. In classification, the standard way of representing class labels is as one-hot vectors. We present a new way of representing class labels called Binary Labels, where each class label is a large binary vector. We further introduce a new paradigm, multi task learning on different label representations. We train a network on two tasks. The main task is to classify images based on their one-hot label, and the auxiliary task is to classify images based on their Binary Label. We show that networks trained on both tasks have many advantages, including higher accuracy across a wide variety of datasets and architectures, both when trained from scratch and when using transfer learning. Networks trained on both tasks are also much more effective when training data is limited, and seem to do especially well on more challenging problems.

1. INTRODUCTION

All supervised learning problems involve three components, data, a model, and labels. A tremendous amount of work has been done involving the first two parts, models, and data, but labels have been mostly ignored. Deep learning architectures get ever more complicated (Villalobos et al., 2022) , and many excellent techniques exist to learn strong representations of the data that is fed into the model (He et al., 2022a; Chen et al., 2020) , but the labels themselves remain as simple as they were 20 years ago. Let's illustrate this point with a common deep learning task, image classification on ImageNet (Deng et al., 2009) . The leaderboard is full of algorithms that employ powerful unsupervised pretraining methods to learn representations of the data, and make use of massive model architectures with hundreds of layers and millions of weights. But if one surveys the top state of the art models, the target labels are all simple one-hot encoded vectors (Yu et al., 2022; Wortsman et al., 2022; Chen et al., 2022) . Recently, various alternative label representations have been proposed, where labels are represented as dense vectors (Chen et al., 2021) . While on some metrics, such as robustness and data efficiency, these labels were able to outperform standard label representations, on other key metrics, such as accuracy, they performed worse than one-hot labels. Additionally, these dense labels were slower at inference time, for two reasons. First, because they added more weights to the networks that used them, the forward pass takes slightly longer. Second, and more significant, is that classification on dense labels is slower by definition since it involves comparing the network output with each label to find the class with the nearest label, as opposed to one-hot labels with a softmax output where the classification is done directly. Furthermore, these dense labels were much slower to converge, often taking four times as many epochs. As such, choosing between these label representations is essentially a trade off between opposing factors. In this paper we explore the key idea of: given various ways of representing labels, why only choose one? Every way of representing labels can become a different task for the network to learn. By choosing to represent the same class labels in different ways, we are reusing the supervision to create multiple learning tasks. We thus propose using recent work in the field of multi task learning to train networks to recognize multiple representations of the same labels. We find that doing so allows us to mitigate the tradeoff above. From an intuitive perspective, it makes sense that learning multiple representations of the same concept can be useful. To give an analogy from data structures, while linked lists and arrays represent the same underlying data, each data structure has inherent advantages and disadvantages, and some algorithms will rely on redundant data storage of both types to get the advantages of both. While it may be novel to study explicit redundant label representations, they commonly exist implicitly. Research on the human brain indicates we make heavy use of redundant representations (Pieszek et al., 2013) , and artificial neural networks are already doing this implicitly (Doimo et al., 2021) . We present a new label representation type, Binary labels, where each class label is represented as a binary vector. This representation has key properties that led us to think it would create a useful auxiliary task to improve accuracy, and indeed our experiments verify this. We make the following main contributions: 1. We present the novel paradigm of using several label representations of a single label to augment our network supervision and create auxiliary tasks for the network to learn 2. We present a new label representation type, Binary Labels 3. We present results that demonstrate the strength of our approach We hope this will inspire further research on this topic.

2. RELATED WORK

We unite work from two areas, label representation, and multi task learning.

2.1. LABEL REPRESENTATION

There are many scenarios where it makes sense for a network performing image classification to output a dense representation as opposed to a softmax vector. For example, in few-shot learning, a common approach, often termed embedding learning, is to learn a lower dimensional representation of the input, such that similar images are near each other in the embedding space (Wang et al., 2020a) . It is also common to use autoencoders to try and learn image representations, which usually involves some sort of dense intermediate representation that is used to reconstruct the original (Pinaya et al., 2020) . While both of these methods involve networks that output a representation, they do not make any attempt to generate or make use of alternative label representations. Indeed, relatively little research has been conducted on the labels themselves or more specifically, their representation, aside from (Chen et al., 2021) , where several alternative methods were proposed. They reported gains on robustness and data efficiency. However convergence was much slower, and accuracy was slightly lower than when using standard softmax labels. Aside form that, the closes work is label smoothing (Szegedy et al., 2016; Sun et al., 2017 ), although as (Chen et al., 2021) note, it is quite different.

2.2. MULTI-TASK LEARNING

Multi task learning, first proposed in 1997 (Caruana, 1997) , has gained traction as a powerful way of training a single network to learn multiple tasks at the same time (Crawshaw, 2020; Ruder, 2017; Zhang & Yang, 2021; Vandenhende et al., 2021) . Often times, doing so leads to gains on both tasks, and recent work has explored what sorts of tasks are best learned together (Standley et al., 2020) . Auxiliary learning is a subtopic in multi task learning that deals specifically with learning a main task, which we care about, and auxiliary tasks, whose only purpose is to increase performance for the main task (Vafaeikia et al., 2020; Liebel & Körner, 2018) . We make use of several standard tools from multitask learning. We use a shared trunk approach (Crawshaw, 2020), such that all tasks share a backbone network, but have different output heads. When combining the losses of different tasks, we explored several popular methods including PC-Grad (Yu et al., 2020 ), GradNorm (Chen et al., 2018 ), GradVac (Wang et al., 2020b ), and Mtadam (Malkiel & Wolf, 2020) , but ultimately settled on Metabalance (He et al., 2022b) , which is specifically aimed at auxiliary learningfoot_0 . We also assign weights to each task, such that the sum of all weights on all tasks sum to 1.



We believe we caught a small mistake in the Metabalance paper, and notified the author. The version we use is slightly different in that it requires two forward passes to correctly compute the gradients

