ALPHA NET: ADAPTATION WITH COMPOSITION IN CLASSIFIER SPACE

Abstract

Deep learning classification models typically train poorly on classes with small numbers of examples. Motivated by the human ability to solve this task, models have been developed that transfer knowledge from classes with many examples to learn classes with few examples. Critically, the majority of these models transfer knowledge within model feature space. In this work, we demonstrate that transferring knowledge within classifier space is more effective and efficient. Specifically, by linearly combining strong nearest neighbor classifiers along with a weak classifier, we are able to compose a stronger classifier. Uniquely, our model can be implemented on top of any existing classification model that includes a classifier layer. We showcase the success of our approach in the task of long-tailed recognition, whereby the classes with few examples, otherwise known as the "tail" classes, suffer the most in performance and are the most challenging classes to learn. Using classifier-level knowledge transfer, we are able to drastically improve -by a margin as high as 10.5% -the state-of-the-art performance on the "tail" categories.

1. INTRODUCTION

The computer vision field has made rapid progress in the area of object recognition due to several factors: complex architectures, larger compute power, more data, and better learning strategies. However, the standard method to train recognition models on new classes still relies on training using large sets of examples. This dependence on large scale data has made learning from few samples a natural challenge. Highlighting this point, new tasks such as low-shot learning and longtailed learning, have recently become common within computer vision. Many approaches to learning from small numbers of examples are inspired by human learning. In particular, humans are able to learn new concepts quickly and efficiently over only a few samples. The overarching theory is that humans are able to transfer their knowledge from previous experiences to bootstrap their new learning task (Lake et al., 2017; 2015; Gopnik & Sobel, 2000) . Inherent in these remarkable capabilities are two related questions: what knowledge is being transferred and how is this knowledge being transferred? Within computer vision, recent low-shot learning and long-tailed recognition models answer these questions by treating visual "representations" as the knowledge structures that are being transferred. As such, the knowledge transfer methods implemented in these models transfer learned features from known classes learned from large data to the learning of new classes with low data (Liu et al., 2019; Yin et al., 2019) . These models exemplify the broader assumption that, in both human and computer vision, knowledge transfer occurs within model representation and feature space (Lake et al., 2015) . In contrast, we claim that previously learned information is more concisely captured in classifier space. This inference is based on the fact that sample representation is unique to that sample, but classifiers are fitted for all the samples in a given class. The success of working within classifier space to improve certain classifiers has been established in several papers (Elhoseiny et al., 2013; Qi et al., 2018) , where the models are able to directly predict classifiers from features or create new models entirely by learning other models. Other non-deep learning models use classifiers learnt with abundant data to generate novel classifiers (Aytar & Zisserman, 2011; 2012) . Despite these successes, the concept of learning within classifier space is not as common in deep learning models. We suggest that transfer learning can, likewise, be implemented in the classifier space of deep learning models. Specifically, we combine known, strong classifiers (i.e., learned with large datasets) with weak classifiers (i.e., learned with small datasets) to improve our weak classifiers. Our classifier space method is illustrated in Figure 1 . In this toy example demonstrating our method, we are given n classifiers C i trained with large data and a weak classifier a, which was trained for a class with very few samples. Our goal is to combine the most relevant strong classifiers to adaptively adjust and improve a. Our method implements this approach in the simplest way possible. To select the most effective strong classifiers to combine, we take the target's a nearest neighbor classifiers. Given the appropriate nearest neighbor classifiers, the challenge becomes how to combine these strong classifiers with the target weak classifier so as to improve its performance given no further samples. We address this challenge using what we view as the most natural approach -creating a new classifier by linearly combining the nearest neighbor classifiers and the original weak classifier. Embedded in this solution is the further challenge of choosing a strategy for computing the combination coefficients. We propose learning the coefficients of the linear combination through another neural network, which we refer to as "Alpha Net". As compared to many other approaches to learning from small numbers of examples, our methodology has three characteristics: 1. Our approach can be implemented on top of any architecture. This is because the Alpha Net does not need to re-learn representations and only operates within classifier space to improve weak classifiers. As a result, our approach is agnostic to the type of architecture used to learn the classifiers; it merely provides a systematic method for combining classifiers. 2. Our approach demonstrates the importance of combining not only the most relevant classifiers but also the original classifier. In the absence of the original classifier, any combination of classifiers becomes a possible solution, without being constrained by the initial classifier. 3. Our approach creates for every target class a completely different set of linear coefficients for our new classifier composition. In this manner, we are learning our coefficients in a more adaptive way, which is extremely difficult to achieve through classical methods. To illustrate the efficacy of our method, we apply it to the task of long-tailed recognition. "Longtailed" refers to a data distribution in which there is a realistic distribution of classes with many examples (head classes) and classes with few examples (tail classes). We compare our Alpha Net method to recent state-of-the-art models Kang et al. ( 2020) on two long-tailed datasets: ImageNet-LT and Places-LT. Critically, we are able to improve the tail classifiers accuracy by as much as 10.5%.

2. RELATED WORK

Creating, modifying, and learning model weights are concepts that are seen in many earlier models. In particular, these concepts appear frequently in transfer learning, meta-learning, low-shot learning, and long-tailed learning. Classifier Creation The process of creating new classifiers is captured within meta-learning concepts such as learning-to-learn, transfer learning, and multi-task learning (Thrun & Pratt, 2012;  



Figure 1: A classifier space depicting how Alpha Net adaptively adjusts weak classifiers through nearest neighbor compositions.

