LONG-TAILED LEARNING REQUIRES FEATURE LEARNING

Abstract

We propose a simple data model inspired from natural data such as text or images, and use it to study the importance of learning features in order to achieve good generalization. Our data model follows a long-tailed distribution in the sense that some rare subcategories have few representatives in the training set. In this context we provide evidence that a learner succeeds if and only if it identifies the correct features, and moreover derive non-asymptotic generalization error bounds that precisely quantify the penalty that one must pay for not learning features.

1. INTRODUCTION

Part of the motivation for deploying a neural network arises from the belief that algorithms that learn features/representations generalize better than algorithms that do not. We try to give some mathematical ballast to this notion by studying a data model where, at an intuitive level, a learner succeeds if and only if it manages to learn the correct features. The data model itself attempts to capture two key structures observed in natural data such as text or images. First, it is endowed with a latent structure at the patch or word level that is directly tied to a classification task. Second, the data distribution has a long-tail, in the sense that rare and uncommon instances collectively form a significant fraction of the data. We derive non-asymptotic generalization error bounds that quantify, within our framework, the penalty that one must pay for not learning features. We first prove a two part result that quantifies precisely the necessity of learning features within the context of our data model. The first part shows that a trivial nearest neighbor classifier performs perfectly when given knowledge of the correct features. The second part shows it is impossible to a priori craft a feature map that generalizes well when using a nearest neighbor classification rule. In other words, success or failure depends only on the ability to identify the correct features and not on the underlying classification rule. Since this cannot be done a priori, the features must be learned. Our theoretical results therefore support the idea that algorithms cannot generalize on long-tailed data if they do not learn features. Nevertheless, an algorithm that does learn features can generalize well. Specifically, the most direct neural network architecture for our data model generalizes almost perfectly when using either a linear classifier or a nearest neighbor classifier on the top of the learned features. Crucially, designing the architecture requires knowing only the meta structure of the problem, but no a priori knowledge of the correct features. This illustrates the built-in advantage of neural networks; their ability to learn features significantly eases the design burden placed on the practitioner. Subcategories in commonly used visual recognition datasets tend to follow a long-tailed distribution (Salakhutdinov et al., 2011; Zhu et al., 2014; Feldman & Zhang, 2020) . Some common subcategories have a wealth of representatives in the training set, whereas many rare subcategories only have a few representatives. At an intuitive level, learning features seems especially important on a long-tailed dataset since features learned from the common subcategories help to properly classify test points from a rare subcategory. Our theoretical results help support this intuition. We note that when considering complex visual recognition tasks, datasets are almost unavoidably long-tailed (Liu et al., 2019) -even if the dataset contains millions of images, it is to be expected that many subcategories will have few samples. In this setting, the classical approach of deriving asymptotic performance guarantees based on a large-sample limit is not a fruitful avenue. General-

