LONG-TAILED LEARNING REQUIRES FEATURE LEARNING

Abstract

We propose a simple data model inspired from natural data such as text or images, and use it to study the importance of learning features in order to achieve good generalization. Our data model follows a long-tailed distribution in the sense that some rare subcategories have few representatives in the training set. In this context we provide evidence that a learner succeeds if and only if it identifies the correct features, and moreover derive non-asymptotic generalization error bounds that precisely quantify the penalty that one must pay for not learning features.

1. INTRODUCTION

Part of the motivation for deploying a neural network arises from the belief that algorithms that learn features/representations generalize better than algorithms that do not. We try to give some mathematical ballast to this notion by studying a data model where, at an intuitive level, a learner succeeds if and only if it manages to learn the correct features. The data model itself attempts to capture two key structures observed in natural data such as text or images. First, it is endowed with a latent structure at the patch or word level that is directly tied to a classification task. Second, the data distribution has a long-tail, in the sense that rare and uncommon instances collectively form a significant fraction of the data. We derive non-asymptotic generalization error bounds that quantify, within our framework, the penalty that one must pay for not learning features. We first prove a two part result that quantifies precisely the necessity of learning features within the context of our data model. The first part shows that a trivial nearest neighbor classifier performs perfectly when given knowledge of the correct features. The second part shows it is impossible to a priori craft a feature map that generalizes well when using a nearest neighbor classification rule. In other words, success or failure depends only on the ability to identify the correct features and not on the underlying classification rule. Since this cannot be done a priori, the features must be learned. Our theoretical results therefore support the idea that algorithms cannot generalize on long-tailed data if they do not learn features. Nevertheless, an algorithm that does learn features can generalize well. Specifically, the most direct neural network architecture for our data model generalizes almost perfectly when using either a linear classifier or a nearest neighbor classifier on the top of the learned features. Crucially, designing the architecture requires knowing only the meta structure of the problem, but no a priori knowledge of the correct features. This illustrates the built-in advantage of neural networks; their ability to learn features significantly eases the design burden placed on the practitioner. Subcategories in commonly used visual recognition datasets tend to follow a long-tailed distribution (Salakhutdinov et al., 2011; Zhu et al., 2014; Feldman & Zhang, 2020) . Some common subcategories have a wealth of representatives in the training set, whereas many rare subcategories only have a few representatives. At an intuitive level, learning features seems especially important on a long-tailed dataset since features learned from the common subcategories help to properly classify test points from a rare subcategory. Our theoretical results help support this intuition. We note that when considering complex visual recognition tasks, datasets are almost unavoidably long-tailed (Liu et al., 2019) -even if the dataset contains millions of images, it is to be expected that many subcategories will have few samples. In this setting, the classical approach of deriving asymptotic performance guarantees based on a large-sample limit is not a fruitful avenue. General-Published as conference paper at ICLR 2023 ization must be approached from a different point of view (c.f. Feldman (2020) for very interesting work in this direction). In particular, the analysis must be non-asymptotic. One of our main contribution is to derive, within the context of our data model, generalization error bounds that are non-asymptotic and relatively tight -by this we mean that our results hold for small numbers of data samples and track reasonably well with empirically evaluated generalization error. In Section 2 we introduce our data model and in Section 3 we discuss our theoretical results. For the simplicity of exposition, both sections focus on the case where each rare subcategory has a single representative in the training set. Section 4 is concerned with the general case in which each rare subcategory has few representatives. Section 5 provides an overview of our proof techniques. Finally, in Section 6, we investigate empirically a few questions that we couldn't resolve analytically. In particular, our error bounds are restricted to the case in which a nearest neighbor classification rule is applied on the top of the features -we provide empirical evidence in this last section that replacing the nearest neighbor classifier by a linear classifier leads to very minimal improvement. This further support the notion that, on our data model, it is the ability to learn features that drives success, not the specific classification rule used on the top of the features. 2018)), usually by proving a bound on the difference between the finite-sample empirical loss and true loss. While powerful in their generality, such approaches are usually task independent and asymptotic; that is, they are mostly agnostic to any idiosyncrasies in the data generating process and need a statistically meaningful number of samples in the training set. As such, the PAC learning framework is not well-tailored to our specific aim of studying generalization on long-tailed data distributions; indeed, in such setting, a rare subcategory might have only a handful of representatives in the training set. After breakthrough results (e.g. Jacot et al. ( 2018 2019)) showed that vastly over-parametrized neural networks become kernel methods (the so-called Neural Tangent Kernel or NTK) in an appropriate limit, much effort has gone toward analyzing the extent to which neural networks outperform kernel methods (Yehudai & Shamir, 2019; Wei et al., 2019; Refinetti et al., 2021; Ghorbani et al., 2019; 2020; Karp et al., 2021; Allen-Zhu & Li, 2019; 2020; Li et al., 2020; Malach et al., 2021) . Our interest lies not in proving such a gap for its own sake, but rather in using the comparison to gain some understanding on the importance of learning features in computer vision and NLP contexts. Analyses that shed theoretical light onto learning with long-tailed distributions (Feldman, 2020; Brown et al., 2021) or onto specific learning mechanisms (Karp et al., 2021) are perhaps closest to our own. The former analyses (Feldman, 2020; Brown et al., 2021) investigate the necessity of memorizing rare training examples in order to obtain near-optimal generalization error when the data distribution is long-tailed. Our analysis differs to the extent that we focus on the necessity of learning features and sharing representations in order to properly classify rare instances. Like us, the latter analysis (Karp et al., 2021 ) also considers a computer vision inspired task and uses it to compare a neural network to a kernel method, with the ultimate aim of studying the learning mechanism involved. Their object of study (finding a sparse signal in the presence of noise), however, markedly differs from our own (learning with long-tailed distributions).

2. THE DATA MODEL

We begin with a simple example to explain our data model and to illustrate, at an intuitive level, the importance of learning features when faced with a long-tailed data distribution. For the sake of exposition we adopt NLP terminology such as 'words' and 'sentences,' but the image-based terminology of 'patches' and 'images' would do as well. The starting point is a very standard mechanism for generating observed data from some underlying collection of latent variables. Consider the data model depicted in Figure 1 . We have a vocabulary of n w = 12 words and a set of n c = 3 concepts: V = {potato, cheese, carrots, chicken, . . .} and C = {vegetable, dairy, meat}.



Related work. By now, a rich literature has developed that studies the generalization abilities of neural networks. A major theme in this line of work is the use of the PAC learning framework to derive generalization bounds for neural networks (e.g. Bartlett et al. (2017); Neyshabur et al. (2017); Golowich et al. (2018); Arora et al. (2018); Neyshabur et al. (

); Du et al. (2018); Allen-Zhu et al. (2019); Ji & Telgarsky (

