EVALUATING REPRESENTATIONS BY THE COMPLEXITY OF LEARNING LOW-LOSS PREDICTORS Anonymous

Abstract

We consider the problem of evaluating representations of data for use in solving a downstream task. We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest. To this end, we introduce two measures: surplus description length (SDL) and ε sample complexity (εSC). To compare our methods to prior work, we also present a framework based on plotting the validation loss versus dataset size (the "loss-data" curve). Existing measures, such as mutual information and minimum description length, correspond to slices and integrals along the dataaxis of the loss-data curve, while ours correspond to slices and integrals along the loss-axis. This analysis shows that prior methods measure properties of an evaluation dataset of a specified size, whereas our methods measure properties of a predictor with a specified loss. We conclude with experiments on real data to compare the behavior of these methods over datasets of varying size.

1. INTRODUCTION

One of the first steps in building a machine learning system is selecting a representation of data. Whereas classical machine learning pipelines often begin with feature engineering, the advent of deep learning has led many to argue for pure end-to-end learning where the deep network constructs the features (LeCun et al., 2015) . However, huge strides in unsupervised learning (Hénaff et al., 2019; Chen et al., 2020; He et al., 2019; van den Oord et al., 2018; Bachman et al., 2019; Devlin et al., 2019; Liu et al., 2019; Raffel et al., 2019; Brown et al., 2020) have led to a reversal of this trend in the past two years, with common wisdom now recommending that the design of most systems start from a pretrained representation. With this boom in representation learning techniques, practitioners and representation researchers alike have the question: Which representation is best for my task? This question exists as the middle step of the representation learning pipeline. The first step is representation learning, which consists of training a representation function on a training set using an objective which may be supervised or unsupervised. The second step, which this paper considers, is representation evaluation. In this step, one uses a measure of representation quality and a labeled evaluation dataset to see how well the representation performs. The final step is deployment, in which the practitioner or researcher puts the learned representation to use. Deployment could involve using the representation on a stream of user-provided data to solve a variety of end tasks (LeCun, 2015) , or simply releasing the trained weights of the representation function for general use. In the same way that BERT (Devlin et al., 2019) representations have been applied to a whole host of problems, the task or amount of data available in deployment might differ from the evaluation phase. We take the position that the best representation is the one which allows for the most efficient learning of a predictor to solve the task. We will measure efficiency in terms of either number of samples or information about the optimal predictor contained in the samples. This position is motivated by practical concerns; the more labels that are needed to solve a task in the deployment phase, the more expensive to use and the less widely applicable a representation will be. We build on a substantial and growing body of literature that attempts to answer the question of which representation is best. Simple, traditional means of evaluating representations, such as the validation accuracy of linear probes (Ettinger et al., 2016; Shi et al., 2016; Alain & Bengio, 2016) , have been widely criticized (Hénaff et al., 2019; Resnick et al., 2019) . Instead, researchers have taken up a variety of alternatives such as the validation accuracy (VA) of nonlinear probes (Conneau et al., 2018; Hénaff et al., 2019) , mutual information (MI) between representations and labels (Bachman et al., 2019; Pimentel et al., 2020) , and minimum description length (MDL) of the labels conditioned on the representations (Blier & Ollivier, 2018; Yogatama et al., 2019; Voita & Titov, 2020) . Validation accuracy (VA), mutual information (MI), and minimum description length (MDL) measure properties of a given dataset, with VA measuring the loss at a finite amount of data, MI measuring it at infinity, and MDL integrating it from zero to n. This dependence on dataset size can lead to misleading conclusions as the amount of available data changes. Middle: Our proposed methods instead measure the complexity of learning a predictor with a particular loss tolerance. ε sample complexity (εSC) measures the number of samples required to reach that loss tolerance, while surplus description length (SDL) integrates the surplus loss incurred above that tolerance. Neither depends on the dataset size. Right: A simple example task which illustrates the issue. One representation, which consists of noisy labels, allows quick learning, while the other supports low loss in the limit of data. Evaluating either representation at a particular dataset size risks drawing the wrong conclusion. We find that these methods all have clear limitations. As can be seen in Figure 1 , VA and MDL are liable to choose different representations for the same task when given evaluation datasets of different sizes. Instead we want an evaluation measure which depends on the data distribution, not a particular dataset or dataset size. Furthermore, VA and MDL lack a predefined notion of success in solving a task. In combination with small evaluation datasets, these measures may lead to premature evaluation by producing a judgement even when there is not enough data to solve the task or meaningfully distinguish one representation from another. Meanwhile, MI measures the lowest loss achievable by any predictor irrespective of the complexity of learning it. We note that while these methods do not correspond to our notion of best representation, they may be correct for different notions of "best". To eliminate these issues, we propose two measures. In both of our measures, the user must specify a tolerance ε so that a population loss of less than ε qualifies as solving the task. The first measure is the surplus description length (SDL) which modifies the MDL to measure the complexity of learning an ε-loss predictor rather than the complexity of the labels in the evaluation dataset. The second is the ε-sample complexity (εSC) which measures the sample complexity of learning an ε-loss predictor. To facilitate our analysis, we also propose a framework called the loss-data framework, illustrated in Figure 1 , that plots the validation loss against the evaluation dataset size (Talmor et al., 2019; Yogatama et al., 2019; Voita & Titov, 2020) . This framework simplifies comparisons between measures. Prior work measures integrals (MDL) and slices (VA and MI) along the data-axis. Our work proposes instead measuring integrals (SDL) and slices (εSC) along the loss-axis. This illustrates how prior work makes tacit choices about the function to learn based on the choice of dataset size. Our work instead makes an explicit, interpretable choice of threshold ε and measures the complexity of solving the task to ε error. We experimentally investigate the behavior of these methods, illustrating the sensitivity of VA and MDL, and the robustness of SDL and εSC, to dataset size. Efficient implementation. To enable reproducible and efficient representation evaluation for representation researchers, we have developed a highly optimized open source Python package (see



Figure1: Each measure for evaluating representation quality is a simple function of the "loss-data" curve shown here, which plots validation loss of a probe against evaluation dataset size. Left: Validation accuracy (VA), mutual information (MI), and minimum description length (MDL) measure properties of a given dataset, with VA measuring the loss at a finite amount of data, MI measuring it at infinity, and MDL integrating it from zero to n. This dependence on dataset size can lead to misleading conclusions as the amount of available data changes. Middle: Our proposed methods instead measure the complexity of learning a predictor with a particular loss tolerance. ε sample complexity (εSC) measures the number of samples required to reach that loss tolerance, while surplus description length (SDL) integrates the surplus loss incurred above that tolerance. Neither depends on the dataset size. Right: A simple example task which illustrates the issue. One representation, which consists of noisy labels, allows quick learning, while the other supports low loss in the limit of data. Evaluating either representation at a particular dataset size risks drawing the wrong conclusion.

