EVALUATING REPRESENTATIONS BY THE COMPLEXITY OF LEARNING LOW-LOSS PREDICTORS Anonymous

Abstract

We consider the problem of evaluating representations of data for use in solving a downstream task. We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest. To this end, we introduce two measures: surplus description length (SDL) and ε sample complexity (εSC). To compare our methods to prior work, we also present a framework based on plotting the validation loss versus dataset size (the "loss-data" curve). Existing measures, such as mutual information and minimum description length, correspond to slices and integrals along the dataaxis of the loss-data curve, while ours correspond to slices and integrals along the loss-axis. This analysis shows that prior methods measure properties of an evaluation dataset of a specified size, whereas our methods measure properties of a predictor with a specified loss. We conclude with experiments on real data to compare the behavior of these methods over datasets of varying size.

1. INTRODUCTION

One of the first steps in building a machine learning system is selecting a representation of data. Whereas classical machine learning pipelines often begin with feature engineering, the advent of deep learning has led many to argue for pure end-to-end learning where the deep network constructs the features (LeCun et al., 2015) . However, huge strides in unsupervised learning (Hénaff et al., 2019; Chen et al., 2020; He et al., 2019; van den Oord et al., 2018; Bachman et al., 2019; Devlin et al., 2019; Liu et al., 2019; Raffel et al., 2019; Brown et al., 2020) have led to a reversal of this trend in the past two years, with common wisdom now recommending that the design of most systems start from a pretrained representation. With this boom in representation learning techniques, practitioners and representation researchers alike have the question: Which representation is best for my task? This question exists as the middle step of the representation learning pipeline. The first step is representation learning, which consists of training a representation function on a training set using an objective which may be supervised or unsupervised. The second step, which this paper considers, is representation evaluation. In this step, one uses a measure of representation quality and a labeled evaluation dataset to see how well the representation performs. The final step is deployment, in which the practitioner or researcher puts the learned representation to use. Deployment could involve using the representation on a stream of user-provided data to solve a variety of end tasks (LeCun, 2015) , or simply releasing the trained weights of the representation function for general use. In the same way that BERT (Devlin et al., 2019) representations have been applied to a whole host of problems, the task or amount of data available in deployment might differ from the evaluation phase. We take the position that the best representation is the one which allows for the most efficient learning of a predictor to solve the task. We will measure efficiency in terms of either number of samples or information about the optimal predictor contained in the samples. This position is motivated by practical concerns; the more labels that are needed to solve a task in the deployment phase, the more expensive to use and the less widely applicable a representation will be. We build on a substantial and growing body of literature that attempts to answer the question of which representation is best. Simple, traditional means of evaluating representations, such as the validation accuracy of linear probes (Ettinger et al., 2016; Shi et al., 2016; Alain & Bengio, 2016) , have been widely criticized (Hénaff et al., 2019; Resnick et al., 2019) . Instead, researchers have taken 1

