EVALUATING REPRESENTATIONS WITH READOUT MODEL SWITCHING

Abstract

Although much of the success of Deep Learning builds on learning good representations, a rigorous method to evaluate their quality is lacking. In this paper, we treat the evaluation of representations as a model selection problem and propose to use the Minimum Description Length (MDL) principle to devise an evaluation metric. Contrary to the established practice of limiting the capacity of the readout model, we design a hybrid discrete and continuous-valued model space for the readout models and employ a switching strategy to combine their predictions. The MDL score takes model complexity, as well as data efficiency into account. As a result, the most appropriate model for the specific task and representation will be chosen, making it a unified measure for comparison. The proposed metric can be efficiently computed with an online method and we present results for pre-trained vision encoders of various architectures (ResNet and ViT) and objective functions (supervised and self-supervised) on a range of downstream tasks. We compare our methods with accuracy-based approaches and show that the latter are inconsistent when multiple readout models are used. Finally, we discuss important properties revealed by our evaluations such as model scaling, preferred readout model, and data efficiency.

1. INTRODUCTION

Data representation is crucial to the performance of machine learning algorithms (Bengio et al., 2013) . Much of the success of Deep Neural Networks (DNN) can be attributed to their capability of gradually building up more and more abstract representations (Lee et al., 2009) . In supervised learning, although the network is trained to predict a specific aspect of the input, the intermediate representations are often proven to be useful for many other downstream tasks (Yosinski et al., 2014) . In unsupervised and self-supervised learning, the network is trained on a surrogate task, such as reconstruction (Hinton & Salakhutdinov, 2006; Kingma & Welling, 2013; He et al., 2021) and contrastive prediction (van den Oord et al., 2018; Chen et al., 2020) , which is supposed to capture generic prior of the data. In recent years, there has been significant improvements in unsupervised representation learning with state-of-the-art models achieving performance comparable to its supervised counterpart (Tomasev et al., 2022) . Despite the importance of data representation, the evaluation method for representations is rarely discussed. The most prevalent practice is to train a readout model on the downstream task. The readout model often has a shallow architecture, e.g. linear layer, to limit its capacity, so that the task performance reflects the representation quality. The problem with this approach is that the readout model cannot adapt to the nature of the representations. Deeper models and fine-tuning alleviate this issue. However, the representations are left with multiple metrics, each using a different readout mechanism, making the comparison extremely difficult (Nozawa & Sato, 2022) . In this paper, we treat evaluating representations as a model selection problem. We propose to use Minimum Description Length (MDL) as the main evaluation metric and use model switching to accommodate the need for multiple readout models. MDL is a well-studied compression-based approach for inductive inference that provides a generic solution to the model selection problem (Rissanen, 1984; Grunwald, 2004; Wallace, 2005; Solomonoff, 1964; Rathmanner & Hutter, 2011) . MDL performs a similar role as held-out validation does for Emperical Risk Minimization (Vapnik, 1991) , but has the advantage of being able to deal with single sequence and non-stationary data. It is closely related to Bayesian model selection and includes a form of Occam's Razor where the metric takes into account the model complexity. The complexity term can be explicitly represented as the codelength of the model in the case of a 2-part code, as a KL-term when using a variational code, or implicitly when using prequential or Bayesian codes. By including the model complexity in the evaluation metric, we automatically resolve the need of limiting the readout model complexity and are able to compare MDL scores freely between different readout mechanisms. Intuitively, if the induced representation is nonlinear and requires a higher capacity model for readout, the MDL score reflects this by having a larger complexity term. Note that this also applies in the case of finetuning, where the pre-trained model is allowed to adapt for the downstream tasks. Model switching allows multiple readout models and automatically finds the best readout model for the downstream task at each dataset size (Figure 1 ). Therefore, MDL with readout model switching provides a unified framework for evaluating representations regardless the evaluation protocol employed. It is conjured that useful representations make the variability in the data more predictable and allow to efficient learning human-like data (Hénaff et al., 2019) . The MDL evaluation metric formalizes the data efficiency perspective -especially evident in the form of prequential MDL. Prequential MDL (Dawid & Vovk, 1999; Poland & Hutter, 2005) turns computing the description length L(D|ϕ) = -log p(D|ϕ) into a sequential prediction problem: log p(D|ϕ) = t log p(y t |ϕ ≤t , y <t ), where ϕ is an encoder for feature extraction, ϕ t := ϕ(x t ) an encoded input, and (x <t , y <t ) is the data seen before timestep t. In order to achieve a short description length, it is beneficial if the representation is capable of fast learning, i.e. given a good representation, a few examples are enough to achieve good performance. As a by-product of our readout model switching based on MDL, we can visualize the predictive performance along the data sequence and inspect how efficient the downstream learning is. Our contributions are as follows: 1. We propose readout model switching for evaluating representations. 2. We prove a regret bound of readout model switching and discuss assumptions made. 3. We use an online learning framework for efficient computation and evaluate the performance of several popular visual representation methods on the set of downstream tasks. 4. We investigate and inspect the preferred readout models, data efficiency and model scaling.

2. BACKGROUND

Minimum Description Length is based on the fundamental idea that learning and comprehension correspond to compression (Rathmanner & Hutter, 2011) . Given data D=(y t ) N 1 ∈ Y N and a hypothesis space M = {M 1 , M 2 , . . . }, where each hypothesis M corresponds to a parametric probabilistic model p(D|θ, M ), MDL aims to identify the model that can compress the data D best. Considering the close relationship between lossless coding and probability distributions, this can be achieved by associating a codelength function L(D|M )= -log p(D|M ) with each hypothesis. A vast body of literature shows that models with a shorter description length have a better chance of generalizing to future data (Wallace, 2005; Grünwald, 2007a; Rathmanner & Hutter, 2011) . A crude way to obtain description lengths is to consider L(D | M )=L M (θ)+L M (D | θ), where L M (θ) is the cost of encoding the parameters and L M (D|θ)=-log p(D|θ, M ) is the cost of compressing the data with the parameterized model. This two-part code approach is intuitive but suboptimal and ambiguous because it does not specify how to encode the parameters. This crude MDL approach has been refined in three distinct but closely related ways:



Figure 1: Illustration of switching between models of different complexity: Depending on the number of training examples either A, B, or C has the best generalization performance. An optimally switched model will have the best performance at each point and thus the lowest prequential description length (= area under the curve).

