MANIFOLD CHARACTERISTICS THAT PREDICT DOWNSTREAM TASK PERFORMANCE

Abstract

Pretraining methods are typically compared by evaluating the accuracy of linear classifiers, transfer learning performance, or visually inspecting the representation manifold's (RM) lower-dimensional projections. We show that the differences between methods can be understood more clearly by investigating the RM directly, which allows for a more detailed comparison. To this end, we propose a framework and new metric to measure and compare different RMs. We also investigate and report on the RM characteristics for various pretraining methods. These characteristics are measured by applying sequentially larger local alterations to the input data, using white noise injections and Projected Gradient Descent (PGD) adversarial attacks, and then tracking each datapoint. We calculate the total distance moved for each datapoint and the relative change in distance between successive alterations. We show that self-supervised methods learn an RM where alterations lead to large but constant size changes, indicating a smoother RM than fully supervised methods. We then combine these measurements into one metric, the Representation Manifold Quality Metric (RMQM), where larger values indicate larger and less variable step sizes, and show that RMQM correlates positively with performance on downstream tasks.

1. INTRODUCTION

Understanding why deep neural networks generalise so well remains a topic of intense research, despite the practical successes that have been achieved with such networks. Less ambitiously than aiming for a complete understanding, we can search for characteristics that indicate good generalisation. Knowledge of such characteristics can then be incorporated into training methods and open more research avenues. These characteristics can also be used to evaluate and compare networks. Arguably the most successful current theories of generalisation focus on the flatness of the loss surface at the minima (Hochreiter & Schmidhuber, 1997; Dziugaite & Roy, 2017; Dherin et al., 2021) (even though the most straightforward measures of flatness are known to be deficient Dinh et al. ( 2017)). Petzka et al. (2021) expands on this argument and shows that these methods correlate strongly with model performance, and reflect the assumption that the labels are locally constant in feature space. A thorough survey by Jiang et al. (2020) shows that some recent methods are, in fact, negatively correlated with generalisation. To our knowledge, no theory looks at the structural characteristics of the learned Representation Manifold (RM) as a predictor for generalisation. We investigate whether structural characteristics in the RMs correlate with generalisation to task performance. To illustrate the intuition behind our investigation, consider Figure 1 , which represents two RMs, A and B. Assume that each RM is produced by the same architecture, trained on the same dataset; both have a flat minima but are trained with different methods. In the case of A, where the manifold is smooth, the sample representations of the Green class are, on average, closer to other Green class's points. Likewise, presentations of the Red class will, on average, be closer to other Red class's samples. On the other hand, if we consider RM B, there are chasms in the manifold that lead to some sample representations being closer to samples of the other class rather than samples of their own class, as illustrated in the blue patch. The purpose of this paper is to justify our claim that specific RM characteristics lead to generalisation. However, to do this, we must first define appropriate RM characteristics that reflect this intuition and show how to measure them. Contribution This paper aims to show with enough empirical evidence that looking at the representation manifold (RM) structure is a good research direction for explaining generalisation in deep learning structures. Following these strong empirical results, future work will require a deeper theoretical investigation into our findings. Our contributions in this paper can then be summarised as defining a model-agnostic and straightforward framework to measure RM characteristics. Using this framework, we compare the RMs learned by encoders trained using supervised, self-supervised and a mixture of both methods on the MNIST and CIFAR-10 datasets. We then present a new metric that calculates the quality of a manifold for generalisation, the Representation Manifold Quality Metric (RMQM). We show that this metric correlates strongly with downstream task performance. These observations support our intuition on the characteristics of an RM that lead to generalisation.

2. RELATED WORK

Representation learning Some of the earliest work in representation learning focused on pretraining networks by generating artificial labels from images and then training the network to predict these labels (Doersch et al., 2015; Zhang et al.; Gidaris et al., 2018) . Other techniques involve contrastive learning where representations from images are directly contrasted against one another such that the network learns to encode similar images to similar representations (Schroff et al., 2015; Oord et al., 2018; Chen et al., 2020; He et al., 2020; Le-Khac et al., 2020) . Comparing representations from trained neural networks Yamins et al. (2014); Cadena et al. (2019) compares how similar representations are by linearly regressing over the one representation to predict the other representation. The R 2 coefficient is then used as a metric to quantify similarity. This metric is not symmetric. Symmetrical methods compare representations from different neural networks by creating a similarity matrix between the hidden representations of all layers as was done in (Laakso & Cottrell, 2000; Kriegeskorte et al., 2008; Li et al., 2016; Wang et al., 2018; Kornblith et al.) .

Manifold Learning

The Manifold Hypothesis states that practical high dimensional datasets lie on a much lower dimensional manifold Carlsson et al. (2008); Fefferman et al. (2016); Goodfellow et al. (2016) . Manifold learning techniques aim to learn this lower-dimensional manifold by performing non-linear dimensionality reduction. A typical application of these non-linear reductions is visualising high dimensional data in two-dimensional or three-dimensional settings. Popular techniques includeTenenbaum et al. (2000) ; Van der Maaten & Hinton (2008); McInnes et al. (2018) . These techniques have been used in various studies to compare different learned representation manifolds (Chen et al., 2019; van der Merwe, 2020; Li et al., 2020; Liu et al., 2022) . Comparing manifolds To evaluate the performance of Generative Adversarial Networks, Barannikov et al. (2021) introduces the Cross-Barcode tool that measures the differences in topologies between two manifolds, which they approximate by the sampled data points from the underlying data distributions. They then derive the Manifold Topology Divergence based on the sum of the



Figure 1: An illustration to give an intuitive understanding of why the structural characteristics of a RM should be considered a predictor of generalisation.

