A picture of the space of typical learning tasks

Abstract

We develop a technique to analyze representations learned by deep networks when they are trained on different tasks using supervised, meta-and contrastive learning. We develop a technique to visualize such representations using an isometric embedding of the space of probabilistic models into a lower-dimensional space, i.e., one that preserves pairwise distances. We discover the following surprising phenomena that shed light upon the structure in the space of learning tasks: (1) the manifold of probabilistic models trained on different tasks using different representation learning methods is effectively low-dimensional; (2) supervised learning on one task results in a surprising amount of progress on seemingly dissimilar tasks; progress on other tasks is larger if the training task has diverse classes; (3) the structure of the space of tasks indicated by our analysis is consistent with parts of the Wordnet phylogenetic tree; (4) fine-tuning a model upon a sub-task does not change the representation much if the model was trained for a large number of epochs; (5) episodic meta-learning algorithms fit similar models eventually as that of supervised learning, even if the two traverse different trajectories during training; (6) contrastive learning methods trained on different datasets learn similar representations. We use classification tasks constructed from the CIFAR-10 and Imagenet datasets to study these phenomena.

1. Introduction

Exploiting data from related tasks to reduce the sample complexity of learning a desired task, is an idea that lies at the heart of burgeoning fields like transfer, multi-task, meta, few-shot, and self-supervised learning. These algorithms have shown an impressive ability to learn representations that can predict well on new tasks. The algorithms are very diverse in how they work but it stands to reason they must be exploiting some shared structure in the space of learning tasks. Although there is a large body of work that seeks to understand relatedness among tasks and how these algorithms exploit it (see §4 for a discussion of related work), we do not know what this shared structure precisely is. Our work makes the following contributions to advancing this line of research. We develop a technique to analyze the learned representation on a task, and its relationship to other tasks. Our key technical innovation is to use ideas from information geometry to characterize the geometry of the space of probabilistic models fit on different tasks. We develop methods to embed training trajectories of probabilistic models into a lower-dimensional space isometrically, i.e., while preserving pairwise distances. This allows us to faithfully visualize the geometry of these very high dimensional spaces (for Imagenet, our probabilistic models are in ∼10 7 dimensions) and thereby interpret the geometry of the space of learning tasks. These technical tools are very general and shed light on the shared structure among tasks. We point these technical tools to study how algorithms that learn from multiple tasks work. We provide evidence for the following phenomena. (1) The manifold of probabilistic models trained on different tasks using different representation learning methods is effectively low-dimensional, and this dimensionality is rather small; For Imagenet, a 3-dimensional subspace preserves 80.02% of the pairwise distances between models, which we define (in Appendix D) as the "explained stress"; (2) Supervised learning on one task results in a surprising amount of progress (informally, "progress" means that the representation learned on one can be used to make accurate predictions on other tasks; this is defined precisely in (4)) on seemingly dissimilar tasks; progress on other tasks is larger if the training task has diverse classes; (3) The structure of the space of tasks indicated by our analysis is consistent with parts of the Wordnet phylogenetic tree; (4) Fine-tuning a model upon a sub-task does not change the representation much if the model was trained for a large number of epochs;

annex

(5) Episodic meta-learning algorithms fit similar models eventually as that of supervised learning, even if the two traverse different trajectories during training; (6) Contrastive learning methods trained on different datasets learn similar representations. We demonstrate these findings on image classification tasks constructed from CIFAR-10 and Imagenet datasets.

2. Methods

Modeling the task We define a task P as a joint distribution on inputs x ∈ R d and outputs y ∈ {1, . . . , C} corresponding to C classes. Suppose we have N independent and identically distributed samples {(xn, y * n )} N n=1 from P . Let ⃗ y = (y 1 , . . . , y N ) denote any sequence of outputs on these N samples and ⃗ y * denote the sequence of ground-truth labels. We may now model the task aswhere w are the parameters of the model and we have used the shorthand p n w (yn) ≡ pw(yn | xn). The true probability distribution which corresponds to the ground-truth labels is denoted by P * ≡ P ( ⃗ y * ).In the same way, let us denote by P 0 the probability distribution that corresponds to p n (y) = 1/C for all n and all y, i.e., P 0 predicts accurately on a fraction 1/C of the samples.Bhattacharyya distance Given two models Pu and Pv parameterized by weights u and v respectively, we define the Bhattacharyya distance (Bhattacharyya, 1946) between them averaged over samples as(2) see Appendix C for more details on ( * ). Our model (1) involves a product over the probabilities of N samples. Typical distances for probability distributions, e.g., Hellinger distance, saturate when the number of samples N is large (because random high-dimensional vectors are nearly orthogonal). It is thus difficult to use such distances to understand high-dimensional probabilistic models. However, the Bhattacharyya distance is well-behaved for large N due to the logarithm (Quinn et al., 2019; Teoh et al., 2020) , and that is why it is well suited to our problem.

Distances between trajectories of probabilistic models

Consider a trajectory (w(k)) k=0,...,T that records the weights after T updates of the optimization algorithm, e.g., stochastic gradient descent. This trajectory corresponds to a trajectory of probabilistic models τw = (P w(k) ) k=0,...,T . We are interested in calculating distances between such training trajectories. First, consider τu = (u(0), u(1), u(2), . . . , u(T )) and another trajectory τv ≡ (u(0), u(2), u(4), . . . , u(T ), u(T ), . . . , u(T )) which trains twice as fast but to the same end point. If we define the distance between these trajectories as, say, k d B (P u(k) , P v(k) ), then the distance between τu and τv will be non-zero-even if they are fundamentally the same. This issue is more pronounced when we calculate distances between training trajectories of different tasks. It arises because we are recording each trajectory using a different time coordinate, namely its own training progress.To compare two trajectories correctly, we need a notion of time that can allow us to uniquely index any trajectory. The geodesic between the start point P 0 and the true distribution P * is a natural candidate for this purpose. Geodesics are locally length-minimizing curves in a metric space. For the product manifold in (1), we can obtain a closed-form formula for the geodesic as follows. We can think of the square root of the probabilities p n u (c) as a point on a C-dimensional sphere. Given two models Pu and Pv the geodesic connecting them under the Fisher information metric is the great circle on this sphere (Ito & Dechant, 2020, Eq. 47): (3)

