A picture of the space of typical learning tasks

Abstract

We develop a technique to analyze representations learned by deep networks when they are trained on different tasks using supervised, meta-and contrastive learning. We develop a technique to visualize such representations using an isometric embedding of the space of probabilistic models into a lower-dimensional space, i.e., one that preserves pairwise distances. We discover the following surprising phenomena that shed light upon the structure in the space of learning tasks: (1) the manifold of probabilistic models trained on different tasks using different representation learning methods is effectively low-dimensional; (2) supervised learning on one task results in a surprising amount of progress on seemingly dissimilar tasks; progress on other tasks is larger if the training task has diverse classes; (3) the structure of the space of tasks indicated by our analysis is consistent with parts of the Wordnet phylogenetic tree; (4) fine-tuning a model upon a sub-task does not change the representation much if the model was trained for a large number of epochs; (5) episodic meta-learning algorithms fit similar models eventually as that of supervised learning, even if the two traverse different trajectories during training; (6) contrastive learning methods trained on different datasets learn similar representations. We use classification tasks constructed from the CIFAR-10 and Imagenet datasets to study these phenomena.

1. Introduction

Exploiting data from related tasks to reduce the sample complexity of learning a desired task, is an idea that lies at the heart of burgeoning fields like transfer, multi-task, meta, few-shot, and self-supervised learning. These algorithms have shown an impressive ability to learn representations that can predict well on new tasks. The algorithms are very diverse in how they work but it stands to reason they must be exploiting some shared structure in the space of learning tasks. Although there is a large body of work that seeks to understand relatedness among tasks and how these algorithms exploit it (see §4 for a discussion of related work), we do not know what this shared structure precisely is. Our work makes the following contributions to advancing this line of research. We develop a technique to analyze the learned representation on a task, and its relationship to other tasks. Our key technical innovation is to use ideas from information geometry to characterize the geometry of the space of probabilistic models fit on different tasks. We develop methods to embed training trajectories of probabilistic models into a lower-dimensional space isometrically, i.e., while preserving pairwise distances. This allows us to faithfully visualize the geometry of these very high dimensional spaces (for Imagenet, our probabilistic models are in ∼10 7 dimensions) and thereby interpret the geometry of the space of learning tasks. These technical tools are very general and shed light on the shared structure among tasks. We point these technical tools to study how algorithms that learn from multiple tasks work. We provide evidence for the following phenomena. (1) The manifold of probabilistic models trained on different tasks using different representation learning methods is effectively low-dimensional, and this dimensionality is rather small; For Imagenet, a 3-dimensional subspace preserves 80.02% of the pairwise distances between models, which we define (in Appendix D) as the "explained stress"; (2) Supervised learning on one task results in a surprising amount of progress (informally, "progress" means that the representation learned on one can be used to make accurate predictions on other tasks; this is defined precisely in (4)) on seemingly dissimilar tasks; progress on other tasks is larger if the training task has diverse classes; (3) The structure of the space of tasks indicated by our analysis is consistent with parts of the Wordnet phylogenetic tree; (4) Fine-tuning a model upon a sub-task does not change the representation much if the model was trained for a large number of epochs;

