FUNCTION CONTRASTIVE LEARNING OF TRANSFER-ABLE REPRESENTATIONS

Abstract

Few-shot-learning seeks to find models that are capable of fast-adaptation to novel tasks which are not encountered during training. Unlike typical few-shot learning algorithms, we propose a contrastive learning method which is not trained to solve a set of tasks, but rather attempts to find a good representation of the underlying datagenerating processes (functions). This allows for finding representations which are useful for an entire series of tasks sharing the same function. In particular, our training scheme is driven by the self-supervision signal indicating whether two sets of samples stem from the same underlying function. Our experiments on a number of synthetic and real-world datasets show that the representations we obtain can outperform strong baselines in terms of downstream performance and noise robustness, even when these baselines are trained in an end-to-end manner.

1. INTRODUCTION

The ability to learn new concepts from only a few examples is a salient characteristic of intelligent behaviour. Nevertheless, contemporary machine learning models consume copious amounts of data to learn even seemingly basic concepts. The mitigation of this issue is the ambition of the few-shot learning framework, wherein a fundamental objective is to learn representations that apply to a variety of different problems (Bengio et al., 2019) . In this work, we propose a self-supervised method for learning such representations by leveraging the framework of contrastive learning. We consider a setting very similar to the one in Neural Processes (NPs) (Garnelo et al., 2018a; b; Kim et al., 2019) : The goal is to solve some task related to an unknown function f after observing just a few input-output examples O f = {(x i , y i )} i . For instance, the task may consist of predicting the function value y at unseen locations x, or it may be to classify images after observing only a few pixels (in that case x is the pixel location and y is the pixel value). To solve such a task, the example dataset O f needs to be encoded into some representation of the underlying function f . Finding a good representation of a function which facilitates solving a wide range of tasks, sharing the same function, is the object of the present paper. Most existing methods approach this problem by optimizing representations in terms of reconstruction i.e., prediction of function values y at unseen locations x (see e.g. NPs (Garnelo et al., 2018a; b; Kim et al., 2019) and Generative Query Networks (GQNs) (Eslami et al., 2018) ). A problem with this objective is that it can cause the model to waste its capacity on reconstructing unimportant features, such as static backgrounds, while ignoring the visually small but important details in its learned representation (Anand et al., 2019; Kipf et al., 2019) . For instance, in order to manipulate a small object in a complex scene, the model's ability to infer the object's shape carries more importance than inferring its color or reconstructing the static background. To address this issue, we propose an approach which contrasts functions, rather than attempting to reconstruct them. The key idea is that two sets of examples of the same function should have similar latent representations, while the representations of different functions should be easily distinguishable. To this end, we propose a novel contrastive learning framework which learns by contrasting sets of input-output pairs (partial observations) of different functions. We show that this self-supervised training signal allows the model to meta-learn task-agnostic, low-dimensional representations of functions which are not only robust to noise but can also be reliably used for a variety of fewshot downstream prediction tasks defined on those functions. To evaluate the effectiveness of the proposed method, we conduct comprehensive experiments on diverse downstream problems including classification, regression, parameter identification, scene understanding and reinforcement learning. We consider different datasets, ranging from simple 1D and 2D regression functions to the challenging simulated and real-world scenes. In particular, we find that a downstream predictor trained with our (pre-trained) representations performs comparable or better than related methods on these tasks, including the ones wherein the predictor is trained jointly with the representation.

Contributions.

• Our key contribution is the insight that the good representation of an underlying data-generating function can be learned by aligning the representations of the samples that stem from it. This perspective of learning functions' representations is different from the typical regression methods and is new to our knowledge. • We propose a novel contrastive learning framework, Function Contrastive Representation Learning (FCRL), to learn such representation of a function, using only its few observed samples. • With experiments on diverse datasets, we show that the functions' representations learned by FCRL are not only robust to noise in inputs but also transfer well to multiple downstream problems.

2. PRELIMINARIES

2.1 PROBLEM SETTING Consider a data-generating function f : X → Y with X = R d and Y ⊆ R d : y = f (x) + ξ; ξ ∼ N (0, σ 2 ) where ξ ∼ N (0, σ 2 ) is Gaussian noise. Let O f = {(x i , y i )} N i=1 be a set of few observed samples of a function f , referred to as the context set and T = {T t } T t=1 be a set of unknown downstream tasks that can be defined on f . The downstream tasks T can take the form of classification, regression and/or reinforcement learning problems and the targets in each task T t vary accordingly. Our goal is to learn the representation of a given function f from only the context set O f , such that the representation retains the maximum useful information about f and can interchangeably be used for multiple downstream tasks T , defined on the same function (without requiring retraining). Few-shot Interpretation. Note that the downstream tasks T are defined directly on the function's representation. Since the representation is inferred from a few observations O f , the overall downstream problem becomes a few-shot prediction problem. In the following, we regard the downstream problems as few-shot problems, unless stated otherwise.

2.2. BACKGROUND

While there exist a variety of methods (Xu et al., 2019; Finn et al., 2017) that take different approaches towards few-shot learning for functions, a class of methods that is particularly relevant to our setting is that of conditional neural processes (CNPs) and NPs (Garnelo et al., 2018a; b) .

Conditional Neural Processes (CNPs).

The key proposal in CNPs (applied to few-shot learning) is to express a distribution over predictor functions given a context set. To this end, they first encode the context O f into individual representations r i = h Φ (x i , y i ) ∀i ∈ [N ], where h Φ is a neural network. The representations are then aggregated via a mean-pooling operation into a fixed size



Figure 1: FCRL for self-supervised scene representation learning: The representations are learned by mapping the context points of each scene closer in the latent space while separating it from the other scenes. The context points correspond to the tuples of camera viewpoints x i ∈ X := R d and the images taken from those viewpoints y i ∈ Y ⊆ R d . Seen here, TriFinger (Wüthrich et al., 2020).

