A PROBABILISTIC APPROACH TO SELF-SUPERVISED LEARNING USING CYCLICAL STOCHASTIC GRADIENT MCMC

Abstract

In this paper we present a practical Bayesian self-supervised learning method with Cyclical Stochastic Gradient Hamiltonian Monte Carlo (cSGHMC). Within this framework, we place a prior over the parameters of a self-supervised learning model and use cSGHMC to approximate the high dimensional and multimodal posterior distribution over the embeddings. By exploring an expressive posterior over the embeddings, Bayesian self-supervised learning produces interpretable and diverse representations. Marginalizing over these representations yields an improvement in performance, calibration and out-of-distribution detection in downstream task. We provide experimental results on multiple classification tasks on five challenging datasets. Moreover, we demonstrate the effectiveness of the proposed method in out-of-distribution detection using the SVHN dataset.

1. INTRODUCTION

Self-supervised learning is a learning strategy where the data themselves provide the labels (Jing & Tian (2020) ). The aim of self-supervised learning is to learn useful representations of the input data without relying on human annotations (Zbontar et al. (2021) ). Since they do not rely on annotated data, they have been used as an essential step in many areas such as natural language processing, computer vision and biomedicine (Jospin et al. (2020) ). Contrastive methods (Chen et al. (2020) ) are one of the promising self-supervised learning approaches which learn representations by maximizing the similarity between embeddings obtained from different distorted versions of an image (Zbontar et al. (2021) ). Several tricks are proposed to overcome the issue of feature collapse. These include using negative samples in simCLR (Zbontar et al. (2021) ) and stop gradient in BYOL (Grill et al. (2020) ). Self-supervised models are often trained using stochastic optimization methods which approximate the distribution over the parameters with a point mass ignoring the uncertainty in the parameter space. Indeed if the regularizer imposed on the model parameters is viewed as the the log of a prior on the distribution of the parameters, optimizing the cost function may be viewed as a maximum a-posteriori (MAP) estimate of model parameters (Li et al. (2016) ). Bayesian methods provide principled alternatives that model the whole posterior over the parameters and account for model uncertainty in the parameter space (Zhang et al. In this paper we aim to adapt Bayesian supervised learning concepts to self-supervised learning to make a self-supervised learning model fully probabilistic using cSG-MCMC. Our motivation comes from the fact that the posterior distribution over the parameters of a self-supervised learning model may be multimodal and thus insufficiently represented by a single point estimate. By exploring the posterior distribution over the parameters instead of point mass we aim to improve performance in downstream tasks. Moreover, the optimization step in cSG-MCMC involves injection of Gaussian noise to the parameter update of SGD which helps to alleviate the feature collapse issue in the contrastive methods and make the features more informative (Li et al. ( 2016)). In this paper, we propose a simple Bayesian formulation for self-supervised learning with a specific family of cSG-MCMC methods called Cyclical Stochastic Gradient Hamiltonian Monte Carlo (cS-GHMC) (Zhang et al. ( 2020)). Within this framework, we use BYOL as a self-supervised learning model which allows to incorporate Bayesian learning of an approximate posterior over the parameters instead of MAP. Our experimental results indicate that by integrating a Bayesian learning we can achieve better performance in downstream tasks including classification and out-of-distribution detection. The simplicity of the proposed approach is one of its greatest strengths.

2. PROBLEM STATEMENT

Given a dataset D, a self-supervised learning model F θ parameterized by θ, aims to produce a representation Z θ by solving a predefined proxy task. In this paper we wish to learn a distribution over the embeddings Z θ by placing a prior over the parameters θ and using Bayesian learning instead of MAP estimation. For learning the representations we use BYOL, a recent self-supervised learning method based on contrastive learning. To obtain the distribution over the embeddings, we use cSGHMC. In the following, first we describe the self-supervised learning model to learn representations. Then, we describe cSGHMC and highlight how it allows to obtain a distribution over the embeddings.

2.1. SELF SUPERVISED LEARNING

The aim of contrastive learning is to learn representations by contrasting two augmented views of an image. Particularly BYOL learns representations by reducing a contrastive loss between two neural networks referred to as online network F θ (parameterized by θ) and target network F ξ (parameterized by ξ). Each network consists of three components, an encoder f (.) (e.g., Resnet-18), a projection head g(.) (e.g., an MLP) and a prediction head q(.) (e.g., an MLP). For a given minibatch X = {x i } N i=1 sampled from a dataset D it produces two distorted views, t(X) and t ′ (X), via a distribution of data augmentations T . The two batches of distorted views then are fed to the online network and the target network, producing batches of embeddings, Z θ and Z ξ , respectively. These features are then transformed with the projection heads into Y θ and Y ξ . The online network then outputs a prediction Q θ of Y ξ using the prediction head q θ (.). Finally the following mean squared error between the normalized predictions Qθ and target projections Y ξ is defined: L θ,ξ = ∥ Qθ -Ȳξ ∥ 2 = 2 -2. ⟨ Qθ , Ȳξ ⟩ ∥ Qθ ∥.∥ Ȳξ ∥ . Lθ,ξ is computed by separately feeding t ′ (X) to the online network F θ and t(X) to the target network F ξ . Then, at each training step, a stochastic optimization step is performed to minimize L BYOL θ,ξ = L θ,ξ + Lθ,ξ The gradient is taken only with respect to θ. So, the parameter update is as follows: θ ← optimizer(θ, ∇ θ L BYOL θ,ξ ). (3) ξ ← τ ξ + (1 -τ )θ, where the weights ξ are an exponential moving average of the online network's parameters θ with a target decay rate τ ∈ [0, 1]. At the end of training, the encoder f θ (.) is used for the downstream task. During training only the parameters θ of the online network F θ are updated.



(2020)). Exact Bayesian learning of Deep Neural Networks is generally intractable, hence Bayesian deep learning models use approximation methods like variational inference (VI, Blundell et al. (2015)) or MCMC methods (Neal (2012)) to capture the posterior over the parameters and estimate model uncertainty. While VI methods usually approximate a single mode, MCMC methods are used for sampling from different modes (Jospin et al. (2020)). In a recent line of work, Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) methods (Welling & Teh (2011), Chen et al. (2014), Ma et al. (2015)) were proposed which couple MCMC with SGD to provide a promising sampling approach to inference in Bayesian deep learning for large datasets(Welling & Teh (2011)). In another work,Zhang et al. (2020)  proposed Cyclical Stochastic Gradient MCMC (cSG-MCMC) in order to explore a highly multimodal parameter space given a realistic computational budget(Zhang  et al. (2020)).

