PREDICTING WHAT YOU ALREADY KNOW HELPS: PROVABLE SELF-SUPERVISED LEARNING

Abstract

Self-supervised representation learning solves auxiliary prediction tasks (known as pretext tasks), that do not require labeled data, to learn semantic representations. These pretext tasks are created solely using the input features, such as predicting a missing image patch, recovering the color channels of an image from context, or predicting missing words in text, yet predicting this known information helps in learning representations effective for downstream prediction tasks. This paper posits a mechanism based on approximate conditional independence to formalize how solving certain pretext tasks can learn representations that provably decrease the sample complexity of downstream supervised tasks. Formally, we quantify how the approximate independence between the components of the pretext task (conditional on the label and latent variables) allows us to learn representations that can solve the downstream task with drastically reduced sample complexity by just training a linear layer on top of the learned representation.

1. INTRODUCTION

Self-supervised learning revitalizes machine learning models in computer vision, language modeling, and control problems (see reference therein (Jing & Tian, 2020; Kolesnikov et al., 2019; Devlin et al., 2018; Wang & Gupta, 2015; Jang et al., 2018) ). Training a model with auxiliary tasks based only on input features reduces the extensive costs of data collection and semantic annotations for downstream tasks. It is also known to improve the adversarial robustness of models (Hendrycks et al., 2019; Carmon et al., 2019; Chen et al., 2020a) . Self-supervised learning creates pseudo labels solely based on input features, and solves auxiliary prediction tasks in a supervised manner (pretext tasks). However, the underlying principles of self-supervised learning are mysterious since it is a-priori unclear why predicting what we already know should help. We thus raise the following question:

What conceptual connection between pretext and downstream tasks ensures good representations?

What is a good way to quantify this? As a thought experiment, consider a simple downstream task of classifying desert, forest, and sea images. A meaningful pretext task is to predict the background color of images (known as image colorization (Zhang et al., 2016) ). Denote X 1 , X 2 , Y to be the input image, color channel, and the downstream label respectively. Given knowledge of the label Y , one can possibly predict the background X 2 without knowing much about X 1 . In other words, X 2 is approximately independent of X 1 conditional on the label Y . Consider another task of inpainting (Pathak et al., 2016) the front of a building (X 2 ) from the rest (X 1 ). While knowing the label "building" (Y ) is not sufficient for successful inpainting, adding additional latent variables Z such as architectural style, location, window positions, etc. will ensure that variation in X 2 given Y, Z is small. We can mathematically interpret this as X 1 being approximate conditionally independent of X 2 given Y, Z. In the above settings with conditional independence, the only way to solve the pretext task for X 1 is to first implicitly predict Y and then predict X 2 from Y . Even without labeled data, the information of Y is hidden in the prediction for X 2 . Contributions. We propose a mechanism based on approximate conditional independence (ACI) to explain why solving pretext tasks created from known information can learn representations that provably reduce downstream sample complexity. For instance, learned representation will

