SELF-SUPERVISED REPRESENTATION LEARNING WITH RELATIVE PREDICTIVE CODING

Abstract

This paper introduces Relative Predictive Coding (RPC), a new contrastive representation learning objective that maintains a good balance among training stability, minibatch size sensitivity, and downstream task performance. The key to the success of RPC is two-fold. First, RPC introduces the relative parameters to regularize the objective for boundedness and low variance. Second, RPC contains no logarithm and exponential score functions, which are the main cause of training instability in prior contrastive objectives. We empirically verify the effectiveness of RPC on benchmark vision and speech self-supervised learning tasks. Lastly, we relate RPC with mutual information (MI) estimation, showing RPC can be used to estimate MI with low variance 1 .

1. INTRODUCTION

Unsupervised learning has drawn tremendous attention recently because it can extract rich representations without label supervision. Self-supervised learning, a subset of unsupervised learning, learns representations by allowing the data to provide supervision (Devlin et al., 2018) . Among its mainstream strategies, self-supervised contrastive learning has been successful in visual object recognition (He et al., 2020; Tian et al., 2019; Chen et al., 2020c ), speech recognition (Oord et al., 2018; Rivière et al., 2020) , language modeling (Kong et al., 2019) , graph representation learning (Velickovic et al., 2019) and reinforcement learning (Kipf et al., 2019) . The idea of self-supervised contrastive learning is to learn latent representations such that related instances (e.g., patches from the same image; defined as positive pairs) will have representations within close distance, while unrelated instances (e.g., patches from two different images; defined as negative pairs) will have distant representations (Arora et al., 2019) . Prior work has formulated the contrastive learning objectives as maximizing the divergence between the distribution of related and unrelated instances. In this regard, different divergence measurement often leads to different loss function design. For example, variational mutual information (MI) estimation (Poole et al., 2019) inspires Contrastive Predictive Coding (CPC) (Oord et al., 2018) . Note that MI is also the KL-divergence between the distributions of related and unrelated instances (Cover & Thomas, 2012) . While the choices of the contrastive learning objectives are abundant (Hjelm et al., 2018; Poole et al., 2019; Ozair et al., 2019) , we point out that there are three challenges faced by existing methods. The first challenge is the training stability, where an unstable training process with high variance may be problematic. For example, Hjelm et al. (2018); Tschannen et al. (2019); Tsai et al. (2020b) show that the contrastive objectives with large variance cause numerical issues and have a poor downstream performance with their learned representations. The second challenge is the sensitivity to minibatch size, where the objectives requiring a huge minibatch size may restrict their practical usage. For instance, SimCLRv2 (Chen et al., 2020c) utilizes CPC as its contrastive objective and reaches state-of-the-art performances on multiple self-supervised and semi-supervised benchmarks. Nonetheless, the objective is trained with a minibatch size of 8, 192, and this scale of training requires enormous computational power. The third challenge is the downstream task performance, which is the one that we would like to emphasize the most. For this reason, in most cases, CPC



Project page: https://github.com/martinmamql/relative_predictive_coding 1

