SELF-SUPERVISED REPRESENTATION LEARNING WITH RELATIVE PREDICTIVE CODING

Abstract

This paper introduces Relative Predictive Coding (RPC), a new contrastive representation learning objective that maintains a good balance among training stability, minibatch size sensitivity, and downstream task performance. The key to the success of RPC is two-fold. First, RPC introduces the relative parameters to regularize the objective for boundedness and low variance. Second, RPC contains no logarithm and exponential score functions, which are the main cause of training instability in prior contrastive objectives. We empirically verify the effectiveness of RPC on benchmark vision and speech self-supervised learning tasks. Lastly, we relate RPC with mutual information (MI) estimation, showing RPC can be used to estimate MI with low variance 1 .

1. INTRODUCTION

Unsupervised learning has drawn tremendous attention recently because it can extract rich representations without label supervision. Self-supervised learning, a subset of unsupervised learning, learns representations by allowing the data to provide supervision (Devlin et al., 2018) . Among its mainstream strategies, self-supervised contrastive learning has been successful in visual object recognition (He et al., 2020; Tian et al., 2019; Chen et al., 2020c) , speech recognition (Oord et al., 2018; Rivière et al., 2020) , language modeling (Kong et al., 2019) , graph representation learning (Velickovic et al., 2019) and reinforcement learning (Kipf et al., 2019) . The idea of self-supervised contrastive learning is to learn latent representations such that related instances (e.g., patches from the same image; defined as positive pairs) will have representations within close distance, while unrelated instances (e.g., patches from two different images; defined as negative pairs) will have distant representations (Arora et al., 2019) . Prior work has formulated the contrastive learning objectives as maximizing the divergence between the distribution of related and unrelated instances. In this regard, different divergence measurement often leads to different loss function design. For example, variational mutual information (MI) estimation (Poole et al., 2019) inspires Contrastive Predictive Coding (CPC) (Oord et al., 2018) . Note that MI is also the KL-divergence between the distributions of related and unrelated instances (Cover & Thomas, 2012) . While the choices of the contrastive learning objectives are abundant (Hjelm et al., 2018; Poole et al., 2019; Ozair et al., 2019) , we point out that there are three challenges faced by existing methods. The first challenge is the training stability, where an unstable training process with high variance may be problematic. For example, Hjelm et al. (2018) ; Tschannen et al. (2019) ; Tsai et al. (2020b) show that the contrastive objectives with large variance cause numerical issues and have a poor downstream performance with their learned representations. The second challenge is the sensitivity to minibatch size, where the objectives requiring a huge minibatch size may restrict their practical usage. For instance, SimCLRv2 (Chen et al., 2020c) utilizes CPC as its contrastive objective and reaches state-of-the-art performances on multiple self-supervised and semi-supervised benchmarks. Nonetheless, the objective is trained with a minibatch size of 8, 192, and this scale of training requires enormous computational power. The third challenge is the downstream task performance, which is the one that we would like to emphasize the most. For this reason, in most cases, CPC of representations (x, y) from their joint distribution ((x, y) ∼ P XY ), this pair is defined as a positive pair; when sampling from the product of marginals ((x, y) ∼ P X P Y ), this pair is defined as a negative pair. Then, Tsai et al. (2020b) formalizes this idea such that the contrastiveness of the representations can be measured by the divergence between P XY and P X P Y , where higher divergence suggests better contrastiveness. To better understand prior contrastive learning objectives, we categorize them in terms of different divergence measurements between P XY and P X P Y , with their detailed objectives presented in Table 1 . We instantiate the discussion using Contrastive Predictive Coding (Oord et al., 2018, J CPC ), which is a lower bound of D KL (P XY P X P Y ) with D KL referring to the KL-divergence: ,yj ) . J CPC (X, Y ) := sup f ∈F E (x,y1)∼P XY ,{yj } N j=2 ∼P Y log e f (x,y1) 1 N N j=1 e f (x (1) Then, Oord et al. (2018) presents to maximize J CPC (X, Y ), so that the learned representations X and Y have high contrastiveness. We note that J CPC has been commonly used in many recent self-supervised representation learning frameworks (He et al., 2020; Chen et al., 2020b) , where they constrain the function to be f (x, y) = cosine(x, y) with cosine(•) being cosine similarity. Under this function design, maximizing J CPC leads the representations of related pairs to be close and representations of unrelated pairs to be distant. The category of modeling D KL (P XY P X P Y ) also includes the Donsker-Varadhan objective (J DV (Donsker & Varadhan, 1975; Belghazi et al., 2018) ) and the Nguyen-Wainright-Jordan objective (J NWJ (Nguyen et al., 2010; Belghazi et al., 2018) ), where Belghazi et al. (2018) ; Tsai et al. (2020b) show that (Nowozin et al., 2016; Hjelm et al., 2018) , where J JS (X, Y ) = 2 D JS (P XY P X P Y )log 2 . 2 The instance of modeling D Wass (P XY P X P Y ) is the Wasserstein Predictive Coding J WPC (Ozair et al., 2019) , where J WPC (X, Y ) modifies J CPC (X, Y ) objective (equation 1) by searching the function from F to F L . F L denotes any class of 1-Lipschitz continuous functions from (X × Y) to R, and thus F L ⊂ F. Ozair et al. (2019) shows that J WPC (X, Y ) is the lower bound of both D KL (P XY P X P Y ) and D Wass (P XY P X P Y ). See Table 1 for all the equations. To conclude, the contrastive representation learning objectives are unsupervised representation learning methods that maximize the distribution divergence between P XY and P X P Y . The learned representations cause high contrastiveness, and recent work (Arora et al., 2019; Tsai et al., 2020a) theoretically show that highly-contrastive representations could improve the performance on downstream tasks. J DV (X, Y ) = J NWJ (X, Y ) = D KL (P XY P X P Y ). After discussing prior contrastive representation learning objectives, we point out three challenges in their practical deployments: training stability, sensitivity to minibatch training size, and downstream task performance. In particular, the three challenges can hardly be handled well at the same time, where we highlight the conclusions in Table 1 . Training Stability: The training stability highly relates to the variance of the objectives, where Song & Ermon (2019) shows that J DV and J NWJ exhibit inevitable high variance due to their inclusion of exponential function. As pointed out by Tsai et al. (2020b) , J CPC , J WPC , and J JS have better training stability because J CPC and J WPC can be realized as a multi-class classification task and J JS can be realized as a binary classification task. The cross-entropy loss adopted in J CPC , J WPC , and J JS is highly-optimized and stable in existing optimization package (Abadi et al., 2016; Paszke et al., 2019) . Sensitivity to minibatch training size: Among all the prior contrastive representation learning methods, J CPC is known to be sensitive to the minibatch training size (Ozair et al., 2019) . Taking a closer look at equation 1, J CPC deploys an instance selection such that y 1 should be selected from {y 1 , y 2 , • • • , y N }, with (x, y 1 ) ∼ P XY , (x, y j>1 ) ∼ P X P Y with N being the minibatch size. Previous work (Poole et al., 2019; Song & Ermon, 2019; Chen et al., 2020b; Caron et al., 2020) showed that a large N results in a more challenging instance selection and forces J CPC to have a better contrastiveness of y 1 (related instance for x) against {y j } N j=2 (unrelated instance for x). J DV , J NWJ , and J JS do not consider the instance selection, and J WPC reduces the minibatch training size sensitivity by enforcing 1-Lipschitz constraint. Downstream Task Performance: The downstream task performance is what we care the most among all the three challenges. J CPC has been the most popular objective as it manifests superior performance over the other alternatives (Tschannen et al., 2019; Tsai et al., 2020b; a) . We note that although J WPC shows better performance on Omniglot (Lake et al., 2015) and CelebA (Liu et al., 2015) datasets, we empirically find it not generalizing well to CIFAR-10/-100 (Krizhevsky et al., 2009) and ImageNet (Russakovsky et al., 2015) .

2.2. RELATIVE PREDICTIVE CODING

In this paper, we present Relative Predictive Coding (RPC), which achieves a good balance among the three challenges mentioned above: J RPC (X, Y ) := sup f ∈F E P XY [f (x, y)]-αE P X P Y [f (x, y)]- β 2 E P XY f 2 (x, y) - γ 2 E P X P Y f 2 (x, y) , (2) where α > 0, β > 0, γ > 0 are hyper-parameters and we define them as relative parameters. Intuitively, J RPC contains no logarithm or exponential, potentially preventing unstable training due to numerical issues. Now, we discuss the roles of α, β, γ. At a first glance, α acts to discourage the scores of P XY and P X P Y from being close, and β/γ acts as a 2 regularization coefficient to stop f from becoming large. For a deeper analysis, the relative parameters act to regularize our objective for boundedness and low variance. To show this claim, we first present the following lemma: Lemma 1 (Optimal Solution for J RPC ) Let r(x, y) = p(x,y) p(x)p(y) be the density ratio. J RPC has the optimal solution f * (x, y) = r(x,y)-α β r(x,y)+γ := r α,β,γ (x, y) with -α γ ≤ r α,β,γ ≤ 1 β . Lemma 1 suggests that J RPC achieves its supreme value at the ratio r α,β,γ (x, y) indexed by the relative parameters α, β, γ (i.e., we term r α,β,γ (x, y) as the relative density ratio). We note that r α,β,γ (x, y) is an increasing function w.r.t. r(x, y) and is nicely bounded even when r(x, y) is large. We will now show that the bounded r α,β,γ suggests the empirical estimation of J RPC has boundeness and low variance. In particular, let {x i , y i } n i=1 be n samples drawn uniformly at random from P XY and {x j , y j } m j=1 be m samples drawn uniformly at random from P X P Y . Then, we use neural networks to empirically estimate J RPC as Ĵm,n RPC : Definition 1 ( Ĵm,n RPC , empirical estimation of J RPC ) We parametrize f via a family of neural networks F Θ := {f θ : θ ∈ Θ ⊆ R d } where d ∈ N and Θ is compact. Then, Ĵm,n RPC = sup f θ ∈FΘ 1 n n i=1 f θ (x i , y i )-1 m m j=1 αf θ (x j , y j )-1 n n i=1 β 2 f 2 θ (x i , y i )-1 m m j=1 γ 2 f 2 θ (x j , y j ). Proposition 1 (Boundedness of Ĵm,n RPC , informal) 0 ≤ J RPC ≤ 1 2β + α 2 2γ . Then, with probability at least 1 -δ, |J RPC -Ĵm,n RPC | = O( d+log (1/δ) n ), where n = min {n, m}. Proposition 2 (Variance of Ĵm,n RPC , informal) There exist universal constants c 1 and c 2 that depend only on α, β, γ, such that Var[ Ĵm,n RPC ] = O c1 n + c2 m . From the two propositions, when m and n are large, i.e., the sample sizes are large, Ĵm,n RPC is bounded, and its variance vanishes to 0. First, the boundedness of Ĵm,n RPC suggests Ĵm,n RPC will not grow to extremely large or small values. Prior contrastive learning objectives with good training stability (e.g., J CPC /J JS /J WPC ) also have the boundedness of their objective values. For instance, the empirical estimation of J CPC is less than log N (equation 1) (Poole et al., 2019) . Nevertheless, J CPC often performs the best only when minibatch size is large, and empirical performances of J JS and J WPC are not as competitive as J CPC . Second, the upper bound of the variance implies the training of Ĵm,n RPC can be stable, and in practice we observe a much smaller value than the stated upper bound. On the contrary, Song & Ermon (2019) shows that the empirical estimations of J DV and J NWJ exhibit inevitable variances that grow exponentially with the true D KL (P XY P X P Y ). Lastly, similar to prior contrastive learning objective that are related to distribution divergence measurement, we associate J RPC with the Chi-square divergence D χ 2 (P XY P X P Y ) = E P X P Y [r 2 (x, y)] -1 (Nielsen & Nock, 2013) . The derivations are provided in Appendix. By having P = β β+γ P XY + γ β+γ P X P Y as the mixture distribution of P XY and P X P Y , we can rewrite J RPC (X, Y ) as J RPC (X, Y ) = β+γ 2 E P [r 2 α,β,γ (x, y)] . Hence, J RPC can be regarded as a generalization of D χ 2 with the relative parameters α, β, γ, where D χ 2 can be recovered from J RPC by specializing α = 0, β = 0 and γ = 1 (e.g., D χ 2 = 2J RPC | α=β=0,γ=1 -1). Note that J RPC may not be a formal divergence measure with arbitrary α, β, γ.

3. EXPERIMENTS

We provide an overview of the experimental section. First, we conduct benchmark self-supervised representation learning tasks spanning visual object classification and speech recognition. This set of experiments are designed to discuss the three challenges of the contrastive representation learning objectives: downstream task performance (Section 3.1), training stability (Section 3.2), and minibatch size sensitivity (Section 3.3). We also provide an ablation study on the choices of the relative parameters in J RPC (Section 3.4). On these experiments we found that J RPC achieves a lower variance during training, a lower batch size insensitivity, and consistent performance improvement. Second, we relate J RPC with mutual information (MI) estimation (Section 3.5). The connection is that MI is an average statistic of the density ratio, and we have shown that the optimal solution of J RPC is the relative density ratio (see Lemma 1). Thus we could estimate MI using the density ratio transformed from the optimal solution of J RPC . On these two sets of experiments, we fairly compare J RPC with other contrastive learning objectives. Particularly, across different objectives, we fix the network, learning rate, optimizer, and batch size (we use the default configurations suggested by the original implementations from Chen et al. (2020c) , Rivière et al. (2020) and Tsai et al. (2020b) .) The only difference will be the objective itself. In what follows, we perform the first set of experiments. We defer experimental details in the Appendix. Datasets. For the visual objective classification, we consider CIFAR-10/-100 (Krizhevsky et al., 2009) , STL-10 (Coates et al., 2011) , and ImageNet (Russakovsky et al., 2015) . CIFAR-10/-100 and ImageNet contain labeled images only, while STL-10 contains labeled and unlabeled images. For the speech recognition, we consider LibriSpeech-100h (Panayotov et al., 2015) dataset, which contains 100 hours of 16kHz English speech from 251 speakers with 41 types of phonemes. Training and Evaluation Details. For the vision experiments, we follow the setup from Sim-CLRv2 (Chen et al., 2020c) , which considers visual object recognition as its downstream task. For the speech experiments, we follow the setup from prior work (Oord et al., 2018; Rivière et al., 2020) , which consider phoneme classification and speaker identification as the downstream tasks. Then, we briefly discuss the training and evaluation details into three modules: 1) related and unrelated data construction, 2) pre-training, and 3) fine-tuning and evaluation. For more details, please refer to Appendix or the original implementations. Related and Unrelated Data Construction. In the vision experiment, we construct the related images by applying different augmentations on the same image. Hence, when (x, y) ∼ P XY , x and y are the same image with different augmentations. The unrelated images are two randomly selected samples. In the speech experiment, we define the current latent feature (feature at time t) and the future samples (samples at time > t) as related data. In other words, the feature in the latent space should contain information that can be used to infer future time steps. A latent feature and randomly selected samples would be considered as unrelated data. Pre-training. The pre-training stage refers to the self-supervised training by a contrastive learning objective. Our training objective is defined in Definition 1, where we use neural networks to parametrize the function using the constructed related and unrelated data. Convolutional neural networks are used for vision experiments. Transformers (Vaswani et al., 2017) and LSTMs (Hochreiter & Schmidhuber, 1997) are used for speech experiments. Fine-tuning and Evaluation. After the pre-training stage, we fix the parameters in the pre-trained networks and add a small fine-tuning network on top of them. Then, we fine-tune this small network with the downstream labels in the data's training split. For the fine-tuning network, both vision and speech experiments consider multi-layer perceptrons. Last, we evaluate the fine-tuned representations on the data's test split. We would like to point out that we do not normalize the hidden representations encoded by the pre-training neural network for loss calculation. This hidden nor- 2020c) is because we only train for 100 epochs rather than 800 due to the fact that running 800 epochs uninterruptedly on cloud TPU is very expensive. Also, we did not employ a memory buffer (He et al., 2020) to store negative samples. We and we did not employ a memory buffer. We also provide the results from fully supervised models as a comparison (Chen et al., 2020b; c) . Fully supervised training performs worse on STL-10 because it does not employ the unlabeled samples in the dataset (Löwe et al., 2019) .

Dataset

ResNet malization technique is widely applied (Tian et al., 2019; Chen et al., 2020b; c) to stabilize training and increase performance for prior objectives, but we find it unnecessary in J RPC .

3.1. DOWNSTREAM TASK PERFORMANCES ON VISION AND SPEECH

For the downstream task performance in the vision domain, we test the proposed J RPC and other contrastive learning objectives on CIFAR-10/-100 (Krizhevsky et al., 2009) , STL-10 (Coates et al., 2011) , and ImageNet ILSVRC-2012 (Russakovsky et al., 2015) . Here we report the best performances J RPC can get on each dataset (we include experimental details in A.7.) Table 2 shows that the proposed J RPC outperforms other objectives on all datasets. Using J RPC on the largest network (ResNet with depth of 152, channel width of 2 and selective kernels), the performance jumps from 77.80% of J CPC to 78.40% of J RPC . Regarding speech representation learning, the downstream performance for phoneme and speaker classification are shown in Table 3 (we defer experimental details in Appendix A.9.) Compared to J CPC , J RPC improves the phoneme classification results with 4.8 percent and the speaker classification results with 0.3 percent, which is closer to the fully supervised model. Overall, the proposed J RPC performs better than other unsupervised learning objectives on both phoneme classification and speaker classification tasks.

3.2. TRAINING STABILITY

We provide empirical training stability comparisons on J DV , J NWJ , J CPC and J RPC by plotting the values of the objectives as the training step increases. We apply the four objectives to the SimCLRv2 framework and train on the CIFAR-10 dataset. All setups of training are exactly the same except the objectives. From our experiments, J DV and J NWJ soon explode to NaN and disrupt training (shown as early stopping in Figure 1a ; extremely large values are not plotted due to scale constraints). On the other hand, J RPC and J CPC has low variance, and both enjoy stable training. As a result, performances using the representation learned from unstable J DV and J NWJ suffer in downstream task, while representation learned by J RPC and J CPC work much better. 

3.3. MINIBATCH SIZE SENSITIVITY

We then provide the analysis on the effect of minibatch size on J RPC and J CPC , since J CPC is known to be sensitive to minibatch size (Poole et al., 2019) . We train SimCLRv2 (Chen et al., 2020c) on CIFAR-10 and the model from Rivière et al. ( 2020) on LibriSpeech-100h using J RPC and J CPC with different minibatch sizes. The settings of relative parameters are the same as Section 3.2. From Figure 1b and 1c, we can observe that both J RPC and J CPC achieve their optimal performance at a large minibatch size. However, when the minibatch size decreases, the performance of J CPC shows higher sensitivity and suffers more when the number of minibatch samples is small. The result suggests that the proposed method might be less sensitive to the change of minibatch size compared to J CPC given the same training settings.

3.4. EFFECT OF RELATIVE PARAMETERS

We study the effect of different combinations of relative parameters in J RPC by comparing downstream performances on visual object recognition. We train SimCLRv2 on CIFAR-10 with different combinations of α, β and γ in J RPC and fix all other experimental settings. We choose α ∈ {0, 0.001, 1.0}, β ∈ {0, 0.001, 1.0}, γ ∈ {0, 0.001, 1.0} and we report the best performances under each combination of α, β, and γ. From Figure 2 , we first observe that α > 0 has better downstream performance than α = 0 when β and γ are fixed. This observation is as expected, since α > 0 encourages representations of related and unrelated samples to be pushed away. Then, we find that a small but nonzero β (β = 0.001) and a large γ (γ = 1.0) give the best performance compared to other combinations. Since β and γ serve as the coefficients of 2 regularization, the results imply that the regularization is a strong and sensitive factor that will influence the performance. The results here are not as competitive as Table 2 because the CIFAR-10 result reported in Table 2 is using a set of relative parameters (α = 1.0, β = 0.005, γ = 1.0) that is different from the combinations in this subsection. Also, we use quite different ranges of γ on ImageNet (see A.7 for details.) In conclusion, we find empirically that a non-zero α, a small β and a large γ will lead to the optimal representation for the downstream task on CIFAR-10.

3.5. RELATION TO MUTUAL INFORMATION ESTIMATION

The presented approach also closely relates to mutual information estimation. For random variables X and Y with joint distribution P XY and product of marginals P X P Y , the mutual information is defined as I(X; Y ) = D KL (P XY P X P Y ). Lemma 1 states that given optimal solution f * (x, y) of J RPC , we can get the density ratio r(x, y) := p(x, y)/p(x)p(y) as r(x, y) = γ/β+α 1-βf * (x,y) -γ β . We can empirically estimate r(x, y) from the estimated f (x, y) via this transformation, and use r(x, y) to estimate mutual information (Tsai et al., 2020b) . Specifically, I(X; Y ) ≈ 1 n n i=1 log r(x i , y i ) with (x i , y i ) ∼ P ⊗n X,Y , where P ⊗n X,Y is the uniformly sampled empirical distribution of P X,Y . Figure 2 : Heatmaps of downstream task performance on CIFAR-10, using different α, β and γ in the JRPC. We conclude that a nonzero α, a small β (β = 0.001) and a large γ(γ = 1.0) are crucial for better performance. ) is neither a lower bound nor a upper bound of mutual information, but can achieve accurate estimates when underlying mutual information is large. J RPC exhibits comparable bias and lower variance compared to the SMILE method, and is more stable than the DoE method. We would like to highlight our method's low-variance property, where we neither clip the values of the estimated density ratio nor impose an upper bound of our estimated mutual information.

4. RELATED WORK

As a subset of unsupervised representation learning, self-supervised representation learning (SSL) adopts self-defined signals as supervision and uses the learned representation for downstream tasks, such as object detection and image captioning (Liu et al., 2020) . We categorize SSL work into two groups: when the signal is the input's hidden property or the corresponding view of the input. For the first group, for example, Jigsaw puzzle (Noroozi & Favaro, 2016) shuffles the image patches and defines the SSL task for predicting the shuffled positions of the image patches. Other instances are Predicting Rotations (Gidaris et al., 2018) and Shuffle & Learn (Misra et al., 2016) . For the second group, the SSL task aims at modeling the co-occurrence of multiple views of data, via the contrastive or the predictive learning objectives (Tsai et al., 2020a) . The predictive objectives encourage reconstruction from one view of the data to the other, such as predicting the lower part of an image from its upper part (ImageGPT by Chen et al. (2020a) ). Comparing the contrastive with predictive learning approaches, Tsai et al. (2020a) points out that the former requires less computational resources for a good performance but suffers more from the over-fitting problem. Theoretical analysis (Arora et al., 2019; Tsai et al., 2020a; Tosh et al., 2020) suggests the contrastively learned representations can lead to a good downstream performance. Beyond the theory, Tian et al. (2020) shows what matters more for the performance are 1) the choice of the contrastive learning objective; and 2) the creation of the positive and negative data pairs in the contrastive objective. Recent work (Khosla et al., 2020) extends the usage of contrastive learning from the selfsupervised setting to the supervised setting. The supervised setting defines the positive pairs as the data from the same class in the contrastive objective, while the self-supervised setting defines the positive pairs as the data with different augmentations. Our work also closely rates to the skewed divergence measurement between distributions (Lee, 1999; 2001; Nielsen, 2010; Yamada et al., 2013) . Recall that the usage of the relative parameters plays a crucial role to regularize our objective for its boundness and low variance. This idea is similar to the skewed divergence measurement, that when calculating the divergence between distributions P and Q, instead of considering D(P Q), these approaches consider D(P αP + (1 -α)Q) with D representing the divergence and 0 < α < 1. A natural example is that the Jensen-Shannon divergence is a symmetric skewed KL divergence: D JS (P Q) = 0.5D KL (P 0.5P + 0.5Q) + 0.5D KL (Q 0.5P + 0.5Q). Compared to the non-skewed counterpart, the skewed divergence has shown to have a more robust estimation for its value (Lee, 1999; 2001; Yamada et al., 2013) . Different from these works that focus on estimating the values of distribution divergence, we focus on learning self-supervised representations.

5. CONCLUSION

In this work, we present RPC, the Relative Predictive Coding, that achieves a good balance among the three challenges when modeling a contrastive learning objective: training stability, sensitivity to minibatch size, and downstream task performance. We believe this work brings an appealing option for training self-supervised models and inspires future work to design objectives for balancing the aforementioned three challenges. In the future, we are interested in applying RPC in other application domains and developing more principled approaches for better representation learning.

A APPENDIX

A.1 PROOF OF LEMMA 1 IN THE MAIN TEXT Lemma 2 (Optimal Solution for J RPC , restating Lemma 1 in the main text) Let J RPC (X, Y ) := sup f ∈F E P XY [f (x, y)]-αE P X P Y [f (x, y)]- β 2 E P XY f 2 (x, y) - γ 2 E P X P Y f 2 (x, y) and r(x, y) = p(x,y) p(x)p(y) be the density ratio. J RPC has the optimal solution f * (x, y) = r(x, y) -α β r(x, y) + γ := r α,β,γ (x, y) with - α γ ≤ r α,β,γ ≤ 1 β . Proof: The second-order functional derivative of the objective is -βdP X,Y -γdP X P Y , which is always negative. The negative second-order functional derivative implies the objective has a supreme value. Then, take the first-order functional derivative ∂JRPC ∂m and set it to zero: dP X,Y -α • dP X P Y -β • f (x, y) • dP X,Y -γ • f (x, y) • dP X P Y = 0. We then get f * (x, y) = dP X,Y -α • dP X P Y β • dP X,Y + γ • dP X P Y = p(x, y) -αp(x)p(y) βp(x, y) + γp(x)p(y) = r(x, y) -α βr(x, y) + γ . Since 0 ≤ r(x, y) ≤ ∞, we have -α γ ≤ r(x,y)-α βr(x,y)+γ ≤ 1 β . Hence, ∀β = 0, γ = 0, f * (x, y) := r α,β,γ (x, y) with - α γ ≤ r α,β,γ ≤ 1 β . A.2 RELATION BETWEEN J RPC AND D χ 2 In this subsection, we aim to show the following: 1) D χ 2 (P XY P X P Y ) = E P X P Y [r 2 (x, y)] -1; and 2) J RPC (X, Y ) = β+γ 2 E P [r 2 α,β,γ (x, y)] by having P = β β+γ P XY + γ β+γ P X P Y as the mixture distribution of P XY and P X P Y . Lemma 3 D χ 2 (P XY P X P Y ) = E P X P Y [r 2 (x, y)] -1 Proof: By definition (Nielsen & Nock, 2013) , D χ 2 (P XY P X P Y ) = dP XY 2 dP X P Y -1 = dP XY dP X P Y 2 dP X P Y -1 = p(x, y) p(x)p(y) 2 dP X P Y -1 = r 2 (x, y)dP X P Y -1 = E P X P Y [r 2 (x, y)] -1. Lemma 4 Defining P = β β+γ P XY + γ β+γ P X P Y as a mixture distribution of P XY and P X P Y , J RPC (X, Y ) = β+γ 2 E P [r 2 α,β,γ (x, y)]. Proof: Plug in the optimal solution f * (x, y) = dP X,Y -α•dP X P Y β•dP X,Y +γ•dP X P Y (see Lemma 2) into J RPC : J RPC = E P XY [f * (x, y)] -αE P X P Y [f * (x, y)] - β 2 E P XY f * 2 (x, y) - γ 2 E P X P Y f * 2 (x, y) = f * (x, y) • dP XY -α • dP X P Y - 1 2 f * 2 (x, y) • β • dP XY + γ • dP X P Y = dP X,Y -α • dP X P Y β • dP X,Y + γ • dP X P Y dP XY -α • dP X P Y - 1 2 dP X,Y -α • dP X P Y β • dP X,Y + γ • dP X P Y 2 β • dP XY + γ • dP X P Y = 1 2 dP X,Y -α • dP X P Y β • dP X,Y + γ • dP X P Y 2 β • dP XY + γ • dP X P Y = β + γ 2 dP X,Y -α • dP X P Y β • dP X,Y + γ • dP X P Y 2 β β + γ • dP XY + γ β + γ • dP X P Y . Since we define r α,β,γ = dP X,Y -α•dP X P Y β•dP X,Y +γ•dP X P Y and P = β β+γ P XY + γ β+γ P X P Y , J RPC = β + γ 2 E P [r 2 α,β,γ (x, y)].

A.3 PROOF OF PROPOSITION 1 IN THE MAIN TEXT

The proof contains two parts: showing 0 ≤ J RPC ≤ 1 2β + α 2 2γ (see Section A.3.1) and Ĵm,n RPC is a consistent estimator for J RPC (see Section A.3.2).

A.3.1 BOUNDNESS

OF J RPC Lemma 5 (Boundness of J RPC ) 0 ≤ J RPC ≤ 1 2β + α 2 2γ Proof: Lemma 4 suggests J RPC (X, Y ) = β+γ 2 E P [r 2 α,β,γ (x, y)] with P = β β+γ P XY + γ β+γ P X P Y as the mixture distribution of P XY and P X P Y . Hence, it is obvious J RPC (X, Y ) ≥ 0. We leverage the intermediate results in the proof of Lemma 4: J RPC (X, Y ) = 1 2 dP X,Y -α • dP X P Y β • dP X,Y + γ • dP X P Y 2 β • dP XY + γ • dP X P Y = 1 2 dP X,Y dP X,Y -α • dP X P Y β • dP X,Y + γ • dP X P Y - α 2 dP X P Y dP X,Y -α • dP X P Y β • dP X,Y + γ • dP X P Y = 1 2 E P XY [r α,β,γ (x, y)] - α 2 E P X P Y [r α,β,γ (x, y)]. Since -α γ ≤ r α,β,γ ≤ 1 β , J RPC (X, Y ) ≤ 1 2β + α 2 2γ .

A.3.2 CONSISTENCY

We first recall the definition of the estimation of J RPC : Definition 2 ( Ĵm,n RPC , empirical estimation of J RPC , restating Definition 1 in the main text) We parametrize f via a family of neural networks F Θ := {f θ : θ ∈ Θ ⊆ R d } where d ∈ N and Θ is compact. Let {x i , y i } n i=1 be n samples drawn uniformly at random from P XY and {x j , y j } m j=1 be m samples drawn uniformly at random from P X P Y . Then, Ĵm,n RPC = sup f θ ∈FΘ 1 n n i=1 f θ (x i , y i ) - 1 m m j=1 αf θ (x j , y j ) - 1 n n i=1 β 2 f 2 θ (x i , y i ) - 1 m m j=1 γ 2 f 2 θ (x j , y j ). Our goal is to show that Ĵm,n RPC is a consistent estimator for J RPC . We begin with the following definition: Ĵm,n RPC,θ := 1 n n i=1 f θ (x i , y i ) - 1 m m j=1 αf θ (x j , y j ) - 1 n n i=1 β 2 f 2 θ (x i , y i ) - 1 m m j=1 γ 2 f 2 θ (x j , y j ) (3) and E ĴRPC,θ := E P XY [f θ (x, y)]-αE P X P Y [f θ (x, y)]- β 2 E P XY [f 2 θ (x, y)]- γ 2 E P X P Y [f 2 θ (x, y)]. Then, we follow the steps: • The first part is about estimation. We show that, with high probability, Ĵm,n RPC,θ is close to E ĴRPC,θ , for any given θ. • The second part is about approximation. We will apply the universal approximation lemma of neural networks (Hornik et al., 1989) to show that there exists a network θ * such that E ĴRPC,θ * is close to J RPC . Part I -Estimation: With high probability, Ĵm,n RPC,θ is close to E ĴRPC,θ , for any given θ. Throughout the analysis on the uniform convergence, we need the assumptions on the boundness and smoothness of the function f θ . Since we show the optimal function f is bounded in J RPC , we can use the same bounded values for f θ without losing too much precision. The smoothness of the function suggests that the output of the network should only change slightly when only slightly perturbing the parameters. Specifically, the two assumptions are as follows: Assumption 1 (boundness of f θ ) There exist universal constants such that ∀f θ ∈ F Θ , C L ≤ f θ ≤ C U . For notations simplicity, we let M = C U -C L be the range of f θ and U = max {|C U |, |C L |} be the maximal absolute value of f θ . In the paper, we can choose to constrain that C L = -α γ and C U = 1 β since the optimal function f * has -α γ ≤ f * ≤ 1 β . Assumption 2 (smoothness of f θ ) There exists constant ρ > 0 such that ∀(x, y) ∈ (X × Y) and θ 1 , θ 2 ∈ Θ, |f θ1 (x, y) -f θ2 (x, y)| ≤ ρ|θ 1 -θ 2 |. Now, we can bound the rate of uniform convergence of a function class in terms of covering number (Bartlett, 1998): Lemma 6 (Estimation) Let > 0 and N (Θ, ) be the covering number of Θ with radius . Then, Pr sup f θ ∈FΘ Ĵm,n RPC,θ -E ĴRPC,θ ≥ ≤2N (Θ, 4ρ 1 + α + 2(β + γ)U ) exp - n 2 32M 2 + exp - m 2 32M 2 α 2 + exp - n 2 32U 2 β 2 + exp - m 2 32U 2 γ 2 . Proof: For notation simplicity, we define the operators • P (f ) = E P XY [f (x, y)] and P n (f ) = 1 n n i=1 f (x i , y i ) • Q(f ) = E P X P Y [f (x, y)] and Q m (f ) = 1 m m j=1 f (x j , y j ) Hence, Ĵm,n RPC,θ -E ĴRPC,θ = P n (f θ ) -P (f θ ) -αQ m (f θ ) + αQ(f θ ) -βP n (f 2 θ ) + βP (f 2 θ ) -γQ m (f 2 θ ) + γQ(f 2 θ ) ≤ |P n (f θ ) -P (f θ )| + α |Q m (f θ ) -Q(f θ )| + β P n (f 2 θ ) -P (f 2 θ ) + γ Q m (f 2 θ ) -Q(f 2 θ ) Pr |P n (f θ k ) -P (f θ k )| + α |Q m (f θ k ) -Q(f θ k )| + β P n (f 2 θ k ) -P (f 2 θ k ) + γ Q m (f 2 θ k ) -Q(f 2 θ k ) ≥ 2 ≤ T k=1 Pr |P n (f θ k ) -P (f θ k )| ≥ 8 + Pr α |Q m (f θ k ) -Q(f θ k )| ≥ 8 + Pr β P n (f 2 θ k ) -P (f 2 θ k ) ≥ 8 + Pr γ Q m (f 2 θ k ) -Q(f 2 θ k ) ≥ 8 . With Hoeffding's inequality, • Pr |P n (f θ k ) -P (f θ k )| ≥ 8 ≤ 2exp -n 2 32M 2 • Pr α |Q m (f θ k ) -Q(f θ k )| ≥ 8 ≤ 2exp -m 2 32M 2 α 2 • Pr β P n (f 2 θ k ) -P (f 2 θ k ) ≥ 8 ≤ 2exp -n 2 32U 2 β 2 • Pr γ Q m (f 2 θ k ) -Q(f 2 θ k ) ≥ 8 ≤ 2exp -m 2 32U 2 γ 2 To conclude, Pr sup f θ ∈FΘ Ĵm,n RPC,θ -E ĴRPC,θ ≥ ≤2N (Θ, 4ρ 1 + α + 2(β + γ)U ) exp - n 2 32M 2 + exp - m 2 32M 2 α 2 + exp - n 2 32U 2 β 2 + exp - m 2 32U 2 γ 2 . Part II -Approximation: Neural Network Universal Approximation. We leverage the universal function approximation lemma of neural network Lemma 7 (Approximation (Hornik et al., 1989 )) Let > 0. There exists d ∈ N and a family of neural networks F Θ := {f θ : θ ∈ Θ ⊆ R d } where Θ is compact, such that inf f θ ∈FΘ E ĴRPC,θ -J RPC ≤ . Part III -Bringing everything together. Now, we are ready to bring the estimation and approximation together to show that there exists a neural network θ * such that, with high probability, Ĵm,n RPC,θ can approximate J RPC with n = min {n, m} at a rate of O(1/ √ n ): Proposition 3 With probability at least 1 -δ, ∃θ * ∈ Θ, |J RPC -Ĵm,n RPC,θ | = O( d+log (1/δ) n ), where n = min {n, m}. Proof: The proof follows by combining Lemma 6 and 7. First, Lemma 7 suggests, ∃θ * ∈ Θ, E ĴRPC,θ * -J RPC ≤ 2 . Next, we perform analysis on the estimation error, aiming to find n, m and the corresponding probability, such that Ĵm,n RPC,θ -E ĴRPC,θ * ≤ 2 . Applying Lemma 6 with the covering number of the neural network: & Bartlett, 2009) and let n = min{n, m}: N (Θ, ) = O exp d log (1/ ) (Anthony Pr sup f θ ∈F Θ Ĵm,n RPC,θ -E ĴRPC,θ ≥ 2 ≤2N (Θ, 8ρ 1 + α + 2(β + γ)U ) exp - n 2 128M 2 + exp - m 2 128M 2 α 2 + exp - n 2 128U 2 β 2 + exp - m 2 128U 2 γ 2 =O exp d log (1/ ) -n 2 , where the big-O notation absorbs all the constants that do not require in the following derivation. Since we want to bound the probability with 1 -δ, we solve the such that exp d log (1/ ) -n 2 ≤ δ. With log (x) ≤ x -1, n 2 + d( -1) ≥ n 2 + dlog ≥ log (1/δ), where this inequality holds when = O d + log (1/δ) n . A.4 PROOF OF PROPOSITION 2 IN THE MAIN TEXT -FROM AN ASYMPTOTIC VIEWPOINT Here, we provide the variance analysis on Ĵm,n RPC via an asymptotic viewpoint. First, assuming the network is correctly specified, and hence there exists a network parameter θ * satisfying f * (x, y) = f θ * (x, y) = r α,β,γ (x, y). Then we recall that Ĵm,n RPC is a consistent estimator of J RPC (see Proposition 3), and under regular conditions, the estimated network parameter θ in Ĵm,n RPC satisfying the asymptotic normality in the large sample limit (see Theorem 5.23 in (Van der Vaart, 2000) ). We recall the definition of Ĵm,n RPC,θ in equation 3 and let n = min{n, m}, the asymptotic expansion of Ĵm,n RPC has Ĵm,n RPC,θ * = Ĵm,n RPC, θ + Jm,n RPC, θ (θ * -θ) + o( θ * -θ ) = Ĵm,n RPC, θ + Jm,n RPC, θ (θ * -θ) + o p ( 1 √ n ) = Ĵm,n RPC, θ + o p ( 1 √ n ), where Jm,n RPC, θ = 0 since θ is the estimation from Ĵm,n RPC = sup f θ ∈FΘ Ĵm,n RPC,θ . Next, we recall the definition in equation 4: E[ ĴRPC, θ ] = E P XY [f θ (x, y)] -αE P X P Y [f θ (x, y)] - β 2 E P XY [f 2 θ (x, y)] - γ 2 E P X P Y [f 2 θ (x, y)]. Likewise, the asymptotic expansion of E[ ĴRPC,θ ] has E[ ĴRPC, θ ] = E[ ĴRPC,θ * ] + E[ JRPC,θ * ]( θ -θ * ) + o( θ -θ * ) = E[ ĴRPC,θ * ] + E[ JRPC,θ * ]( θ -θ * ) + o p ( 1 √ n ) = E[ ĴRPC,θ * ] + o p ( 1 √ n ), where E[ JRPC,θ * ] = 0 since E[ ĴRPC,θ * ] = J RPC and θ * satisfying f * (x, y) = f θ * (x, y). Combining equations 5 and 6: Ĵm,n RPC, θ -E[ ĴRPC, θ ] = Ĵm,n RPC,θ * -J RPC + o p ( 1 √ n ) = 1 n n i=1 f * θ (x i , y i ) -α 1 m m j=1 f * θ (x j , y j ) - β 2 1 n n i=1 f 2 θ * (x i , y i ) - γ 2 1 m m j=1 f 2 θ * (x j , y j ) -E P XY [f * (x, y)] + αE P X P Y [f * (x, y)] + β 2 E P XY f * 2 (x, y) + γ 2 E P X P Y f * 2 (x, y) + o p ( 1 √ n ) = 1 n n i=1 r α,β,γ (x i , y i ) -α 1 m m j=1 r α,β,γ (x j , y j ) - β 2 1 n n i=1 r 2 α,β,γ (x i , y i ) - γ 2 1 m m j=1 r 2 α,β,γ (x j , y j ) -E P XY [r α,β,γ (x, y)] + αE P X P Y [r α,β,γ (x, y)] + β 2 E P XY r 2 α,β,γ (x, y) + γ 2 E P X P Y r 2 α,β,γ (x, y) + o p ( 1 √ n ) = 1 √ n • 1 √ n n i=1 r α,β,γ (x i , y i ) - β 2 r 2 α,β,γ (x i , y i ) -E P XY r α,β,γ (x, y) - β 2 r 2 α,β,γ (x, y) - 1 √ m • 1 √ m m j=1 αr α,β,γ (x j , y j ) + γ 2 r 2 α,β,γ (x j , y j ) -E P X P Y αr α,β,γ (x, y) + γ 2 r 2 α,β,γ (x, y) + o p ( 1 √ n ). Therefore, the asymptotic Variance of Ĵm,n RPC is Var[ Ĵm,n RPC ] = 1 n Var P XY [r α,β,γ (x, y) - β 2 r 2 α,β,γ (x, y)] + 1 m Var P X P Y [αr α,β,γ (x, y) + γ 2 r 2 α,β,γ (x, y)] + o( 1 n ). First, we look at Var P XY [r α,β,γ (x, y) -β 2 r 2 α,β,γ (x, y)]. Since β > 0 and -α γ ≤ r α,β,γ ≤ 1 β , simple calculation gives us -2αγ+βα 2 2γ 2 ≤ r α,β,γ (x, y) -β 2 r 2 α,β,γ (x, y) ≤ 1 2β . Hence, Var P XY [r α,β,γ (x, y) - β 2 r 2 α,β,γ (x, y)] ≤ max 2αγ + βα 2 2γ 2 2 , 1 2β 2 . Next, we look at Var P X P Y [αr α,β,γ (x, y) + γ 2 r 2 α,β,γ (x, y)]. Since α ≥ 0, γ > 0 and -α γ ≤ r α,β,γ ≤ 1 β , simple calculation gives us -α 2 2γ ≤ αr α,β,γ (x, y) + γ 2 r 2 α,β,γ (x, y) ≤ 2αβ+γ 2β 2 . Hence, Var P X P Y [αr α,β,γ (x, y) + γ 2 r 2 α,β,γ (x, y)] ≤ max α 2 2γ 2 , 2αβ + γ 2β 2 2 . Combining everything together, we restate the Proposition 2 in the main text: Proposition 4 (Asymptotic Variance of Ĵm,n RPC ) The whole pipeline of pretraining contains the following steps: First, a stochastic data augmentation will transform one image sample x k to two different but correlated augmented views, x 2k-1 and x 2k . Then a base encoder f (•) implemented using ResNet (He et al., 2016) will extract representations from augmented views, creating representations h 2k-1 and h 2k . Later a small neural network g(•) called projection head will map h 2k-1 and h 2k to z 2k-1 and z 2k in a different latent space. For each minibatch of N samples, there will be 2N views generated. For each image x k there will be one positive pair x 2k-1 and x 2k and 2(N -1) negative samples. The RPC loss between a pair of positive views, x i and x j (augmented from the same image) , can be calculated by the substitution f θ (x i , x j ) = (z i • z j )/τ = s i,j (τ is a hyperparameter) to the definition of RPC: Var[ Ĵm,n RPC ] = 1 n Var P XY [r α,β,γ (x, y) - β 2 r 2 α,β,γ (x, y)] + 1 m Var P X P Y [αr α,β,γ (x, y) + γ 2 r 2 α,β,γ (x, y)] + o( 1 n ) ≤ 1 n max 2αγ + βα 2 2γ 2 2 , 1 2β 2 + 1 m max α 2 2γ 2 , 2αβ + γ 2β 2 2 + o( 1 n ) A. RPC i,j = -(s i,j - α 2(N -1) 2N k=1 1 [k =i] s i,k - β 2 s 2 i,j - γ 2 • 2(N -1) 2N k=1 1 [k =i] s 2 i,k ) For losses other than RPC, a hidden normalization of s i,j is often required by replacing z i • z j with (z i • z j )/|z i ||z j |. CPC and WPC adopt this, while other objectives needs it to help stabilize training variance. RPC does not need this normalization. dence level is chosen) in Table 4 . Both CPC and RPC use the same experimental settings throughout this paper. Here we use the relative parameters (α = 1.0, β = 0.005, γ = 1.0) in J RPC which gives the best performance on CIFAR-10. The confidence intervals of CPC do not overlap with the confidence intervals of RPC, which means the difference of the downstream task performance between RPC and CPC is statistically significant. A.9 RELATIVE PREDICTIVE CODING ON SPEECH For speech representation learning, we adopt the general architecture from Oord et al. (2018) . Given an input signal x 1:T with T time steps, we first pass it through an encoder φ θ parametrized by θ to produce a sequence of hidden representations {h 1:T } where h t = φ θ (x t ). After that, we obtain the contextual representation c t at time step t with a sequential model ψ ρ parametrized by ρ:  c t = ψ ρ (h 1 , . . . , = f k (h, c t ) = exp((h) W k c t ) , where W k is a learnable linear transformation defined separately for each k ∈ {1, ..., K} and K is predetermined as 12 time steps. The loss in Equation 2 will then be formulated as: RPC t,k = -(f k (h t+k , c t ) - α |N | hi∈N f k (h i , c t ) - β 2 f 2 k (h t+k , c t ) - γ 2|N | hi∈N f 2 k (h i , c t )) (8) We use the following relative parameters: α = 1, β = 0.25, and γ = 1, and we use the temperature τ = 16 for J RPC . For J CPC we follow the original implementation which sets τ = 1. We fix all other experimental setups, including architecture, learning rate, and optimizer. As shown in Table 3 , J RPC has better downstream task performance, and is closer to the performance from a fully supervised model.

A.10 EMPIRICAL OBSERVATIONS ON VARIANCE AND MINIBATCH SIZE

Variance Experiment Setup We perform the variance comparison of J DV , J NWJ and the proposed J RPC . The empirical experiments are performed using SimCLRv2 (Chen et al., 2020c) on CIFAR-10 dataset. We use a ResNet of depth 18, with batch size of 512. We train each objective with 30K training steps and record their value. In Figure 1 , we use a temperature τ = 128 for all objectives. Unlike other experiments, where hidden normalization is applied to other objectives, we remove hidden normarlization for all objectives due to the reality that objectives after normalization does not reflect their original values. From Figure 1 , J RPC enjoys lower variance and more stable training compared to J DV and J NWJ .

Minibatch Size Experimental Setup

We perform experiments on the effect of batch size on downstream performances for different objective. The experiments are performed using SimCLRv2 (Chen et al., 2020c) on CIFAR-10 dataset, as well as the model from Rivière et al. (2020) on LibriSpeech-100h dataset (Panayotov et al., 2015) . For vision task, we use the default temperature τ = 0.5 from Chen et al. (2020c) and hidden normalization mentioned in Section 3 for J CPC . For J RPC in vision and speech tasks we use a temperature of τ = 128 and τ = 16 respectively, both without hidden normalization.

A.11 MUTUAL INFORMATION ESTIMATION

Our method is compared with baseline methods CPC (Oord et al., 2018) , NWJ (Nguyen et al., 2010) , JSD (Nowozin et al., 2016) , and SMILE (Song & Ermon, 2019) . All the approaches consider the same design of f (x, y), which is a 3-layer neural network taking concatenated (x, y) as the input. We also fix the learning rate, the optimizer, and the minibatch size across all the estimators for a fair comparison. We present results of mutual information by Relative Predictive Coding using different sets of relative parameters in Figure 4 . In the first row, we set β = 10 -3 , γ = 1, and experiment with different α values. In the second row, we set α = 1, γ = 1 and in the last row we set α = 1, β = 10 -3 . From the figure, a small β around 10 -3 and a large γ around 1.0 is crucial for an estimation that is relatively low bias and low variance. This conclusion is consistent with Section 3 in the main text. We also performed comparison between J RPC and Difference of Entropies (DoE) (McAllester & Stratos, 2020) . We performed two sets of experiments: in the first set of experiments we compare J RPC and DoE when MI is large (> 100 nats), while in the second set of experiments we compare J RPC and DoE using the setup in this section (MI < 12 nats and MI increases by 2 per 4k training steps). On the one hand, when MI is large (> 100 nats), we acknowledge that DoE is performing well on MI estimation, compared to J RPC which only estimates the MI around 20. This analysis is based on the code from https://github.com/karlstratos/doe. On the other hand, when the true MI is small, the DoE method is more unstable than J RPC , as shown in Figure 5 . Figure 5 illustrates the results of the DoE method when the distribution is isotropic Gaussian (correctly specified) or Logistic (mis-specified). Figure 3 only shows the results using Gaussian.



Project page: https://github.com/martinmamql/relative_predictive_coding JJS(X, Y ) achieves its supreme value when f * (x, y) = log(p(x, y)/p(x)p(y))(Tsai et al., 2020b). Plugin f * (x, y) into JJS(X, Y ), we can conclude JJS(X, Y ) = 2(DJS(PXY PX PY ) -log 2). For WPC(Ozair et al., 2019), the global batch normalization during pretraining is disabled since we enforce 1-Lipschitz by gradient penalty(Gulrajani et al., 2017).



Figure 1: (a) Empirical values of JDV, JNWJ, JCPC and JRPC performing visual object recognition on CIFAR-10. JDV and JNWJ soon explode to NaN values and stop the training (shown as early stopping in the figure), while JCPC and JRPC are more stable. Performance comparison of JCPC and JRPC on (b) CIFAR-10 and (c) LibriSpeech-100h with different minibatch sizes, showing that the performance of JRPC is less sensitive to minibatch size change compared to JCPC.

Figure 3: Mutual information estimation performed on 20-d correlated Gaussian distribution, with the correlation increasing each 4K steps. JRPC exhibits smaller variance than SMILE and DoE, and smaller bias than JCPC.

Figure 4: Mutual information estimation by RPC performed on 20-d correlated Gaussian distribution, with different sets of relative parameters.

Figure 5: Mutual information estimation by DoE performed on 20-d correlated Gaussian distribution. The figure on the left shows parametrization under Gaussian (correctly specified), and the figure on the right shows parametrization under Logistic (mis-specified).

Top-1 accuracy (%) for visual object recognition results. JDV and JNWJ are not reported on ImageNet due to numerical instability. ResNet depth, width and Selective Kernel (SK) configuration for each setting are provided in ResNet depth+width+SK column. A slight drop of JCPC performance compared to Chen et al. (

Accuracy (%) for LibriSpeech-100h phoneme and speaker classification results. We also provide the results from fully supervised model as a comparison(Oord et al., 2018).

5 PROOF OF PROPOSITION 2 IN THE MAIN TEXT -FROM BOUNDNESS OF f θ As discussed in Assumption 1, for the estimation Ĵm,n RPC , we can bound the function f θ in F Θ within [-α γ , 1 β ] without losing precision. Then, re-arranging Ĵm,n RPC :

h t ), where c t contains context information before time step t. For unsupervised pre-training, we use a multi-layer convolutional network as the encoder φ θ , and an LSTM with hidden dimension 256 as the sequential model ψ ρ . Here, the contrastiveness is between the positive pair (h t+k , c t ) where k is the number of time steps ahead, and the negative pairs (h i , c t ), where h i is randomly sampled from N , a batch of hidden representation of signals assumed to be unrelated to c t . The scoring function f based on Equation 2 at step t and look-ahead k will be f k

ACKNOWLEDGEMENT

This work was supported in part by the NSF IIS1763562, NSF Awards #1750439 #1722822, National Institutes of Health, IARPA D17PC00340, ONR Grant N000141812861, and Facebook PhD Fellowship. We would also like to acknowledge NVIDIA's GPU support and Cloud TPU support from Google's TensorFlow Research Cloud (TFRC).

annex

Table 4 : Confidence Intervals of performances of J RPC and J CPC on CIFAR-10/-100 and ImageNet.A.8 CIFAR-10/-100 AND IMAGENET EXPERIMENTS DETAILS ImageNet Following the settings in (Chen et al., 2020b; c) , we train the model on Cloud TPU with 128 cores, with a batch size of 4, 096 and global batch normalization 3 (Ioffe & Szegedy, 2015) . Here we refer to the term batch size as the number of images (or utterances in the speech experiments) we use per GPU, while the term minibatch size refers to the number of negative samples used to calculate the objective, such as CPC or our proposed RPC. The largest model we train is a 152-layer ResNet with selective kernels (SK) (Li et al., 2019) and 2× wider channels. We use the LARS optimizer (You et al., 2017) with momentum 0.9. The learning rate linearly increases for the first 20 epochs, reaching a maximum of 6.4, then decayed with cosine decay schedule. The weight decay is 10 -4 . A MLP projection head g(•) with three layers is used on top of the ResNet encoder. Unlike Chen et al. (2020c), we do not use a memory buffer, and train the model for only 100 epochs rather than 800 epochs due to computational constraints. These two options slightly reduce CPC's performance benchmark for about 2% with the exact same setting. The unsupervised pre-training is followed by a supervised fine-tuning. Following SimCLRv2 (Chen et al., 2020b; c) , we fine-tune the 3-layer g(•) for the downstream tasks. We use learning rates 0. CIFAR-10/-100 Following the settings in (Chen et al., 2020b) , we train the model on a single GPU, with a batch size of 512 and global batch normalization (Ioffe & Szegedy, 2015) . We use ResNet (He et al., 2016) of depth 18 and depth 50, and does not use Selective Kernel (Li et al., 2019) or a multiplied width size. We use the LARS optimizer (You et al., 2017) with momentum 0.9. The learning rate linearly increases for the first 20 epochs, reaching a maximum of 6.4, then decayed with cosine decay schedule. The weight decay is 10 -4 . A MLP projection head g(•) with three layers is used on top of the ResNet encoder. Unlike Chen et al. (2020c) , we do not use a memory buffer. We train the model for 1000 epochs. The unsupervised pre-training is followed by a supervised fine-tuning. Following SimCLRv2 (Chen et al., 2020b; c) , we fine-tune the 3-layer g(•) for the downstream tasks. We use learning rates 0.16 for standard 50-layer ResNet , and weight decay and learning rate warmup are removed. For J RPC we disable hidden normalization and use a temperature τ = 128. For all other objectives, we use hidden normalization and τ = 0.5 following previous work (Chen et al., 2020c) . For relative parameters, we use α = 1.0, β = 0.005, and γ = 1.0.

STL-10

We also perform the pre-training and fine-tuning on STL-10 ( Coates et al., 2011) using the model proposed in Chuang et al. (2020) . Chuang et al. (2020) proposed to indirectly approximate the distribution of negative samples so that the objective is debiased. However, their implementation of contrastive learning is consistent with Chen et al. (2020b) . We use a ResNet with depth 50 as an encoder for pre-training, with Adam optimizer, learning rate 0.001 and weight decay 10 -6 . The temperature τ is set to 0.5 for all objectives other than J RPC , which disables hidden normalization and use τ = 128. The downstream task performance increases from 83.4% of J CPC to 84.1% of J RPC .Confidence Interval We also provide the confidence interval of J RPC and J CPC on CIFAR-10, CIFAR-100 and ImageNet, using ResNet-18, ResNet-18 and ResNet-50 respectively (95% confi-

