DECOMPOSING MUTUAL INFORMATION FOR REPRESENTATION LEARNING

Abstract

Many self-supervised representation learning methods maximize mutual information (MI) across views. In this paper, we transform each view into a set of subviews and then decompose the original MI bound into a sum of bounds involving conditional MI between the subviews. E.g., given two views x and y of the same input example, we can split x into two subviews, x and x , which depend only on x but are otherwise unconstrained. The following holds: I(x; y) ≥ I(x ; y) + I(x ; y|x ), due to the chain rule and information processing inequality. By maximizing both terms in the decomposition, our approach explicitly rewards the encoder for any information about y which it extracts from x , and for information about y extracted from x in excess of the information from x . We provide a novel contrastive lower bound on conditional MI, that relies on sampling contrast sets from p(y|x ). By decomposing the original MI into a sum of increasingly challenging MI bounds between sets of increasingly informed views, our representations can capture more of the total information shared between the original views. We empirically test the method in a vision domain and for dialogue generation.

1. INTRODUCTION

The ability to extract actionable information from data in the absence of explicit supervision seems to be a core prerequisite for building systems that can, for instance, learn from few data points or quickly make analogies and transfer to other tasks. Approaches to this problem include generative models (Hinton, 2012; Kingma & Welling, 2014) and self-supervised representation learning approaches, in which the objective is not to maximize likelihood, but to formulate a series of (label-agnostic) tasks that the model needs to solve through its representations (Noroozi & Favaro, 2016; Devlin et al., 2019; Gidaris et al., 2018; Hjelm et al., 2019) . Self-supervised learning includes successful models leveraging contrastive learning, which have recently attained comparable performance to their fully-supervised counterparts (Bachman et al., 2019; Chen et al., 2020a) . Many self-supervised learning methods train an encoder such that the representations of a pair of views x and y derived from the same input example are more similar to each other than to representations of views sampled from a contrastive negative sample distribution, which is usually the marginal distribution of the data. For images, different views can be built using random flipping, color jittering and cropping (Bachman et al., 2019; Chen et al., 2020a) . For sequential data such as conversational text, the views can be past and future utterances in a given dialogue. It can be shown that these methods maximize a lower bound on mutual information (MI) between the views, I(x; y), w.r.t. the encoder, i.e. the InfoNCE bound (Oord et al., 2018) . One significant shortcoming of this approach is the large number of contrastive samples required, which directly impacts the total amount of information which the bound can measure (McAllester & Stratos, 2018; Poole et al., 2019) . In this paper, we consider creating subviews of x by removing information from it in various ways, e.g. by masking some pixels. Then, we use representations from less informed subviews as a source of hard contrastive samples for representations from more informed subviews. For example, in Fig. 1 , one can mask a pixel region in x to obtain x and ask (the representation of) x to be closer to y than to random images of the corpus, and for x to be closer to y than to samples from p(y|x ). This corresponds to decomposing the MI between x and y into I(x; y) ≥ I(x ; y) + I(x ; y|x ). The conditional MI measures the information about y that the model has gained by looking at x beyond the information already contained in x . In Fig. 1 (left), standard contrastive approaches x'' Figure 1 : A demonstration of our approach in vision (left) and dialogue (right). (left) Given two augmentations x and y, we fork x into two subviews, x which is an exact copy of x and x , an information-restricted view obtained by occluding some of the pixels in x . We can maximize I(x; y) ≥ I(x ; y) + I(x ; y|x ) using a contrastive bound by training x to be closer to y than to other images from the corpus, and by training x to be closer to y than to samples from p(y|x ), i.e. we can use x to generate hard negative samples for x . The conditional MI term encourages the encoder to imbue the representation of x with information it shares with y beyond the information already in x . (right) x and y represent past and future in a dialogue respectively and x is the "recent past". In this context, the encoder is encouraged to capture long-term dependencies that cannot be explained by the most recent utterances. could focus on the overall "shape" of the object and would need many negative samples to capture other discriminative features. In our approach, the model is more directly encouraged to capture these additional features, e.g. the embossed detailing. In the context of predictive coding on sequential data such as dialogue, by setting x to be the most recent utterance (Fig. 1 , right), the encoder is directly encouraged to capture long-term dependencies that cannot be explained by x . We formally show that, by such decomposition, our representations can potentially capture more of the total information shared between the original views x and y. Maximizing MI between multiple views can be related to recent efforts in representation learning, amongst them AMDIM (Bachman et al., 2019) , CMC (Tian et al., 2019) and SwAV (Caron et al., 2020) . However, these models maximize the sum of MIs between views I({x , x }; y) = I(x ; y) + I(x ; y). E.g., in Bachman et al. (2019) , x and x could be global and local representations of an image, and in Caron et al. (2020) , x and x could be the views resulting from standard cropping and the aggressive multi-crop strategy. This equality is only valid when the views x and x are statistically independent, which usually does not hold. Instead, we argue that a better decomposition is I({x , x }; y) = I(x ; y) + I(x ; y|x ), which always holds. Most importantly, the conditional MI term encourages the encoder to capture more non-redundant information across views. To maximize our proposed decomposition, we present a novel lower bound on conditional MI in Section 3. For the conditional MI maximization, we give a computationally tractable approximation that adds minimal overhead. In Section 4, we first show in a synthetic setting that decomposing MI and using the proposed conditional MI bound leads to capturing more of the ground-truth MI. Finally, we present evidence of the effectiveness of the method in vision and in dialogue generation.

2. PROBLEM SETTING

The maximum MI predictive coding framework (McAllester, 2018; Oord et al., 2018; Hjelm et al., 2019) prescribes learning representations of input data such that they maximize MI. Estimating MI is generally a hard problem that has received a lot of attention in the community (Kraskov et al., 2004; Barber & Agakov, 2003) . Let x and y be two random variables which can generally describe input data from various domains, e.g. text, images or sound. We can learn representations of x and y by maximizing the MI of the respective features produced by encoders f, g : X → R d , which by the data processing inequality, is bounded by I(x; y): arg max f,g I(f (x); g(y)) ≤ I(x; y). We assume that the encoders can be shared, i.e. f = g. The optimization in Eq. 1 is challenging but can be lower-bounded. Our starting point is the recently proposed InfoNCE lower bound on MI (Oord et al., 2018) and its application to self-supervised learning for visual representations (Bachman et al., 2019; Chen et al., 2020a) . In this setting, x and y are paired input images, or independentlyaugmented copies of the same image. These are encoded using a neural network encoder which is trained such that the representations of the two image copies are closer to each other in the embedding space than to other images drawn from the marginal distribution of the corpus. This can be viewed as a contrastive estimation of the MI (Oord et al., 2018) . We present the InfoNCE bound next.

2.1. INFONCE BOUND

InfoNCE (Oord et al., 2018) is a lower-bound on I(x; y) obtained by comparing pairs sampled from the joint distribution x, y 1 ∼ p(x, y) to a set of negative samples, y 2:K ∼ p(y 2:K ) = K k=2 p(y k ), also called contrastive, independently sampled from the marginal: I N CE (x; y|E, K) = E p(x,y1)p(y 2:K ) log e E(x,y1) 1 K K k=1 e E(x,y k ) ≤ I(x, y), ( ) where E is a critic assigning a real valued score to x, y pairs. We provide an exact derivation for this bound in the Appendixfoot_0 . For this bound, the optimal critic is the log-odds between the conditional distribution p(y|x) and the marginal distribution of y, E * (x, y) = log p(y|x) p(y) + c(x) (Oord et al., 2018; Poole et al., 2019) . The InfoNCE bound is loose if the true mutual information I(x; y) is larger than log K. In order to overcome this difficulty, recent methods either train with large batch sizes (Chen et al., 2020a) or exploit an external memory of negative samples in order to reduce memory requirements (Chen et al., 2020b; Tian et al., 2020) . These methods rely on uniform sampling from the training set in order to form the contrastive sets. For further discussion of the limits of variational bounds of MI, see McAllester & Stratos (2018) .

3. DECOMPOSING MUTUAL INFORMATION

By the data processing inequality: I(x; y) ≥ I({x 1 , . . . , x N }; y), where {x 1 , . . . , x N } are different subviews of x -i.e., views derived from x without adding any exogenous information. For example, {x 1 , . . . , x N } can represent exchanges in a longer dialog x, sentences in a document x, or different augmentations of the same image x. Equality is obtained when the set of subviews retains all information about x, e.g. if x is in the set. Without loss of generality, we consider the case N = 2, I(x; y) ≥ I({x , x }; y), where {x , x } indicates two subviews derived from the original x. We can apply the chain rule for MI: I(x; y) ≥ I({x , x }; y) = I(x ; y) + I(x ; y|x ), where the equality is obtained if and only if I(x; y|{x , x }) = 0, i.e. x doesn't give any information about y in excess to {x , x }foot_1 . This suggests that we can maximize I(x; y) by maximizing each of the MI terms in the sum. The conditional MI term can be written as: I(x ; y|x ) = E p(x ,x ,y) log p(y|x , x ) p(y|x ) . ( ) This conditional MI is different from the unconditional MI, I(x ; y), insofar it measures the amount of information shared between x and y which cannot be explained by x . Note that the decomposition holds for arbitrary partitions of x , x , e.g. I({x , x }; y) = I(x ; y) + I(x ; y|x ). When X is high-dimensional, the amount of mutual information between x and y will potentially be larger than the amount of MI that I N CE can measure given computational constraints associated with large K and the poor log scaling properties of the bound. The idea that we put forward is to split the total MI into a sum of MI terms of smaller magnitude, thus for which I N CE would have less bias for any given K, and estimate each of those terms in turn. The resulting decomposed bound can be written into a sum of unconditional and conditional MI terms: I N CES (x; y) = I N CE (x ; y) + I CN CE (x ; y|x ) ≤ I(x; y), where I CN CE is a lower-bound on conditional MI and will be presented in the next section. Both conditional (Eq. 6) and unconditional bounds on the MI (Eq. 14) can capture at most log K nats of MI. Therefore, the bound that arises from the decomposition of the MI in Eq. 5 potentially allows to capture up to N log K nats of MI in total, where N is the number of subviews used to describe x. This shows that measuring mutual information by decomposing it in a sequence of estimation problems potentially allows to capture more nats of MI than with the standard I N CE , which is bounded by log K.

4. CONTRASTIVE BOUNDS ON CONDITIONAL MUTUAL INFORMATION

One of the difficulties in computing the decomposed bound is measuring the conditional mutual information. In this section, we provide bounds and approximations of this quantity. First, we show that we can readily extend InfoNCE. Proposition 1 (Conditional InfoNCE). The following is a lower-bound on the conditional mutual information I(x ; y|x ) and verifies the properties below: I CN CE (x ; y|x , E, K) = E p(x ,x ,y1)p(y 2:K |x ) log e E(x ,x ,y1) 1 K K k=1 e E(x ,x ,y k ) (6) 1. I CN CE ≤ I(x ; y|x ). 2. E * = arg sup E I CN CE = log p(y|x ,x ) p(y|x ) + c(x , x ). 3. When K → ∞ and E = E * , we recover the true conditional MI: lim K→∞ I CN CE (x ; y|x , E * , K) = I(x ; y|x ). The proof can be found in Sec. A.2 and follows closely the derivation of the InfoNCE bound by applying a result from Barber & Agakov (2003) and setting the proposal distribution of the variational approximation to p(y|x ). An alternative derivation of this bound was also presented in parallel in Foster et al. (2020) for optimal experiment design. Eq. 6 shows that a lower bound on the conditional MI can be obtained by sampling contrastive sets from the proposal distribution p(y|x ). Indeed, since we want to estimate the MI conditioned on x , we should allow our contrastive distribution to condition on x . Note that E is now a function of three variables. Computing Eq. 6 requires access to a large number of samples from p(y|x ), which is unknown and usually challenging to obtain. In order to overcome this, we propose two solutions.

4.1. VARIATIONAL APPROXIMATION

The next proposition shows that it is possible to obtain a bound on the conditional MI by approximating the unknown conditional distribution p(y|x ) with a variational distribution τ (y|x ). Proposition 2 (Variational I CN CE ). For any variational approximation τ (y|x ) in lieu of p(y|x ), I V AR (x , y|x , E, τ, K) = E p(x ,x ,y1)τ (y 2:K |x ) log e E(x ,x ,y1) 1 K K k=1 e E(x ,x ,y k ) (7) -E p(x ) KL (p(y|x ) τ (y|x )) , with p(•|x ) << τ (•|x ) for any x , we have the following properties: 1. I V AR ≤ I(x ; y|x ). 2. If τ (y|x ) = p(y|x ), I V AR = I CN CE . 3. lim K→∞ sup E I V AR (x ; y|x , E, τ, K) = I(x ; y|x ). See Sec. A.3 for the proof. This bound side-steps the problem of requiring access to an arbitrary number of contrastive samples from the unknown p(y|x ) by i.i.d. sampling from the known and tractable τ (y|x ). We prove that as the number of examples goes to ∞, optimizing the bound w.r.t. E converges to the true conditional MI. Interestingly, this holds true for any value of τ , though the choice of τ will most likely impact the convergence rate of the estimator. Eq. 3 is superficially similar to the ELBO (evidence lower bound) objective used to train VAEs (Kingma & Welling, 2014) , where τ plays the role of the approximate posterior (although the KL direction in the ELBO is inverted). This parallel suggests that τ * (y|x ) = p(y|x ) may not be the optimal solution for some values of K and E. However, we see trivially that if we ignore the dependency of the first expectation term on τ and only optimize τ to minimize the KL term, then it is guaranteed that τ * (y|x) = p(y|x ), for any K and E. Thus, by the second property in Proposition 2, optimizing I V AR (E, τ * , K) w.r.t E will correspond to optimizing I CN CE . In practice, the latter observation significantly simplifies the estimation problem as one can minimize a Monte-Carlo approximation of the KL divergence w.r.t τ by standard supervised learning: we can efficiently approximate the KL by taking samples from p(y|x ). Those can be directly obtained by using the joint samples from p(x, y) included in the training set and computing x from x.foot_2 

4.2. IMPORTANCE SAMPLING APPROXIMATION

Maximizing I V AR can still be challenging as it requires estimating a distribution over potentially high-dimensional inputs. In this section, we provide an importance sampling approximation of I CN CE that bypasses this issue. We start by observing that the optimal critic for I N CE (x ; y|E, K) is Ē(x , y) = log p(y|x ) p(y) +c(x ), for any c. Assuming we have appropriately estimated Ē(x , y), it is possible to use importance sampling to produce approximate samples from p(y|x ). This is achieved by first sampling y 1:M ∼ p(y) and resampling ,ym) . This process is also called "sampling importance resampling" (SIR). As M/K → ∞, it is guaranteed to produce samples from p(y|x ) (Rubin, 1987) . The SIR estimator is written as: K ≤ M (K > 0) examples i.i.d. from the normalized importance distribution q SIR (y k ) = w k δ(y k ∈ y 1:M ), where w k = exp Ē(x ,y k ) M m=1 exp Ē(x I SIR (x , y|x , E, K) = E p(x ,x ,y1)p(y 1:M )q SIR (y 2:K ) 1 K log e E(x ,x ,y1) K k=1 e E(x ,x ,y k ) , ( ) where we note the dependence of q SIR on w k and hence Ē. SIR is known to increase the variance of the estimator (Skare et al., 2003) and is wasteful given that only a smaller set of K examples are actually used for MI estimation. Hereafter, we provide a cheap approximation of the SIR estimator. The key idea is to rewrite the contribution of the negative samples in the denominator of Eq. 8 as an average (K -1) K k=2 1 K-1 e E(x , x ,y k ) and use the normalized importance weights w k to estimate that term under the resampling distribution. We hypothesize that this variant has less variance as it does not require the additional resampling step. The following proposition shows that as the number of negative examples goes to infinity, the proposed approximation converges to the true value of the conditional MI. Proposition 3 (Importance Sampling I CN CE ). The following approximation of I SIR : I IS (x , y|x , E, K) = E p(x ,x ,y1)p(y 2:K ) log e E(x ,x ,y1) 1 K (e E(x ,x ,y1) + (K -1) K k=2 w k e E(x ,x ,y k ) ) , where The proof can be found in Sec. A.4. This objective up-weights the negative contribution to the normalization term of examples that have high probability under the resampling distribution. This approximation is cheap to compute given that the negative samples still initially come from the marginal distribution p(y) and avoids the need for resampling. The proposition shows that in the limit of K → ∞, optimizing I IS w.r.t. E converges to the conditional MI and the optimal E converges to the optimal I CN CE solution. We also note that we suppose E The I IS approximation provides a general, grounded way of sampling "harder" negatives by filtering samples from the easily-sampled marginal p(y). w k = exp Ē(x ,y k ) K k=2 exp Ē(x ,y k ) and Ē = arg sup E I N CE (x , y|E, K), verifies: 1. lim K→∞ sup E I IS (x ; y|x , E, K) = I(x ; y|x ), 2. lim K→∞ arg sup E I IS = log p(y|x ,x ) p(y|x ) + c(x , x ).

5. EXPERIMENTS

We start by investigating whether maximizing the decomposed MI using our conditional MI bound leads to a better estimate of the ground-truth MI in a synthetic experiment. Then, we experiment on a self-supervised image representation learning domain. Finally, we explore an application to natural language generation in a sequential setting, such as conversational dialogue.

5.1. SYNTHETIC DATA

We extend Poole et al. (2019) 's two variable setup to three variables. We posit that {x , x , y} are three Gaussian co-variates, x , x , y ∼ N (0, Σ) and we choose Σ such that we can control the total mutual information I({x , x }; y) such that I = {5, 10, 15, 20} (see Appendix for pseudo-code and details of the setup). We aim to estimate the total MI I({x , x }; y) and compare the performance of our approximators in doing so. For more details of this particular experimental setting, see App. B. In Figure 2 , we compare the estimate of the MI obtained by: 1. InfoNCE, which computes I N CE ({x , x }, y|E, K) and will serve as our baseline; 2. InfoNCEs, which probes the effectiveness of decomposing the total MI into a sum of smaller terms and computes I N CE (x , y|E, K/2)+I CN CE (x , y|x , E, K/2), where K/2 samples are obtained from p(y) and K/2 are sampled from p(y|x ); 3. InfoNCEs IS, the decomposed bound using our importance sampling approximation to the conditional MI I IS , i.e. I N CE (x , y|E, K) + I IS (x , y|x , E, K). This does not require access to samples from p(y|x ) and aims to test the validity of our approximation in an empirical setting. Both terms reuse the same number of samples K. For 2., we use only half as many samples as InfoNCE to estimate each term in the MI decomposition (K/2), so that the total number of negative samples is comparable to InfoNCE. Note that we use K samples in "InfoNCE IS", because those are reused for the conditional MI computation. All critics E are parametrized by MLPs as explained in Sec. B. Our results in Figure 2 show that, for larger amounts of true MI, decomposing MI as we proposed can capture more nats than InfoNCE with an order magnitude less examples. We also note that the importance sampling estimator seems to x ⇔ y, x ⇔ y, x ⇔ x y 200 70.9 (↑) 90.1 (↑) without cond. MI (x = crop(x)) x ⇔ y, x ⇔ y 200 69.9 (↓) 89.5 (-)  Additional

5.2. VISION

Imagenet We study self-supervised learning of image representations using 224x224 images from ImageNet. The evaluation is performed by fitting a linear classifier to the task labels using the pre-trained representations only, that is, we fix the weights of the pre-trained image encoder f . Each input image is independently augmented into two views x and y using a stochastically applied transformation. For the base model hyper-parameters and augmentations, we follow the "InfoMin Aug." setup (Tian et al., 2020) . This uses random resized crop, color jittering, gaussian blur, rand augment, color dropping, and jigsaw as augmentations and uses a momentum-contrastive memory buffer of K = 65536 examples (Chen et al., 2020b) . We fork x into two sub-views {x , x }: we set x x and x to be an information-restricted view of x. We found beneficial to maximize both decompositions of the MI: I(x ; y) + I(x ; y|x ) = I(x ; y) + I(x ; y|x ). By noting that I(x ; y|x ) is likely zero given that the information of x is contained in x , our encoder f is trained to maximize: L = λ I N CE (x ; y|f, K) + (1 -λ) I N CE (x ; y|f, K) + I IS (x ; y|x , f, K) Note that if x = x, then our decomposition boils down to maximizing the standard InfoNCE bound. Therefore, InfoMin Aug. is recovered by fixing λ = 1 or by setting x = x. The computation of the conditional MI term does not add computational cost as it can be computed by caching the logits used in the two unconditional MI terms (see Sec. B). We experiment with two ways of obtaining restricted information views x : cut, which applies cutout to x, and crop which is inspired by Caron et al. (2020) and consists in cropping the image aggressively and resizing the resulting crops to 96x96. To do so, we use the RandomResizedCrop from the torchvision.transforms module with parameters: s = (0.05, 0.14). Results are reported in Table 1 . Augmenting the InfoMin Aug. base model with our conditional contrastive loss leads to 0.8% gains on top-1 accuracy and 0.6% on top-5 accuracy. We notice that the crop strategy seems to perform slightly better than the cut strategy. One reason could be that cutout introduces image patches that do not follow the pixel statistics in the corpus. More generally, we think there could be information restricted views that are better suited than others. In order to isolate the impact on performance due to integrating an additional view x , i.e. the I N CE (x ; y|f, K) term in the optimization, we set the conditional mutual information term to zero in the line "without cond. MI". We see that this does not improve over the baseline InfoMin Aug., and its performance is 1% lower than our method, pointing to the fact that maximizing conditional MI across views provides the observed gains. We also include the very recent results of SwAV (Caron et al., 2020) and ByOL (Grill et al., 2020) which use a larger number of views (SwAV) and different loss functions (SwAV, ByOL) and thus we think are orthogonal to our approach. We think our approach is general and could be integrated in those solutions as well.

CIFAR-10

We also experiment on CIFAR-10 building upon SimCLR (Chen et al., 2020b) , which uses a standard ResNet-50 architecture by replacing the first 7x7 Conv of stride 2 with 3x3 Conv of stride 1 and also remove the max pooling operation. In order to generate the views, we use Inception crop (flip and resize to 32x32) and color distortion. We train with learning rate 0.5, batch-size 800, momentum coefficient of 0.9 and cosine annealing schedule. Our energy function is the cosine similarity between representations scaled by a temperature of 0.5 (Chen et al., 2020b) . We obtain a top-1 accuracy of 94.7% using a linear classifier compared to 94.0% as reported in Chen et al. (2020b) and 95.1% for a supervised baseline with same architecture.

5.3. DIALOGUE

For dialogue language modeling, we adopt the predictive coding framework (Elias, 1955; McAllester & Stratos, 2018) and consider past and future in a dialogue as views of the same conversation. Given L utterances x = (x 1 , . . . , x L ), we maximize I N CS (x ≤k ; x >k |f, K), where past x ≤k = (x 1 , . . . , x k ) and future x >k = (x k+1 , . . . , x L ) are obtained by choosing a split point 1 < k < L. We obtain f (x ≤k ), f (x >k ) by computing a forward pass of the fine-tuned "small" GPT2 model (Radford et al., 2019) on past and future tokens, respectively, and obtaining the state corresponding to the last token in the last layer. We evaluate our introduced models against different baselines. GPT2 is a basic small pre-trained model fine-tuned on the dialogue corpus. TransferTransfo (Wolf et al., 2019) augments the standard next-word prediction loss in GPT2 with the next-sentence prediction loss similar to Devlin et al. (2019) . Our baseline GPT2+InfoNCE maximizes I N CE (x ≤k ; x >k |f, K) in addition to standard nextword prediction loss. In GPT2+InfoNCE S , we further set x = x ≤k and x = x k , the recent past, and maximize I N CES (x ≤k , x >k ). To maximize the conditional MI bound, we sample contrastive futures from p(x >k |x k ; θ GPT2 ), using GPT2 itself as the variational approximationfoot_3 . We fine-tune all models on the Wizard of Wikipedia (WoW) dataset (Dinan et al., 2018) with early stopping on validation perplexity. We evaluate our models using automated metrics and human evaluation: we report perplexity (ppl), BLEU (Papineni et al., 2002) , and word-repetition-based metrics from Welleck et al. (2019) , specifically: seq-rep-n measures the portion of duplicate n-grams and seq-rep-avg averages over n ∈ {2, 3, 4, 5, 6}. We measure diversity via dist-n (Li et al., 2016) , the number of unique n-grams, normalized by the total number of n-grams. Table 4 shows results on the validation set. For the test set results, please refer to the Appendix. Incorporating InfoNCE yields improvements in all metricsfoot_4 . Please refer to the Appendix for sample dialogue exchanges. We also perform human evaluation on randomly sampled 1000 WoW dialogue contexts. We present the annotators with a pair of candidate responses consisting of GPT2+InfoNCE S responses and baseline responses. They were asked to compare the pairs regarding interestingness, relevance and humanness, using a 3-point Likert scale (Zhang et al., 2019) . 

6. DISCUSSION

The result in Eq. 5 is reminiscent of conditional noise-contrastive estimation (CNCE) (Ceylan & Gutmann, 2018) which proposes a framework for data-conditional noise distributions for noise contrastive estimation (Gutmann & Hyvärinen, 2012) . Here, we provide an alternative interpretation in terms of a bound on conditional mutual information. In CNCE, the proposal distribution is obtained by noising the conditional proposal distribution. It would be interesting to investigate whether it is possible to form information-restricted views by similar noise injection, and whether "optimal" info-restricted views exist. Recent work questioned whether MI maximization itself is at the core of the recent success in representation learning (Rainforth et al., 2018; Tschannen et al., 2019) . These observed that models capturing a larger amount of mutual information between views do not always lead to better downstream performance and that other desirable properties of the representation space may be responsible for the improvements (Wang & Isola, 2020) . Although we acknowledge that various factors can be at play for downstream performance, we posit that devising more effective ways to maximize MI will still prove useful in representation learning, especially if paired with architectural inductive biases or explicit regularization methods.

A DERIVATIONS A.1 DERIVATION OF INFONCE, I N CE

We start from Barber and Agakov's variational lower bound on MI (Barber & Agakov, 2003) . I(x; y) can be bounded as follows: I(x; y) = E p(x,y) log p(y|x) p(y) ≥ E p(x,y) log q(y|x) p(y) , ( ) where q is an arbitrary distribution. We show that the InfoNCE bound (Oord et al., 2018) corresponds to a particular choice for the variational distribution q followed by the application of the Jensen inequality. Specifically, q(y|x) is defined by independently sampling a set of examples {y1, . . . , yK } from a proposal distribution π(y) and then choosing y from {y1, . . . , yK } in proportion to the importance weights wy = e E(x,y) k e E(x,y k ) , where E is a function that takes x and y and outputs a scalar. In the context of representation learning, E is usually a dot product between some representations of x and y, e.g. f (x) T f (y) (Oord et al., 2018) . The unnormalized density of y given a specific set of samples y2:K = {y2, . . . , yK } and x is: q(y|x, y2:K ) = π(y) • K • e E(x,y) e E(x,y) + K k=2 e E(x,y k ) , ( ) where we introduce a factor K which provides "normalization in expectation". By normalization in expectation, we mean that taking the expectation of q(y|x, y2:K ) with respect to resampling of the alternatives y2:K from π(y) produces a normalized density (see Sec. A.1.1 for a derivation): q(y|x) = E π(y 2:K ) [q(y|x, y2:K )], where π(y2:K ) = K k=2 π(y k ). The InfoNCE bound (Oord et al., 2018) is then obtained by setting the proposal distribution as the marginal distribution, π(y) ≡ p(y) and applying Jensen's inequality, giving: I(x, y) ≥ E p(x,y) log E p(y 2:K ) q(y|x, y2:K ) p(y) ≥ E p(x,y) E p(y 2:K ) log p(y) K • wy p(y) = E p(x,y) E p(y 2:K ) log K • e E(x,y) e E(x,y) + K k=2 e E(x,y k ) = E p(x,y 1 )p(y 2:K ) log e E(x,y) 1 K K k=1 e E(x,y k ) = INCE(x; y|E, K) ≤ log K, where the second inequality has been obtained using Jensen's inequality.

A.1.1 DERIVATION OF NORMALIZED DISTRIBUTION

We follow Cremer et al. (2017) to show that q(y|x) = E y 2:K ∼π(y) [q(y|x, y2:K )] is a normalized distribution: x q(y|x) dy = y E y 2:K ∼π(y)   π(y) e E(x,y) 1 K K k=2 e E(x,y k ) + e E(x,y)   dy = y π(y)E y 2:K ∼π(y)   e E(x,y) 1 K K k=2 e E(x,y k ) + e E(x,y)   dy = E π(y) E π(y 2:K )   e E(x,y) 1 K K k=2 e E(x,y k ) + e E(x,y)   = E π(y 1:K ) e E(x,y) 1 K K k=1 e E(x,y k ) = K • E π(y 1:K ) e E(x,y 1 ) K k=1 e E(x,y k ) = K i=1 E π(y 1:K ) e E(x,y i ) K k=1 e E(x,y k ) = E π(y 1:K ) K i=1 e E(x,y i ) K k=1 e E(x,y k ) = 1 (15) A.2 PROOFS FOR I CN CE Proposition 1 (Conditional InfoNCE). The following is a lower-bound on the conditional mutual information I(x ; y|x ) and verifies the properties below: ICNCE(x ; y|x , E, K) = E p(x ,x ,y 1 )p(y 2:K |x ) log e E(x ,x ,y 1 ) 1 K K k=1 e E(x ,x ,y k ) (6) 1. ICNCE ≤ I(x ; y|x ). 2. E * = arg sup E ICNCE = log p(y|x ,x ) p(y|x ) + c(x , x ). 3. When K → ∞ and E = E * , we recover the true conditional MI: limK→∞ ICNCE(x ; y|x , E * , K) = I(x ; y|x ). Proof. We begin with 1., the derivation is as follows: I(x ; y|x ) = E p(x ,x ,y) log p(y|x , x ) p(y|x ) ≥ E p(x ,x ,y) log q(y|x , x ) p(y|x ) (16) = E p(x ,x ,y) log E p(y 2:K |x ) q(y|x , x , y2:K ) p(y|x ) ≥ E p(x ,x ,y) E p(y 2:K |x ) log p(y|x ) K • wy p(y|x ) (18) = E p(x ,x ,y) E p(y 2:K |x ) log K • e E(x ,x ,y) K k=1 e E(x ,x ,y k ) (19) = E p(x ,x ,y) E p(y 2:K |x ) log e E(x ,x ,y) 1 K K k=1 e E(x ,x ,y k ) (20) = ICNCE(x ; y|x , E, K), where we used in Eq. 16 the Jensen's inequality following Barber and Agakov's bound (Barber & Agakov, 2003) and used p(y|x ) as our proposal distribution for the variational approximation q(y|x , x ). For 2., we rewrite ICNCE by grouping the expectation w.r.t x : E p(x ) E p(x ,y 1 |x )p(y 2:K |x ) log e E(x ,x ,y 1 ) 1 K K k=1 e E(x ,x ,y k ) . ( ) Given that both distributions in the inner-most expectation condition on the same x , this term has the same form as INCE and therefore the optimal solution is E * x = log p(y|x ,x ) p(y|x ) + c x (x ) (Ma & Collins, 2018) . The optimal E for ICNCE is thus obtained by choosing E(x , x , y) = E * x for each x , giving E * = log p(y|x ,x ) p(y|x ) + c(x , x ). For proving 3., we substitute the optimal critic and take the limit K → ∞. We have: lim K→∞ E p(x ,x ,y 1 )p(y 2:K |x ) log p(y|x ,x ) p(y|x ) 1 K p(y 1 |x ,x ) p(y 1 |x ) + K k=2 p(y k |x ,x ) p(y k |x ) , From the Strong Law of Large Numbers, we know that as 1 K-1 K-1 k=1 p(y k |x ,x ) p(y k |x ) → E p(y|x ) p(y|x ,x ) p(y|x ) = 1, as K → ∞ a.s., therefore (relabeling y = y1): ICNCE ∼K→∞ E p(x ,x ,y) log p(y|x ,x ) p(y|x ) 1 K p(y|x ,x ) p(y|x ) + K -1 (24) ∼K→∞ E p(x ,x ,y) log p(y|x , x ) p(y|x ) + log K p(y|x ,x ) p(y|x ) + K -1 (25) ∼K→∞ I(x , y|x ), where the last equality is obtained by noting that the second term → 0.

A.3 PROOFS FOR I V AR

Proposition 2 (Variational ICNCE). For any variational approximation τ (y|x ) in lieu of p(y|x ), IV AR(x , y|x , E, τ, K) = E p(x ,x ,y 1 )τ (y 2:K |x ) log e E(x ,x ,y 1 ) 1 K K k=1 e E(x ,x ,y k ) (7) -E p(x ) KL p(y|x ) τ (y|x ) , with p(•|x ) << τ (•|x ) for any x , we have the following properties: 1. IV AR ≤ I(x ; y|x ). 2. If τ (y|x ) = p(y|x ), IV AR = ICNCE. 3. limK→∞ sup E IV AR(x ; y|x , E, τ, K) = I(x ; y|x ). Proof. For 1., we proceed as follows: I(x ; y|x ) ≥ E p(x,y) log q(y|x , x )τ (y|x ) p(y|x )τ (y|x ) = E p(x,y) log q(y|x , x ) τ (y|x ) -E p(x) KL(p(y|x ) τ (y|x )) ≥ E p(x,y 1 )τ (y 2:K |x ) log e E(x ,x ,y 1 ) 1 K K k=1 e E(x ,x ,y 1 ) -E p(x) KL(p(y|x ) τ (y|x )) , = IV AR(x , y|x , E, τ, K) where the last step has been obtained as in Eq. 18. Proving 2. is straightforward by noting that if τ = p, KL(p(y|x )||τ (y|x )) = 0 and the first term corresponds to ICNCE. Proving 3. goes as follows: sup E E p(x ,x ,y 1 )τ (y 2:K |x ) log e E(x ,x ,y 1 ) 1 K K k=1 e E(x ,x ,y k ) -E p(x ) KL p(y|x ) τ (y|x ) (28) = E p(x ,x ,y 1 )τ (y 2:K |x ) log p(y1|x , x ) τ (y1|x ) -log p(y1|x ) τ (y1|x ) -log 1 K K k=1 p(y k |x , x ) τ (y k |x ) (29) = I(x , y|x ) -E p(x ,x ,y 1 )τ (y 2:K |x ) log 1 K K k=1 p(y k |x , x ) τ (y k |x ) (30) →K→∞ I(x , y|x ). This is obtained by noting that (1) for any K and τ , arg sup E IV AR = p(y|x ,x ) τ (y|x ) (because the KL doesn't depend on E) and (2) the second term in the last line goes to 0 for K → ∞ (a straightforward application of the Strong Law of Large Numbers shows that for samples y2:K drawn from τ (y2:K |x ), we have: 1 K K k=2 p(y k |x ,x ) τ (y k |x ) →K→∞ 1).

A.4 PROOFS FOR I IS

We will be using the following lemma. Lemma 1. For any x , x and y, and any sequence EK such that ||EK -E||∞ →K→∞ 0: lim K→∞ E p(y 2:K ) log Ke E K (x ,x ,y) e E K (x ,x ,y) + (K -1) K k=2 w k e E K (x ,x ,y k ) (32) = lim K→∞ E p(y 2:K |x ) log Ke E(x ,x ,y) e E(x ,x ,y) + K k=2 e E(x ,x ,y k ) , ( ) where w k = exp Ē(x ,y k ) K k=2 exp Ē(x ,y k ) for Ē(x , y k ) = arg sup E INCE(x , y|E, K) = p(y k |x ) p(y k ) . Proof. We see that almost surely, for y2:K ∼ p(•): ,x ,y) , (34) where we applied the Strong Law of Large Numbers to the denominator. K k=2 w k e E K (x ,x ,y k ) = 1 K-1 K k=2 p(y k |x ) p(y k ) e E K (x ,x ,y k ) 1 K-1 K k=2 p(y k |x ) p(y k ) →K→∞ E p(y|x ) e E(x For the numerator, we write: 1 K -1 K k=2 p(y k |x ) p(y k ) e E K (x ,x ,y k ) = 1 K -1 K k=2 p(y k |x ) p(y k ) e E(x ,x ,y k ) + 1 K -1 K k=2 p(y k |x ) p(y k ) (e E K (x ,x ,y k ) -e E(x ,x ,y k ) ) and note that the first term is the standard IS estimator using p(y k ) as proposal distribution and tends to ,x ,y) from the Strong Law of Large Numbers, while the second term goes to 0 as EK tends to E uniformly. E p(y|x ) e E(x This gives limK→∞ E p(y 2:K ) log Ke E K (x ,x ,y) e E K (x ,x ,y) +(K-1) ,x ,y) . K k=2 w k e E K (x ,x ,y k ) = log e E(x ,x ,y) E p(y|x ) e E(x Following the same logic, without the importance-sampling demonstrates that: ,x ,y) E p(y|x ) e E(x ,x ,y) , which concludes the proof. lim K→∞ E p(y 2:K |x ) log Ke E(x ,x ,y) e E(x ,x ,y) + K k=2 e E(x ,x ,y k ) = log e E(x Proposition 3 (Importance Sampling ICNCE). The following approximation of ISIR: IIS(x , y|x , E, K) = E p(x ,x ,y 1 )p(y 2:K ) log e E(x ,x ,y 1 ) 1 K (e E(x ,x ,y 1 ) + (K -1) K k=2 w k e E(x ,x ,y k ) ) , where Proof. By applying Lemma 1 with EK = E, we know that for any E: w k = exp Ē(x ,y k ) K k=2 exp Ē(x , lim K→∞ IIS(x ; y|x , E, Ē, K) = lim K→∞ E p(x , x ,y)p(y 2:K |x ) log Ke E(x ,x ,y) e E(x ,x ,y) + K k=2 e E(x ,x ,y k ) . In particular, the RHS of the equality corresponds to limK→∞ ICNCE(x , y|x , E, K). That quantity is smaller than I(x , y|x ), with equality for E = E * . This guarantees that: lim K→∞ sup E IIS(x ; y|x , E, Ē, K) ≥ lim K→∞ IIS(x ; y|x , E * , Ē, K) = I(x , y|x ). We now prove the reverse inequality. We let 2 = limK→∞ sup E IIS(x ; y|x , E, Ē, K) -I(x , y|x ), and assume toward a contradiction that > 0. We know that: ∃K0, ∀K ≥ K0, sup E IIS(x ; y|x , E, Ē, K) ≥ I(x , y|x ) + . Now, ∀K ≥ K0, let EK be such that: IIS(x ; y|x , EK , Ē, K) ≥ sup E IIS(x ; y|x , E, Ē, K) -2 , and thus: ∀K ≥ K0, IIS(x ; y|x , EK , Ē, K) ≥ I(x , y|x ) + 2 . Since EK ∈ R |X |×|X |×|Y| , {EK } K≥K 0 contains a subsequence that converges to a certain E∞ ∈ R|X|×|X|×|Y| . Without loss of generality, we assume that ∀K, ∀x , ∀x , E p(y) [EK (x , x , y)] = 0 which implies that E p(y) [E∞(x , x , y)] = 0 (similarly to INCE, IIS is invariant to constants added to E). In particular, this guarantees that ||E∞||∞ < ∞. Otherwise, we would have E∞(x , x , y) = -∞ for a given y, which would then imply IIS(x ; y|x , E∞, Ē, K) = -∞ and give a contradiction. We can now apply Lemma 1 to {EK } and E∞ to show that limK→∞ IIS(x ; y|x , EK , Ē, K) = limK→∞ ICNCE(x , y|x , E∞, K), and get a contradiction: the first term is larger than I(x , y|x ) + 2 while the second is smaller than I(x , y|x ).

B PSEUDOCODE B.1 LOSS COMPUTATION

We provide a pseudo-code for the loss computation which uses MocoV2 backbone comprising a memory of contrastive examples obtained using a momentum-averaged encoder (Chen et al., 2020b) . 

B.2 SYNTHETIC EXPERIMENTS

Here, we provide details for Sec. 5.1. In this experiment, each x , x and y are 20-dimensional. For each dimension, we sampled (x i , x i , yi) from a correlated Gaussian with mean 0 and covariance matrix covi. For a given value of MI, mi = {5, 10, 15, 20}, we sample covariance matrices covi = sample_cov(mii), such that i mii = mi, mii chosen at random. We optimize the bounds by stochastic gradient descent (Adam, learning rate 5 • 10 -4 ). All encoders f are multi-layer perceptrons with a single hidden layer and ReLU activation. Both hidden and output layer have size 100.

InfoNCE computes:

Ep log e f ([x ,x ]) T f (y) e f ([x ,x ]) T f (y) + K k=2 e f ([x ,x ]) T f (y k ) + log K, y2:K ∼ p(y), where the proposal is the marginal distribution p(y), E is chosen to be a dot product between representations, Ep denotes expectation w.r.t. the known joint distribution p(x , x , y) and is approximated with Monte-Carlo, [x , x ] denotes concatenation and f is a 1-hidden layer MLP. The four rows from top to bottom are: (1) the "past" dialogue up to utterance k (2) the ground-truth utterance for the next turn k + 1 (3) generations for the next turn sampled from the "restricted context" conditional "future" distribution p(y|x k ) (4) future candidates sampled from the groundtruth "future" distribution. We can see that p(y|x k ) is semantically close but incoherent w.r.t to the dialogue history as it was conditioned solely on the immediate past utterance x k . However, we can notice that p(y) is semantically distant from x as it was sampled randomly from the data distribution. The highlighted text in green correspond to the topic of the conversation. Speaker B indicates that it has never done either parachuting or skydiving. p(y|x k ) corresponds to the set of hard negatives that are closely related to the conversation. B 1 corresponds to the utterance generated based on the restricted context x k . The utterance is on-topic but completely contradictory to what speaker B has said in the past. On the other hand B 1 is randomly sampled from other dialogues. We can observe that the utterance is clearly irrelevant to the conversation. Therefore, it is easier to the model to discriminate between B 1 and B gt . We closely follow the protocol used in Zhang et al. (2019) . Systems were paired and each response pair was presented to 3 judges in random order on a 3 point Likert scale. We use a majority vote for each response pair to decide whether system1, system2, or neither, performed better. We then bootstrap the set of majority votes to obtain a 95% confidence interval on the expected difference between system1 and system2. If this confidence interval contains 0, the difference is deemed insignificant. We also compute p-values from the confidence intervalsfoot_6 . In the following tables, "pivot" is always the system given by our full InfoNCES model. Pairings where the pairwise confidence interval is marked with "*" have a significant difference between systems. 



The derivation inOord et al. (2018) presented an approximation and therefore was not properly a bound. An alternative, exact derivation of the bound can be found inPoole et al. (2019). For a proof of this fact, it suffices to consider I({x, x , x }; y) = I(x; y|{x , x }) + I({x , x }; y), given that I({x, x , x }; y) = I(x; y), equality is obtained iff I(x; y|{x , x }) = 0. The ability to perform that computation is usually a key assumption in self-supervised learning approaches. The negative sampling of future candidates is done offline. Note that our results are not directly comparable withLi et al. (2019) as their model is trained from scratch on a not publicly available Reddit-based corpus. 0.069 0.416 - https://www.bmj.com/content/343/bmj.d2304



Figure 2: We plot the value of the MI estimated by I N CE and I N CES bounds for three Gaussian covariates x , x , y as function of the number of negative samples K. We sample different covariances for a fixed true MI (green horizontal line) and report error bars. "InfoNCE" computes I N CE (x , x ; y); "InfoNCE S " computes I N CE (x ; y) + I CN CE (x ; y|x ); "InfoNCE S IS" computes I N CE (x ; y) + I IS (x ; y|x ).

y k ) and Ē = arg sup E INCE(x , y|E, K), verifies: 1. limK→∞ sup E IIS(x ; y|x , E, K) = I(x ; y|x ), 2. limK→∞ arg sup E IIS = log p(y|x ,x ) p(y|x ) + c(x , x ).

def compute_loss(xp, xpp, y, f, f_ema, memory, lam=0.5): (xp; y) + (1 -lam) * (mi(xpp; y) + mi(xp; y | xpp)) """ # encode xp and xpp with standard encoder, (1, dim) q_xp, q_xpp = f(x_p), f(x_pp) # encode y with momentum-averaged encoder, (1, dim) k_y = f_ema(y).detach() # (1 + n_mem,), first is xpp_y score logits_xpp_y = dot(q_xpp, cat(k_y, memory)) # (1 + n_mem,), first is xp_y score logits_xp_y = dot(q_xp, cat(k_y, memory)) # infonce bound between xp and y nce_xp_y = -log_softmax(logits_xp_y)[0] # infonce bound between xpp and y nce_xpp_y = -log_softmax(logits_xpp_y)[0] K = len(logits_xpp_y) # compute resampling importance weights w_pp_y = softmax(logits_xpp_y[1:]) # form approximation to the partition function (Eq. 12) Z_xp_y = (K -1) * w_pp_y * exp(logits_xp_y[1:]) Z_xp_y = Z_xp_y.sum() + exp(logits_xp_y[0]) # infonce bound on the conditional mutual information nce_xp_y_I_xpp = -logits_xp_y[0] + log(Z_xp_y) # compose final loss loss = lam * nce_xp_y loss += (1-lam) * (nce_xpp_y + nce_xp_y_I_xpp) return loss

Accuracy on ImageNet linear evaluation. x ⇔ y denotes standard contrastive matching between views. In "InfoNCE S ", we use the same base InfoMin Aug. architecture but augments the loss function with conditional MI maximization across views (x ⇔ x y). All models use a standard Resnet-50 architecture. (↑) represents improvement over InfoMin Aug.

Its empirical behavior for MI = {5, 10} could indicate that InfoNCEs IS is a valid lower bound on MI, although we couldn't prove it formally.

Table4lists the difference between fraction of wins for GPT2+InfoNCE S and other models as H-rel, H-hum, and H-int. Overall, GPT2+InfoNCE S was strongly preferred over GPT2, TransferTransfo and GPT2+InfoNCE, but not the gold response. Bootstrap confidence intervals and p-values (t-test) indicate all improvements except for GPT2+InfoNCE on the relevance criterion are significant at α=0.05.

Results for perplexity, sequence-level metric, token-level metrics, BLEU, diversity metrics and human evaluation on the valid data of the Wizard of Wikipedia dataset(Dinan et al., 2018).

A sample dialogue between speaker A and speaker B from the Wizard of Wikipedia dataset.

Selected responses to the same context from different methods fine-tuned on the Wizard of Wikipedia datasets. Archery is a sport / skill of using a bow to propel arrows. A: And a great sport it is . Do you know where archery originated from ? GPT2+InfoNCE S : Archery originated from ancient Mesopotamia, and is actually practiced internationally in over 200 countries. GPT2+InfoNCE: I am not sure but many traditions and principles arose during the Mesolithic and Paleolithic era. TransferTransfo: Yep , you just use it for skill and using it to shoot arrows. GPT2: I don't know, but I know that the old French called it archer's art. What is your motivation? B: That's a tough question. I 'd say my own progress. I like seeing how much I improve with something. You ? A: I am retired now. Are you familiar with the work of Mehr and Meyer, well known psychologists ? B: They said that " Motivation is a word that is part of the popular culture as few other psychological concepts are . " Do you agree with what they said ? GPT2+InfoNCE S : Yes, definitely! You should check out Mark Twain's famous motivational book ! There are theories that point out how important it is to meditate. GPT2+InfoNCE: They said that in psychology research, theories and evidence, participants who enjoyed continuous experience could be more likely to be successful. TransferTransfo: Absolutely, I disagree. Are you an author or a commentator? GPT2: That's awesome. So , what types of differences exist between "good " motivation and bad?

InfoNCEs computes:

E p(x ,x ,y)p(y 2:K ) log e f (x ) T f (y) e f (x ) T f (y) + K k=2 e f (x ) T f (y k ) + (36) E p(x ,x ,y)p(y 2:K |x ) log e f ([x ,x ]) T f (y) e f ([x ,x ]) T f (y) + K k=2 e f ([x ,x ]) T f (y k )+ 2 log K where f (x) is just f ([x, 0] ) in order to re-use MLP parameters for the two terms. The negative samples of the conditional MI term come from the conditional distribution p(y|x ), which is assumed to be known in this controlled setting. We maximize both lower bounds with respect to the encoder f .We report pseudo-code for sample_cov, used to generate 3×3 covariance matrices for a fixed mi = I({x , x }; y) and uniformly sampled α = I(x ; y)/I({x , x }; y): For all InfoNCE terms, given the past, the model is trained to pick the ground-truth future among a set of N future candidates. This candidate set includes the ground-truth future and N -1 negative futures drawn from different proposal distributions. To compute InfoNCE(f (x ≤k ); f (x >k )), we consider the ground truth future of each sample as a negative candidate for the other samples in the batch. Using this approach, the number of candidates N is equated to the batch size. This ensures that negative samples are sampled from the marginal distribution p(x >k ). To compute the conditional information bound InfoNCES, we sample negative futures p(y|x k ) by leveraging the GPT2 model itself, by conditioning the model only on the most recent utterance x k in the past.

C.2 EXPERIMENTAL SETUP

Given memory constraints, the proposed models are trained with a batch size of 5 per GPU over 10 epochs, considering up to three utterances for the future and five utterances in the past. All the models are trained on 2 NVIDIA V100s. The models early-stop in the 4th epoch. We use the Adam optimizer with a learning rate of 6.25 × 10 -5 , which we linearly decay to zero during training. Dropout is set to 10% on all layers. InfoNCE/InfoNCES terms are weighted with a factor 0.1 in the loss function.x A: I like parachuting or skydiving . B : I've never done either but they sound terrifying, not a fan of heights.A: But it is interesting game. This first parachute jump in history was made by Andre Jacques. B: Oh really ? Sounds like a french name, what year did he do it ? A: It done in October 22 1797. They tested his contraption by leaping from a hydrogen balloon. B: Was he successful or did he kick the bucket off that stunt? A: I think its a success. The military developed parachuting tech.B gt Yeah nowadays they are a lot more stable and well made. 

