DECOMPOSING MUTUAL INFORMATION FOR REPRESENTATION LEARNING

Abstract

Many self-supervised representation learning methods maximize mutual information (MI) across views. In this paper, we transform each view into a set of subviews and then decompose the original MI bound into a sum of bounds involving conditional MI between the subviews. E.g., given two views x and y of the same input example, we can split x into two subviews, x and x , which depend only on x but are otherwise unconstrained. The following holds: I(x; y) ≥ I(x ; y) + I(x ; y|x ), due to the chain rule and information processing inequality. By maximizing both terms in the decomposition, our approach explicitly rewards the encoder for any information about y which it extracts from x , and for information about y extracted from x in excess of the information from x . We provide a novel contrastive lower bound on conditional MI, that relies on sampling contrast sets from p(y|x ). By decomposing the original MI into a sum of increasingly challenging MI bounds between sets of increasingly informed views, our representations can capture more of the total information shared between the original views. We empirically test the method in a vision domain and for dialogue generation.

1. INTRODUCTION

The ability to extract actionable information from data in the absence of explicit supervision seems to be a core prerequisite for building systems that can, for instance, learn from few data points or quickly make analogies and transfer to other tasks. Approaches to this problem include generative models (Hinton, 2012; Kingma & Welling, 2014) and self-supervised representation learning approaches, in which the objective is not to maximize likelihood, but to formulate a series of (label-agnostic) tasks that the model needs to solve through its representations (Noroozi & Favaro, 2016; Devlin et al., 2019; Gidaris et al., 2018; Hjelm et al., 2019) . Self-supervised learning includes successful models leveraging contrastive learning, which have recently attained comparable performance to their fully-supervised counterparts (Bachman et al., 2019; Chen et al., 2020a) . Many self-supervised learning methods train an encoder such that the representations of a pair of views x and y derived from the same input example are more similar to each other than to representations of views sampled from a contrastive negative sample distribution, which is usually the marginal distribution of the data. For images, different views can be built using random flipping, color jittering and cropping (Bachman et al., 2019; Chen et al., 2020a) . For sequential data such as conversational text, the views can be past and future utterances in a given dialogue. It can be shown that these methods maximize a lower bound on mutual information (MI) between the views, I(x; y), w.r.t. the encoder, i.e. the InfoNCE bound (Oord et al., 2018) . One significant shortcoming of this approach is the large number of contrastive samples required, which directly impacts the total amount of information which the bound can measure (McAllester & Stratos, 2018; Poole et al., 2019) . In this paper, we consider creating subviews of x by removing information from it in various ways, e.g. by masking some pixels. Then, we use representations from less informed subviews as a source of hard contrastive samples for representations from more informed subviews. For example, in Fig. 1 , one can mask a pixel region in x to obtain x and ask (the representation of) x to be closer to y than to random images of the corpus, and for x to be closer to y than to samples from p(y|x ). This corresponds to decomposing the MI between x and y into I(x; y) ≥ I(x ; y) + I(x ; y|x ). The conditional MI measures the information about y that the model has gained by looking at x beyond the information already contained in x . In Fig. 1 could focus on the overall "shape" of the object and would need many negative samples to capture other discriminative features. In our approach, the model is more directly encouraged to capture these additional features, e.g. the embossed detailing. In the context of predictive coding on sequential data such as dialogue, by setting x to be the most recent utterance (Fig. 1 , right), the encoder is directly encouraged to capture long-term dependencies that cannot be explained by x . We formally show that, by such decomposition, our representations can potentially capture more of the total information shared between the original views x and y. 2020), x and x could be the views resulting from standard cropping and the aggressive multi-crop strategy. This equality is only valid when the views x and x are statistically independent, which usually does not hold. Instead, we argue that a better decomposition is I({x , x }; y) = I(x ; y) + I(x ; y|x ), which always holds. Most importantly, the conditional MI term encourages the encoder to capture more non-redundant information across views.

Maximizing MI between multiple views can be

To maximize our proposed decomposition, we present a novel lower bound on conditional MI in Section 3. For the conditional MI maximization, we give a computationally tractable approximation that adds minimal overhead. In Section 4, we first show in a synthetic setting that decomposing MI and using the proposed conditional MI bound leads to capturing more of the ground-truth MI. Finally, we present evidence of the effectiveness of the method in vision and in dialogue generation.

2. PROBLEM SETTING

The maximum MI predictive coding framework (McAllester, 2018; Oord et al., 2018; Hjelm et al., 2019) prescribes learning representations of input data such that they maximize MI. Estimating MI is generally a hard problem that has received a lot of attention in the community (Kraskov et al., 2004; Barber & Agakov, 2003) . Let x and y be two random variables which can generally describe input data from various domains, e.g. text, images or sound. We can learn representations of x and y by maximizing the MI of the respective features produced by encoders f, g : X → R d , which by the data processing inequality, is bounded by I(x; y): arg max f,g I(f (x); g(y)) ≤ I(x; y). (1) We assume that the encoders can be shared, i.e. f = g. The optimization in Eq. 1 is challenging but can be lower-bounded. Our starting point is the recently proposed InfoNCE lower bound on MI (Oord et al., 2018) and its application to self-supervised learning for visual representations (Bachman



Figure1: A demonstration of our approach in vision (left) and dialogue (right). (left) Given two augmentations x and y, we fork x into two subviews, x which is an exact copy of x and x , an information-restricted view obtained by occluding some of the pixels in x . We can maximize I(x; y) ≥ I(x ; y) + I(x ; y|x ) using a contrastive bound by training x to be closer to y than to other images from the corpus, and by training x to be closer to y than to samples from p(y|x ), i.e. we can use x to generate hard negative samples for x . The conditional MI term encourages the encoder to imbue the representation of x with information it shares with y beyond the information already in x . (right) x and y represent past and future in a dialogue respectively and x is the "recent past". In this context, the encoder is encouraged to capture long-term dependencies that cannot be explained by the most recent utterances.

related to recent efforts in representation learning, amongst them AMDIM (Bachman et al., 2019), CMC (Tian et al., 2019) and SwAV (Caron et al., 2020). However, these models maximize the sum of MIs between views I({x , x }; y) = I(x ; y) + I(x ; y). E.g., in Bachman et al. (2019), x and x could be global and local representations of an image, and in Caron et al. (

